Exploring the Trump Twitter Archive with PyCaret

For adventurous beginners in NLP.


For this project, we’ll be using PyCaret:

PyCaret is an open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within seconds in your choice of notebook environment.¹

PyCaret

PyCaret does a lot more than NLP. It also does a whole slew of both supervised and unsupervised ML including classification, regression, clustering, anomaly detection, and associate rule mining.

To learn more, check out Moez Ali’s announcement.


Housekeeping

Let’s begin by installing PyCaret. Just do pip install pycaret and we are good to go! Note: PyCaret is a big library so you may want to go grab a cup of coffee while waiting for it to install.

Also, we need to download the English language model because it is not automatically downloaded with PyCaret:

python -m spacy download en_core_web_sm
python -m textblob.download_corpora

Getting the Data

Let’s read the data into a dataframe. If you want to follow along, you can download the dataset here. This dataset contains Trump’s tweets from the moment he took office on January 20, 2017 to May 30, 2020.

import pandas as pd
from pycaret.nlp import *
df = pd.read_csv('trump_20200530.csv')

Let’s check the shape of our data first:

df.shape

And let’s take a quick look:

df.head()

For expediency, let’s sample only 1,000 tweets.

# sampling the data to select only 1000 tweets
df = df.sample(1000, random_state=493).reset_index(drop=True)
df.shape

Topic Modeling

The fun part!

nlp = setup(data = df, target = 'text', session_id = 493,
customI _stopwords = [ 'rt', 'https', 'http', 'co', 'amp'])

PyCaret’s setup() function performs the following text-processing steps:

  1. Removing Numeric Characters
  2. Removing Special Characters
  3. Word Tokenization
  4. Stopword Removal
  5. Bigram Extraction
  6. Trigram Extraction
  7. Lemmatizing
  8. Custom Stopwords

And all in one line of code!

It takes in two parameters: the dataframe in data and the name of the text column that we want to pass in target. In our case, we also used the optional parameters session_id for reproducibility and custom_stopwords to reduce the noise coming from the tweets.

After all is said and done, we’ll get something similar to this:

In the next step, we’ll create the model and we’ll use ‘lda’:

lda = create_model('lda', num_topics = 6, multi_core = True)

Above, we created an ‘lda’ model and passed in the number of topics as 6 and set it so that the LDA will use all CPU cores available to parallelize and speed up training.

Finally, we’ll assign topic proportions to the rest of the dataset using assign_model().

lda_results = assign_model(lda)
lda_results.head()

Visualizing the Results

Let’s the plot the overall frequency distribution of the entire corpus:

plot_model()

Now let’s extract the bigrams and trigrams for the entire corpus:

plot_model(plot = 'bigram')
plot_model(plot = 'trigram')

But what if we only want to extract the n-grams from a specific topic? Easy, we’ll just pass in the topic_num parameter.

plot_model(lda, plot = 'trigram', topic_num = 'Topic 1')

If we want the distribution of topics we’ll simply change it and specify it in the plot parameter.

plot_model(lda, plot = 'topic_distribution')

And that’s it!

We’ve successfully conducted topic modeling on President Trump’s tweets since taking office.

Bonus Round

Moez Ali wrote a great tutorial on using PyCaret in Power BI. Check it out.


Thank you for reading! Exploratory data analysis uses a lot of techniques and we’ve only explored a few on this post. I encourage you to keep practicing and employ other techniques to derive insights from data.

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1] PyCaret. (June 4, 2020). Why PyCarethttps://pycaret.org/

This article was first published in Towards Data Science’ Medium publication.