For adventurous beginners in NLP.
For this project, we’ll be using PyCaret:
PyCaret does a lot more than NLP. It also does a whole slew of both supervised and unsupervised ML including classification, regression, clustering, anomaly detection, and associate rule mining.
Let’s begin by installing PyCaret. Just do
pip install pycaret and we are good to go! Note: PyCaret is a big library so you may want to go grab a cup of coffee while waiting for it to install.
Also, we need to download the English language model because it is not automatically downloaded with PyCaret:
python -m spacy download en_core_web_sm python -m textblob.download_corpora
Getting the Data
Let’s read the data into a dataframe. If you want to follow along, you can download the dataset here. This dataset contains Trump’s tweets from the moment he took office on January 20, 2017 to May 30, 2020.
import pandas as pd from pycaret.nlp import * df = pd.read_csv('trump_20200530.csv')
Let’s check the shape of our data first:
And let’s take a quick look:
For expediency, let’s sample only 1,000 tweets.
# sampling the data to select only 1000 tweets
df = df.sample(1000, random_state=493).reset_index(drop=True)
The fun part!
nlp = setup(data = df, target = 'text', session_id = 493,
customI _stopwords = [ 'rt', 'https', 'http', 'co', 'amp'])
setup() function performs the following text-processing steps:
- Removing Numeric Characters
- Removing Special Characters
- Word Tokenization
- Stopword Removal
- Bigram Extraction
- Trigram Extraction
- Custom Stopwords
And all in one line of code!
It takes in two parameters: the dataframe in
data and the name of the text column that we want to pass in
target. In our case, we also used the optional parameters
session_id for reproducibility and
custom_stopwords to reduce the noise coming from the tweets.
After all is said and done, we’ll get something similar to this:
In the next step, we’ll create the model and we’ll use
lda = create_model('lda', num_topics = 6, multi_core = True)
Above, we created an
‘lda’ model and passed in the number of topics as
6 and set it so that the LDA will use all CPU cores available to parallelize and speed up training.
Finally, we’ll assign topic proportions to the rest of the dataset using
lda_results = assign_model(lda)
Visualizing the Results
Let’s the plot the overall frequency distribution of the entire corpus:
Now let’s extract the bigrams and trigrams for the entire corpus:
plot_model(plot = 'bigram')
plot_model(plot = 'trigram')
But what if we only want to extract the n-grams from a specific topic? Easy, we’ll just pass in the
plot_model(lda, plot = 'trigram', topic_num = 'Topic 1')
If we want the distribution of topics we’ll simply change it and specify it in the
plot_model(lda, plot = 'topic_distribution')
And that’s it!
We’ve successfully conducted topic modeling on President Trump’s tweets since taking office.
Thank you for reading! Exploratory data analysis uses a lot of techniques and we’ve only explored a few on this post. I encourage you to keep practicing and employ other techniques to derive insights from data.
 PyCaret. (June 4, 2020). Why PyCaret. https://pycaret.org/
This article was first published in Towards Data Science’ Medium publication.