Topic Modeling on PyCaret

I remember a brief conversation with my boss’ boss a while back. He said that he wouldn’t be impressed if somebody in the company built a face recognition tool from scratch because, and I quote, “Guess what? There’s an API for that.” He then goes on about the futility of doing something that’s already been done instead of just using it.

This gave me an insight into how an executive thinks. Not that they don’t care about the coolness factor of a project, but at the end of that day, they’re most concerned about how a project will add value to the business and even more importantly, how quickly it can be done.

In the real world, the time it takes to build prototype matters. And the quicker we get from data to insights, the better off we will be. These help us stay agile.

And this brings me to PyCaret.

PyCaret is an open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within seconds in your choice of notebook environment.[1]

Pycaret is basically a wrapper for some of the most popular machine learning libraries and frameworks scikit-learn and spaCy. Here are the things that PyCaret can do:

  • Classification
  • Regression
  • Clustering
  • Anomaly Detection
  • Natural Language Processing
  • Associate Rule Mining

If you’re interested in reading about the difference between traditional NLP approach vs. PyCaret’s NLP module, check out Prateek Baghel’s article.

Natural Language Processing

In just a few lines of code, PyCaret makes natural language processing so easy that it’s almost criminal. Like most of its other modules, PyCaret’s NLP module streamlined pipeline cuts the time from data to insights in more than half the time.

For example, with only one line, it performs text processing automatically, with the ability to customize stop words. Add another line or two, and you got yourself a language model. With yet another line, it gives you a properly formatted plotly graph. And finally, adding another line gives you the option to evaluate the model. You can even tune the model with, guess what, one line of code!

Instead of just telling you all about the wonderful features of PyCaret, maybe it’s be better if we do a little show and tell instead.

The Pipeline

For this post, we’ll create an NLP pipeline that involves the following 6 glorious steps:

  1. Getting the Data
  2. Setting up the Environment
  3. Creating the Model
  4. Assigning the Model
  5. Plotting the Model
  6. Evaluating the Model

We will be going through an end-to-end demonstration of this pipeline with a brief explanation of the functions involved and their parameters.

Let’s get started.


Let us begin by installing PyCaret. If this is your first time installing it, just type the following into your terminal:

pip install pycaret

However, if you have a previously installed version of PyCaret, you can upgrade using the following command:

pip install —-upgrade pycaret

Beware: PyCaret is a big library so it’s going to take a few minutes to download and install.

We’ll also need to download the English language model because it is not included in the PyCaret installation.

python -m spacy download en_core_web_sm
python -m textblob.download_corpora

Next, let’s fire up a Jupyter notebook and import PyCaret’s NLP module:

#import nlp module
from pycaret.nlp import *

Importing the pycaret.nlp automatically sets up your environment to perform NLP tasks only.

Getting the Data

Before setup, we need to decide first how we’re going to ingest data. There are two methods of getting the data into the pipeline. One is by using a Pandas dataframe and another is by using a simple list of textual data.

Passing a DataFrame

#import pandas if we're gonna use a dataframe
import pandas as pd

# load the data into a dataframe
df = pd.read_csv('hilaryclinton.csv')

Above, we’re simply loading the data into a dataframe.

Passing a List

# read a file containing a list of text data and assign it to 'lines'
with open('list.txt') as f:
    lines =

Above, we’re opening the file 'list.txt' and reading it. We assign the resulting list into the lines.


From the rest of this experiment, we’ll just use a dataframe to pass textual data to thesetup() function of the NLP module. And for the sake of expediency, we’ll sample the dataframe to only select a thousand tweets.

# sampling the data to select only 1000 tweets
df = df.sample(1000, random_state=493).reset_index(drop=True)

Let’s take a quick look at our dataframe with df.head() and df.shape.

Setting Up the Environment

In the line below, we’ll initialize the setup by calling the setup() function and assign it to nlp.

# initialize the setup
nlp = setup(data = df, target = 'text', session_id = 493, custom_stopwords = [ 'rt', 'https', 'http', 'co', 'amp'])

With data and target, we’re telling PyCaret that we’d like to use the values in the 'text' column of df. Also, we’re setting the session_id to an arbitrary number of 493 so that we can reproduce the experiment over and over again and get the same result. Finally, we added custom_stopwords so that PyCaret will exclude the specified list of words in the analysis.

Note that if we want to use a list instead, we could replace df with lines and get rid of target = ‘text’ because a list has no columns for the PyCaret to target!

Here’s the output of nlp:

The output table above confirms our session id, number of documents (rows or records), and vocabulary size. It also shows up if we used custom stopwords or not.

Creating the Model

Below, we’ll create the model by calling the create_model() function and assign it to lda. The function already knows to use the dataset that we specified during setup(). In our case, the PyCaret knows we want to create a model based on the 'text' in df.

# create the model
lda = create_model('lda', num_topics = 6, multi_core = True)

In the line above, notice that w param used 'lda' as the parameter. LDA stands for Latent Dirichlet Allocation. We could’ve just as easily opted for other types of models.

Here’s the list of models that PyCaret currently supports:

  • ‘lda’: Latent Dirichlet Allocation
  • ‘lsi’: Latent Semantic Indexing
  • ‘hdp’: Hierarchical Dirichlet Process
  • ‘rp’: Random Projections
  • ‘nmf’: Non-Negative Matrix Factorization

I encourage you to research the difference between the models above, To start, check out Lettier’s awesome guide on LDA.

The next parameter we used is num_topics = 6. This tells PyCaret to use six topics in the results ranging from 0 to 5. If num_topic is not set, the default number is 4. Lastly, we set multi_core to tell PyCaret to use all available CPUs for parallel processing. This saves a lot of computational time.

Assigning the Model

By calling assign_model(), we’re going to label our data so that we’ll get a dataframe (based on our original dataframe: df) with additional columns that include the following information:

  • Topic percent value for each topic
  • The dominant topic
  • The percent value of the dominant topic
# label the data using trained model
df_lda = assign_model(lda)

Let’s take a look at df_lda.

Plotting the Model

Calling the plot_model() function will give us some visualization about frequency, distribution, polarity, et cetera. The plot_model() function takes three parameters: model, plot, and topic_num. The model instructs PyCaret what model to use and must be preceded by a create_model() function. topic_num designates which topic number (from 0 to 5) will the visualization be based on.

plot_model(lda, plot='topic_distribution')
plot_model(lda, plot='topic_model')
plot_model(lda, plot='wordcloud', topic_num = 'Topic 5')
plot_model(lda, plot='frequency', topic_num = 'Topic 5')
plot_model(lda, plot='bigram', topic_num = 'Topic 5')
plot_model(lda, plot='trigram', topic_num = 'Topic 5')
plot_model(lda, plot='distribution', topic_num = 'Topic 5')
plot_model(lda, plot='sentiment', topic_num = 'Topic 5')
plot_model(lda, plot='tsne')

PyCarets offers a variety of plots. The type of graph generated will depend on the plot parameter. Here is the list of currently available visualizations:

  • ‘frequency’: Word Token Frequency (default)
  • ‘distribution’: Word Distribution Plot
  • ‘bigram’: Bigram Frequency Plot
  • ‘trigram’: Trigram Frequency Plot
  • ‘sentiment’: Sentiment Polarity Plot
  • ‘pos’: Part of Speech Frequency
  • ‘tsne’: t-SNE (3d) Dimension Plot
  • ‘topic_model’ : Topic Model (pyLDAvis)
  • ‘topic_distribution’ : Topic Infer Distribution
  • ‘wordcloud’: Word cloud
  • ‘umap’: UMAP Dimensionality Plot

Evaluating the Model

Evaluating the models involves calling the evaluate_model() function. It takes only one parameter: the model to be used. In our case, the model is stored is lda that was created with the create_model() function in an earlier step.

The function returns a visual user interface for plotting.

And voilà, we’re done!


Using PyCaret’s NLP module, we were able to quickly from getting the data to evaluating the model in just a few lines of code. We covered the functions involved in each step and examined the parameters of those functions.

Thank you for reading! PyCaret’s NLP module has a lot more features and I encourage you to read their documentation to further familiarize yourself with the module and maybe even the whole library!

In the next post, I’ll continue to explore PyCaret’s functionalities.

If you want to learn more about my journey from slacker to data scientist, check out the article here.

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1] PyCaret. (June 4, 2020). Why PyCaret. Data