I remember a brief conversation with my boss’ boss a while back. He said that he wouldn’t be impressed if somebody in the company built a face recognition tool from scratch because, and I quote, “Guess what? There’s an API for that.” He then goes on about the futility of doing something that’s already been done instead of just using it.
This gave me an insight into how an executive thinks. Not that they don’t care about the coolness factor of a project, but at the end of that day, they’re most concerned about how a project will add value to the business and even more importantly, how quickly it can be done.
In the real world, the time it takes to build prototype matters. And the quicker we get from data to insights, the better off we will be. These help us stay agile.
And this brings me to PyCaret.
PyCaret is an open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within seconds in your choice of notebook environment.
Pycaret is basically a wrapper for some of the most popular machine learning libraries and frameworks scikit-learn and spaCy. Here are the things that PyCaret can do:
- Anomaly Detection
- Natural Language Processing
- Associate Rule Mining
Natural Language Processing
In just a few lines of code, PyCaret makes natural language processing so easy that it’s almost criminal. Like most of its other modules, PyCaret’s NLP module streamlined pipeline cuts the time from data to insights in more than half the time.
For example, with only one line, it performs text processing automatically, with the ability to customize stop words. Add another line or two, and you got yourself a language model. With yet another line, it gives you a properly formatted plotly graph. And finally, adding another line gives you the option to evaluate the model. You can even tune the model with, guess what, one line of code!
Instead of just telling you all about the wonderful features of PyCaret, maybe it’s be better if we do a little show and tell instead.
For this post, we’ll create an NLP pipeline that involves the following 6 glorious steps:
- Getting the Data
- Setting up the Environment
- Creating the Model
- Assigning the Model
- Plotting the Model
- Evaluating the Model
We will be going through an end-to-end demonstration of this pipeline with a brief explanation of the functions involved and their parameters.
Let’s get started.
Let us begin by installing PyCaret. If this is your first time installing it, just type the following into your terminal:
pip install pycaret
However, if you have a previously installed version of PyCaret, you can upgrade using the following command:
pip install —-upgrade pycaret
Beware: PyCaret is a big library so it’s going to take a few minutes to download and install.
We’ll also need to download the English language model because it is not included in the PyCaret installation.
python -m spacy download en_core_web_sm
python -m textblob.download_corpora
Next, let’s fire up a Jupyter notebook and import PyCaret’s NLP module:
#import nlp module from pycaret.nlp import *
pycaret.nlp automatically sets up your environment to perform NLP tasks only.
Getting the Data
Before setup, we need to decide first how we’re going to ingest data. There are two methods of getting the data into the pipeline. One is by using a Pandas dataframe and another is by using a simple list of textual data.
Passing a DataFrame
#import pandas if we're gonna use a dataframe import pandas as pd # load the data into a dataframe df = pd.read_csv('hilaryclinton.csv')
Above, we’re simply loading the data into a dataframe.
Passing a List
# read a file containing a list of text data and assign it to 'lines' with open('list.txt') as f: lines = f.read().splitlines()
Above, we’re opening the file
'list.txt' and reading it. We assign the resulting list into the
From the rest of this experiment, we’ll just use a dataframe to pass textual data to the
setup() function of the NLP module. And for the sake of expediency, we’ll sample the dataframe to only select a thousand tweets.
# sampling the data to select only 1000 tweets df = df.sample(1000, random_state=493).reset_index(drop=True)
Let’s take a quick look at our dataframe with
Setting Up the Environment
In the line below, we’ll initialize the setup by calling the
setup() function and assign it to
# initialize the setup nlp = setup(data = df, target = 'text', session_id = 493, custom_stopwords = [ 'rt', 'https', 'http', 'co', 'amp'])
target, we’re telling PyCaret that we’d like to use the values in the
'text' column of
df. Also, we’re setting the
session_id to an arbitrary number of
493 so that we can reproduce the experiment over and over again and get the same result. Finally, we added
custom_stopwords so that PyCaret will exclude the specified list of words in the analysis.
Note that if we want to use a list instead, we could replace
lines and get rid of
target = ‘text’ because a list has no columns for the PyCaret to target!
Here’s the output of
The output table above confirms our session id, number of documents (rows or records), and vocabulary size. It also shows up if we used custom stopwords or not.
Creating the Model
Below, we’ll create the model by calling the
create_model() function and assign it to
lda. The function already knows to use the dataset that we specified during
setup(). In our case, the PyCaret knows we want to create a model based on the
# create the model lda = create_model('lda', num_topics = 6, multi_core = True)
In the line above, notice that w param used
'lda' as the parameter. LDA stands for Latent Dirichlet Allocation. We could’ve just as easily opted for other types of models.
Here’s the list of models that PyCaret currently supports:
- ‘lda’: Latent Dirichlet Allocation
- ‘lsi’: Latent Semantic Indexing
- ‘hdp’: Hierarchical Dirichlet Process
- ‘rp’: Random Projections
- ‘nmf’: Non-Negative Matrix Factorization
The next parameter we used is
num_topics = 6. This tells PyCaret to use six topics in the results ranging from 0 to 5. If num_topic is not set, the default number is 4. Lastly, we set
multi_core to tell PyCaret to use all available CPUs for parallel processing. This saves a lot of computational time.
Assigning the Model
assign_model(), we’re going to label our data so that we’ll get a dataframe (based on our original dataframe:
df) with additional columns that include the following information:
- Topic percent value for each topic
- The dominant topic
- The percent value of the dominant topic
# label the data using trained model df_lda = assign_model(lda)
Let’s take a look at
Plotting the Model
plot_model() function will give us some visualization about frequency, distribution, polarity, et cetera. The
plot_model() function takes three parameters: model, plot, and topic_num. The
model instructs PyCaret what model to use and must be preceded by a
topic_num designates which topic number (from 0 to 5) will the visualization be based on.
plot_model(lda, plot='topic_distribution') plot_model(lda, plot='topic_model') plot_model(lda, plot='wordcloud', topic_num = 'Topic 5') plot_model(lda, plot='frequency', topic_num = 'Topic 5') plot_model(lda, plot='bigram', topic_num = 'Topic 5') plot_model(lda, plot='trigram', topic_num = 'Topic 5') plot_model(lda, plot='distribution', topic_num = 'Topic 5') plot_model(lda, plot='sentiment', topic_num = 'Topic 5') plot_model(lda, plot='tsne')
PyCarets offers a variety of plots. The type of graph generated will depend on the
plot parameter. Here is the list of currently available visualizations:
- ‘frequency’: Word Token Frequency (default)
- ‘distribution’: Word Distribution Plot
- ‘bigram’: Bigram Frequency Plot
- ‘trigram’: Trigram Frequency Plot
- ‘sentiment’: Sentiment Polarity Plot
- ‘pos’: Part of Speech Frequency
- ‘tsne’: t-SNE (3d) Dimension Plot
- ‘topic_model’ : Topic Model (pyLDAvis)
- ‘topic_distribution’ : Topic Infer Distribution
- ‘wordcloud’: Word cloud
- ‘umap’: UMAP Dimensionality Plot
Evaluating the Model
Evaluating the models involves calling the
evaluate_model() function. It takes only one parameter: the model to be used. In our case, the model is stored is
lda that was created with the
create_model() function in an earlier step.
The function returns a visual user interface for plotting.
And voilà, we’re done!
Using PyCaret’s NLP module, we were able to quickly from getting the data to evaluating the model in just a few lines of code. We covered the functions involved in each step and examined the parameters of those functions.
Thank you for reading! PyCaret’s NLP module has a lot more features and I encourage you to read their documentation to further familiarize yourself with the module and maybe even the whole library!
In the next post, I’ll continue to explore PyCaret’s functionalities.
If you want to learn more about my journey from slacker to data scientist, check out the article here.
 PyCaret. (June 4, 2020). Why PyCaret. https://pycaret.org/Towards Data