Topic Modeling on PyCaret

I remember a brief conversation with my boss’ boss a while back. He said that he wouldn’t be impressed if somebody in the company built a face recognition tool from scratch because, and I quote, “Guess what? There’s an API for that.” He then goes on about the futility of doing something that’s already been done instead of just using it.

This gave me an insight into how an executive thinks. Not that they don’t care about the coolness factor of a project, but at the end of that day, they’re most concerned about how a project will add value to the business and even more importantly, how quickly it can be done.

In the real world, the time it takes to build prototype matters. And the quicker we get from data to insights, the better off we will be. These help us stay agile.

And this brings me to PyCaret.


PyCaret is an open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within seconds in your choice of notebook environment.[1]

Pycaret is basically a wrapper for some of the most popular machine learning libraries and frameworks scikit-learn and spaCy. Here are the things that PyCaret can do:

  • Classification
  • Regression
  • Clustering
  • Anomaly Detection
  • Natural Language Processing
  • Associate Rule Mining

If you’re interested in reading about the difference between traditional NLP approach vs. PyCaret’s NLP module, check out Prateek Baghel’s article.

Natural Language Processing

In just a few lines of code, PyCaret makes natural language processing so easy that it’s almost criminal. Like most of its other modules, PyCaret’s NLP module streamlined pipeline cuts the time from data to insights in more than half the time.

For example, with only one line, it performs text processing automatically, with the ability to customize stop words. Add another line or two, and you got yourself a language model. With yet another line, it gives you a properly formatted plotly graph. And finally, adding another line gives you the option to evaluate the model. You can even tune the model with, guess what, one line of code!

Instead of just telling you all about the wonderful features of PyCaret, maybe it’s be better if we do a little show and tell instead.


The Pipeline

For this post, we’ll create an NLP pipeline that involves the following 6 glorious steps:

  1. Getting the Data
  2. Setting up the Environment
  3. Creating the Model
  4. Assigning the Model
  5. Plotting the Model
  6. Evaluating the Model

We will be going through an end-to-end demonstration of this pipeline with a brief explanation of the functions involved and their parameters.

Let’s get started.


Housekeeping

Let us begin by installing PyCaret. If this is your first time installing it, just type the following into your terminal:

pip install pycaret

However, if you have a previously installed version of PyCaret, you can upgrade using the following command:

pip install —-upgrade pycaret

Beware: PyCaret is a big library so it’s going to take a few minutes to download and install.

We’ll also need to download the English language model because it is not included in the PyCaret installation.

python -m spacy download en_core_web_sm
python -m textblob.download_corpora

Next, let’s fire up a Jupyter notebook and import PyCaret’s NLP module:

#import nlp module
from pycaret.nlp import *

Importing the pycaret.nlp automatically sets up your environment to perform NLP tasks only.

Getting the Data

Before setup, we need to decide first how we’re going to ingest data. There are two methods of getting the data into the pipeline. One is by using a Pandas dataframe and another is by using a simple list of textual data.

Passing a DataFrame

#import pandas if we're gonna use a dataframe
import pandas as pd

# load the data into a dataframe
df = pd.read_csv('hilaryclinton.csv')

Above, we’re simply loading the data into a dataframe.

Passing a List

# read a file containing a list of text data and assign it to 'lines'
with open('list.txt') as f:
    lines = f.read().splitlines()

Above, we’re opening the file 'list.txt' and reading it. We assign the resulting list into the lines.

Sampling

From the rest of this experiment, we’ll just use a dataframe to pass textual data to thesetup() function of the NLP module. And for the sake of expediency, we’ll sample the dataframe to only select a thousand tweets.

# sampling the data to select only 1000 tweets
df = df.sample(1000, random_state=493).reset_index(drop=True)

Let’s take a quick look at our dataframe with df.head() and df.shape.

Setting Up the Environment

In the line below, we’ll initialize the setup by calling the setup() function and assign it to nlp.

# initialize the setup
nlp = setup(data = df, target = 'text', session_id = 493, custom_stopwords = [ 'rt', 'https', 'http', 'co', 'amp'])

With data and target, we’re telling PyCaret that we’d like to use the values in the 'text' column of df. Also, we’re setting the session_id to an arbitrary number of 493 so that we can reproduce the experiment over and over again and get the same result. Finally, we added custom_stopwords so that PyCaret will exclude the specified list of words in the analysis.

Note that if we want to use a list instead, we could replace df with lines and get rid of target = ‘text’ because a list has no columns for the PyCaret to target!

Here’s the output of nlp:

The output table above confirms our session id, number of documents (rows or records), and vocabulary size. It also shows up if we used custom stopwords or not.

Creating the Model

Below, we’ll create the model by calling the create_model() function and assign it to lda. The function already knows to use the dataset that we specified during setup(). In our case, the PyCaret knows we want to create a model based on the 'text' in df.

# create the model
lda = create_model('lda', num_topics = 6, multi_core = True)

In the line above, notice that w param used 'lda' as the parameter. LDA stands for Latent Dirichlet Allocation. We could’ve just as easily opted for other types of models.

Here’s the list of models that PyCaret currently supports:

  • ‘lda’: Latent Dirichlet Allocation
  • ‘lsi’: Latent Semantic Indexing
  • ‘hdp’: Hierarchical Dirichlet Process
  • ‘rp’: Random Projections
  • ‘nmf’: Non-Negative Matrix Factorization

I encourage you to research the difference between the models above, To start, check out Lettier’s awesome guide on LDA.

The next parameter we used is num_topics = 6. This tells PyCaret to use six topics in the results ranging from 0 to 5. If num_topic is not set, the default number is 4. Lastly, we set multi_core to tell PyCaret to use all available CPUs for parallel processing. This saves a lot of computational time.

Assigning the Model

By calling assign_model(), we’re going to label our data so that we’ll get a dataframe (based on our original dataframe: df) with additional columns that include the following information:

  • Topic percent value for each topic
  • The dominant topic
  • The percent value of the dominant topic
# label the data using trained model
df_lda = assign_model(lda)

Let’s take a look at df_lda.

Plotting the Model

Calling the plot_model() function will give us some visualization about frequency, distribution, polarity, et cetera. The plot_model() function takes three parameters: model, plot, and topic_num. The model instructs PyCaret what model to use and must be preceded by a create_model() function. topic_num designates which topic number (from 0 to 5) will the visualization be based on.

plot_model(lda, plot='topic_distribution')
plot_model(lda, plot='topic_model')
plot_model(lda, plot='wordcloud', topic_num = 'Topic 5')
plot_model(lda, plot='frequency', topic_num = 'Topic 5')
plot_model(lda, plot='bigram', topic_num = 'Topic 5')
plot_model(lda, plot='trigram', topic_num = 'Topic 5')
plot_model(lda, plot='distribution', topic_num = 'Topic 5')
plot_model(lda, plot='sentiment', topic_num = 'Topic 5')
plot_model(lda, plot='tsne')

PyCarets offers a variety of plots. The type of graph generated will depend on the plot parameter. Here is the list of currently available visualizations:

  • ‘frequency’: Word Token Frequency (default)
  • ‘distribution’: Word Distribution Plot
  • ‘bigram’: Bigram Frequency Plot
  • ‘trigram’: Trigram Frequency Plot
  • ‘sentiment’: Sentiment Polarity Plot
  • ‘pos’: Part of Speech Frequency
  • ‘tsne’: t-SNE (3d) Dimension Plot
  • ‘topic_model’ : Topic Model (pyLDAvis)
  • ‘topic_distribution’ : Topic Infer Distribution
  • ‘wordcloud’: Word cloud
  • ‘umap’: UMAP Dimensionality Plot

Evaluating the Model

Evaluating the models involves calling the evaluate_model() function. It takes only one parameter: the model to be used. In our case, the model is stored is lda that was created with the create_model() function in an earlier step.

The function returns a visual user interface for plotting.

And voilà, we’re done!

Conclusion

Using PyCaret’s NLP module, we were able to quickly from getting the data to evaluating the model in just a few lines of code. We covered the functions involved in each step and examined the parameters of those functions.


Thank you for reading! PyCaret’s NLP module has a lot more features and I encourage you to read their documentation to further familiarize yourself with the module and maybe even the whole library!

In the next post, I’ll continue to explore PyCaret’s functionalities.

If you want to learn more about my journey from slacker to data scientist, check out the article here.

Stay tuned!

You can reach me on Twitter or LinkedIn.


[1] PyCaret. (June 4, 2020). Why PyCaret. https://pycaret.org/Towards Data

Into the Heart of Darkness - Pt. 2

Exploring the Trump Twitter Archive with spaCy.


In a previous post, we set out to explore the dataset provided by the Trump Twitter Archive. My initial goal was to do something fun by using a very interesting dataset. However, it didn’t quite turn out that way.

On this post, we’ll continue our journey but this time we’ll be using spaCy.


For this project, we’ll be using pandas for data manipulation, spaCy for natural language processing, and joblib to speed things up.

Let’s get started by firing up a Jupyter notebook!

Housekeeping

Let’s import pandas and also set the display options so Jupyter won’t truncate our columns and rows. Let’s also set a random seed for reproducibility.

# for manipulating data
import pandas as pd
# setting the random seed for reproducibility
import random
random.seed(493)
# to print out all the outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', -1)

Getting the Data

Let’s read the data into a dataframe. If you want to follow along, you can download the cleaned dataset here along with the file for stop words¹. This dataset contains Trump’s tweets from the moment he took office on January 20, 2017 to May 30, 2020.

df = pd.read_csv('trump_20200530_clean.csv', parse_dates=True, index_col='datetime')

Let’s take a quick look at the data.

df.head()
df.info()

Using spaCy

Now let’s import spaCy and begin natural language processing.

# for natural language processing: named entity recognition
import spacy
import en_core_web_sm

We’re only going to use spaCy’s ner functionality or named-entity recognition so we’ll disable the rest of the functionalities. This will save us a lot of loading time later.

nlp = spacy.load(‘en_core_web_sm’, disable=[‘tagger’, ‘parser’, ‘textcat’])

Now let’s load the contents stopwords file into the variable stopswords. Note that we converted the list into a set to also save some processing time later.

with open(‘twitter-stopwords — TA — Less.txt’) as f:
contents = f.read().split(‘,’)
stopwords = set(contents)

Next, we’ll import joblib and define a few functions to help with parallel processing.

from joblib import Parallel, delayed

def chunker(iterable, total_length, chunksize):
    return (iterable[pos: pos + chunksize] for pos in range(0, total_length, chunksize))

def flatten(list_of_lists):
    "Flatten a list of lists to a combined list"
    return [item for sublist in list_of_lists for item in sublist]

def process_chunk(texts):
    preproc_pipe = []
    for doc in nlp.pipe(texts, batch_size=20):
        preproc_pipe.append([(ent.text) for ent in doc.ents if ent.label_ in ['NORP', 'PERSON', 'FAC', 'ORG', 'GPE', 'LOC', 'PRODUCT', 'EVENT']])
    return preproc_pipe

def preprocess_parallel(texts, chunksize=100):
    executor = Parallel(n_jobs=7, backend='multiprocessing', prefer="processes")
    do = delayed(process_chunk)
    tasks = (do(chunk) for chunk in chunker(texts, len(df), chunksize=chunksize))
    result = executor(tasks)
    return flatten(result)

In the code above², the function preprocess_parallel executes the other function process_chunks in parallel to help with speed. The function process_chunks iterates through a series of texts — in our case, the column 'tweet' of our the df dataframe — and inspects the entity if it belongs to either NORP, PERSON, FAC, ORG, GPE, LOC, PRODUCT, or EVENT. If it is, the entity is then appended to 'preproc_pipe' and subsequently returned to its caller. Prashanth Rao has a very good article on making spaCy super fast.

Let’s call the main driver for the functions now.

df['entities'] = preprocess_parallel(df['tweet'], chunksize=1000)

Doing a quick df.head() will reveal the new column 'entities' that we added earlier to hold the entities found in the 'tweet' column.

Prettifying the Results

In the code below, we’re making a list of lists called 'entities' and then flattening it for easier processing. We’re also converting it into a set called 'entities_set'.

entities = [entity for entity in df.entities if entity != []]
entities = [item for sublist in entities for item in sublist]
entities_set = set(entities)

Next, let’s count the frequency of the entities and append it to the list of tuples entities_counts. Then let’s convert the results into a dataframe df_counts.

df_counts = pd.Series(entities).value_counts()[:20].to_frame().reset_index()
df_counts.columns=['entity', 'count']
df_counts

For this step, we’re going to reinitialize an empty list entity_counts and manually construct a list of tuples with a combined set of entities and the sum of their frequencies or count.

entity_counts = []

entity_counts.append(('Democrats', df_counts.loc[df_counts.entity.isin(['Democrats', 'Dems', 'Democrat'])]['count'].sum()))
entity_counts.append(('Americans', df_counts.loc[df_counts.entity.isin(['American', 'Americans'])]['count'].sum()))
entity_counts.append(('Congress', df_counts.loc[df_counts.entity.isin(['House', 'Senate', 'Congress'])]['count'].sum()))
entity_counts.append(('America', df_counts.loc[df_counts.entity.isin(['U.S.', 'the United States', 'America'])]['count'].sum()))
entity_counts.append(('Republicans', df_counts.loc[df_counts.entity.isin(['Republican', 'Republicans'])]['count'].sum()))

entity_counts.append(('China', 533))
entity_counts.append(('FBI', 316))
entity_counts.append(('Russia', 313))
entity_counts.append(('Fake News', 248))
entity_counts.append(('Mexico', 213))
entity_counts.append(('Obama', 176))

Let’s take a quick look before continuing.

Finally, let’s convert the list of tuples into a dataframe.

df_ner = pd.DataFrame(entity_counts, columns=["entity", "count"]).sort_values('count', ascending=False).reset_index(drop=True)

And that’s it!

We’ve successfully created a ranking of the named entities that President Trump most frequently talked about in his tweets since taking office.


Thank you for reading! Exploratory data analysis uses a lot of techniques and we’ve only explored a few on this post. I encourage you to keep practicing and employ other techniques to derive insights from data.

In the next post, we shall continue our journey into the heart of darkness and do some topic-modeling using LDA.

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1] GONG Wei’s Homepage. (May 30, 2020). Stop words for tweets. https://sites.google.com/site/iamgongwei/home/sw

[2] Towards Data Science. (May 30, 2020). Turbo-charge your spaCy NLP pipeline. https://towardsdatascience.com/turbo-charge-your-spacy-nlp-pipeline-551435b664ad

Into the Heart of Darkness - Pt. 1

Exploring the Trump Twitter Archive with Python. For beginners.


In this post, we’ll explore the dataset provided by the Trump Twitter Archive. My goal was to do something fun by using a very interesting dataset. However, as it turned out, exposure to Trump’s narcissism and shenanigans were quite depressing — if not traumatic.

You’d been warned!


For this project, we’ll be using pandas and numpy for data manipulation, matplotlib for visualizations, datetime for working with timestamps, unicodedata and regex for processing strings, and finally, nltk for natural language processing.

Let’s get started by firing up a Jupyter notebook!

Environment

We’re going to import pandas and matplotlib, and also set the display options for Jupyter so that the rows and columns are not truncated.

# for manipulating data
import pandas as pd
import numpy as np
# for visualizations
%matplotlib inline
import matplotlib.pyplot as plt
# to print out all the outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', -1)

Getting the Data

Let’s read the data into a dataframe. If you want to follow along, you can download the dataset here. This dataset contains Trump’s tweets from the moment he took office on January 20, 2017 to May 30, 2020.

df = pd.read_csv('trump_20200530.csv')

Let’s look at the first five rows and see the number of records (rows) and fields (columns).

df.head()
df.shape

Let’s do a quick renaming of the columns to make it easier for us later.

df.columns=['source', 'tweet', 'date_time', 'retweets', 'favorites', 'is_retweet', 'id']

Let’s drop the id column since it’s not really relevant right now.

df = df.drop(columns=['id'])

Let’s do a quick sanity check, this time let’s also check the dtypes of the columns.

df.head()
df.info()

Working with Timestamps

We can see from the previous screenshot that the ‘date_time’ column is a string. Let’s parse it to a timestamp.

# for working with timestamps
from datetime import datetime
from dateutil.parser import parse
dt = []
for ts in df.date_time:
dt.append(parse(ts))
dt[:5]

Let’s add a column with ‘datetime’ that contains the timestamp information.

df['datetime'] = df.apply(lambda row: parse(row.date_time), axis=1)

Let’s double-check the data range of our dataset.

df.datetime.min()
df.datetime.max()

Trimming the Data

Let’s see how many sources there are for the tweets.

df.source.value_counts()

Let’s only keep the ones that were made using the ‘Twitter for iPhone’ app.

df = df.loc[df.source == 'Twitter for iPhone']

We should drop the old ‘date_time’ column and the ‘source’ column as well.

df = df.drop(columns=['date_time', 'source'])

Separating the Retweets

Let’s see how many are retweets.

df.is_retweet.value_counts()

Let’s make another dataframe that contains only retweets and drop the ‘is_retweet’ column.

df_retweets = df.loc[df.is_retweet == True]
df_retweets = df_retweets.drop(columns=['is_retweet'])

Sanity check:

df_retweets.head()
df_retweets.shape

Back on the original dataframe, let’s remove the retweets from the dataset and drop the ‘is_retweet’ column altogether.

df = df.loc[df.is_retweet == False]
df = df.drop(columns=['is_retweet'])

Another sanity check:

df.head()
df.shape

Exploring the Data

Let’s explore both of the dataframes and answer a few questions.

What time does the President tweet the most? What time does he tweet the least?

The graph below shows that the President most frequently tweets around 12pm. He also tweets the least around 8am. Maybe he’s not a morning person?

title = 'Number of Tweets by Hour'
df.tweet.groupby(df.datetime.dt.hour).count().plot(figsize=(12,8), fontsize=14, kind='bar', rot=0, title=title)
plt.xlabel('Hour')
plt.ylabel('Number of Tweets')

What day does the President tweet the most? What day does he tweet the least?

The graph below shows that the President most frequently tweets on Wednesday. He also tweets the least on Thursday.

title = 'Number of Tweets by Day of the Week'
df.tweet.groupby(df.datetime.dt.dayofweek).count().plot(figsize=(12,8), fontsize=14, kind='bar', rot=0, title=title)
plt.xlabel('Day of the Week')
plt.ylabel('Number of Tweets')
plt.xticks(np.arange(7),['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])

Isolating Twitter Handles from the Retweets

Let’s import regex so we can use it to parse the text and isolate the Twitter handles of the original tweets. In the code below, we add another column that contains the Twitter handle.

import re
pattern = re.compile('(?<=RT @).*?(?=:)')
df_retweets['original'] = [re.search(pattern, tweet).group(0) for tweet in df_retweets.tweet]

Let’s create another dataframe that will contain only the original Twitter handles and their associated number of retweets.

df_originals = df_retweets.groupby(['original']).sum().sort_values('retweets').reset_index().sort_values('retweets', ascending=False)

Let’s check the data real quick:

df_originals.head()
df_originals.shape

Let’s visualize the results real quick so we can get an idea if the data is disproportionate or not.

df_originals = df_retweets.groupby(['original']).sum().sort_values('retweets').reset_index().sort_values('retweets', ascending=False)[:10].sort_values('retweets')
df_originals.plot.barh(x='original', y='retweets', figsize=(16,10), fontsize=16)
plt.xlabel("Originating Tweet's Username")
plt.xticks([])

Which Twitter user does the President like to retweet the most?

The graph below shows that the President likes to retweet the tweets from ‘@realDonaldTrump’. Does this mean the president likes to retweet himself? You don’t say!

The interesting handle on this one is ‘@charliekirk11’. Charlie Kirk is the founder of Turning Point USA. CBS News described the organization as a far-right organization that is “shunned or at least ignored by more established conservative groups in Washington, but embraced by many Trump supporters”.¹

The Top 5 Retweets

Let’s look at the top 5 tweets that were retweeted the most by others based on the original Twitter handle.

Let’s start with the ones with ‘@realDonaldTrump’.

df_retweets.loc[df_retweets.original == 'realDonaldTrump'].sort_values('retweets', ascending=False)[:5]

And another one with ‘@charliekirk11’.

df_retweets.loc[df_retweets.original == 'charliekirk11'].sort_values('retweets', ascending=False)[:5]

Examining Retweets’ Favorites count

Let’s find out how many of the retweets are favorited by others.

df_retweets.favorites.value_counts()

Surprisingly, none of the retweets seemed to have been favorited by anybody. Weird.

We should drop it.

Counting N-Grams

To do some n-gram ranking, we need to import unicodedata and nltk. We also need to specify additional stopwords that we may need to exclude from our analysis.

# for cleaning and natural language processing
import unicodedata
import nltk
# add appropriate words that will be ignored in the analysis
ADDITIONAL_STOPWORDS = ['rt']

Here are a few of my favorite functions for natural language processing:

def clean(text):
  """
  A simple function to clean up the data. All the words that
  are not designated as a stop word is then lemmatized after
  encoding and basic regex parsing are performed.
  """
  wnl = nltk.stem.WordNetLemmatizer()
  stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
  text = (unicodedata.normalize('NFKD', text)
    .encode('ascii', 'ignore')
    .decode('utf-8', 'ignore')
    .lower())
  words = re.sub(r'[^\w\s]', '', text).split()
  return [wnl.lemmatize(word) for word in words if word not in stopwords]

def get_words(df, column):
    """
    Takes in a dataframe and columns and returns a list of
    words from the values in the specified column.
    """
    return clean(''.join(str(df[column].tolist())))

def get_bigrams(df, column):
    """
    Takes in a list of words and returns a series of
    bigrams with value counts.
    """
    return (pd.Series(nltk.ngrams(get_words(df, column), 2)).value_counts())[:10]

def get_trigrams(df, column):
    """
    Takes in a list of words and returns a series of
    trigrams with value counts.
    """
    return (pd.Series(nltk.ngrams(get_words(df, column), 3)).value_counts())[:10]

def viz_bigrams(df ,column):
    get_bigrams(df, column).sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))

    plt.title('20 Most Frequently Occuring Bigrams')
    plt.ylabel('Bigram')
    plt.xlabel('# Occurances')

def viz_trigrams(df, column):
    get_trigrams(df, column).sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))

    plt.title('20 Most Frequently Occuring Trigrams')
    plt.ylabel('Trigram')
    plt.xlabel('# Occurances')
 

Let’s look at the top 10 bigrams in the df dataframe using the ‘tweet’ column.

get_bigrams(df, 'tweet')

And now, for the top 10 trigrams:

Let’s use the viz_bigrams() function and visualize the bigrams.

viz_bigrams(df, ‘tweet’)

Similarly, let’s use the viz_trigrams() function and visualize the trigrams.

viz_trigrams(df, 'tweet')

And there we have it!

From the moment that Trump took office, we can confidently say that the “fake news media” has been on top of the president’s mind.

Conclusion

Using basic Python and the nltk library, we’ve explored the dataset from the Trump Twitter Archive and did some n-gram ranking out of it.


Thank you for reading! Exploratory data analysis uses a lot of techniques and we’ve only explored a few on this post. I encourage you to keep practicing and employ other techniques to derive insights from data.

In the next post, we shall continue our journey into the heart of darkness and use spaCy to extract named-entities from the same dataset.

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1] CBS News. “Trump speaks to conservative group Turning Point USA”. www.cbsnews.com. Archived from the original on July 31, 2019. Retrieved August 5, 2019.

Populating a Network Graph with Named-Entities

An early attempt of using networkx to visualize the results of natural language processing.


I do a lot of natural language processing and usually, the results are pretty boring to the eye. When I learned about network graphs, it got me thinking, why not use keywords as nodes and connect them together to create a network graph?

Yupp, why not!

In this post, we’ll do exactly that. We’re going to extract named-entities from news articles about coronavirus and then use their relationships to connect them together in a network graph.


A Brief Introduction

Network graphs are a cool visual that contains nodes (vertices) and edges (lines). It’s often used in social network analysis and network analysis but data scientists also use it for natural language processing.

Photo by Anders Sandberg on Flicker

Natural Language Processing or NLP is a branch of artificial intelligence that deals with programming computers to process and analyze large volumes of text and derive meaning out of them.¹ In other words, it’s all about teaching computers how to understand human language… like a boss!

Photo by brewbooks on Flickr

Enough introduction, let’s get to coding!


To get started, let’s make sure to take care of all dependencies. Open up a terminal and execute the following commands:

pip install -U spacy
python -m spacy download en
pip install networkx
pip install fuzzywuzzy

This will install spaCy and download the trained model for English. The third command installs networkx. This should work for most systems. If it doesn’t work for you, check out the documentation for spaCy and networkx. Also, we’re using fuzzywuzzy for some text preprocessing.

With that out of the way, let’s fire up a Jupyter notebook and get started!


Imports

Run the following code block into a cell to get all the necessary imports into our Python environment.

import pandas as pd
import numpy as np
import pickle
from operator import itemgetter
from fuzzywuzzy import process, fuzz# for natural language processing
import spacy
import en_core_web_sm# for visualizations
%matplotlib inline
from matplotlib.pyplot import figureimport networkx as nx

Getting the Data

If you want to follow along, you can download the sample dataset here. The file was created using newspaper to import news articles from the npr.org. If you’re feeling adventurous, use the code snippet below to build your own dataset.

import requests
import json
import time
import newspaper
import pickle

npr = newspaper.build('https://www.npr.org/sections/coronavirus-live-updates')

corpus = []
count = 0
for article in npr.articles:
    time.sleep(1)
    article.download()
    article.parse()
    text = article.text
    corpus.append(text)
    if count % 10 == 0 and count != 0:
        print('Obtained {} articles'.format(count))
    count += 1

corpus300 = corpus[:300]

with open("npr_coronavirus.txt", "wb") as fp:   # Pickling
    pickle.dump(corpus300, fp)

# with open("npr_coronavirus.txt", "rb") as fp:   # Unpickling
#     corpus = pickle.load(fp)

Let’s get our data.

with open('npr_coronavirus.txt', 'rb') as fp:   # Unpickling
corpus = pickle.load(fp)

Extract Entities

Next, we’ll start by loading spaCy’s English model:

nlp = en_core_web_sm.load()

Then, we’ll extract the entities:

entities = []for article in corpus[:50]:
tokens = nlp(''.join(article))
gpe_list = []
for ent in tokens.ents:
if ent.label_ == 'GPE':
gpe_list.append(ent.text)
entities.append(gpe_list)

In the above code block, we created an empty list called entities to store a list of lists that contains the extracted entities from each of the articles. In the for-loop, we looped through the first 50 articles of the corpus. For each iteration, we converted each articles into tokens (words) and then we looped through all those words to get the entities that are labeled as GPE for countries, states, and cities. We used ent.text to extract the actual entity and appended them one by one to entities.

Here’s the result:

Note that North Carolina has several variations of its name and some have “the” prefixed in their names. Let’s get rid of them.

articles = []for entity_list in entities:
cleaned_entity_list = []
for entity in entity_list:
cleaned_entity_list.append(entity.lstrip('the ').replace("'s", "").replace("’s",""))
articles.append(cleaned_entity_list)

In the code block above, we’re simply traversing the list of lists articles and cleaning the entities one by one. With each iteration, we’re stripping the prefix “the” and getting rid of 's.

Optional: FuzzyWuzzy

Looking at the entities, I’ve noticed that there are also variations in the “United States” is represented. There exists “United States of America” while some are just “United States”. We can trim these down into a more standard naming convention.

FuzzyWuzzy can help with this.

Described by pypi.org as “string matching like a boss,” FiuzzyWuzzy uses Levenshtein distance to calculate the similarities between words.¹ For a really good tutorial on how to use FuzzyWuzzy, check out Thanh Huynh’s article.FuzzyWuzzy: Find Similar Strings within one column in PythonToken Sort Ratio vs. Token Set Ratiotowardsdatascience.com

Here’s the optional code for using FuzzyWuzzy:

choices = set([item for sublist in articles for item in sublist])

cleaned_articles = []
for article in articles:
    article_entities = []
    for entity in set(article):
        article_entities.append(process.extractOne(entity, choices)[0])
    cleaned_articles.append(article_entities)

For the final step before creating the network graph, let’s get rid of the empty lists within our list of list that were generated by articles who didn’t have any GPE entity types.

articles = [article for article in articles if article != []]

Create the Network Graph

For the next step, we’ll create the world into which the graph will exist.

G = nx.Graph()

Then, we’ll manually add the nodes with G.add_nodes_from().

for entities in articles:
G.add_nodes_from(entities)

Let’s see what the graph looks like with:

figure(figsize=(10, 8))
nx.draw(G, node_size=15)

Next, let’s add the edges that will connect the nodes.

for entities in articles:
if len(entities) > 1:
for i in range(len(entities)-1):
G.add_edges_from([(str(entities[i]),str(entities[i+1]))])

For each iteration of the code above, we used a conditional that will only entertain a list of entities that has two or more entities. Then, we manually connect each of the entities with G.add_edges_from().

Let’s see what the graph looks like now:

figure(figsize=(10, 8))
nx.draw(G, node_size=10)

This graph reminds me of spiders! LOL.

To organize it a bit, I decided to use the shell version of the network graph:

figure(figsize=(10, 8))
nx.draw_shell(G, node_size=15)

We can tell that some nodes are heavier on connections than others. To see which nodes have the most connections, let’s use G.degree().

G.degree()

This gives the following degree view:

Let’s find out which node or entity has the most number of connections.

max(dict(G.degree()).items(), key = lambda x : x[1])

To find out which other nodes have the most number of connections, let’s check the top 5:

degree_dict = dict(G.degree(G.nodes()))
nx.set_node_attributes(G, degree_dict, 'degree')sorted_degree = sorted(degree_dict.items(), key=itemgetter(1), reverse=True)

Above, sorted_degrees is a list that contains all the nodes and their degree values. We only wanted the top 5 like so:

print("Top 5 nodes by degree:")
for d in sorted_degree[:5]:
print(d)

Bonus Round: Gephi

Gephi is an open-source and free desktop application that lets us visualize, explore, and analyze all kinds of graphs and networks.²

Let’s export our graph data into a file so we can import it into Gephi.

nx.write_gexf(G, "npr_coronavirus_GPE_50.gexf")

Cool beans!

Next Steps

This time, we only processed 50 articles from npr.org. What would happen if we processed all 300 articles from our dataset? What will we see if we change the entity type from GPE to PERSON? How else can we use network graphs to visualize natural language processing results?

There’s always more to do. The possibilities are endless!


I hope you enjoyed today’s post. The code is not perfect and we have a long way to go towards realizing insights from the data. I encourage you to dive deeper and learn more about spaCynetworkxfuzzywuzzy, and even Gephi.

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1]: Wikipedia. (May 25, 2020). Natural language processing https://en.wikipedia.org/wiki/Natural_language_processing

[2]: Gephi. (May 25, 2020). The Open Graph Viz Platform https://gephi.org/

This article was first published in Towards Data Science‘ Medium publication.

From DataFrame to Named-Entities

A quick-start guide to extracting named-entities from a Pandas dataframe using spaCy.


A long time ago in a galaxy far away, I was analyzing comments left by customers and I noticed that they seemed to mention specific companies much more than others. This gave me an idea. Maybe there is a way to extract the names of companies from the comments and I could quantify them and conduct further analysis.

There is! Enter: named-entity-recognition.

Named-Entity Recognition

According to Wikipedia, named-entity recognition or NER “is a subtask of information extraction that seeks to locate and classify named entity mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.”¹ In other words, NER attempts to extract words that categorized into proper names and even numerical entities.

In this post, I’ll share the code that will let us extract named-entities from a Pandas dataframe using spaCy, an open-source library provides industrial-strength natural language processing in Python and is designed for production use.²

To get started, let’s install spaCy with the following pip command:

pip install -U spacy

After that, let’s download the pre-trained model for English:

python -m spacy download en

With that out of the way, let’s open up a Jupyter notebook and get started!

Imports

Run the following code block into a cell to get all the necessary imports into our Python environment.

# for manipulating dataframes
import pandas as pd# for natural language processing: named entity recognition
import spacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()# for visualizations
%matplotlib inline

The important line in this block is nlp = en_core_web_sm.load() because this is what we’ll be using later to extract the entities from the text.

Getting the Data

First, let’s get our data and load it into a dataframe. If you want to follow along, download the sample dataset here or create your own from the Trump Twitter Archive.

df = pd.read_csv('ever_trump.csv')

Running df.head() in a cell will get us acquainted with the data set quickly.

Getting the Tokens

Second, let’s create tokens that will serve as input for spaCy. In the line below, we create a variable tokens that contains all the words in the 'text' column of the df dataframe.

tokens = nlp(''.join(str(df.text.tolist())))

Third, we’re going to extract entities. We can just extract the most common entities for now:

items = [x.text for x in tokens.ents]
Counter(items).most_common(20)
Screenshot by Author

Extracting Named-Entities

Next, we’ll extract the entities based on their categories. We have a few to choose from people to events and even organizations. For a complete list of all that spaCy has to offer, check out their documentation on named-entities.

Screenshot by Author

To start, we’ll extract people (real and fictional) using the PERSON type.

person_list = []for ent in tokens.ents:
if ent.label_ == 'PERSON':
person_list.append(ent.text)

person_counts = Counter(person_list).most_common(20)df_person = pd.DataFrame(person_counts, columns =['text', 'count'])

In the code above, we started by making an empty list with person_list = [].

Then, we utilized a for-loop to loop through the entities found in tokens with tokens.ents. After that, we made a conditional that will append to the previously created list if the entity label is equal to PERSON type.

We’ll want to know how many times a certain entity of PERSON type appears in the tokens so we did with person_counts = Counter(person_list).most_common(20). This line will give us the top 20 most common entities for this type.

Finally, we created the df_person dataframe to store the results and this is what we get:

Screenshot by Author

We’ll repeat the same pattern for the NORP type which recognizes nationalities, religious and political groups.

norp_list = []for ent in tokens.ents:
if ent.label_ == 'NORP':
norp_list.append(ent.text)

norp_counts = Counter(norp_list).most_common(20)df_norp = pd.DataFrame(norp_counts, columns =['text', 'count'])

And this is what we get:

Screenshot by Author

Bonus Round: Visualization

Let’s create a horizontal bar graph of the df_norp dataframe.

df_norp.plot.barh(x='text', y='count', title="Nationalities, Religious, and Political Groups", figsize=(10,8)).invert_yaxis()
Screenshot by Author

Voilà, that’s it!


I hope you enjoyed this one. Natural language processing is a huge topic but I hope that this gentle introduction will encourage you to explore more and expand your repertoire.

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1]: Wikipedia. (May 22, 2020). Named-entity recognition https://en.wikipedia.org/wiki/Named-entity_recognition

[2]: spaCy. (May 22, 2020). Industrial-Strength Natural Language Processing in Python https://spacy.io/

This article was first published in Towards Data Science‘ Medium publication.

Create an N-Gram Ranking in Power BI

A quick start guide on building a Python visual with a few simple clicks of the mouse and a dash of code.


In a previous article, I wrote a quick start guide on creating and visualizing n-gram ranking using nltk for natural language processing. However, I needed a way to share my findings with others who don’t have Python or Jupyter Notebook installed in their machines. I needed to use our organization’s BI reporting tool: Power BI.

Enter Python Visual.

The Python visual allows you to create a visualization generated by running Python code. In this post, we’ll walk through the steps needed to visualize the results of our n-gram ranking using this visual.

First, let’s get our data. You can download the sample dataset here. Then, we could load the data into Power BI Desktop as shown below:

Select Text/CSV and click on “Connect”.

Select the file in the Windows Explorer folder and click open:

Click on “Load”.

Next, find the Py icon on the “Visualizations” panel.

Then, click on “Enable” at the prompt that appears to enable script visuals.

You’ll see a placeholder appear in the main area and a Python script editor panel at the bottom of the dashboard.

Select the ‘text’ column on the “Fields” panel.

You’ll see a predefined script that serves as a preamble for the script that we’re going to write.

In the Python script editor panel, place your cursor at the end of line #6 and hit enter twice.

Then, copy and paste the following code:

import re
import unicodedata
import nltk
from nltk.corpus import stopwordsADDITIONAL_STOPWORDS = ['covfefe']import matplotlib.pyplot as pltdef basic_clean(text):
wnl = nltk.stem.WordNetLemmatizer()
stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
text = (unicodedata.normalize('NFKD', text)
.encode('ascii', 'ignore')
.decode('utf-8', 'ignore')
.lower())
words = re.sub(r'[^\w\s]', '', text).split()
return [wnl.lemmatize(word) for word in words if word not in stopwords]words = basic_clean(''.join(str(dataset['text'].tolist())))bigrams_series = (pandas.Series(nltk.ngrams(words, 2)).value_counts())[:12]bigrams_series.sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))plt.show()

In a nutshell, the code above transforms extracts n-grams from the 'text' column of thedataset dataframe and creates a horizontal bar graph out of it using matplotlib. The result of plt.show() is what Power BI displays on the Python visual.

For more information on this code, please visit my previous tutorial.From DataFrame to N-GramsA quick-start guide to creating and visualizing n-gram ranking using nltk for natural language processing.towardsdatascience.com

After you’re done pasting the code, click on the “play” icon at the upper right corner of the Python script editor panel.

After a few moments, you should now be able to see the horizontal bar graph like the one below:

And that’s it!

With a few simple clicks of the mouse, along with some help from our Python script, we’re able to visualize the results of our n-gram ranking.


I hope you enjoyed today’s post on one of Power BI’s strongest features. Power BI already has some useful and beautiful built-in visuals but sometimes, you just need a little bit more flexibility. Running Python code helps with this. I hope this gentle introduction will encourage you to explore more and expand your repertoire.

In the next article, I’ll share a quick-start guide to extracting named-entities from a Pandas dataframe using spaCy.

Stay tuned!

You can reach me on Twitter or LinkedIn.

This article was first published in Towards Data Science‘ Medium publication.

From DataFrame to N-Grams

A quick-start guide to creating and visualizing n-gram ranking using nltk for natural language processing.


When I was first starting to learn NLP, I remember getting frustrated or intimidated by information overload so I’ve decided to write a post that covers the bare minimum. You know what they say, “Walk before you run!”

This is a very gentle introduction so we won’t be using any fancy code here.


In a nutshell, natural language processing or NLP simply refers to the process of reading and understanding written or spoken language using a computer. At its simplest use case, we can use a computer to read a book, for example, and count how many times each word was used instead of us manually doing it.

NLP is a big topic and there’s already been a ton of articles written on the subject so we won’t be covering that here. Instead, we’ll focus on how to quickly do one of the simplest but useful techniques in NLP: N-gram ranking.

N-Gram Ranking

Simply put, an n-gram is a sequence of n words where n is a discrete number that can range from 1 to infinity! For example, the word “cheese” is a 1-gram (unigram). The combination of the words “cheese flavored” is a 2-gram (bigram). Similarly, “cheese flavored snack” is a 3-gram (trigram). And “ultimate cheese flavored snack” is a 4-gram (qualgram). So on and so forth.

In n-gram ranking, we simply rank the n-grams according to how many times they appear in a body of text — be it a book, a collection of tweets, or reviews left by customers of your company.

Let’s get started!

Getting the Data

First, let’s get our data and load it into a dataframe. You can download the sample dataset here or create your own from the Trump Twitter Archive.

import pandas as pddf = pd.read_csv('tweets.csv')

Using df.head() we can quickly get acquainted with the dataset.

A sample of President Trump’s tweets.

Importing Packages

Next, we’ll import packages so we can properly set up our Jupyter notebook:

# natural language processing: n-gram ranking
import re
import unicodedata
import nltk
from nltk.corpus import stopwords# add appropriate words that will be ignored in the analysis
ADDITIONAL_STOPWORDS = ['covfefe']
import matplotlib.pyplot as plt

In the code block above, we imported pandas so that we can shape and manipulate our data in all sorts of different and wonderful ways! Next, we imported re for regex, unicodedata for Unicode data, and nltk to help with parsing the text and cleaning them up a bit. And then, we specified additional stop words that we want to ignore. This is helpful in trimming down the noise. Lastly, we imported matplotlib matplotlib so we can visualize the result of our n-gram ranking later.

Next, let’s create a function that will perform basic cleaning of the data.

Basic Cleaning

def basic_clean(text):
"""
A simple function to clean up the data. All the words that
are not designated as a stop word is then lemmatized after
encoding and basic regex parsing are performed.
"""
wnl = nltk.stem.WordNetLemmatizer()
stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
text = (unicodedata.normalize('NFKD', text)
.encode('ascii', 'ignore')
.decode('utf-8', 'ignore')
.lower())
words = re.sub(r'[^\w\s]', '', text).split()
return [wnl.lemmatize(word) for word in words if word not in stopwords]

The function above takes in a list of words or text as input and returns a cleaner set of words. The function does normalization, encoding/decoding, lower casing, and lemmatization.

Let’s use it!

words = basic_clean(''.join(str(df['text'].tolist())))

Above, we’re simply calling the function basic_lean() to process the 'text' column of our dataframe df and making it a simple list with tolist(). We then assign the results to words.

A list of already cleaned, normalized, and lemmatized words.

N-grams

Here comes the fun part! In one line of code, we can find out which bigrams occur the most in this particular sample of tweets.

(pd.Series(nltk.ngrams(words, 2)).value_counts())[:10]

We can easily replace the number 2 with 3 so we can get the top 10 trigrams instead.

(pd.Series(nltk.ngrams(words, 3)).value_counts())[:10]

Voilà! We got ourselves a great start. But why stop now? Let’s try it and make a little eye candy.

Bonus Round: Visualization

To make things a little easier for ourselves, let’s assign the result of n-grams to variables with meaningful names:

bigrams_series = (pd.Series(nltk.ngrams(words, 2)).value_counts())[:12]trigrams_series = (pd.Series(nltk.ngrams(words, 3)).value_counts())[:12]

I’ve replaced [:10] with [:12] because I wanted more n-grams in the results. This is an arbitrary value so you can choose whatever makes the most sense to you according to your situation.

Let’s create a horizontal bar graph:

bigrams_series.sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))

And let’s spiffy it up a bit by adding titles and axis labels:

bigrams_series.sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))
plt.title('20 Most Frequently Occuring Bigrams')
plt.ylabel('Bigram')
plt.xlabel('# of Occurances')

And that’s it! With a few simple lines of code, we quickly made a ranking of n-grams from a Pandas dataframe and even made a horizontal bar graph out of it.


I hope you enjoyed this one. Natural Language Processing is a big topic but I hope that this gentle introduction will encourage you to explore more and expand your repertoire.

In the next article, we’ll visualize an n-gram ranking in Power BI with a few simple clicks of the mouse and a dash of Python!

Stay tuned!

You can reach me on Twitter or LinkedIn.

This article was first published on Towards Data Science‘ Medium publication.