A quick-start guide to extracting named-entities from a Pandas dataframe using spaCy.
A long time ago in a galaxy far away, I was analyzing comments left by customers and I noticed that they seemed to mention specific companies much more than others. This gave me an idea. Maybe there is a way to extract the names of companies from the comments and I could quantify them and conduct further analysis.
There is! Enter: named-entity-recognition.
Named-Entity Recognition
According to Wikipedia, named-entity recognition or NER “is a subtask of information extraction that seeks to locate and classify named entity mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.”¹ In other words, NER attempts to extract words that categorized into proper names and even numerical entities.
In this post, I’ll share the code that will let us extract named-entities from a Pandas dataframe using spaCy, an open-source library provides industrial-strength natural language processing in Python and is designed for production use.²
To get started, let’s install spaCy with the following pip command:
pip install -U spacy
After that, let’s download the pre-trained model for English:
python -m spacy download en
With that out of the way, let’s open up a Jupyter notebook and get started!
Imports
Run the following code block into a cell to get all the necessary imports into our Python environment.
# for manipulating dataframes
import pandas as pd# for natural language processing: named entity recognition
import spacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()# for visualizations
%matplotlib inline
The important line in this block is nlp = en_core_web_sm.load()
because this is what we’ll be using later to extract the entities from the text.
Getting the Data
First, let’s get our data and load it into a dataframe. If you want to follow along, download the sample dataset here or create your own from the Trump Twitter Archive.
df = pd.read_csv('ever_trump.csv')
Running df.head()
in a cell will get us acquainted with the data set quickly.

Getting the Tokens
Second, let’s create tokens that will serve as input for spaCy. In the line below, we create a variable tokens
that contains all the words in the 'text'
column of the df
dataframe.
tokens = nlp(''.join(str(df.text.tolist())))
Third, we’re going to extract entities. We can just extract the most common entities for now:
items = [x.text for x in tokens.ents]
Counter(items).most_common(20)

Extracting Named-Entities
Next, we’ll extract the entities based on their categories. We have a few to choose from people to events and even organizations. For a complete list of all that spaCy has to offer, check out their documentation on named-entities.

To start, we’ll extract people (real and fictional) using the PERSON
type.
person_list = []for ent in tokens.ents:
if ent.label_ == 'PERSON':
person_list.append(ent.text)
person_counts = Counter(person_list).most_common(20)df_person = pd.DataFrame(person_counts, columns =['text', 'count'])
In the code above, we started by making an empty list with person_list = []
.
Then, we utilized a for-loop to loop through the entities found in tokens with tokens.ents
. After that, we made a conditional that will append to the previously created list if the entity label is equal to PERSON
type.
We’ll want to know how many times a certain entity of PERSON
type appears in the tokens so we did with person_counts = Counter(person_list).most_common(20)
. This line will give us the top 20 most common entities for this type.
Finally, we created the df_person
dataframe to store the results and this is what we get:

We’ll repeat the same pattern for the NORP
type which recognizes nationalities, religious and political groups.
norp_list = []for ent in tokens.ents:
if ent.label_ == 'NORP':
norp_list.append(ent.text)
norp_counts = Counter(norp_list).most_common(20)df_norp = pd.DataFrame(norp_counts, columns =['text', 'count'])
And this is what we get:

Bonus Round: Visualization
Let’s create a horizontal bar graph of the df_norp
dataframe.
df_norp.plot.barh(x='text', y='count', title="Nationalities, Religious, and Political Groups", figsize=(10,8)).invert_yaxis()

Voilà, that’s it!
I hope you enjoyed this one. Natural language processing is a huge topic but I hope that this gentle introduction will encourage you to explore more and expand your repertoire.
Stay tuned!
You can reach me on Twitter or LinkedIn.
[1]: Wikipedia. (May 22, 2020). Named-entity recognition https://en.wikipedia.org/wiki/Named-entity_recognition
[2]: spaCy. (May 22, 2020). Industrial-Strength Natural Language Processing in Python https://spacy.io/
This article was first published in Towards Data Science‘ Medium publication.