A quick-start guide to extracting named-entities from a Pandas dataframe using spaCy.
A long time ago in a galaxy far away, I was analyzing comments left by customers and I noticed that they seemed to mention specific companies much more than others. This gave me an idea. Maybe there is a way to extract the names of companies from the comments and I could quantify them and conduct further analysis.
There is! Enter: named-entity-recognition.
According to Wikipedia, named-entity recognition or NER “is a subtask of information extraction that seeks to locate and classify named entity mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.”¹ In other words, NER attempts to extract words that categorized into proper names and even numerical entities.
In this post, I’ll share the code that will let us extract named-entities from a Pandas dataframe using spaCy, an open-source library provides industrial-strength natural language processing in Python and is designed for production use.²
To get started, let’s install spaCy with the following pip command:
pip install -U spacy
After that, let’s download the pre-trained model for English:
python -m spacy download en
With that out of the way, let’s open up a Jupyter notebook and get started!
Run the following code block into a cell to get all the necessary imports into our Python environment.
# for manipulating dataframes
import pandas as pd# for natural language processing: named entity recognition
from collections import Counter
nlp = en_core_web_sm.load()# for visualizations
The important line in this block is
nlp = en_core_web_sm.load() because this is what we’ll be using later to extract the entities from the text.
Getting the Data
df = pd.read_csv('ever_trump.csv')
df.head() in a cell will get us acquainted with the data set quickly.
Getting the Tokens
Second, let’s create tokens that will serve as input for spaCy. In the line below, we create a variable
tokens that contains all the words in the
'text' column of the
tokens = nlp(''.join(str(df.text.tolist())))
Third, we’re going to extract entities. We can just extract the most common entities for now:
items = [x.text for x in tokens.ents]
Next, we’ll extract the entities based on their categories. We have a few to choose from people to events and even organizations. For a complete list of all that spaCy has to offer, check out their documentation on named-entities.
To start, we’ll extract people (real and fictional) using the
person_list = for ent in tokens.ents:
if ent.label_ == 'PERSON':
person_counts = Counter(person_list).most_common(20)df_person = pd.DataFrame(person_counts, columns =['text', 'count'])
In the code above, we started by making an empty list with
person_list = .
Then, we utilized a for-loop to loop through the entities found in tokens with
tokens.ents. After that, we made a conditional that will append to the previously created list if the entity label is equal to
We’ll want to know how many times a certain entity of
PERSON type appears in the tokens so we did with
person_counts = Counter(person_list).most_common(20). This line will give us the top 20 most common entities for this type.
Finally, we created the
df_person dataframe to store the results and this is what we get:
We’ll repeat the same pattern for the
NORP type which recognizes nationalities, religious and political groups.
norp_list = for ent in tokens.ents:
if ent.label_ == 'NORP':
norp_counts = Counter(norp_list).most_common(20)df_norp = pd.DataFrame(norp_counts, columns =['text', 'count'])
And this is what we get:
Bonus Round: Visualization
Let’s create a horizontal bar graph of the
df_norp.plot.barh(x='text', y='count', title="Nationalities, Religious, and Political Groups", figsize=(10,8)).invert_yaxis()
Voilà, that’s it!
I hope you enjoyed this one. Natural language processing is a huge topic but I hope that this gentle introduction will encourage you to explore more and expand your repertoire.
: Wikipedia. (May 22, 2020). Named-entity recognition https://en.wikipedia.org/wiki/Named-entity_recognition
: spaCy. (May 22, 2020). Industrial-Strength Natural Language Processing in Python https://spacy.io/
This article was first published in Towards Data Science‘ Medium publication.