Exploring the Trump Twitter Archive with spaCy.
In a previous post, we set out to explore the dataset provided by the Trump Twitter Archive. My initial goal was to do something fun by using a very interesting dataset. However, it didn’t quite turn out that way.
On this post, we’ll continue our journey but this time we’ll be using spaCy.
For this project, we’ll be using pandas for data manipulation, spaCy for natural language processing, and joblib to speed things up.
Let’s get started by firing up a Jupyter notebook!
Let’s import pandas and also set the display options so Jupyter won’t truncate our columns and rows. Let’s also set a random seed for reproducibility.
# for manipulating data
import pandas as pd
# setting the random seed for reproducibility
# to print out all the outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# set display options
Getting the Data
Let’s read the data into a dataframe. If you want to follow along, you can download the cleaned dataset here along with the file for stop words¹. This dataset contains Trump’s tweets from the moment he took office on January 20, 2017 to May 30, 2020.
df = pd.read_csv('trump_20200530_clean.csv', parse_dates=True, index_col='datetime')
Let’s take a quick look at the data.
Now let’s import spaCy and begin natural language processing.
# for natural language processing: named entity recognition
We’re only going to use spaCy’s ner functionality or named-entity recognition so we’ll disable the rest of the functionalities. This will save us a lot of loading time later.
nlp = spacy.load(‘en_core_web_sm’, disable=[‘tagger’, ‘parser’, ‘textcat’])
Now let’s load the contents stopwords file into the variable
stopswords. Note that we converted the list into a set to also save some processing time later.
with open(‘twitter-stopwords — TA — Less.txt’) as f:
contents = f.read().split(‘,’)
stopwords = set(contents)
Next, we’ll import joblib and define a few functions to help with parallel processing.
from joblib import Parallel, delayed def chunker(iterable, total_length, chunksize): return (iterable[pos: pos + chunksize] for pos in range(0, total_length, chunksize)) def flatten(list_of_lists): "Flatten a list of lists to a combined list" return [item for sublist in list_of_lists for item in sublist] def process_chunk(texts): preproc_pipe =  for doc in nlp.pipe(texts, batch_size=20): preproc_pipe.append([(ent.text) for ent in doc.ents if ent.label_ in ['NORP', 'PERSON', 'FAC', 'ORG', 'GPE', 'LOC', 'PRODUCT', 'EVENT']]) return preproc_pipe def preprocess_parallel(texts, chunksize=100): executor = Parallel(n_jobs=7, backend='multiprocessing', prefer="processes") do = delayed(process_chunk) tasks = (do(chunk) for chunk in chunker(texts, len(df), chunksize=chunksize)) result = executor(tasks) return flatten(result)
In the code above², the function
preprocess_parallel executes the other function
process_chunks in parallel to help with speed. The function
process_chunks iterates through a series of texts — in our case, the column
'tweet' of our the
df dataframe — and inspects the entity if it belongs to either NORP, PERSON, FAC, ORG, GPE, LOC, PRODUCT, or EVENT. If it is, the entity is then appended to
'preproc_pipe' and subsequently returned to its caller. Prashanth Rao has a very good article on making spaCy super fast.
Let’s call the main driver for the functions now.
df['entities'] = preprocess_parallel(df['tweet'], chunksize=1000)
Doing a quick
df.head() will reveal the new column
'entities' that we added earlier to hold the entities found in the
Prettifying the Results
In the code below, we’re making a list of lists called
'entities' and then flattening it for easier processing. We’re also converting it into a set called
entities = [entity for entity in df.entities if entity != ]
entities = [item for sublist in entities for item in sublist]
entities_set = set(entities)
Next, let’s count the frequency of the entities and append it to the list of tuples
entities_counts. Then let’s convert the results into a dataframe
df_counts = pd.Series(entities).value_counts()[:20].to_frame().reset_index()
For this step, we’re going to reinitialize an empty list
entity_counts and manually construct a list of tuples with a combined set of entities and the sum of their frequencies or count.
entity_counts =  entity_counts.append(('Democrats', df_counts.loc[df_counts.entity.isin(['Democrats', 'Dems', 'Democrat'])]['count'].sum())) entity_counts.append(('Americans', df_counts.loc[df_counts.entity.isin(['American', 'Americans'])]['count'].sum())) entity_counts.append(('Congress', df_counts.loc[df_counts.entity.isin(['House', 'Senate', 'Congress'])]['count'].sum())) entity_counts.append(('America', df_counts.loc[df_counts.entity.isin(['U.S.', 'the United States', 'America'])]['count'].sum())) entity_counts.append(('Republicans', df_counts.loc[df_counts.entity.isin(['Republican', 'Republicans'])]['count'].sum())) entity_counts.append(('China', 533)) entity_counts.append(('FBI', 316)) entity_counts.append(('Russia', 313)) entity_counts.append(('Fake News', 248)) entity_counts.append(('Mexico', 213)) entity_counts.append(('Obama', 176))
Let’s take a quick look before continuing.
Finally, let’s convert the list of tuples into a dataframe.
df_ner = pd.DataFrame(entity_counts, columns=["entity", "count"]).sort_values('count', ascending=False).reset_index(drop=True)
And that’s it!
We’ve successfully created a ranking of the named entities that President Trump most frequently talked about in his tweets since taking office.
Thank you for reading! Exploratory data analysis uses a lot of techniques and we’ve only explored a few on this post. I encourage you to keep practicing and employ other techniques to derive insights from data.
In the next post, we shall continue our journey into the heart of darkness and do some topic-modeling using LDA.
 GONG Wei’s Homepage. (May 30, 2020). Stop words for tweets. https://sites.google.com/site/iamgongwei/home/sw
 Towards Data Science. (May 30, 2020). Turbo-charge your spaCy NLP pipeline. https://towardsdatascience.com/turbo-charge-your-spacy-nlp-pipeline-551435b664ad