Into the Heart of Darkness - Pt. 2

Exploring the Trump Twitter Archive with spaCy.


In a previous post, we set out to explore the dataset provided by the Trump Twitter Archive. My initial goal was to do something fun by using a very interesting dataset. However, it didn’t quite turn out that way.

On this post, we’ll continue our journey but this time we’ll be using spaCy.


For this project, we’ll be using pandas for data manipulation, spaCy for natural language processing, and joblib to speed things up.

Let’s get started by firing up a Jupyter notebook!

Housekeeping

Let’s import pandas and also set the display options so Jupyter won’t truncate our columns and rows. Let’s also set a random seed for reproducibility.

# for manipulating data
import pandas as pd
# setting the random seed for reproducibility
import random
random.seed(493)
# to print out all the outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', -1)

Getting the Data

Let’s read the data into a dataframe. If you want to follow along, you can download the cleaned dataset here along with the file for stop words¹. This dataset contains Trump’s tweets from the moment he took office on January 20, 2017 to May 30, 2020.

df = pd.read_csv('trump_20200530_clean.csv', parse_dates=True, index_col='datetime')

Let’s take a quick look at the data.

df.head()
df.info()

Using spaCy

Now let’s import spaCy and begin natural language processing.

# for natural language processing: named entity recognition
import spacy
import en_core_web_sm

We’re only going to use spaCy’s ner functionality or named-entity recognition so we’ll disable the rest of the functionalities. This will save us a lot of loading time later.

nlp = spacy.load(‘en_core_web_sm’, disable=[‘tagger’, ‘parser’, ‘textcat’])

Now let’s load the contents stopwords file into the variable stopswords. Note that we converted the list into a set to also save some processing time later.

with open(‘twitter-stopwords — TA — Less.txt’) as f:
contents = f.read().split(‘,’)
stopwords = set(contents)

Next, we’ll import joblib and define a few functions to help with parallel processing.

from joblib import Parallel, delayed

def chunker(iterable, total_length, chunksize):
    return (iterable[pos: pos + chunksize] for pos in range(0, total_length, chunksize))

def flatten(list_of_lists):
    "Flatten a list of lists to a combined list"
    return [item for sublist in list_of_lists for item in sublist]

def process_chunk(texts):
    preproc_pipe = []
    for doc in nlp.pipe(texts, batch_size=20):
        preproc_pipe.append([(ent.text) for ent in doc.ents if ent.label_ in ['NORP', 'PERSON', 'FAC', 'ORG', 'GPE', 'LOC', 'PRODUCT', 'EVENT']])
    return preproc_pipe

def preprocess_parallel(texts, chunksize=100):
    executor = Parallel(n_jobs=7, backend='multiprocessing', prefer="processes")
    do = delayed(process_chunk)
    tasks = (do(chunk) for chunk in chunker(texts, len(df), chunksize=chunksize))
    result = executor(tasks)
    return flatten(result)

In the code above², the function preprocess_parallel executes the other function process_chunks in parallel to help with speed. The function process_chunks iterates through a series of texts — in our case, the column 'tweet' of our the df dataframe — and inspects the entity if it belongs to either NORP, PERSON, FAC, ORG, GPE, LOC, PRODUCT, or EVENT. If it is, the entity is then appended to 'preproc_pipe' and subsequently returned to its caller. Prashanth Rao has a very good article on making spaCy super fast.

Let’s call the main driver for the functions now.

df['entities'] = preprocess_parallel(df['tweet'], chunksize=1000)

Doing a quick df.head() will reveal the new column 'entities' that we added earlier to hold the entities found in the 'tweet' column.

Prettifying the Results

In the code below, we’re making a list of lists called 'entities' and then flattening it for easier processing. We’re also converting it into a set called 'entities_set'.

entities = [entity for entity in df.entities if entity != []]
entities = [item for sublist in entities for item in sublist]
entities_set = set(entities)

Next, let’s count the frequency of the entities and append it to the list of tuples entities_counts. Then let’s convert the results into a dataframe df_counts.

df_counts = pd.Series(entities).value_counts()[:20].to_frame().reset_index()
df_counts.columns=['entity', 'count']
df_counts

For this step, we’re going to reinitialize an empty list entity_counts and manually construct a list of tuples with a combined set of entities and the sum of their frequencies or count.

entity_counts = []

entity_counts.append(('Democrats', df_counts.loc[df_counts.entity.isin(['Democrats', 'Dems', 'Democrat'])]['count'].sum()))
entity_counts.append(('Americans', df_counts.loc[df_counts.entity.isin(['American', 'Americans'])]['count'].sum()))
entity_counts.append(('Congress', df_counts.loc[df_counts.entity.isin(['House', 'Senate', 'Congress'])]['count'].sum()))
entity_counts.append(('America', df_counts.loc[df_counts.entity.isin(['U.S.', 'the United States', 'America'])]['count'].sum()))
entity_counts.append(('Republicans', df_counts.loc[df_counts.entity.isin(['Republican', 'Republicans'])]['count'].sum()))

entity_counts.append(('China', 533))
entity_counts.append(('FBI', 316))
entity_counts.append(('Russia', 313))
entity_counts.append(('Fake News', 248))
entity_counts.append(('Mexico', 213))
entity_counts.append(('Obama', 176))

Let’s take a quick look before continuing.

Finally, let’s convert the list of tuples into a dataframe.

df_ner = pd.DataFrame(entity_counts, columns=["entity", "count"]).sort_values('count', ascending=False).reset_index(drop=True)

And that’s it!

We’ve successfully created a ranking of the named entities that President Trump most frequently talked about in his tweets since taking office.


Thank you for reading! Exploratory data analysis uses a lot of techniques and we’ve only explored a few on this post. I encourage you to keep practicing and employ other techniques to derive insights from data.

In the next post, we shall continue our journey into the heart of darkness and do some topic-modeling using LDA.

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1] GONG Wei’s Homepage. (May 30, 2020). Stop words for tweets. https://sites.google.com/site/iamgongwei/home/sw

[2] Towards Data Science. (May 30, 2020). Turbo-charge your spaCy NLP pipeline. https://towardsdatascience.com/turbo-charge-your-spacy-nlp-pipeline-551435b664ad