Into the Heart of Darkness - Pt. 1

Exploring the Trump Twitter Archive with Python. For beginners.


In this post, we’ll explore the dataset provided by the Trump Twitter Archive. My goal was to do something fun by using a very interesting dataset. However, as it turned out, exposure to Trump’s narcissism and shenanigans were quite depressing — if not traumatic.

You’d been warned!


For this project, we’ll be using pandas and numpy for data manipulation, matplotlib for visualizations, datetime for working with timestamps, unicodedata and regex for processing strings, and finally, nltk for natural language processing.

Let’s get started by firing up a Jupyter notebook!

Environment

We’re going to import pandas and matplotlib, and also set the display options for Jupyter so that the rows and columns are not truncated.

# for manipulating data
import pandas as pd
import numpy as np
# for visualizations
%matplotlib inline
import matplotlib.pyplot as plt
# to print out all the outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', -1)

Getting the Data

Let’s read the data into a dataframe. If you want to follow along, you can download the dataset here. This dataset contains Trump’s tweets from the moment he took office on January 20, 2017 to May 30, 2020.

df = pd.read_csv('trump_20200530.csv')

Let’s look at the first five rows and see the number of records (rows) and fields (columns).

df.head()
df.shape

Let’s do a quick renaming of the columns to make it easier for us later.

df.columns=['source', 'tweet', 'date_time', 'retweets', 'favorites', 'is_retweet', 'id']

Let’s drop the id column since it’s not really relevant right now.

df = df.drop(columns=['id'])

Let’s do a quick sanity check, this time let’s also check the dtypes of the columns.

df.head()
df.info()

Working with Timestamps

We can see from the previous screenshot that the ‘date_time’ column is a string. Let’s parse it to a timestamp.

# for working with timestamps
from datetime import datetime
from dateutil.parser import parse
dt = []
for ts in df.date_time:
dt.append(parse(ts))
dt[:5]

Let’s add a column with ‘datetime’ that contains the timestamp information.

df['datetime'] = df.apply(lambda row: parse(row.date_time), axis=1)

Let’s double-check the data range of our dataset.

df.datetime.min()
df.datetime.max()

Trimming the Data

Let’s see how many sources there are for the tweets.

df.source.value_counts()

Let’s only keep the ones that were made using the ‘Twitter for iPhone’ app.

df = df.loc[df.source == 'Twitter for iPhone']

We should drop the old ‘date_time’ column and the ‘source’ column as well.

df = df.drop(columns=['date_time', 'source'])

Separating the Retweets

Let’s see how many are retweets.

df.is_retweet.value_counts()

Let’s make another dataframe that contains only retweets and drop the ‘is_retweet’ column.

df_retweets = df.loc[df.is_retweet == True]
df_retweets = df_retweets.drop(columns=['is_retweet'])

Sanity check:

df_retweets.head()
df_retweets.shape

Back on the original dataframe, let’s remove the retweets from the dataset and drop the ‘is_retweet’ column altogether.

df = df.loc[df.is_retweet == False]
df = df.drop(columns=['is_retweet'])

Another sanity check:

df.head()
df.shape

Exploring the Data

Let’s explore both of the dataframes and answer a few questions.

What time does the President tweet the most? What time does he tweet the least?

The graph below shows that the President most frequently tweets around 12pm. He also tweets the least around 8am. Maybe he’s not a morning person?

title = 'Number of Tweets by Hour'
df.tweet.groupby(df.datetime.dt.hour).count().plot(figsize=(12,8), fontsize=14, kind='bar', rot=0, title=title)
plt.xlabel('Hour')
plt.ylabel('Number of Tweets')

What day does the President tweet the most? What day does he tweet the least?

The graph below shows that the President most frequently tweets on Wednesday. He also tweets the least on Thursday.

title = 'Number of Tweets by Day of the Week'
df.tweet.groupby(df.datetime.dt.dayofweek).count().plot(figsize=(12,8), fontsize=14, kind='bar', rot=0, title=title)
plt.xlabel('Day of the Week')
plt.ylabel('Number of Tweets')
plt.xticks(np.arange(7),['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])

Isolating Twitter Handles from the Retweets

Let’s import regex so we can use it to parse the text and isolate the Twitter handles of the original tweets. In the code below, we add another column that contains the Twitter handle.

import re
pattern = re.compile('(?<=RT @).*?(?=:)')
df_retweets['original'] = [re.search(pattern, tweet).group(0) for tweet in df_retweets.tweet]

Let’s create another dataframe that will contain only the original Twitter handles and their associated number of retweets.

df_originals = df_retweets.groupby(['original']).sum().sort_values('retweets').reset_index().sort_values('retweets', ascending=False)

Let’s check the data real quick:

df_originals.head()
df_originals.shape

Let’s visualize the results real quick so we can get an idea if the data is disproportionate or not.

df_originals = df_retweets.groupby(['original']).sum().sort_values('retweets').reset_index().sort_values('retweets', ascending=False)[:10].sort_values('retweets')
df_originals.plot.barh(x='original', y='retweets', figsize=(16,10), fontsize=16)
plt.xlabel("Originating Tweet's Username")
plt.xticks([])

Which Twitter user does the President like to retweet the most?

The graph below shows that the President likes to retweet the tweets from ‘@realDonaldTrump’. Does this mean the president likes to retweet himself? You don’t say!

The interesting handle on this one is ‘@charliekirk11’. Charlie Kirk is the founder of Turning Point USA. CBS News described the organization as a far-right organization that is “shunned or at least ignored by more established conservative groups in Washington, but embraced by many Trump supporters”.¹

The Top 5 Retweets

Let’s look at the top 5 tweets that were retweeted the most by others based on the original Twitter handle.

Let’s start with the ones with ‘@realDonaldTrump’.

df_retweets.loc[df_retweets.original == 'realDonaldTrump'].sort_values('retweets', ascending=False)[:5]

And another one with ‘@charliekirk11’.

df_retweets.loc[df_retweets.original == 'charliekirk11'].sort_values('retweets', ascending=False)[:5]

Examining Retweets’ Favorites count

Let’s find out how many of the retweets are favorited by others.

df_retweets.favorites.value_counts()

Surprisingly, none of the retweets seemed to have been favorited by anybody. Weird.

We should drop it.

Counting N-Grams

To do some n-gram ranking, we need to import unicodedata and nltk. We also need to specify additional stopwords that we may need to exclude from our analysis.

# for cleaning and natural language processing
import unicodedata
import nltk
# add appropriate words that will be ignored in the analysis
ADDITIONAL_STOPWORDS = ['rt']

Here are a few of my favorite functions for natural language processing:

def clean(text):
  """
  A simple function to clean up the data. All the words that
  are not designated as a stop word is then lemmatized after
  encoding and basic regex parsing are performed.
  """
  wnl = nltk.stem.WordNetLemmatizer()
  stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
  text = (unicodedata.normalize('NFKD', text)
    .encode('ascii', 'ignore')
    .decode('utf-8', 'ignore')
    .lower())
  words = re.sub(r'[^\w\s]', '', text).split()
  return [wnl.lemmatize(word) for word in words if word not in stopwords]

def get_words(df, column):
    """
    Takes in a dataframe and columns and returns a list of
    words from the values in the specified column.
    """
    return clean(''.join(str(df[column].tolist())))

def get_bigrams(df, column):
    """
    Takes in a list of words and returns a series of
    bigrams with value counts.
    """
    return (pd.Series(nltk.ngrams(get_words(df, column), 2)).value_counts())[:10]

def get_trigrams(df, column):
    """
    Takes in a list of words and returns a series of
    trigrams with value counts.
    """
    return (pd.Series(nltk.ngrams(get_words(df, column), 3)).value_counts())[:10]

def viz_bigrams(df ,column):
    get_bigrams(df, column).sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))

    plt.title('20 Most Frequently Occuring Bigrams')
    plt.ylabel('Bigram')
    plt.xlabel('# Occurances')

def viz_trigrams(df, column):
    get_trigrams(df, column).sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))

    plt.title('20 Most Frequently Occuring Trigrams')
    plt.ylabel('Trigram')
    plt.xlabel('# Occurances')
 

Let’s look at the top 10 bigrams in the df dataframe using the ‘tweet’ column.

get_bigrams(df, 'tweet')

And now, for the top 10 trigrams:

Let’s use the viz_bigrams() function and visualize the bigrams.

viz_bigrams(df, ‘tweet’)

Similarly, let’s use the viz_trigrams() function and visualize the trigrams.

viz_trigrams(df, 'tweet')

And there we have it!

From the moment that Trump took office, we can confidently say that the “fake news media” has been on top of the president’s mind.

Conclusion

Using basic Python and the nltk library, we’ve explored the dataset from the Trump Twitter Archive and did some n-gram ranking out of it.


Thank you for reading! Exploratory data analysis uses a lot of techniques and we’ve only explored a few on this post. I encourage you to keep practicing and employ other techniques to derive insights from data.

In the next post, we shall continue our journey into the heart of darkness and use spaCy to extract named-entities from the same dataset.

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1] CBS News. “Trump speaks to conservative group Turning Point USA”. www.cbsnews.com. Archived from the original on July 31, 2019. Retrieved August 5, 2019.

Published by

Ednalyn C. De Dios

I’ve always been enamored with code and I love data science because of its inherent power to solve real problems. Having grown up in the Philippines, served in the United States Navy, and worked in the nonprofit sector, I am driven to make the world a better place. I have started and participated in numerous campaigns that aim to reduce domestic violence and child abuse in the community.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.