Analysis of Texas Public School Spending and STAAR Performance

Project Overview

A1. Research Question

The research questions this project sought to answer is whether we can predict if a Texas public school district would perform well or do poorly on the STAAR test based on how they allocate or spend their budget with consideration to how racially diverse a district is. This project also sought to analyze which financial expenditure items are important predictors of whether a district will perform well or do poorly on the STAAR test. The purpose of this analysis is to provide counsel to school administrators and policymakers so that they can make data-driven decisions instead of relying on anecdotal evidence alone.

A2. Project Scope

This project’s scope was to use a Jupyter notebook to build a model that predicts whether a Texas public school district would perform well or do poorly on the STAAR test. This model took several inputs of financial expenditure data and then outputted a prediction related to that district’s performance on the STAAR test. This project examined academic year 2018-2019.

A3. Solution Overview – Tools

The Jupyter notebook that contains the model was coded in the Python programming language and relied on several key inputs from the end-user. The first input is a csv file called Enrollment Report_Statewide_Districts_Grade_Ethnicity_2018-2019 (Texas Education Agency, 2019), which contains demographic information on white and non-white population of the district. The second key input is tidy_campstaar1_2012to2019 and tidy_campstaar2_2012to2019 (Texas Education Agency, 2022), which contains information on how many students in the district performed well (meets or masters a subject) or did poorly (approaches comprehension of a subject). The third key input is the 2007-2021-summaried-peims-financial-data (Texas Education Agency, 2022), which contains information on how a district spent their budget. The notebook then automatically performed cleansing and transformation of the data to make it suitable for analysis and modeling. The notebook then constructed a model that predicted how well a district would perform or do poorly on the STAAR test.

A4. Solution Overview – Methodologies

There were four different types of methodologies used in this project: Project, Data Collection, Analytical, and Statistical. Each methodology played an important role in the planning, execution, and verification of the project.

Project Plan

            In this project, the author executed the plans without change. All goals, objectives, and deliverables listed below were completed as describe in Task 2 apart from the dates or time frame for each task. The timeline projected in Task 2 were very conservative and the actual project completed earlier than planned.

B1. Project Execution

The goal for this project was to create a model that would predict how a district would perform on a STAAR test. The objective of this goals are:

  • Determine which districts are performing or underperforming in relation to how much funding they receive, how racially diverse a district is, and how well the students in the district perform on the STAAR test overall.
  • Determine which type of expenditures contribute to the weighted performance of the district.
  • Predict how a district will perform based on how they allocate their budget.

B2. Project Planning Methodology

The project planning methodology used was PAPEM-DM (De Dios, 2020). The seven steps involved in this method include: Planning, acquisition, preparation, exploration, modeling, delivery, and maintenance.

  1. Project Planning – answered the why of the project, addressed the how, delivered a list of research questions to answer and criteria of success.
  2. Acquisition of Data – acquired the data from the TEA and Kaggle websites.
  3. Preparation of Data – prepared the data which includes cleaning the data and transforming it into a format suitable for analysis and modeling.
  4. Exploratory Data Analysis – explored the data using descriptive statistics, visualize the relationship between variables using matplotlib and seaborn. The deliverables of this phase would be a few graphs ready for presentation to stakeholders.
  5. Modeling – created, fitted, and used the model for training and inference. The deliverables were a pickle file for the trained model.
  6. Delivery of Results – delivered the results by completing Task 3 of this capstone project.
  7. Maintenance – while important, this project did not include this phase. However, the Jupyter notebook can be used as a basis for deployment into production (for inferencing) using any of the cloud technology platforms like AWS, Azure, or GCP.

B3. Project Timeline and Milestones

PhaseDescriptionDeliverableStartEndDuration
PlanningProject planningProject plan8/18/37 days
AcquisitionGetting the dataRaw datasets8/48/41 day
PreparationCleansing and transformationClean datasets8/58/2113 days
ExplorationEDAVisualizations8/58/2113 days
ModelingTrain and inferPickle file8/58/2113 days
DeliveryTask 3 ReportPDF Document8/228/287 days
MaintenanceDeployment to productionNot ApplicableN/AN/AN/A

Methodology

C. Data Selection and Collection Process

The author downloaded two of the datasets on the TEA website and while the third data could have also downloaded from the same place, the author chose to use the Kaggle version of the data because it has already been collected in a format that would not necessitate me using web scraping to mine the information from the website.

Data selection and collection didn’t vary much from the proposed plan. The only one addition the author made was the inclusion of ethnicity data that the author downloaded from the same TEA website. This data was necessary to examine and analyze the effect of racial diversity within the districts.

One obstacle the author encountered was in the same dataset that the author added during the actual execution of the project. The author encountered several errors when reading the csv file and realized that formatting is off on the actual csv itself. The author adjusted the read_csv parameter “skiprows” and was able to bypass the header information that was needed in the analysis.

There really were no data governance issues because the dataset is already in the public domain. The one concern would have been anonymity of the students in schools or school districts that are too small that the anonymity of the students would have been impossible to mask. This issue was already taken care by the Texas Education Agency before they even published the dataset.

For example, for those districts that have small subpopulations, the values for the numerator values have been coded as -999. The author simply had to replace those values with a 0 for the dataframe to be processed cleanly. Another example of the datasets’ cleanliness is the fact that there were no missing values in any of the columns.

Therefore, cleanliness and sanitation were the obvious advantages of the datasets used. However, their limitation was the fact that when merged together, the dataset only encompasses the 2018-2019 school year. This is by design, Hence, not really that big of a limitation to warrant much concern.

D. Data Extraction and Preparation Process

The author utilized the power of pandas to read the csv files into a dataframe. The method read_csv() was appropriate because the data is in a tabular format delimited by a comma.

For data preparation or data wrangling, the author used many techniques to handle missing and duplicate data, as well as remove extraneous columns and rows. Some of these techniques include replacing null placeholder values with true numeric NaN values, drop_duplicates() and drop() columns, as well as, string methods to clean up the or standardized the columns names. The author also used locs, ilocs, and np.where to filter columns and rows. Finally, for merging the datasets, the author a combination of concat() and merge() methods to execute inner and outer joins of the data. Utilizing pandas methods made sense because of their efficiency, versatility, and simplicity.

E. Data Analysis Process

For exploratory data analysis, the author used several visualizations to examine the relationship between variables. For example, the author used a heatmap to visualize the correlation coefficients via a correlation matrix using the Pearson method. Also, a pairplot was used to examine whether any of the relationships between variables had a positive or a negative trend. Both were essential in determining and validating whether any of the variables suffered from multicollinearity. Knowing this is important because it would determine what type of machine learning algorithm can or cannot be used if the author choose not to handle the related variables individually. Finally, several violin plots were used to examine the distribution and density of each variable belonging to both passing and failing classes.

To complement the visualizations, the author implemented a few statistical tests to analyze the dataset. These tests were the Levene’s test, the Shapiro test, and the Mann-Whitney U test. The author used Levene’s and Shapiro to test for variance and normality so that the author can validate whether to use parametric or non-parametric methods of statistical testing. As a result, the author used Mann-Whitney U to test for significance.

However, statistical testing does have its own limitations. For one, qualitative factors of the study are not considered because “statistical method cannot study phenomena that cannot be expressed, captured, or measured in quantitative terms.“ For example, we considered the efficiency of districts while regarding for racial diversity, but on must be careful in assigning weights to the imbalance because it is simply not a simple black and white matter. One cannot simply assign a numerical value on racial disparity and there is no clear-cut formula that defines racial inequity neither.

In the author’s opinion, both visualizations and statistical testing techniques were both needed at the same time. Visualization, although limited to eye-balling techniques can reveal hidden patterns in the data while statistical testing can validate assumptions made by eyeballing said data.

Results

F. Project Success

F1. Statistical Significance

Calculating the statistical significance of this project was a reasonably straight-forward task. First, I had defined a formula to define the measure of efficiency with respect to racial diversity. I called the measure E for efficiency, and E = (B/(P/S))/X where B is the total operating program budget of the district, P is the total number of passing (meets or masters a subject) students, S is the total number of students taking the test, and X is for the ratio of non-white student to white student enrollment. Based on this formula, I then calculated the first quartile of created two subsets of data based on that quartile. Those equal or below the quartile contained district data that had performed well on the STAAR test and those above the quartile contained districts that did poorly on the STAAR test. I then tested for the statistical difference between the two group samples. Since the p-value is below 0.05, the difference between the two groups were found to be statistically significant. With this, the author can reject the null hypothesis and confirm the original hypothesis that the “Payroll” expenditure (and other variables for that matter) of districts that performed well on the STAAR test is statistically different than those districts that performed poorly.

F2. Practical Significance

The practical significance of the difference in certain expenditures will result in a significant amount of money saved by each district. For example, if a school district decides to spend more on operations versus paying more for capital investments, the district would have realized more of a return on this investment in terms of higher student performance scores on the STAAR test.

F3. Overall Success

The author believes this project a success. All three criteria laid out in task two were met. First, the formula for calculating the efficiency of a district is appropriate and applicable. Second, the model has an accuracy and AUC scores of more than 70%. And third, the project lists the most important features for high-performing districts.

G. Key Takeaways

This project set out to create Jupyter notebook that contains a model that would predict whether a Texas public school district would perform well or do poorly on the STAAR test. This model needed to be based on the weighted definition of efficiency and also need to have an accuracy or AUC score of more than 70% to be considered a success. The following chart summarizes whether the Jupyter notebook accomplished its first two metric.

Criterion/MetricRequired DataSuccess
Is the formula for calculating the efficiency appropriate and applicable?E = (B/(P/S))/XYES
Does the model have an accuracy and AUC scores of more than 70%?Accuracy 90.93% / AUC 96%YES
Does the project list the most important features for high-performing districts?Bar graphYES

The table above is a straightforward way to visualize and summarize the accomplishments of each objective.

The graph above shows the Receiver-Operator curve (ROC) which visualizes just how skilled the model is (blue line) compared to an unskilled model (by chance, red line).

The graph below shows the most important features of the model and is a graphical way to convey how each different variables affects the model based on the F score.

            Based on the findings above, the author suggests that a study be conducted on how particularly racial diversity affects the efficiency of Texas public school districts as it has been shown that it does influence the model. In addition, further study needs to be conducted to determine whether having a bilingual program is an indication of positive or negative correlation on performance.

Sources

Baron, E. J. (2021). School Spending and Student Outcomes: Evidence from Revenue Limit Elections in Wisconsin. American Economic Journal: Economic Policy, 14(1), 1-39. Retrieved from https://www.aeaweb.org/articles?id=10.1257/pol.20200226

Carhart, A. E. (2016). SCHOOL FINANCE DECISIONS AND ACADEMIC PERFORMANCE: AN ANALYSIS OF THE IMPACTS OF SCHOOL EXPENDITURES ON STUDENT PERFORMANCE. Master Thesis. Retrieved from chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://www.csus.edu/college/social-sciences-interdisciplinary-studies/public-policy-administration/_internal/_documents/thesis-bank/thesis-bank-2016-carhart.pdf

Texas Education Agency. (2022, July 31). PEIMS Financial Data Downloads. Retrieved from Texas Education Agency: https://tea.texas.gov/finance-and-grants/state-funding/state-funding-reports-and-data/peims-financial-data-downloads

Texas Education Agency. (2022, July 31). Texas Education Agency Data 2012-2019. Retrieved from Kaggle: https://www.kaggle.com/datasets/9e3ce42f60ded3ba2a6dd890993493f2c4b284c5cfa035d711bd98fa3359924c?resource=download

Thompson, J., Young, J. K., & and Shelton, K. (2021). “Evaluating Educational Efficiency in Texas Public Schools Utilizing Data Envelopment Analysis. School Leadership Review, 16(1), Article 9. Retrieved from https://scholarworks.sfasu.edu/slr/vol16/iss1/9

Using MapQuest API to Get Geo Data

A friendly tutorial on getting zip codes and other geographic data from street addresses.


Knowing how to deal with geographic data is a must-have for a data scientist. In this post, we will play around with the MapQuest Search API to get zip codes from street addresses along with their corresponding latitude and longitude to boot!

The Scenario

In 2019, my friends and I participated in CivTechSA Datathon. At one point in the competition, we wanted to visualize the data points and overlay them on San Antonio’s map. The problem is, we had incomplete data. Surprise! All we had were a street number and a street name — no zip code, no latitude, nor longitude. We then turned to the great internet for some help.

We found a great API by MapQuest that will give us exactly what we needed. With just a sprinkle of Python code, we were able to accomplish our goal.

Today, we’re going to walk through this process.

The Data

To follow along, you can download the data from here. Just scroll down to the bottom tab on over to the Data Catalog 2019. Look for SAWS (San Antonio Water System) as shown below.

Image for post
Screenshot by Ednalyn C. De Dios

Download the file by clicking on the link to the Excel file.

Image for post
Screenshot by Ednalyn C. De Dios

OR, you can click on this.

MapQuest API Key

Head on over to https://developer.mapquest.com/ and create an account to get a free API key.

Image for post
Screenshot by Ednalyn C. De Dios
Image for post
Screenshot by Ednalyn C. De Dios
Image for post
Screenshot by Ednalyn C. De Dios
Image for post
Screenshot by Ednalyn C. De Dios
Image for post
Screenshot by Ednalyn C. De Dios

Copy the ‘Consumer Key’ and keep it in a safe place. We’ll need it later.

Jupyter Notebook

Now, let’s fire up a Jupyter notebook and get coding!

For starters, let’s set up the environment by doing a couple of imports.https://towardsdatascience.com/media/7d0f7ced4082761e995ecf8ce0213c3f

Don’t forget to replace the API_KEY (line#12) with your own key above.

Now. let’s read the Excel file with a simple df = pd.read_excel().

Image for post
Screenshot by Ednalyn C. De Dios

Next, we’ll combine the street number and street name columns.https://towardsdatascience.com/media/1696465db27770b7f2942ab707d2efa5

Image for post
Screenshot by Ednalyn C. De Dios

The ALL CAPS hurts my eyes. Let’s do something about it:

df['street_address'] = df.street_address.str.title() .
Image for post
Screenshot by Ednalyn C. De Dios

Below are two functions that call the API and returns geo data.https://towardsdatascience.com/media/3ec6009e8b6069387a9edde18bdad0d3

We can manually call it with the line below. Don’t forget to replace the ‘#####’ with your own API key. You can use any address you want (replace spaces with a + character).

get_zip('https://www.mapquestapi.com/geocoding/v1/address?key=####################&inFormat=kvp&outFormat=json&location=100+ Military+Plaza&thumbMaps=false&delimiter=%2C')

But we’ve got many addresses, so we’ll use a loop to call the API repeatedly.https://towardsdatascience.com/media/9a970862f0352997417c5211df359a9b

Let’s see what the result looks like:

Image for post
Screenshot by Ednalyn C. De Dios

Finally, let’s create a dataframe that will house the street addresses — complete with zip code, latitude, and longitude.https://towardsdatascience.com/media/adfcc23ff94f54877bc80b72e2537ed9

Voila! We’ve got ourselves geo data.

Image for post
Screenshot by Ednalyn C. De Dios

For extra credit, let’s import the data in Tableau and get a pretty spiffy visual:

Image for post
Screenshot by Ednalyn C. De Dios

And that’s it, folks!

You can find the jupyter notebook here.

Thanks for stopping by and reading my post. Hope it was useful 🙂

If you want to learn more about my journey from slacker to data scientist, check out the article below:From Slacker to Data ScientistMy journey into data science without a degree.towardsdatascience.com

And if you’re thinking about switching gears and venture into data science, start thinking about rebranding now:The Slacker’s Guide to Rebranding Yourself as a Data ScientistOpinionated advice for the rest of us. Love of math, optional.towardsdatascience.com

Stay tuned!

You can reach me on Twitter or LinkedIn.

This article was first published in Towards Data Science’ Medium publication.

Exploring the Trump Twitter Archive with PyCaret

For adventurous beginners in NLP.


For this project, we’ll be using PyCaret:

PyCaret is an open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within seconds in your choice of notebook environment.¹

PyCaret

PyCaret does a lot more than NLP. It also does a whole slew of both supervised and unsupervised ML including classification, regression, clustering, anomaly detection, and associate rule mining.

To learn more, check out Moez Ali’s announcement.


Housekeeping

Let’s begin by installing PyCaret. Just do pip install pycaret and we are good to go! Note: PyCaret is a big library so you may want to go grab a cup of coffee while waiting for it to install.

Also, we need to download the English language model because it is not automatically downloaded with PyCaret:

python -m spacy download en_core_web_sm
python -m textblob.download_corpora

Getting the Data

Let’s read the data into a dataframe. If you want to follow along, you can download the dataset here. This dataset contains Trump’s tweets from the moment he took office on January 20, 2017 to May 30, 2020.

import pandas as pd
from pycaret.nlp import *
df = pd.read_csv('trump_20200530.csv')

Let’s check the shape of our data first:

df.shape

And let’s take a quick look:

df.head()

For expediency, let’s sample only 1,000 tweets.

# sampling the data to select only 1000 tweets
df = df.sample(1000, random_state=493).reset_index(drop=True)
df.shape

Topic Modeling

The fun part!

nlp = setup(data = df, target = 'text', session_id = 493,
customI _stopwords = [ 'rt', 'https', 'http', 'co', 'amp'])

PyCaret’s setup() function performs the following text-processing steps:

  1. Removing Numeric Characters
  2. Removing Special Characters
  3. Word Tokenization
  4. Stopword Removal
  5. Bigram Extraction
  6. Trigram Extraction
  7. Lemmatizing
  8. Custom Stopwords

And all in one line of code!

It takes in two parameters: the dataframe in data and the name of the text column that we want to pass in target. In our case, we also used the optional parameters session_id for reproducibility and custom_stopwords to reduce the noise coming from the tweets.

After all is said and done, we’ll get something similar to this:

In the next step, we’ll create the model and we’ll use ‘lda’:

lda = create_model('lda', num_topics = 6, multi_core = True)

Above, we created an ‘lda’ model and passed in the number of topics as 6 and set it so that the LDA will use all CPU cores available to parallelize and speed up training.

Finally, we’ll assign topic proportions to the rest of the dataset using assign_model().

lda_results = assign_model(lda)
lda_results.head()

Visualizing the Results

Let’s the plot the overall frequency distribution of the entire corpus:

plot_model()

Now let’s extract the bigrams and trigrams for the entire corpus:

plot_model(plot = 'bigram')
plot_model(plot = 'trigram')

But what if we only want to extract the n-grams from a specific topic? Easy, we’ll just pass in the topic_num parameter.

plot_model(lda, plot = 'trigram', topic_num = 'Topic 1')

If we want the distribution of topics we’ll simply change it and specify it in the plot parameter.

plot_model(lda, plot = 'topic_distribution')

And that’s it!

We’ve successfully conducted topic modeling on President Trump’s tweets since taking office.

Bonus Round

Moez Ali wrote a great tutorial on using PyCaret in Power BI. Check it out.


Thank you for reading! Exploratory data analysis uses a lot of techniques and we’ve only explored a few on this post. I encourage you to keep practicing and employ other techniques to derive insights from data.

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1] PyCaret. (June 4, 2020). Why PyCarethttps://pycaret.org/

This article was first published in Towards Data Science’ Medium publication.

Programming Environment Setup

I was bored over the weekend so I decided to restore my Macbook Pro to factory settings so that I can set up my programming environment the proper way. After all, what’s a data scientist without her toys?

Let’s start with a replacement to the default terminal and pyenv installation to manage different Python versions.

Let’s move on to managing different Python interpreters and virtual environments using pyenv-virtualenv.

Into the Heart of Darkness - Pt. 2

Exploring the Trump Twitter Archive with spaCy.


In a previous post, we set out to explore the dataset provided by the Trump Twitter Archive. My initial goal was to do something fun by using a very interesting dataset. However, it didn’t quite turn out that way.

On this post, we’ll continue our journey but this time we’ll be using spaCy.


For this project, we’ll be using pandas for data manipulation, spaCy for natural language processing, and joblib to speed things up.

Let’s get started by firing up a Jupyter notebook!

Housekeeping

Let’s import pandas and also set the display options so Jupyter won’t truncate our columns and rows. Let’s also set a random seed for reproducibility.

# for manipulating data
import pandas as pd
# setting the random seed for reproducibility
import random
random.seed(493)
# to print out all the outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', -1)

Getting the Data

Let’s read the data into a dataframe. If you want to follow along, you can download the cleaned dataset here along with the file for stop words¹. This dataset contains Trump’s tweets from the moment he took office on January 20, 2017 to May 30, 2020.

df = pd.read_csv('trump_20200530_clean.csv', parse_dates=True, index_col='datetime')

Let’s take a quick look at the data.

df.head()
df.info()

Using spaCy

Now let’s import spaCy and begin natural language processing.

# for natural language processing: named entity recognition
import spacy
import en_core_web_sm

We’re only going to use spaCy’s ner functionality or named-entity recognition so we’ll disable the rest of the functionalities. This will save us a lot of loading time later.

nlp = spacy.load(‘en_core_web_sm’, disable=[‘tagger’, ‘parser’, ‘textcat’])

Now let’s load the contents stopwords file into the variable stopswords. Note that we converted the list into a set to also save some processing time later.

with open(‘twitter-stopwords — TA — Less.txt’) as f:
contents = f.read().split(‘,’)
stopwords = set(contents)

Next, we’ll import joblib and define a few functions to help with parallel processing.

from joblib import Parallel, delayed

def chunker(iterable, total_length, chunksize):
    return (iterable[pos: pos + chunksize] for pos in range(0, total_length, chunksize))

def flatten(list_of_lists):
    "Flatten a list of lists to a combined list"
    return [item for sublist in list_of_lists for item in sublist]

def process_chunk(texts):
    preproc_pipe = []
    for doc in nlp.pipe(texts, batch_size=20):
        preproc_pipe.append([(ent.text) for ent in doc.ents if ent.label_ in ['NORP', 'PERSON', 'FAC', 'ORG', 'GPE', 'LOC', 'PRODUCT', 'EVENT']])
    return preproc_pipe

def preprocess_parallel(texts, chunksize=100):
    executor = Parallel(n_jobs=7, backend='multiprocessing', prefer="processes")
    do = delayed(process_chunk)
    tasks = (do(chunk) for chunk in chunker(texts, len(df), chunksize=chunksize))
    result = executor(tasks)
    return flatten(result)

In the code above², the function preprocess_parallel executes the other function process_chunks in parallel to help with speed. The function process_chunks iterates through a series of texts — in our case, the column 'tweet' of our the df dataframe — and inspects the entity if it belongs to either NORP, PERSON, FAC, ORG, GPE, LOC, PRODUCT, or EVENT. If it is, the entity is then appended to 'preproc_pipe' and subsequently returned to its caller. Prashanth Rao has a very good article on making spaCy super fast.

Let’s call the main driver for the functions now.

df['entities'] = preprocess_parallel(df['tweet'], chunksize=1000)

Doing a quick df.head() will reveal the new column 'entities' that we added earlier to hold the entities found in the 'tweet' column.

Prettifying the Results

In the code below, we’re making a list of lists called 'entities' and then flattening it for easier processing. We’re also converting it into a set called 'entities_set'.

entities = [entity for entity in df.entities if entity != []]
entities = [item for sublist in entities for item in sublist]
entities_set = set(entities)

Next, let’s count the frequency of the entities and append it to the list of tuples entities_counts. Then let’s convert the results into a dataframe df_counts.

df_counts = pd.Series(entities).value_counts()[:20].to_frame().reset_index()
df_counts.columns=['entity', 'count']
df_counts

For this step, we’re going to reinitialize an empty list entity_counts and manually construct a list of tuples with a combined set of entities and the sum of their frequencies or count.

entity_counts = []

entity_counts.append(('Democrats', df_counts.loc[df_counts.entity.isin(['Democrats', 'Dems', 'Democrat'])]['count'].sum()))
entity_counts.append(('Americans', df_counts.loc[df_counts.entity.isin(['American', 'Americans'])]['count'].sum()))
entity_counts.append(('Congress', df_counts.loc[df_counts.entity.isin(['House', 'Senate', 'Congress'])]['count'].sum()))
entity_counts.append(('America', df_counts.loc[df_counts.entity.isin(['U.S.', 'the United States', 'America'])]['count'].sum()))
entity_counts.append(('Republicans', df_counts.loc[df_counts.entity.isin(['Republican', 'Republicans'])]['count'].sum()))

entity_counts.append(('China', 533))
entity_counts.append(('FBI', 316))
entity_counts.append(('Russia', 313))
entity_counts.append(('Fake News', 248))
entity_counts.append(('Mexico', 213))
entity_counts.append(('Obama', 176))

Let’s take a quick look before continuing.

Finally, let’s convert the list of tuples into a dataframe.

df_ner = pd.DataFrame(entity_counts, columns=["entity", "count"]).sort_values('count', ascending=False).reset_index(drop=True)

And that’s it!

We’ve successfully created a ranking of the named entities that President Trump most frequently talked about in his tweets since taking office.


Thank you for reading! Exploratory data analysis uses a lot of techniques and we’ve only explored a few on this post. I encourage you to keep practicing and employ other techniques to derive insights from data.

In the next post, we shall continue our journey into the heart of darkness and do some topic-modeling using LDA.

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1] GONG Wei’s Homepage. (May 30, 2020). Stop words for tweets. https://sites.google.com/site/iamgongwei/home/sw

[2] Towards Data Science. (May 30, 2020). Turbo-charge your spaCy NLP pipeline. https://towardsdatascience.com/turbo-charge-your-spacy-nlp-pipeline-551435b664ad

Into the Heart of Darkness - Pt. 1

Exploring the Trump Twitter Archive with Python. For beginners.


In this post, we’ll explore the dataset provided by the Trump Twitter Archive. My goal was to do something fun by using a very interesting dataset. However, as it turned out, exposure to Trump’s narcissism and shenanigans were quite depressing — if not traumatic.

You’d been warned!


For this project, we’ll be using pandas and numpy for data manipulation, matplotlib for visualizations, datetime for working with timestamps, unicodedata and regex for processing strings, and finally, nltk for natural language processing.

Let’s get started by firing up a Jupyter notebook!

Environment

We’re going to import pandas and matplotlib, and also set the display options for Jupyter so that the rows and columns are not truncated.

# for manipulating data
import pandas as pd
import numpy as np
# for visualizations
%matplotlib inline
import matplotlib.pyplot as plt
# to print out all the outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', -1)

Getting the Data

Let’s read the data into a dataframe. If you want to follow along, you can download the dataset here. This dataset contains Trump’s tweets from the moment he took office on January 20, 2017 to May 30, 2020.

df = pd.read_csv('trump_20200530.csv')

Let’s look at the first five rows and see the number of records (rows) and fields (columns).

df.head()
df.shape

Let’s do a quick renaming of the columns to make it easier for us later.

df.columns=['source', 'tweet', 'date_time', 'retweets', 'favorites', 'is_retweet', 'id']

Let’s drop the id column since it’s not really relevant right now.

df = df.drop(columns=['id'])

Let’s do a quick sanity check, this time let’s also check the dtypes of the columns.

df.head()
df.info()

Working with Timestamps

We can see from the previous screenshot that the ‘date_time’ column is a string. Let’s parse it to a timestamp.

# for working with timestamps
from datetime import datetime
from dateutil.parser import parse
dt = []
for ts in df.date_time:
dt.append(parse(ts))
dt[:5]

Let’s add a column with ‘datetime’ that contains the timestamp information.

df['datetime'] = df.apply(lambda row: parse(row.date_time), axis=1)

Let’s double-check the data range of our dataset.

df.datetime.min()
df.datetime.max()

Trimming the Data

Let’s see how many sources there are for the tweets.

df.source.value_counts()

Let’s only keep the ones that were made using the ‘Twitter for iPhone’ app.

df = df.loc[df.source == 'Twitter for iPhone']

We should drop the old ‘date_time’ column and the ‘source’ column as well.

df = df.drop(columns=['date_time', 'source'])

Separating the Retweets

Let’s see how many are retweets.

df.is_retweet.value_counts()

Let’s make another dataframe that contains only retweets and drop the ‘is_retweet’ column.

df_retweets = df.loc[df.is_retweet == True]
df_retweets = df_retweets.drop(columns=['is_retweet'])

Sanity check:

df_retweets.head()
df_retweets.shape

Back on the original dataframe, let’s remove the retweets from the dataset and drop the ‘is_retweet’ column altogether.

df = df.loc[df.is_retweet == False]
df = df.drop(columns=['is_retweet'])

Another sanity check:

df.head()
df.shape

Exploring the Data

Let’s explore both of the dataframes and answer a few questions.

What time does the President tweet the most? What time does he tweet the least?

The graph below shows that the President most frequently tweets around 12pm. He also tweets the least around 8am. Maybe he’s not a morning person?

title = 'Number of Tweets by Hour'
df.tweet.groupby(df.datetime.dt.hour).count().plot(figsize=(12,8), fontsize=14, kind='bar', rot=0, title=title)
plt.xlabel('Hour')
plt.ylabel('Number of Tweets')

What day does the President tweet the most? What day does he tweet the least?

The graph below shows that the President most frequently tweets on Wednesday. He also tweets the least on Thursday.

title = 'Number of Tweets by Day of the Week'
df.tweet.groupby(df.datetime.dt.dayofweek).count().plot(figsize=(12,8), fontsize=14, kind='bar', rot=0, title=title)
plt.xlabel('Day of the Week')
plt.ylabel('Number of Tweets')
plt.xticks(np.arange(7),['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])

Isolating Twitter Handles from the Retweets

Let’s import regex so we can use it to parse the text and isolate the Twitter handles of the original tweets. In the code below, we add another column that contains the Twitter handle.

import re
pattern = re.compile('(?<=RT @).*?(?=:)')
df_retweets['original'] = [re.search(pattern, tweet).group(0) for tweet in df_retweets.tweet]

Let’s create another dataframe that will contain only the original Twitter handles and their associated number of retweets.

df_originals = df_retweets.groupby(['original']).sum().sort_values('retweets').reset_index().sort_values('retweets', ascending=False)

Let’s check the data real quick:

df_originals.head()
df_originals.shape

Let’s visualize the results real quick so we can get an idea if the data is disproportionate or not.

df_originals = df_retweets.groupby(['original']).sum().sort_values('retweets').reset_index().sort_values('retweets', ascending=False)[:10].sort_values('retweets')
df_originals.plot.barh(x='original', y='retweets', figsize=(16,10), fontsize=16)
plt.xlabel("Originating Tweet's Username")
plt.xticks([])

Which Twitter user does the President like to retweet the most?

The graph below shows that the President likes to retweet the tweets from ‘@realDonaldTrump’. Does this mean the president likes to retweet himself? You don’t say!

The interesting handle on this one is ‘@charliekirk11’. Charlie Kirk is the founder of Turning Point USA. CBS News described the organization as a far-right organization that is “shunned or at least ignored by more established conservative groups in Washington, but embraced by many Trump supporters”.¹

The Top 5 Retweets

Let’s look at the top 5 tweets that were retweeted the most by others based on the original Twitter handle.

Let’s start with the ones with ‘@realDonaldTrump’.

df_retweets.loc[df_retweets.original == 'realDonaldTrump'].sort_values('retweets', ascending=False)[:5]

And another one with ‘@charliekirk11’.

df_retweets.loc[df_retweets.original == 'charliekirk11'].sort_values('retweets', ascending=False)[:5]

Examining Retweets’ Favorites count

Let’s find out how many of the retweets are favorited by others.

df_retweets.favorites.value_counts()

Surprisingly, none of the retweets seemed to have been favorited by anybody. Weird.

We should drop it.

Counting N-Grams

To do some n-gram ranking, we need to import unicodedata and nltk. We also need to specify additional stopwords that we may need to exclude from our analysis.

# for cleaning and natural language processing
import unicodedata
import nltk
# add appropriate words that will be ignored in the analysis
ADDITIONAL_STOPWORDS = ['rt']

Here are a few of my favorite functions for natural language processing:

def clean(text):
  """
  A simple function to clean up the data. All the words that
  are not designated as a stop word is then lemmatized after
  encoding and basic regex parsing are performed.
  """
  wnl = nltk.stem.WordNetLemmatizer()
  stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
  text = (unicodedata.normalize('NFKD', text)
    .encode('ascii', 'ignore')
    .decode('utf-8', 'ignore')
    .lower())
  words = re.sub(r'[^\w\s]', '', text).split()
  return [wnl.lemmatize(word) for word in words if word not in stopwords]

def get_words(df, column):
    """
    Takes in a dataframe and columns and returns a list of
    words from the values in the specified column.
    """
    return clean(''.join(str(df[column].tolist())))

def get_bigrams(df, column):
    """
    Takes in a list of words and returns a series of
    bigrams with value counts.
    """
    return (pd.Series(nltk.ngrams(get_words(df, column), 2)).value_counts())[:10]

def get_trigrams(df, column):
    """
    Takes in a list of words and returns a series of
    trigrams with value counts.
    """
    return (pd.Series(nltk.ngrams(get_words(df, column), 3)).value_counts())[:10]

def viz_bigrams(df ,column):
    get_bigrams(df, column).sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))

    plt.title('20 Most Frequently Occuring Bigrams')
    plt.ylabel('Bigram')
    plt.xlabel('# Occurances')

def viz_trigrams(df, column):
    get_trigrams(df, column).sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))

    plt.title('20 Most Frequently Occuring Trigrams')
    plt.ylabel('Trigram')
    plt.xlabel('# Occurances')
 

Let’s look at the top 10 bigrams in the df dataframe using the ‘tweet’ column.

get_bigrams(df, 'tweet')

And now, for the top 10 trigrams:

Let’s use the viz_bigrams() function and visualize the bigrams.

viz_bigrams(df, ‘tweet’)

Similarly, let’s use the viz_trigrams() function and visualize the trigrams.

viz_trigrams(df, 'tweet')

And there we have it!

From the moment that Trump took office, we can confidently say that the “fake news media” has been on top of the president’s mind.

Conclusion

Using basic Python and the nltk library, we’ve explored the dataset from the Trump Twitter Archive and did some n-gram ranking out of it.


Thank you for reading! Exploratory data analysis uses a lot of techniques and we’ve only explored a few on this post. I encourage you to keep practicing and employ other techniques to derive insights from data.

In the next post, we shall continue our journey into the heart of darkness and use spaCy to extract named-entities from the same dataset.

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1] CBS News. “Trump speaks to conservative group Turning Point USA”. www.cbsnews.com. Archived from the original on July 31, 2019. Retrieved August 5, 2019.

Populating a Network Graph with Named-Entities

An early attempt of using networkx to visualize the results of natural language processing.


I do a lot of natural language processing and usually, the results are pretty boring to the eye. When I learned about network graphs, it got me thinking, why not use keywords as nodes and connect them together to create a network graph?

Yupp, why not!

In this post, we’ll do exactly that. We’re going to extract named-entities from news articles about coronavirus and then use their relationships to connect them together in a network graph.


A Brief Introduction

Network graphs are a cool visual that contains nodes (vertices) and edges (lines). It’s often used in social network analysis and network analysis but data scientists also use it for natural language processing.

Photo by Anders Sandberg on Flicker

Natural Language Processing or NLP is a branch of artificial intelligence that deals with programming computers to process and analyze large volumes of text and derive meaning out of them.¹ In other words, it’s all about teaching computers how to understand human language… like a boss!

Photo by brewbooks on Flickr

Enough introduction, let’s get to coding!


To get started, let’s make sure to take care of all dependencies. Open up a terminal and execute the following commands:

pip install -U spacy
python -m spacy download en
pip install networkx
pip install fuzzywuzzy

This will install spaCy and download the trained model for English. The third command installs networkx. This should work for most systems. If it doesn’t work for you, check out the documentation for spaCy and networkx. Also, we’re using fuzzywuzzy for some text preprocessing.

With that out of the way, let’s fire up a Jupyter notebook and get started!


Imports

Run the following code block into a cell to get all the necessary imports into our Python environment.

import pandas as pd
import numpy as np
import pickle
from operator import itemgetter
from fuzzywuzzy import process, fuzz# for natural language processing
import spacy
import en_core_web_sm# for visualizations
%matplotlib inline
from matplotlib.pyplot import figureimport networkx as nx

Getting the Data

If you want to follow along, you can download the sample dataset here. The file was created using newspaper to import news articles from the npr.org. If you’re feeling adventurous, use the code snippet below to build your own dataset.

import requests
import json
import time
import newspaper
import pickle

npr = newspaper.build('https://www.npr.org/sections/coronavirus-live-updates')

corpus = []
count = 0
for article in npr.articles:
    time.sleep(1)
    article.download()
    article.parse()
    text = article.text
    corpus.append(text)
    if count % 10 == 0 and count != 0:
        print('Obtained {} articles'.format(count))
    count += 1

corpus300 = corpus[:300]

with open("npr_coronavirus.txt", "wb") as fp:   # Pickling
    pickle.dump(corpus300, fp)

# with open("npr_coronavirus.txt", "rb") as fp:   # Unpickling
#     corpus = pickle.load(fp)

Let’s get our data.

with open('npr_coronavirus.txt', 'rb') as fp:   # Unpickling
corpus = pickle.load(fp)

Extract Entities

Next, we’ll start by loading spaCy’s English model:

nlp = en_core_web_sm.load()

Then, we’ll extract the entities:

entities = []for article in corpus[:50]:
tokens = nlp(''.join(article))
gpe_list = []
for ent in tokens.ents:
if ent.label_ == 'GPE':
gpe_list.append(ent.text)
entities.append(gpe_list)

In the above code block, we created an empty list called entities to store a list of lists that contains the extracted entities from each of the articles. In the for-loop, we looped through the first 50 articles of the corpus. For each iteration, we converted each articles into tokens (words) and then we looped through all those words to get the entities that are labeled as GPE for countries, states, and cities. We used ent.text to extract the actual entity and appended them one by one to entities.

Here’s the result:

Note that North Carolina has several variations of its name and some have “the” prefixed in their names. Let’s get rid of them.

articles = []for entity_list in entities:
cleaned_entity_list = []
for entity in entity_list:
cleaned_entity_list.append(entity.lstrip('the ').replace("'s", "").replace("’s",""))
articles.append(cleaned_entity_list)

In the code block above, we’re simply traversing the list of lists articles and cleaning the entities one by one. With each iteration, we’re stripping the prefix “the” and getting rid of 's.

Optional: FuzzyWuzzy

Looking at the entities, I’ve noticed that there are also variations in the “United States” is represented. There exists “United States of America” while some are just “United States”. We can trim these down into a more standard naming convention.

FuzzyWuzzy can help with this.

Described by pypi.org as “string matching like a boss,” FiuzzyWuzzy uses Levenshtein distance to calculate the similarities between words.¹ For a really good tutorial on how to use FuzzyWuzzy, check out Thanh Huynh’s article.FuzzyWuzzy: Find Similar Strings within one column in PythonToken Sort Ratio vs. Token Set Ratiotowardsdatascience.com

Here’s the optional code for using FuzzyWuzzy:

choices = set([item for sublist in articles for item in sublist])

cleaned_articles = []
for article in articles:
    article_entities = []
    for entity in set(article):
        article_entities.append(process.extractOne(entity, choices)[0])
    cleaned_articles.append(article_entities)

For the final step before creating the network graph, let’s get rid of the empty lists within our list of list that were generated by articles who didn’t have any GPE entity types.

articles = [article for article in articles if article != []]

Create the Network Graph

For the next step, we’ll create the world into which the graph will exist.

G = nx.Graph()

Then, we’ll manually add the nodes with G.add_nodes_from().

for entities in articles:
G.add_nodes_from(entities)

Let’s see what the graph looks like with:

figure(figsize=(10, 8))
nx.draw(G, node_size=15)

Next, let’s add the edges that will connect the nodes.

for entities in articles:
if len(entities) > 1:
for i in range(len(entities)-1):
G.add_edges_from([(str(entities[i]),str(entities[i+1]))])

For each iteration of the code above, we used a conditional that will only entertain a list of entities that has two or more entities. Then, we manually connect each of the entities with G.add_edges_from().

Let’s see what the graph looks like now:

figure(figsize=(10, 8))
nx.draw(G, node_size=10)

This graph reminds me of spiders! LOL.

To organize it a bit, I decided to use the shell version of the network graph:

figure(figsize=(10, 8))
nx.draw_shell(G, node_size=15)

We can tell that some nodes are heavier on connections than others. To see which nodes have the most connections, let’s use G.degree().

G.degree()

This gives the following degree view:

Let’s find out which node or entity has the most number of connections.

max(dict(G.degree()).items(), key = lambda x : x[1])

To find out which other nodes have the most number of connections, let’s check the top 5:

degree_dict = dict(G.degree(G.nodes()))
nx.set_node_attributes(G, degree_dict, 'degree')sorted_degree = sorted(degree_dict.items(), key=itemgetter(1), reverse=True)

Above, sorted_degrees is a list that contains all the nodes and their degree values. We only wanted the top 5 like so:

print("Top 5 nodes by degree:")
for d in sorted_degree[:5]:
print(d)

Bonus Round: Gephi

Gephi is an open-source and free desktop application that lets us visualize, explore, and analyze all kinds of graphs and networks.²

Let’s export our graph data into a file so we can import it into Gephi.

nx.write_gexf(G, "npr_coronavirus_GPE_50.gexf")

Cool beans!

Next Steps

This time, we only processed 50 articles from npr.org. What would happen if we processed all 300 articles from our dataset? What will we see if we change the entity type from GPE to PERSON? How else can we use network graphs to visualize natural language processing results?

There’s always more to do. The possibilities are endless!


I hope you enjoyed today’s post. The code is not perfect and we have a long way to go towards realizing insights from the data. I encourage you to dive deeper and learn more about spaCynetworkxfuzzywuzzy, and even Gephi.

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1]: Wikipedia. (May 25, 2020). Natural language processing https://en.wikipedia.org/wiki/Natural_language_processing

[2]: Gephi. (May 25, 2020). The Open Graph Viz Platform https://gephi.org/

This article was first published in Towards Data Science‘ Medium publication.