Analysis of Texas Public School Spending and STAAR Performance

Project Overview

A1. Research Question

The research questions this project sought to answer is whether we can predict if a Texas public school district would perform well or do poorly on the STAAR test based on how they allocate or spend their budget with consideration to how racially diverse a district is. This project also sought to analyze which financial expenditure items are important predictors of whether a district will perform well or do poorly on the STAAR test. The purpose of this analysis is to provide counsel to school administrators and policymakers so that they can make data-driven decisions instead of relying on anecdotal evidence alone.

A2. Project Scope

This project’s scope was to use a Jupyter notebook to build a model that predicts whether a Texas public school district would perform well or do poorly on the STAAR test. This model took several inputs of financial expenditure data and then outputted a prediction related to that district’s performance on the STAAR test. This project examined academic year 2018-2019.

A3. Solution Overview – Tools

The Jupyter notebook that contains the model was coded in the Python programming language and relied on several key inputs from the end-user. The first input is a csv file called Enrollment Report_Statewide_Districts_Grade_Ethnicity_2018-2019 (Texas Education Agency, 2019), which contains demographic information on white and non-white population of the district. The second key input is tidy_campstaar1_2012to2019 and tidy_campstaar2_2012to2019 (Texas Education Agency, 2022), which contains information on how many students in the district performed well (meets or masters a subject) or did poorly (approaches comprehension of a subject). The third key input is the 2007-2021-summaried-peims-financial-data (Texas Education Agency, 2022), which contains information on how a district spent their budget. The notebook then automatically performed cleansing and transformation of the data to make it suitable for analysis and modeling. The notebook then constructed a model that predicted how well a district would perform or do poorly on the STAAR test.

A4. Solution Overview – Methodologies

There were four different types of methodologies used in this project: Project, Data Collection, Analytical, and Statistical. Each methodology played an important role in the planning, execution, and verification of the project.

Project Plan

            In this project, the author executed the plans without change. All goals, objectives, and deliverables listed below were completed as describe in Task 2 apart from the dates or time frame for each task. The timeline projected in Task 2 were very conservative and the actual project completed earlier than planned.

B1. Project Execution

The goal for this project was to create a model that would predict how a district would perform on a STAAR test. The objective of this goals are:

  • Determine which districts are performing or underperforming in relation to how much funding they receive, how racially diverse a district is, and how well the students in the district perform on the STAAR test overall.
  • Determine which type of expenditures contribute to the weighted performance of the district.
  • Predict how a district will perform based on how they allocate their budget.

B2. Project Planning Methodology

The project planning methodology used was PAPEM-DM (De Dios, 2020). The seven steps involved in this method include: Planning, acquisition, preparation, exploration, modeling, delivery, and maintenance.

  1. Project Planning – answered the why of the project, addressed the how, delivered a list of research questions to answer and criteria of success.
  2. Acquisition of Data – acquired the data from the TEA and Kaggle websites.
  3. Preparation of Data – prepared the data which includes cleaning the data and transforming it into a format suitable for analysis and modeling.
  4. Exploratory Data Analysis – explored the data using descriptive statistics, visualize the relationship between variables using matplotlib and seaborn. The deliverables of this phase would be a few graphs ready for presentation to stakeholders.
  5. Modeling – created, fitted, and used the model for training and inference. The deliverables were a pickle file for the trained model.
  6. Delivery of Results – delivered the results by completing Task 3 of this capstone project.
  7. Maintenance – while important, this project did not include this phase. However, the Jupyter notebook can be used as a basis for deployment into production (for inferencing) using any of the cloud technology platforms like AWS, Azure, or GCP.

B3. Project Timeline and Milestones

PhaseDescriptionDeliverableStartEndDuration
PlanningProject planningProject plan8/18/37 days
AcquisitionGetting the dataRaw datasets8/48/41 day
PreparationCleansing and transformationClean datasets8/58/2113 days
ExplorationEDAVisualizations8/58/2113 days
ModelingTrain and inferPickle file8/58/2113 days
DeliveryTask 3 ReportPDF Document8/228/287 days
MaintenanceDeployment to productionNot ApplicableN/AN/AN/A

Methodology

C. Data Selection and Collection Process

The author downloaded two of the datasets on the TEA website and while the third data could have also downloaded from the same place, the author chose to use the Kaggle version of the data because it has already been collected in a format that would not necessitate me using web scraping to mine the information from the website.

Data selection and collection didn’t vary much from the proposed plan. The only one addition the author made was the inclusion of ethnicity data that the author downloaded from the same TEA website. This data was necessary to examine and analyze the effect of racial diversity within the districts.

One obstacle the author encountered was in the same dataset that the author added during the actual execution of the project. The author encountered several errors when reading the csv file and realized that formatting is off on the actual csv itself. The author adjusted the read_csv parameter “skiprows” and was able to bypass the header information that was needed in the analysis.

There really were no data governance issues because the dataset is already in the public domain. The one concern would have been anonymity of the students in schools or school districts that are too small that the anonymity of the students would have been impossible to mask. This issue was already taken care by the Texas Education Agency before they even published the dataset.

For example, for those districts that have small subpopulations, the values for the numerator values have been coded as -999. The author simply had to replace those values with a 0 for the dataframe to be processed cleanly. Another example of the datasets’ cleanliness is the fact that there were no missing values in any of the columns.

Therefore, cleanliness and sanitation were the obvious advantages of the datasets used. However, their limitation was the fact that when merged together, the dataset only encompasses the 2018-2019 school year. This is by design, Hence, not really that big of a limitation to warrant much concern.

D. Data Extraction and Preparation Process

The author utilized the power of pandas to read the csv files into a dataframe. The method read_csv() was appropriate because the data is in a tabular format delimited by a comma.

For data preparation or data wrangling, the author used many techniques to handle missing and duplicate data, as well as remove extraneous columns and rows. Some of these techniques include replacing null placeholder values with true numeric NaN values, drop_duplicates() and drop() columns, as well as, string methods to clean up the or standardized the columns names. The author also used locs, ilocs, and np.where to filter columns and rows. Finally, for merging the datasets, the author a combination of concat() and merge() methods to execute inner and outer joins of the data. Utilizing pandas methods made sense because of their efficiency, versatility, and simplicity.

E. Data Analysis Process

For exploratory data analysis, the author used several visualizations to examine the relationship between variables. For example, the author used a heatmap to visualize the correlation coefficients via a correlation matrix using the Pearson method. Also, a pairplot was used to examine whether any of the relationships between variables had a positive or a negative trend. Both were essential in determining and validating whether any of the variables suffered from multicollinearity. Knowing this is important because it would determine what type of machine learning algorithm can or cannot be used if the author choose not to handle the related variables individually. Finally, several violin plots were used to examine the distribution and density of each variable belonging to both passing and failing classes.

To complement the visualizations, the author implemented a few statistical tests to analyze the dataset. These tests were the Levene’s test, the Shapiro test, and the Mann-Whitney U test. The author used Levene’s and Shapiro to test for variance and normality so that the author can validate whether to use parametric or non-parametric methods of statistical testing. As a result, the author used Mann-Whitney U to test for significance.

However, statistical testing does have its own limitations. For one, qualitative factors of the study are not considered because “statistical method cannot study phenomena that cannot be expressed, captured, or measured in quantitative terms.“ For example, we considered the efficiency of districts while regarding for racial diversity, but on must be careful in assigning weights to the imbalance because it is simply not a simple black and white matter. One cannot simply assign a numerical value on racial disparity and there is no clear-cut formula that defines racial inequity neither.

In the author’s opinion, both visualizations and statistical testing techniques were both needed at the same time. Visualization, although limited to eye-balling techniques can reveal hidden patterns in the data while statistical testing can validate assumptions made by eyeballing said data.

Results

F. Project Success

F1. Statistical Significance

Calculating the statistical significance of this project was a reasonably straight-forward task. First, I had defined a formula to define the measure of efficiency with respect to racial diversity. I called the measure E for efficiency, and E = (B/(P/S))/X where B is the total operating program budget of the district, P is the total number of passing (meets or masters a subject) students, S is the total number of students taking the test, and X is for the ratio of non-white student to white student enrollment. Based on this formula, I then calculated the first quartile of created two subsets of data based on that quartile. Those equal or below the quartile contained district data that had performed well on the STAAR test and those above the quartile contained districts that did poorly on the STAAR test. I then tested for the statistical difference between the two group samples. Since the p-value is below 0.05, the difference between the two groups were found to be statistically significant. With this, the author can reject the null hypothesis and confirm the original hypothesis that the “Payroll” expenditure (and other variables for that matter) of districts that performed well on the STAAR test is statistically different than those districts that performed poorly.

F2. Practical Significance

The practical significance of the difference in certain expenditures will result in a significant amount of money saved by each district. For example, if a school district decides to spend more on operations versus paying more for capital investments, the district would have realized more of a return on this investment in terms of higher student performance scores on the STAAR test.

F3. Overall Success

The author believes this project a success. All three criteria laid out in task two were met. First, the formula for calculating the efficiency of a district is appropriate and applicable. Second, the model has an accuracy and AUC scores of more than 70%. And third, the project lists the most important features for high-performing districts.

G. Key Takeaways

This project set out to create Jupyter notebook that contains a model that would predict whether a Texas public school district would perform well or do poorly on the STAAR test. This model needed to be based on the weighted definition of efficiency and also need to have an accuracy or AUC score of more than 70% to be considered a success. The following chart summarizes whether the Jupyter notebook accomplished its first two metric.

Criterion/MetricRequired DataSuccess
Is the formula for calculating the efficiency appropriate and applicable?E = (B/(P/S))/XYES
Does the model have an accuracy and AUC scores of more than 70%?Accuracy 90.93% / AUC 96%YES
Does the project list the most important features for high-performing districts?Bar graphYES

The table above is a straightforward way to visualize and summarize the accomplishments of each objective.

The graph above shows the Receiver-Operator curve (ROC) which visualizes just how skilled the model is (blue line) compared to an unskilled model (by chance, red line).

The graph below shows the most important features of the model and is a graphical way to convey how each different variables affects the model based on the F score.

            Based on the findings above, the author suggests that a study be conducted on how particularly racial diversity affects the efficiency of Texas public school districts as it has been shown that it does influence the model. In addition, further study needs to be conducted to determine whether having a bilingual program is an indication of positive or negative correlation on performance.

Sources

Baron, E. J. (2021). School Spending and Student Outcomes: Evidence from Revenue Limit Elections in Wisconsin. American Economic Journal: Economic Policy, 14(1), 1-39. Retrieved from https://www.aeaweb.org/articles?id=10.1257/pol.20200226

Carhart, A. E. (2016). SCHOOL FINANCE DECISIONS AND ACADEMIC PERFORMANCE: AN ANALYSIS OF THE IMPACTS OF SCHOOL EXPENDITURES ON STUDENT PERFORMANCE. Master Thesis. Retrieved from chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://www.csus.edu/college/social-sciences-interdisciplinary-studies/public-policy-administration/_internal/_documents/thesis-bank/thesis-bank-2016-carhart.pdf

Texas Education Agency. (2022, July 31). PEIMS Financial Data Downloads. Retrieved from Texas Education Agency: https://tea.texas.gov/finance-and-grants/state-funding/state-funding-reports-and-data/peims-financial-data-downloads

Texas Education Agency. (2022, July 31). Texas Education Agency Data 2012-2019. Retrieved from Kaggle: https://www.kaggle.com/datasets/9e3ce42f60ded3ba2a6dd890993493f2c4b284c5cfa035d711bd98fa3359924c?resource=download

Thompson, J., Young, J. K., & and Shelton, K. (2021). “Evaluating Educational Efficiency in Texas Public Schools Utilizing Data Envelopment Analysis. School Leadership Review, 16(1), Article 9. Retrieved from https://scholarworks.sfasu.edu/slr/vol16/iss1/9

Drop It Like It’s Hot

I have a recurring dream where my instructor from a coding boot camp would constantly beat my head with a ruler telling me to read a package or library’s documentation. Hence, as a past time, I would find myself digging into Python or Panda’s documentation.

Today, I found myself wandering into pandas’ .drop() function. So, in this post, I shall attempt to make sense of panda’s documentation for the ever famous .drop().


Housekeeping

Let’s import pandas and create a sample dataframe.

import pandas as pd

data = {'fname': ['Priyanka', 'Jane', 'Sarah', 'Jake', 'Tatum', 'Shubham', 'Antonio'],
        'color': ['Red', 'Orange', 'Yellow', 'Green', 'Blue', 'Indigo', 'Violet'],
        'value': [0, 1, 2, 3, 5, 8, 13],
        'score': [345, 778, 124, 554, 864, 908, 456]
       }

df = pd.DataFrame(data)

If we type df into a cell in Jupyter notebook, this will give us the whole dataframe:

One-level DataFrame Operations

Now let’s get rid of some columns.

df.drop(['color', 'score'], axis=1)

The code above simply tells Python to get rid of the 'color' and 'score' in axis=1 which means look in the columns. Alternatively, we could’ve just as easily not used the named parameter axis because it’s confusing. So, let’s try that now:

df.drop(columns=['color', 'score'])

Both of the methods above will result in the following:

Next, we’ll get rid of some rows (or records).

df.drop([1, 2, 4, 6])

Above, we’re simply telling Python to get rid of the rows with the index of 1, 2, 4, and 6. Note that the indices are passed as a list [1, 2, 4, 6]. This will result in the following:

MultiIndex DataFrame Operations

In this next round, we’re going to work with a multi-index dataframe. Let’s set it up:

data = pd.MultiIndex(levels=[['slim jim', 'avocado', 'banana', 'pork rinds'],
                             ['carbs', 'fat', 'protein']],
                     codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3],
                            [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])

df = pd.DataFrame(index=data, columns=['thinghy', 'majig'],
                  data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
                        [250, 150], [1.5, 0.8], [320, 250],
                        [1, 0.8], [0.3, 0.2], [34.2, 56], [33, 45.1], [67.3, 98]])

This is how the multi-index dataframe looks like:

Now, let’s get rid of the 'thinghy' column with:

df.drop(columns='thinghy')

And this is what we get:

Next, let’s get rid of 'pork rinds' because I don’t like them:

df.drop(index='pork rinds', level=0)

And this is what we get:

And finally, let’s cut the fat:

df.drop(index='fat', level=1)

Above, level=1 simply means the second level (since the first level starts with 0). In this case, it’s the carbs, fat, and protein levels. By specifying index='fat', we’re telling Python to get rid of the fat in level=1.

Here’s what we get:

Staying Put

So far, with all the playing that we did, somehow, if we type df into a cell, the output that we’re going to get is the original dataframe without modifications. this is because all the changes that we’ve been making take effect only on the display.

But what if we want to make the changes permanent? Enter: inplace.

df.drop(index='fat', level=1, inplace=True)

Above, we added inplace=True in the parameter. This signals Python that we want the changes to be taken in place so that when we output df, this is what we’ll get:

We had permanently cut the fat off. LOL!


Thank you for reading! That’s it for today.

Stay tuned!

You can reach me on Twitter or LinkedIn.

Selecting Rows with .loc

As data scientists, we spent most of our time wrangling knee-deep in manipulating data using Pandas. In this post, we’ll be looking at the .loc property of Pandas to select rows based on some predefined conditions.

Let’s open up a Jupyter notebook, and let’s get wrangling!


The Data

We will be using the 311 Service Calls dataset¹ from the City of San Antonio Open Data website to illustrate how the different .loc techniques work.

Housekeeping

Before we get started, let’s do a little housekeeping first.

import pandas as pd

# to print out all the outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', -1)

Nothing fancy going on here. We’re just importing the mandatory Pandas library and setting the display options so that when we inspect our dataframe, the columns and rows won’t be truncated by Jupyter. We’re setting it up so that every output within a single cell is displayed and not just the last one.

def show_missing(df):
    """
    Return the total missing values and the percentage of
    missing values by column.
    """
    null_count = df.isnull().sum()
    null_percentage = (null_count / df.shape[0]) * 100
    empty_count = pd.Series(((df == ' ') | (df == '')).sum())
    empty_percentage = (empty_count / df.shape[0]) * 100
    nan_count = pd.Series(((df == 'nan') | (df == 'NaN')).sum())
    nan_percentage = (nan_count / df.shape[0]) * 100
    return pd.DataFrame({'num_missing': null_count, 'missing_percentage': null_percentage,
                         'num_empty': empty_count, 'empty_percentage': empty_percentage,
                         'nan_count': nan_count, 'nan_percentage': nan_percentage})

In the code above, we’re defining a function that will show us the number of missing or null values and their percentage.

Getting the Data

Let’s load the data into a dataframe.

Doing a quick df.head() we’ll see the first five rows of the data:

And df.info() will let us see the dtypes of the columns.

Then, show_missing(df) shows us if there are any missing values in the data.

Selecting rows where the column is null or not.

Let’s select rows where the 'Dept' column has null values and also filtering a dataframe where null values are excluded.

df['Dept'].value_counts(dropna=False)

df_null = df.loc[df['Dept'].isnull()]
df_null.head()
df_null.shape

df_notnull = df.loc[df['Dept'].notnull()]
df_notnull.head()
df_notnull.shape

First, we did a value count of the column ‘Dept’ column. The method .value_counts() returns a panda series listing all the values of the designated column and their frequency. By default, the method ignores NaN values and will not list it. However, if you include the parameter dropna=False it will include any NaN values in the result.

Next, the line df_null = df.loc[df['Dept'].isnull()] tells the computer to select rows in df where the column 'Dept' is null. The resulting dataframe is assigned to df_null , and all its rows will NaN as values in the ‘Dept’ column.

Similarly, the line df_notnull = df.loc[df['Dept'].notnull()] tells the computer to select rows in df where the column 'Dept' is not null. The resulting dataframe is assigned to df_notnull , and all its rows will not have any NaN as values in the ‘Dept’ column.

The general syntax for these two techniques are:

df_new = df_old.loc[df_old['Column Name'].isnull()]
df_new = df_old.loc[df_old['Column Name'].notnull()]

Selecting rows where the column is a specific value.

The 'Late (Yes/No)' column looks interesting. Let’s take a look at it!

df['Late (Yes/No)'].value_counts(dropna=False)

df_late = df.loc[df['Late (Yes/No)'] == 'YES']
df_late.head()
df_late.shape

df_notlate = df.loc[df['Late (Yes/No)'] == 'NO']
df_notlate.head()
df_notlate.shape

Again, we did a quick value count on the 'Late (Yes/No)' column. Then, we filtered for the cases that were late with df_late = df.loc[df['Late (Yes/No)'] == 'YES']. Similarly, we did the opposite by changing 'YES' to 'NO' and assign it to a different dataframe df_notlate.

The syntax is not much different from the previous example except the addition of == sign between the column and the value we want to compare. It basically asks, for every row, if the value on a particular column (left side) matches the value that we specified (right-side). If the match is True, it includes that row in the result. If the match is False, it ignores it.

Here’s the resulting dataframe for df_late:

And here’s the one for df_notlate:

The general syntax for this technique is:

df_new = df_old.loc[df_old['Column Name'] == 'some_value' ]

Selecting rows where the column is not a specific value.

We’ve learned how to select rows based on ‘yes’ and ‘no.’ But what if the values are not binary? For example, let’s look at the ‘Category’ column:

One hundred ninety-two thousand one hundred ninety-seven rows or records do not have a category assigned, but instead of NaN, empty, or null value, we get 'No Category' as the category itself. What if we want to filter these out? Enter: the != operator.

df.Category.value_counts(dropna=False)

df_categorized = df.loc[df['Category'] != 'No Category']
df_categorized.head()
df_categorized.shape

df_categorized.Category.value_counts(dropna=False)

As usual, we did customary value counts on the 'Category' column to see what we’re working with. Then, we created the df_categorized dataframe to include any records in the the df dataframe that don’t have 'No Category' as their value in the 'Category' column.

Here’s the result of doing a value count on the 'Category' column of the df_categorized dataframe:

As the screenshot above shows, the value counts retained everything but the ‘No Category.’

The general syntax for this technique is:

df_new = df_old.loc[df_old['Column Name'] != 'some_value' ]

Select rows based on multiple conditions.

Let’s consider the following columns, 'Late (Yes/No)' and 'CaseStatus':

What if we wanted to know which open cases right now are already passed their SLA (service level agreement)? We would need to use multiple conditions to filter the cases or rows in a new dataframe. Enter the & operator.

df_late_open = df.loc[(df['Late (Yes/No)'] == 'YES') & (df['CaseStatus'] == 'Open')]

df_late_open.head()
df_late_open.shape

The syntax is similar to the previous ones except for the introduction of the & operator in between parenthesis. In the line df_late_open = df.loc[(df[‘Late (Yes/No)’] == ‘YES’) & (df[‘CaseStatus’] == ‘Open’)], there are two conditions:

  1. (df[‘Late (Yes/No)’] == ‘YES’)
  2. (df[‘CaseStatus’] == ‘Open’)

We want both of these to be true to match a row, so we included the operator & in between them. In plain speak, the & bitwise operator simply means AND. Other bitwise operators include pipe| sign for OR and the tilde ~ for NOT. I encourage you to experiment using these bitwise operators to get a good feel of what all they can do. Just remember to enclose each condition between parenthesis so that you don’t confuse Python.

The general syntax for this technique is:

df_new = df_old.loc[(df_old['Column Name 1'] == 'some_value_1') & (df['Column Name 2'] == 'some_value_2')]

Select rows having a column value that belongs in some list of values.

Let’s look at the value count for the 'Council District' column:

What if we wanted to focus on districts #2, #3, #4, and #5 because they’re in south San Antonio, and they’re known for getting poor service from the city? (I’m so totally making this up by the way!) In this case, we could use the .isin() method like so:

df['Council District'].value_counts(dropna=False)

df_south = df.loc[df['Council District'].isin([2,3,4,5])]
df_south.head()
df_south.shape

df_south['Council District'].value_counts()

Remember to pass your choices inside the .isin() method as a list like ['choice1', 'choice2', 'choice3'] because otherwise, it will cause an error. For integers like in our example, it is not necessary to include quotation marks because quotation marks are for string values only.

Here’s the result of our new dataframe df_south:

The general syntax for this technique is:

df_new = df_old.loc[df_old[Column Name'].isin(['choice1', 'choice2', 'choice3'])]

Conclusion

And that’s it! In this post, we loaded the 311 service calls data into a dataframe and created subsets of data using the .loc method.


Thanks for reading! I hope you enjoyed today’s post. Data wrangling, at least for me, is a fun exercise because this is the phase where I first get to know the data and it gives me a chance to hone my problem-solving skills when faced with really messy data. Happy wrangling folks!

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1] City of San Antonio Open Data. (May 31, 2020). 311 Service Calls. https://data.sanantonio.gov/dataset/service-calls

From DataFrame to Named-Entities

A quick-start guide to extracting named-entities from a Pandas dataframe using spaCy.


A long time ago in a galaxy far away, I was analyzing comments left by customers and I noticed that they seemed to mention specific companies much more than others. This gave me an idea. Maybe there is a way to extract the names of companies from the comments and I could quantify them and conduct further analysis.

There is! Enter: named-entity-recognition.

Named-Entity Recognition

According to Wikipedia, named-entity recognition or NER “is a subtask of information extraction that seeks to locate and classify named entity mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.”¹ In other words, NER attempts to extract words that categorized into proper names and even numerical entities.

In this post, I’ll share the code that will let us extract named-entities from a Pandas dataframe using spaCy, an open-source library provides industrial-strength natural language processing in Python and is designed for production use.²

To get started, let’s install spaCy with the following pip command:

pip install -U spacy

After that, let’s download the pre-trained model for English:

python -m spacy download en

With that out of the way, let’s open up a Jupyter notebook and get started!

Imports

Run the following code block into a cell to get all the necessary imports into our Python environment.

# for manipulating dataframes
import pandas as pd# for natural language processing: named entity recognition
import spacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()# for visualizations
%matplotlib inline

The important line in this block is nlp = en_core_web_sm.load() because this is what we’ll be using later to extract the entities from the text.

Getting the Data

First, let’s get our data and load it into a dataframe. If you want to follow along, download the sample dataset here or create your own from the Trump Twitter Archive.

df = pd.read_csv('ever_trump.csv')

Running df.head() in a cell will get us acquainted with the data set quickly.

Getting the Tokens

Second, let’s create tokens that will serve as input for spaCy. In the line below, we create a variable tokens that contains all the words in the 'text' column of the df dataframe.

tokens = nlp(''.join(str(df.text.tolist())))

Third, we’re going to extract entities. We can just extract the most common entities for now:

items = [x.text for x in tokens.ents]
Counter(items).most_common(20)
Screenshot by Author

Extracting Named-Entities

Next, we’ll extract the entities based on their categories. We have a few to choose from people to events and even organizations. For a complete list of all that spaCy has to offer, check out their documentation on named-entities.

Screenshot by Author

To start, we’ll extract people (real and fictional) using the PERSON type.

person_list = []for ent in tokens.ents:
if ent.label_ == 'PERSON':
person_list.append(ent.text)

person_counts = Counter(person_list).most_common(20)df_person = pd.DataFrame(person_counts, columns =['text', 'count'])

In the code above, we started by making an empty list with person_list = [].

Then, we utilized a for-loop to loop through the entities found in tokens with tokens.ents. After that, we made a conditional that will append to the previously created list if the entity label is equal to PERSON type.

We’ll want to know how many times a certain entity of PERSON type appears in the tokens so we did with person_counts = Counter(person_list).most_common(20). This line will give us the top 20 most common entities for this type.

Finally, we created the df_person dataframe to store the results and this is what we get:

Screenshot by Author

We’ll repeat the same pattern for the NORP type which recognizes nationalities, religious and political groups.

norp_list = []for ent in tokens.ents:
if ent.label_ == 'NORP':
norp_list.append(ent.text)

norp_counts = Counter(norp_list).most_common(20)df_norp = pd.DataFrame(norp_counts, columns =['text', 'count'])

And this is what we get:

Screenshot by Author

Bonus Round: Visualization

Let’s create a horizontal bar graph of the df_norp dataframe.

df_norp.plot.barh(x='text', y='count', title="Nationalities, Religious, and Political Groups", figsize=(10,8)).invert_yaxis()
Screenshot by Author

Voilà, that’s it!


I hope you enjoyed this one. Natural language processing is a huge topic but I hope that this gentle introduction will encourage you to explore more and expand your repertoire.

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1]: Wikipedia. (May 22, 2020). Named-entity recognition https://en.wikipedia.org/wiki/Named-entity_recognition

[2]: spaCy. (May 22, 2020). Industrial-Strength Natural Language Processing in Python https://spacy.io/

This article was first published in Towards Data Science‘ Medium publication.

From DataFrame to N-Grams

A quick-start guide to creating and visualizing n-gram ranking using nltk for natural language processing.


When I was first starting to learn NLP, I remember getting frustrated or intimidated by information overload so I’ve decided to write a post that covers the bare minimum. You know what they say, “Walk before you run!”

This is a very gentle introduction so we won’t be using any fancy code here.


In a nutshell, natural language processing or NLP simply refers to the process of reading and understanding written or spoken language using a computer. At its simplest use case, we can use a computer to read a book, for example, and count how many times each word was used instead of us manually doing it.

NLP is a big topic and there’s already been a ton of articles written on the subject so we won’t be covering that here. Instead, we’ll focus on how to quickly do one of the simplest but useful techniques in NLP: N-gram ranking.

N-Gram Ranking

Simply put, an n-gram is a sequence of n words where n is a discrete number that can range from 1 to infinity! For example, the word “cheese” is a 1-gram (unigram). The combination of the words “cheese flavored” is a 2-gram (bigram). Similarly, “cheese flavored snack” is a 3-gram (trigram). And “ultimate cheese flavored snack” is a 4-gram (qualgram). So on and so forth.

In n-gram ranking, we simply rank the n-grams according to how many times they appear in a body of text — be it a book, a collection of tweets, or reviews left by customers of your company.

Let’s get started!

Getting the Data

First, let’s get our data and load it into a dataframe. You can download the sample dataset here or create your own from the Trump Twitter Archive.

import pandas as pddf = pd.read_csv('tweets.csv')

Using df.head() we can quickly get acquainted with the dataset.

A sample of President Trump’s tweets.

Importing Packages

Next, we’ll import packages so we can properly set up our Jupyter notebook:

# natural language processing: n-gram ranking
import re
import unicodedata
import nltk
from nltk.corpus import stopwords# add appropriate words that will be ignored in the analysis
ADDITIONAL_STOPWORDS = ['covfefe']
import matplotlib.pyplot as plt

In the code block above, we imported pandas so that we can shape and manipulate our data in all sorts of different and wonderful ways! Next, we imported re for regex, unicodedata for Unicode data, and nltk to help with parsing the text and cleaning them up a bit. And then, we specified additional stop words that we want to ignore. This is helpful in trimming down the noise. Lastly, we imported matplotlib matplotlib so we can visualize the result of our n-gram ranking later.

Next, let’s create a function that will perform basic cleaning of the data.

Basic Cleaning

def basic_clean(text):
"""
A simple function to clean up the data. All the words that
are not designated as a stop word is then lemmatized after
encoding and basic regex parsing are performed.
"""
wnl = nltk.stem.WordNetLemmatizer()
stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
text = (unicodedata.normalize('NFKD', text)
.encode('ascii', 'ignore')
.decode('utf-8', 'ignore')
.lower())
words = re.sub(r'[^\w\s]', '', text).split()
return [wnl.lemmatize(word) for word in words if word not in stopwords]

The function above takes in a list of words or text as input and returns a cleaner set of words. The function does normalization, encoding/decoding, lower casing, and lemmatization.

Let’s use it!

words = basic_clean(''.join(str(df['text'].tolist())))

Above, we’re simply calling the function basic_lean() to process the 'text' column of our dataframe df and making it a simple list with tolist(). We then assign the results to words.

A list of already cleaned, normalized, and lemmatized words.

N-grams

Here comes the fun part! In one line of code, we can find out which bigrams occur the most in this particular sample of tweets.

(pd.Series(nltk.ngrams(words, 2)).value_counts())[:10]

We can easily replace the number 2 with 3 so we can get the top 10 trigrams instead.

(pd.Series(nltk.ngrams(words, 3)).value_counts())[:10]

Voilà! We got ourselves a great start. But why stop now? Let’s try it and make a little eye candy.

Bonus Round: Visualization

To make things a little easier for ourselves, let’s assign the result of n-grams to variables with meaningful names:

bigrams_series = (pd.Series(nltk.ngrams(words, 2)).value_counts())[:12]trigrams_series = (pd.Series(nltk.ngrams(words, 3)).value_counts())[:12]

I’ve replaced [:10] with [:12] because I wanted more n-grams in the results. This is an arbitrary value so you can choose whatever makes the most sense to you according to your situation.

Let’s create a horizontal bar graph:

bigrams_series.sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))

And let’s spiffy it up a bit by adding titles and axis labels:

bigrams_series.sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))
plt.title('20 Most Frequently Occuring Bigrams')
plt.ylabel('Bigram')
plt.xlabel('# of Occurances')

And that’s it! With a few simple lines of code, we quickly made a ranking of n-grams from a Pandas dataframe and even made a horizontal bar graph out of it.


I hope you enjoyed this one. Natural Language Processing is a big topic but I hope that this gentle introduction will encourage you to explore more and expand your repertoire.

In the next article, we’ll visualize an n-gram ranking in Power BI with a few simple clicks of the mouse and a dash of Python!

Stay tuned!

You can reach me on Twitter or LinkedIn.

This article was first published on Towards Data Science‘ Medium publication.

From DataFrame to Network Graph

I just discovered — quite accidentally — how to export data from JIRA so naturally, I began to think of ways to visualize the information and potentially glean some insight from the dataset. I’ve stumbled upon the concept of network graphs and the idea quickly captured my imagination. I realized that I can use it to tell stories not only about the relationships between people but between words as well! But NLP is a big topic, so how about we walk first and run later?!

This is just a very gentle introduction so we won’t be using any fancy code here.


Network graphs “show interconnections between a set of entities”¹ where entities arenodes and the connections between them are represented through links or edges¹. In the graph below, the dots are the nodes and the lines are called edges.

Martin Grandjean / CC BY-SA (https://creativecommons.org/licenses/by-sa/3.0)

In this post, I’ll share the code that will let us quickly visualize a Pandas dataframe using a popular network graph package: networkx.

First, let’s get our data and load it into a datagram. You can download the sample dataset here.

import pandas as pd
df = pd.read_csv(‘jira_sample.csv’)

Second, let’s trim the dataframe to only include the columns we want to examine. In this case, we only want the columns ‘Assignee’ and ‘Reporter’.

df1 = df[[‘Assignee’, ‘Reporter’]]

Third, it’s time to create the world into which the graph will exist. If you haven’t already, install the networkx package by doing a quick pip install networkx.

import networkx as nx
G = nx.from_pandas_edgelist(df1, ‘Assignee’, ‘Reporter’)

Next, we’ll materialize the graph we created with the help of matplotlib for formatting.

from matplotlib.pyplot import figure
figure(figsize=(10, 8))
nx.draw_shell(G, with_labels=True)

The most important line in the block above is nx.draw_shell(G, with_labels=True). It tells the computer to draw the graph Gusing a shell layout with the labels for entities turned on.

Voilà! We got ourselves a network graph:

A network graph of reporters and assignees from a sample JIRA dataset.

Right off the bat, we can tell that there’s a heavy concentration of lines originating from three major players, ‘barbie.doll’, ‘susan.lee’, and ‘joe.appleseed’. Of course, just to be sure, it’s always a good idea to confirm our ‘eyeballing’ with some hard data.

Bonus Round

Let’s check out ‘barbie.doll’.

G[‘barbie.doll’]

To see how many connections ‘barbie.doll’ has, let’s use len():

len(G[‘barbie.doll’])

Next, let’s created another dataframe that shows the nodes and their number of connections.

leaderboard = {}
for x in G.nodes:
leaderboard[x] = len(G[x])
s = pd.Series(leaderboard, name=’connections’)
df2 = s.to_frame().sort_values(‘connections’, ascending=False)

In the code block above, we first initialized an empty dictionary called ‘leaderboard’ and then used a simple for-loop to populate the dictionary with names and number of connections. Then, we created a series out of the dictionary. Finally, we created another dataframe from the series that we created using to_frame().

To display the dataframe, we simply use df2.head() and we got ourselves a leaderboard!

And that’s it! With a few simple lines of code, we quickly made a network graph from a Pandas dataframe and even displayed a table with names and number of connections.


I hope you enjoyed this one. Network graph analysis is a big topic but I hope that this gentle introduction will encourage you to explore more and expand your repertoire.

In the next article, I’ll walk through Power BI’s custom visual called ‘Network Navigator’ to create a network graph with a few simple clicks of the mouse.

Stay tuned!

[1]: Data-To-Viz. (May 15, 2020). Network Diagram https://www.data-to-viz.com/graph/network.html

How to put two or more dataframes together

The Pandas function below takes a list of dataframes and concatenates them into. This basic flavor of concat()joins the dataframes vertically. In other words, the rows of one dataframe gets added on to the previous one.

df = pd.concat([df1,df2,df3])

Or if you want, you can store the list of dataframes into a variable first and then call the concat function. Like so:

# we must import pandas first
# put it in the beginning of your file
import pandas as pd

frames = [df1, df2, df3, df4, df5]
df = pd.concat(frames)

On the other hand, if I want to join the dataframes horizontally, then I can use merge().

For example, in the code below, we are merging df1 with df2 using ‘column_name’ as the common column. This is the column from which to base the merge. If there are any other identical columns that exist between the two dataframes, the suffixes are then appended to the each of the column names accordingly.

This particular flavor of merge() joins the dataframes horizontally. In the words, the columns of the dataframes gets added together to make one big mamma jamma of a dataframe;

df_merged = df1.merge(df2,
                      left_on='column_name',
                      right_on='column_name',
                      suffixes=('_left', '_right'))