A friendly tutorial on getting zip codes and other geographic data from street addresses.
Knowing how to deal with geographic data is a must-have for a data scientist. In this post, we will play around with the MapQuest Search API to get zip codes from street addresses along with their corresponding latitude and longitude to boot!
The Scenario
In 2019, my friends and I participated in CivTechSA Datathon. At one point in the competition, we wanted to visualize the data points and overlay them on San Antonio’s map. The problem is, we had incomplete data. Surprise! All we had were a street number and a street name — no zip code, no latitude, nor longitude. We then turned to the great internet for some help.
We found a great API by MapQuest that will give us exactly what we needed. With just a sprinkle of Python code, we were able to accomplish our goal.
Today, we’re going to walk through this process.
The Data
To follow along, you can download the data from here. Just scroll down to the bottom tab on over to the Data Catalog 2019. Look for SAWS (San Antonio Water System) as shown below.
Below are two functions that call the API and returns geo data.https://towardsdatascience.com/media/3ec6009e8b6069387a9edde18bdad0d3
We can manually call it with the line below. Don’t forget to replace the ‘#####’ with your own API key. You can use any address you want (replace spaces with a + character).
Finally, let’s create a dataframe that will house the street addresses — complete with zip code, latitude, and longitude.https://towardsdatascience.com/media/adfcc23ff94f54877bc80b72e2537ed9
PyCaret is an open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within seconds in your choice of notebook environment.¹
PyCaret
PyCaret does a lot more than NLP. It also does a whole slew of both supervised and unsupervised ML including classification, regression, clustering, anomaly detection, and associate rule mining.
Let’s begin by installing PyCaret. Just do pip install pycaret and we are good to go! Note: PyCaret is a big library so you may want to go grab a cup of coffee while waiting for it to install.
Also, we need to download the English language model because it is not automatically downloaded with PyCaret:
Let’s read the data into a dataframe. If you want to follow along, you can download the dataset here. This dataset contains Trump’s tweets from the moment he took office on January 20, 2017 to May 30, 2020.
import pandas as pd
from pycaret.nlp import *
df = pd.read_csv('trump_20200530.csv')
Let’s check the shape of our data first:
df.shape
And let’s take a quick look:
df.head()
For expediency, let’s sample only 1,000 tweets.
# sampling the data to select only 1000 tweets df = df.sample(1000, random_state=493).reset_index(drop=True) df.shape
PyCaret’s setup() function performs the following text-processing steps:
Removing Numeric Characters
Removing Special Characters
Word Tokenization
Stopword Removal
Bigram Extraction
Trigram Extraction
Lemmatizing
Custom Stopwords
And all in one line of code!
It takes in two parameters: the dataframe in data and the name of the text column that we want to pass in target. In our case, we also used the optional parameters session_id for reproducibility and custom_stopwords to reduce the noise coming from the tweets.
After all is said and done, we’ll get something similar to this:
In the next step, we’ll create the model and we’ll use ‘lda’:
Above, we created an ‘lda’ model and passed in the number of topics as 6 and set it so that the LDA will use all CPU cores available to parallelize and speed up training.
Finally, we’ll assign topic proportions to the rest of the dataset using assign_model().
Thank you for reading! Exploratory data analysis uses a lot of techniques and we’ve only explored a few on this post. I encourage you to keep practicing and employ other techniques to derive insights from data.
I was bored over the weekend so I decided to restore my Macbook Pro to factory settings so that I can set up my programming environment the proper way. After all, what’s a data scientist without her toys?
Let’s start with a replacement to the default terminal and pyenv installation to manage different Python versions.
Replacing the default Mac terminal with iTerm2 and Oh My Zsh.
Let’s move on to managing different Python interpreters and virtual environments using pyenv-virtualenv.
In a previous post, we set out to explore the dataset provided by the Trump Twitter Archive. My initial goal was to do something fun by using a very interesting dataset. However, it didn’t quite turn out that way.
On this post, we’ll continue our journey but this time we’ll be using spaCy.
For this project, we’ll be using pandas for data manipulation, spaCy for natural language processing, and joblib to speed things up.
Let’s get started by firing up a Jupyter notebook!
Housekeeping
Let’s import pandas and also set the display options so Jupyter won’t truncate our columns and rows. Let’s also set a random seed for reproducibility.
# for manipulating data import pandas as pd
# setting the random seed for reproducibility import random random.seed(493)
# to print out all the outputs from IPython.core.interactiveshell import InteractiveShell InteractiveShell.ast_node_interactivity = "all"
# set display options pd.set_option('display.max_columns', None) pd.set_option('display.max_rows', None) pd.set_option('display.max_colwidth', -1)
Getting the Data
Let’s read the data into a dataframe. If you want to follow along, you can download the cleaned dataset here along with the file for stop words¹. This dataset contains Trump’s tweets from the moment he took office on January 20, 2017 to May 30, 2020.
Now let’s import spaCy and begin natural language processing.
# for natural language processing: named entity recognition import spacy import en_core_web_sm
We’re only going to use spaCy’s ner functionality or named-entity recognition so we’ll disable the rest of the functionalities. This will save us a lot of loading time later.
Now let’s load the contents stopwords file into the variable stopswords. Note that we converted the list into a set to also save some processing time later.
with open(‘twitter-stopwords — TA — Less.txt’) as f: contents = f.read().split(‘,’) stopwords = set(contents)
Next, we’ll import joblib and define a few functions to help with parallel processing.
from joblib import Parallel, delayed
def chunker(iterable, total_length, chunksize):
return (iterable[pos: pos + chunksize] for pos in range(0, total_length, chunksize))
def flatten(list_of_lists):
"Flatten a list of lists to a combined list"
return [item for sublist in list_of_lists for item in sublist]
def process_chunk(texts):
preproc_pipe = []
for doc in nlp.pipe(texts, batch_size=20):
preproc_pipe.append([(ent.text) for ent in doc.ents if ent.label_ in ['NORP', 'PERSON', 'FAC', 'ORG', 'GPE', 'LOC', 'PRODUCT', 'EVENT']])
return preproc_pipe
def preprocess_parallel(texts, chunksize=100):
executor = Parallel(n_jobs=7, backend='multiprocessing', prefer="processes")
do = delayed(process_chunk)
tasks = (do(chunk) for chunk in chunker(texts, len(df), chunksize=chunksize))
result = executor(tasks)
return flatten(result)
In the code above², the function preprocess_parallel executes the other function process_chunks in parallel to help with speed. The function process_chunks iterates through a series of texts — in our case, the column 'tweet' of our the df dataframe — and inspects the entity if it belongs to either NORP, PERSON, FAC, ORG, GPE, LOC, PRODUCT, or EVENT. If it is, the entity is then appended to 'preproc_pipe' and subsequently returned to its caller. Prashanth Rao has a very good article on making spaCy super fast.
Doing a quick df.head() will reveal the new column 'entities' that we added earlier to hold the entities found in the 'tweet' column.
Prettifying the Results
In the code below, we’re making a list of lists called 'entities' and then flattening it for easier processing. We’re also converting it into a set called 'entities_set'.
entities = [entity for entity in df.entities if entity != []] entities = [item for sublist in entities for item in sublist]
entities_set = set(entities)
Next, let’s count the frequency of the entities and append it to the list of tuples entities_counts. Then let’s convert the results into a dataframe df_counts.
For this step, we’re going to reinitialize an empty list entity_counts and manually construct a list of tuples with a combined set of entities and the sum of their frequencies or count.
We’ve successfully created a ranking of the named entities that President Trump most frequently talked about in his tweets since taking office.
Thank you for reading! Exploratory data analysis uses a lot of techniques and we’ve only explored a few on this post. I encourage you to keep practicing and employ other techniques to derive insights from data.
In the next post, we shall continue our journey into the heart of darkness and do some topic-modeling using LDA.
Exploring the Trump Twitter Archive with Python. For beginners.
In this post, we’ll explore the dataset provided by the Trump Twitter Archive. My goal was to do something fun by using a very interesting dataset. However, as it turned out, exposure to Trump’s narcissism and shenanigans were quite depressing — if not traumatic.
You’d been warned!
For this project, we’ll be using pandas and numpy for data manipulation, matplotlib for visualizations, datetime for working with timestamps, unicodedata and regex for processing strings, and finally, nltk for natural language processing.
Let’s get started by firing up a Jupyter notebook!
Environment
We’re going to import pandas and matplotlib, and also set the display options for Jupyter so that the rows and columns are not truncated.
# for manipulating data import pandas as pd import numpy as np
# for visualizations %matplotlib inline import matplotlib.pyplot as plt
# to print out all the outputs from IPython.core.interactiveshell import InteractiveShell InteractiveShell.ast_node_interactivity = "all"
# set display options pd.set_option('display.max_columns', None) pd.set_option('display.max_rows', None) pd.set_option('display.max_colwidth', -1)
Getting the Data
Let’s read the data into a dataframe. If you want to follow along, you can download the dataset here. This dataset contains Trump’s tweets from the moment he took office on January 20, 2017 to May 30, 2020.
df = pd.read_csv('trump_20200530.csv')
Let’s look at the first five rows and see the number of records (rows) and fields (columns).
df.head() df.shape
Let’s do a quick renaming of the columns to make it easier for us later.
Let’s explore both of the dataframes and answer a few questions.
What time does the President tweet the most? What time does he tweet the least?
The graph below shows that the President most frequently tweets around 12pm. He also tweets the least around 8am. Maybe he’s not a morning person?
title = 'Number of Tweets by Hour' df.tweet.groupby(df.datetime.dt.hour).count().plot(figsize=(12,8), fontsize=14, kind='bar', rot=0, title=title) plt.xlabel('Hour') plt.ylabel('Number of Tweets')
What day does the President tweet the most? What day does he tweet the least?
The graph below shows that the President most frequently tweets on Wednesday. He also tweets the least on Thursday.
title = 'Number of Tweets by Day of the Week' df.tweet.groupby(df.datetime.dt.dayofweek).count().plot(figsize=(12,8), fontsize=14, kind='bar', rot=0, title=title) plt.xlabel('Day of the Week') plt.ylabel('Number of Tweets') plt.xticks(np.arange(7),['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
Isolating Twitter Handles from the Retweets
Let’s import regex so we can use it to parse the text and isolate the Twitter handles of the original tweets. In the code below, we add another column that contains the Twitter handle.
import re
pattern = re.compile('(?<=RT @).*?(?=:)') df_retweets['original'] = [re.search(pattern, tweet).group(0) for tweet in df_retweets.tweet]
Let’s create another dataframe that will contain only the original Twitter handles and their associated number of retweets.
Which Twitter user does the President like to retweet the most?
The graph below shows that the President likes to retweet the tweets from ‘@realDonaldTrump’. Does this mean the president likes to retweet himself? You don’t say!
The interesting handle on this one is ‘@charliekirk11’. Charlie Kirk is the founder of Turning Point USA. CBS News described the organization as a far-right organization that is “shunned or at least ignored by more established conservative groups in Washington, but embraced by many Trump supporters”.¹
The Top 5 Retweets
Let’s look at the top 5 tweets that were retweeted the most by others based on the original Twitter handle.
Let’s start with the ones with ‘@realDonaldTrump’.
Let’s find out how many of the retweets are favorited by others.
df_retweets.favorites.value_counts()
Surprisingly, none of the retweets seemed to have been favorited by anybody. Weird.
We should drop it.
Counting N-Grams
To do some n-gram ranking, we need to import unicodedata and nltk. We also need to specify additional stopwords that we may need to exclude from our analysis.
# for cleaning and natural language processing import unicodedata import nltk
# add appropriate words that will be ignored in the analysis ADDITIONAL_STOPWORDS = ['rt']
Here are a few of my favorite functions for natural language processing:
def clean(text):
"""
A simple function to clean up the data. All the words that
are not designated as a stop word is then lemmatized after
encoding and basic regex parsing are performed.
"""
wnl = nltk.stem.WordNetLemmatizer()
stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
text = (unicodedata.normalize('NFKD', text)
.encode('ascii', 'ignore')
.decode('utf-8', 'ignore')
.lower())
words = re.sub(r'[^\w\s]', '', text).split()
return [wnl.lemmatize(word) for word in words if word not in stopwords]
def get_words(df, column):
"""
Takes in a dataframe and columns and returns a list of
words from the values in the specified column.
"""
return clean(''.join(str(df[column].tolist())))
def get_bigrams(df, column):
"""
Takes in a list of words and returns a series of
bigrams with value counts.
"""
return (pd.Series(nltk.ngrams(get_words(df, column), 2)).value_counts())[:10]
def get_trigrams(df, column):
"""
Takes in a list of words and returns a series of
trigrams with value counts.
"""
return (pd.Series(nltk.ngrams(get_words(df, column), 3)).value_counts())[:10]
def viz_bigrams(df ,column):
get_bigrams(df, column).sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))
plt.title('20 Most Frequently Occuring Bigrams')
plt.ylabel('Bigram')
plt.xlabel('# Occurances')
def viz_trigrams(df, column):
get_trigrams(df, column).sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))
plt.title('20 Most Frequently Occuring Trigrams')
plt.ylabel('Trigram')
plt.xlabel('# Occurances')
Let’s look at the top 10 bigrams in the df dataframe using the ‘tweet’ column.
get_bigrams(df, 'tweet')
And now, for the top 10 trigrams:
Let’s use the viz_bigrams() function and visualize the bigrams.
viz_bigrams(df, ‘tweet’)
Similarly, let’s use the viz_trigrams() function and visualize the trigrams.
viz_trigrams(df, 'tweet')
And there we have it!
From the moment that Trump took office, we can confidently say that the “fake news media” has been on top of the president’s mind.
Conclusion
Using basic Python and the nltk library, we’ve explored the dataset from the Trump Twitter Archive and did some n-gram ranking out of it.
Thank you for reading! Exploratory data analysis uses a lot of techniques and we’ve only explored a few on this post. I encourage you to keep practicing and employ other techniques to derive insights from data.
In the next post, we shall continue our journey into the heart of darkness and use spaCy to extract named-entities from the same dataset.
An early attempt of using networkx to visualize the results of natural language processing.
I do a lot of natural language processing and usually, the results are pretty boring to the eye. When I learned about network graphs, it got me thinking, why not use keywords as nodes and connect them together to create a network graph?
Yupp, why not!
In this post, we’ll do exactly that. We’re going to extract named-entities from news articles about coronavirus and then use their relationships to connect them together in a network graph.
A Brief Introduction
Network graphs are a cool visual that contains nodes (vertices) and edges (lines). It’s often used in social network analysis and network analysis but data scientists also use it for natural language processing.
Natural Language Processing or NLP is a branch of artificial intelligence that deals with programming computers to process and analyze large volumes of text and derive meaning out of them.¹ In other words, it’s all about teaching computers how to understand human language… like a boss!
This will install spaCy and download the trained model for English. The third command installs networkx. This should work for most systems. If it doesn’t work for you, check out the documentation for spaCy and networkx. Also, we’re using fuzzywuzzy for some text preprocessing.
With that out of the way, let’s fire up a Jupyter notebook and get started!
Imports
Run the following code block into a cell to get all the necessary imports into our Python environment.
import pandas as pd import numpy as np import pickle from operator import itemgetter from fuzzywuzzy import process, fuzz# for natural language processing import spacy import en_core_web_sm# for visualizations %matplotlib inline from matplotlib.pyplot import figureimport networkx as nx
Getting the Data
If you want to follow along, you can download the sample dataset here. The file was created using newspaper to import news articles from the npr.org. If you’re feeling adventurous, use the code snippet below to build your own dataset.
import requests
import json
import time
import newspaper
import pickle
npr = newspaper.build('https://www.npr.org/sections/coronavirus-live-updates')
corpus = []
count = 0
for article in npr.articles:
time.sleep(1)
article.download()
article.parse()
text = article.text
corpus.append(text)
if count % 10 == 0 and count != 0:
print('Obtained {} articles'.format(count))
count += 1
corpus300 = corpus[:300]
with open("npr_coronavirus.txt", "wb") as fp: # Pickling
pickle.dump(corpus300, fp)
# with open("npr_coronavirus.txt", "rb") as fp: # Unpickling
# corpus = pickle.load(fp)
Let’s get our data.
with open('npr_coronavirus.txt', 'rb') as fp: # Unpickling corpus = pickle.load(fp)
Extract Entities
Next, we’ll start by loading spaCy’s English model:
nlp = en_core_web_sm.load()
Then, we’ll extract the entities:
entities = []for article in corpus[:50]: tokens = nlp(''.join(article)) gpe_list = [] for ent in tokens.ents: if ent.label_ == 'GPE': gpe_list.append(ent.text) entities.append(gpe_list)
In the above code block, we created an empty list called entities to store a list of lists that contains the extracted entities from each of the articles. In the for-loop, we looped through the first 50 articles of the corpus. For each iteration, we converted each articles into tokens (words) and then we looped through all those words to get the entities that are labeled as GPE for countries, states, and cities. We used ent.text to extract the actual entity and appended them one by one to entities.
Here’s the result:
Note that North Carolina has several variations of its name and some have “the” prefixed in their names. Let’s get rid of them.
articles = []for entity_list in entities: cleaned_entity_list = [] for entity in entity_list: cleaned_entity_list.append(entity.lstrip('the ').replace("'s", "").replace("’s","")) articles.append(cleaned_entity_list)
In the code block above, we’re simply traversing the list of lists articles and cleaning the entities one by one. With each iteration, we’re stripping the prefix “the” and getting rid of 's.
Optional: FuzzyWuzzy
Looking at the entities, I’ve noticed that there are also variations in the “United States” is represented. There exists “United States of America” while some are just “United States”. We can trim these down into a more standard naming convention.
choices = set([item for sublist in articles for item in sublist])
cleaned_articles = []
for article in articles:
article_entities = []
for entity in set(article):
article_entities.append(process.extractOne(entity, choices)[0])
cleaned_articles.append(article_entities)
For the final step before creating the network graph, let’s get rid of the empty lists within our list of list that were generated by articles who didn’t have any GPE entity types.
articles = [article for article in articles if article != []]
Create the Network Graph
For the next step, we’ll create the world into which the graph will exist.
G = nx.Graph()
Then, we’ll manually add the nodes with G.add_nodes_from().
for entities in articles: G.add_nodes_from(entities)
Let’s see what the graph looks like with:
figure(figsize=(10, 8)) nx.draw(G, node_size=15)
Next, let’s add the edges that will connect the nodes.
for entities in articles: if len(entities) > 1: for i in range(len(entities)-1): G.add_edges_from([(str(entities[i]),str(entities[i+1]))])
For each iteration of the code above, we used a conditional that will only entertain a list of entities that has two or more entities. Then, we manually connect each of the entities with G.add_edges_from().
Let’s see what the graph looks like now:
figure(figsize=(10, 8)) nx.draw(G, node_size=10)
This graph reminds me of spiders! LOL.
To organize it a bit, I decided to use the shell version of the network graph:
Above, sorted_degrees is a list that contains all the nodes and their degree values. We only wanted the top 5 like so:
print("Top 5 nodes by degree:") for d in sorted_degree[:5]: print(d)
Bonus Round: Gephi
Gephi is an open-source and free desktop application that lets us visualize, explore, and analyze all kinds of graphs and networks.²
Let’s export our graph data into a file so we can import it into Gephi.
nx.write_gexf(G, "npr_coronavirus_GPE_50.gexf")
Cool beans!
Next Steps
This time, we only processed 50 articles from npr.org. What would happen if we processed all 300 articles from our dataset? What will we see if we change the entity type from GPE to PERSON? How else can we use network graphs to visualize natural language processing results?
There’s always more to do. The possibilities are endless!
I hope you enjoyed today’s post. The code is not perfect and we have a long way to go towards realizing insights from the data. I encourage you to dive deeper and learn more about spaCy, networkx, fuzzywuzzy, and even Gephi.
A quick-start guide to extracting named-entities from a Pandas dataframe using spaCy.
A long time ago in a galaxy far away, I was analyzing comments left by customers and I noticed that they seemed to mention specific companies much more than others. This gave me an idea. Maybe there is a way to extract the names of companies from the comments and I could quantify them and conduct further analysis.
There is! Enter: named-entity-recognition.
Named-Entity Recognition
According to Wikipedia, named-entity recognition or NER “is a subtask of information extraction that seeks to locate and classify named entity mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.”¹ In other words, NER attempts to extract words that categorized into proper names and even numerical entities.
In this post, I’ll share the code that will let us extract named-entities from a Pandas dataframe using spaCy, an open-source library provides industrial-strength natural language processing in Python and is designed for production use.²
To get started, let’s install spaCy with the following pip command:
pip install -U spacy
After that, let’s download the pre-trained model for English:
python -m spacy download en
With that out of the way, let’s open up a Jupyter notebook and get started!
Imports
Run the following code block into a cell to get all the necessary imports into our Python environment.
# for manipulating dataframes import pandas as pd# for natural language processing: named entity recognition import spacy from collections import Counter import en_core_web_sm nlp = en_core_web_sm.load()# for visualizations %matplotlib inline
The important line in this block is nlp = en_core_web_sm.load() because this is what we’ll be using later to extract the entities from the text.
Getting the Data
First, let’s get our data and load it into a dataframe. If you want to follow along, download the sample dataset here or create your own from the Trump Twitter Archive.
df = pd.read_csv('ever_trump.csv')
Running df.head() in a cell will get us acquainted with the data set quickly.
Getting the Tokens
Second, let’s create tokens that will serve as input for spaCy. In the line below, we create a variable tokens that contains all the words in the 'text' column of the df dataframe.
tokens = nlp(''.join(str(df.text.tolist())))
Third, we’re going to extract entities. We can just extract the most common entities for now:
items = [x.text for x in tokens.ents] Counter(items).most_common(20)
Screenshot by Author
Extracting Named-Entities
Next, we’ll extract the entities based on their categories. We have a few to choose from people to events and even organizations. For a complete list of all that spaCy has to offer, check out their documentation on named-entities.
Screenshot by Author
To start, we’ll extract people (real and fictional) using the PERSON type.
person_list = []for ent in tokens.ents: if ent.label_ == 'PERSON': person_list.append(ent.text)
In the code above, we started by making an empty list with person_list = [].
Then, we utilized a for-loop to loop through the entities found in tokens with tokens.ents. After that, we made a conditional that will append to the previously created list if the entity label is equal to PERSON type.
We’ll want to know how many times a certain entity of PERSON type appears in the tokens so we did with person_counts = Counter(person_list).most_common(20). This line will give us the top 20 most common entities for this type.
Finally, we created the df_person dataframe to store the results and this is what we get:
Screenshot by Author
We’ll repeat the same pattern for the NORP type which recognizes nationalities, religious and political groups.
norp_list = []for ent in tokens.ents: if ent.label_ == 'NORP': norp_list.append(ent.text)
Let’s create a horizontal bar graph of the df_norp dataframe.
df_norp.plot.barh(x='text', y='count', title="Nationalities, Religious, and Political Groups", figsize=(10,8)).invert_yaxis()
Screenshot by Author
Voilà, that’s it!
I hope you enjoyed this one. Natural language processing is a huge topic but I hope that this gentle introduction will encourage you to explore more and expand your repertoire.