From Slacker to Data Scientist

My journey into data science without a degree.

Butterflies in my belly; my stomach is tied up in knots. I know I’m taking a risk by sharing my story, but I wanted to reach out to others aspiring to be a data scientist. I am writing this with hopes that my story will encourage and motivate you. At the very least, hopefully, your journey won’t be as long as mine.

So, full speed ahead.

I don’t have a PhD. Heck, I don’t even have any degree to speak of. Still, I am very fortunate enough to work as a data scientist in a ridiculously good company.

How I did it? Hint: I had a lot of help.

Never Let Schooling Interfere With Your Education — Grant Allen

Formative Years

It was 1995 and I had just gotten my very first computer. It was a 1982 Apple IIe. It didn’t come with any software but it came with a manual. That’s how I learned my very first computer language: Apple BASIC.

My love for programming was born.

In Algebra class, I remember learning about the quadratic equation. I had a cheap graphic calculator then, a Casio, that’s about half the price of a TI-82. It came with a manual too so I decided to write a program that will solve the quadratic equation for me without much hassle.

My love for solving problems was born.

In my senior year, my parents didn’t know anything about financial aid but I was determined to go to college so I decided to join the Navy so that I could use MGIB pay for my college. After all, four years of service didn’t seem that long.

My love for adventure was born.

Later in my career in the Navy, I was promoted as the ship’s financial manager. I was in charge of managing multiple budgets. The experience taught me bookkeeping.

My love for numbers was born.

After the Navy, I ended volunteering for a non-profit. They eventually recruited me to start a domestic violence crisis program from scratch. I had no social work experience but I agreed anyway.

My love for saying “Why not?” was born.

Rock Bottom

After a few successful years, my boss retired and the new boss fired me. I was devastated. I fell into a deep state of clinical depression and I felt worthless.

I recall crying very loudly in the kitchen table. It has been more than a year since my non-profit job and I’m nowhere near close as having a prospect for the next one. I was in a very dark space.

Thankfully, the crying fit was a cathartic experience. It gave me a jolt to do some introspection, stop whining, and come up with a plan.

“Choose a Job You Love, and You Will Never Have To Work a Day in Your Life. “ — Anonymous

Falling in Love, All Over Again

To pay the bills, I’ve been working as a freelance web designer/developer but I wasn’t happy. Frankly, the business of doing web design bored me. It was frustrating working with clients who think and act like they’re the expert on design.

So I started thinking, “what’s next?”.

Searching the web, I’ve stumbled upon the latest news in artificial intelligence. It led me to machine learning which in turn led me to the subject of data science.

I was infatuated.

I signed up for Andrew Ng’s machine learning course on Coursera. I listened to TwitML, Linear Digression, and a few other podcasts. I revisited Python and got reacquainted with git on Github.

I was in love.

It was at this time that I made the conscious decision to be a data scientist.

Leap of Faith

Learning something new was fun for me. But still, I had that voice in my head telling me that no matter how much I study and learn, I will never get a job because I don’t have a degree.

So, I took a hard look at the mirror and acknowledge that I need help. The question now is where to start looking.

Then one day out of the blue, my girlfriend asked me what data science is. I jumped off my feet and starting explaining right away. Once I stopped explaining to catch a breath, I managed to ask her why she asked. And that’s when she told me that she’d seen a sign on the billboard. We went for a drive and saw the sign for myself. It was a curious billboard with two big words “data science” and a smaller one that says “Codeup.” I went to their website and researched their employment outcome.

I was sold.


Before the start of the class, we were given a list of materials to go over.

Given that I had only about two months to prepare, I was not expected to finish the courses. I was basically told to just skim over the content. Well, I did them anyway. I spent day and night going over the courses and materials. Did the tests, got the certificates!


Boot camp was a blur. We had a saying in the Navy about the boot camp experience: “the days drag on but the weeks fly by.” This was definitely true for the Codeup boot camp as well.

Codeup is described as a “fully-immersive, project-based 18-week Data Science career accelerator that provides students with 600+hours of expert instruction in applied data science. Students develop expertise across the full data science pipeline (planning, acquisition, preparation, exploration, modeling, delivery), and become comfortable working with real, messy data to deliver actionable insights to diverse stakeholders.”¹

We were coding in Python, querying the SQL database, and making dashboards in Tableau. We did projects after projects. We learned about different methodologies like regression, classification, clustering, time-series, anomaly detection, natural language processing, and distributed machine learning.

More importantly, the experience taught us the following:

  1. Real data is messy; deal with it.
  2. If you can’t communicate with your stakeholders, you’re useless.
  3. Document your code.
  4. Read the documentation.
  5. Always be learning.

Job Hunting

Our job hunting process started from day one of boot camp. We updated our LinkedIn profile and made sure that we’re pushing to Github almost every day. I even spruced up my personal website to include the projects we’ve done during class. And of course, we made sure that our resumé is in good shape.

Codeup helped me with all of these.

In addition, Codeup also helped prepare us for both technical and behavioral interviews. We practiced answering questions following the S.T.A.R. format (Situation, Task, Action, Result). We optimized our answers to highlight our strengths as high-potential candidates.


My education continued even after graduation. In between filling out applications, I would code every day and try out different Python libraries. I regularly read the news for the latest development in machine learning. While doing chores, I listen to a podcast, a TedTalk, or a LinkedIn learning video. When bored, I listened to or read books.

There’s a lot of good technical books out there to read. But for the non-technical ones, I recommend the following:

  • Thinking with Data by Max Shron
  • Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy by Cathy O’Neill
  • Invisible Women: Data Bias in a World Designed for Men by Caroline Criado Perez
  • Rookie Smarts: Why Learning Beats Knowing in the New Game of Work by Liz Wiseman
  • Grit: The Power of Passion and Perseverance by Angela Duckworth
  • The First 90 Days: Proven Strategies for Getting Up to Speed Faster and Smarter by Michael Watkins

Dealing with Rejection

I’ve had a lot of rejections. The first one was the hardest but after that, it kept getting easier. I developed a thick skin and just moved on.

Rejection sucks. Try not to take it personally. Nobody likes to fail, but it will happen. When it does, fail up.


It took me 3 months after graduating from boot camp to get a job. It took a lot of sacrifices. When I finally got the job offer, I felt very grateful, relieved, and excited.

I could not have done it without Codeup and my family’s support.

Thanks for reading! I hope you got something out of this post.

To all aspiring data scientists out there, just don’t give up. Try not to listen to all the haters out there. If you must, hear what they have to say, take stock of your weaknesses, and aspire to learn better than yesterday. But never ever let them discourage you. Remember, data science skills lie on a spectrum. If you’ve got the passion and perseverance, I’m pretty sure that there’s a company or organization out there that’s just the right fit for you.

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1] Codeup Alumni Portal. (May 31, 2020). Resumé — Ednalyn C. De Dios

This article was first published in Towards Data Science‘ Medium publication.

Populating a Network Graph with Named-Entities

An early attempt of using networkx to visualize the results of natural language processing.

I do a lot of natural language processing and usually, the results are pretty boring to the eye. When I learned about network graphs, it got me thinking, why not use keywords as nodes and connect them together to create a network graph?

Yupp, why not!

In this post, we’ll do exactly that. We’re going to extract named-entities from news articles about coronavirus and then use their relationships to connect them together in a network graph.

A Brief Introduction

Network graphs are a cool visual that contains nodes (vertices) and edges (lines). It’s often used in social network analysis and network analysis but data scientists also use it for natural language processing.

Photo by Anders Sandberg on Flicker

Natural Language Processing or NLP is a branch of artificial intelligence that deals with programming computers to process and analyze large volumes of text and derive meaning out of them.¹ In other words, it’s all about teaching computers how to understand human language… like a boss!

Photo by brewbooks on Flickr

Enough introduction, let’s get to coding!

To get started, let’s make sure to take care of all dependencies. Open up a terminal and execute the following commands:

pip install -U spacy
python -m spacy download en
pip install networkx
pip install fuzzywuzzy

This will install spaCy and download the trained model for English. The third command installs networkx. This should work for most systems. If it doesn’t work for you, check out the documentation for spaCy and networkx. Also, we’re using fuzzywuzzy for some text preprocessing.

With that out of the way, let’s fire up a Jupyter notebook and get started!


Run the following code block into a cell to get all the necessary imports into our Python environment.

import pandas as pd
import numpy as np
import pickle
from operator import itemgetter
from fuzzywuzzy import process, fuzz# for natural language processing
import spacy
import en_core_web_sm# for visualizations
%matplotlib inline
from matplotlib.pyplot import figureimport networkx as nx

Getting the Data

If you want to follow along, you can download the sample dataset here. The file was created using newspaper to import news articles from the If you’re feeling adventurous, use the code snippet below to build your own dataset.

import requests
import json
import time
import newspaper
import pickle

npr ='')

corpus = []
count = 0
for article in npr.articles:
    text = article.text
    if count % 10 == 0 and count != 0:
        print('Obtained {} articles'.format(count))
    count += 1

corpus300 = corpus[:300]

with open("npr_coronavirus.txt", "wb") as fp:   # Pickling
    pickle.dump(corpus300, fp)

# with open("npr_coronavirus.txt", "rb") as fp:   # Unpickling
#     corpus = pickle.load(fp)

Let’s get our data.

with open('npr_coronavirus.txt', 'rb') as fp:   # Unpickling
corpus = pickle.load(fp)

Extract Entities

Next, we’ll start by loading spaCy’s English model:

nlp = en_core_web_sm.load()

Then, we’ll extract the entities:

entities = []for article in corpus[:50]:
tokens = nlp(''.join(article))
gpe_list = []
for ent in tokens.ents:
if ent.label_ == 'GPE':

In the above code block, we created an empty list called entities to store a list of lists that contains the extracted entities from each of the articles. In the for-loop, we looped through the first 50 articles of the corpus. For each iteration, we converted each articles into tokens (words) and then we looped through all those words to get the entities that are labeled as GPE for countries, states, and cities. We used ent.text to extract the actual entity and appended them one by one to entities.

Here’s the result:

Note that North Carolina has several variations of its name and some have “the” prefixed in their names. Let’s get rid of them.

articles = []for entity_list in entities:
cleaned_entity_list = []
for entity in entity_list:
cleaned_entity_list.append(entity.lstrip('the ').replace("'s", "").replace("’s",""))

In the code block above, we’re simply traversing the list of lists articles and cleaning the entities one by one. With each iteration, we’re stripping the prefix “the” and getting rid of 's.

Optional: FuzzyWuzzy

Looking at the entities, I’ve noticed that there are also variations in the “United States” is represented. There exists “United States of America” while some are just “United States”. We can trim these down into a more standard naming convention.

FuzzyWuzzy can help with this.

Described by as “string matching like a boss,” FiuzzyWuzzy uses Levenshtein distance to calculate the similarities between words.¹ For a really good tutorial on how to use FuzzyWuzzy, check out Thanh Huynh’s article.FuzzyWuzzy: Find Similar Strings within one column in PythonToken Sort Ratio vs. Token Set

Here’s the optional code for using FuzzyWuzzy:

choices = set([item for sublist in articles for item in sublist])

cleaned_articles = []
for article in articles:
    article_entities = []
    for entity in set(article):
        article_entities.append(process.extractOne(entity, choices)[0])

For the final step before creating the network graph, let’s get rid of the empty lists within our list of list that were generated by articles who didn’t have any GPE entity types.

articles = [article for article in articles if article != []]

Create the Network Graph

For the next step, we’ll create the world into which the graph will exist.

G = nx.Graph()

Then, we’ll manually add the nodes with G.add_nodes_from().

for entities in articles:

Let’s see what the graph looks like with:

figure(figsize=(10, 8))
nx.draw(G, node_size=15)

Next, let’s add the edges that will connect the nodes.

for entities in articles:
if len(entities) > 1:
for i in range(len(entities)-1):

For each iteration of the code above, we used a conditional that will only entertain a list of entities that has two or more entities. Then, we manually connect each of the entities with G.add_edges_from().

Let’s see what the graph looks like now:

figure(figsize=(10, 8))
nx.draw(G, node_size=10)

This graph reminds me of spiders! LOL.

To organize it a bit, I decided to use the shell version of the network graph:

figure(figsize=(10, 8))
nx.draw_shell(G, node_size=15)

We can tell that some nodes are heavier on connections than others. To see which nodes have the most connections, let’s use

This gives the following degree view:

Let’s find out which node or entity has the most number of connections.

max(dict(, key = lambda x : x[1])

To find out which other nodes have the most number of connections, let’s check the top 5:

degree_dict = dict(
nx.set_node_attributes(G, degree_dict, 'degree')sorted_degree = sorted(degree_dict.items(), key=itemgetter(1), reverse=True)

Above, sorted_degrees is a list that contains all the nodes and their degree values. We only wanted the top 5 like so:

print("Top 5 nodes by degree:")
for d in sorted_degree[:5]:

Bonus Round: Gephi

Gephi is an open-source and free desktop application that lets us visualize, explore, and analyze all kinds of graphs and networks.²

Let’s export our graph data into a file so we can import it into Gephi.

nx.write_gexf(G, "npr_coronavirus_GPE_50.gexf")

Cool beans!

Next Steps

This time, we only processed 50 articles from What would happen if we processed all 300 articles from our dataset? What will we see if we change the entity type from GPE to PERSON? How else can we use network graphs to visualize natural language processing results?

There’s always more to do. The possibilities are endless!

I hope you enjoyed today’s post. The code is not perfect and we have a long way to go towards realizing insights from the data. I encourage you to dive deeper and learn more about spaCynetworkxfuzzywuzzy, and even Gephi.

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1]: Wikipedia. (May 25, 2020). Natural language processing

[2]: Gephi. (May 25, 2020). The Open Graph Viz Platform

This article was first published in Towards Data Science‘ Medium publication.

From DataFrame to Named-Entities

A quick-start guide to extracting named-entities from a Pandas dataframe using spaCy.

A long time ago in a galaxy far away, I was analyzing comments left by customers and I noticed that they seemed to mention specific companies much more than others. This gave me an idea. Maybe there is a way to extract the names of companies from the comments and I could quantify them and conduct further analysis.

There is! Enter: named-entity-recognition.

Named-Entity Recognition

According to Wikipedia, named-entity recognition or NER “is a subtask of information extraction that seeks to locate and classify named entity mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.”¹ In other words, NER attempts to extract words that categorized into proper names and even numerical entities.

In this post, I’ll share the code that will let us extract named-entities from a Pandas dataframe using spaCy, an open-source library provides industrial-strength natural language processing in Python and is designed for production use.²

To get started, let’s install spaCy with the following pip command:

pip install -U spacy

After that, let’s download the pre-trained model for English:

python -m spacy download en

With that out of the way, let’s open up a Jupyter notebook and get started!


Run the following code block into a cell to get all the necessary imports into our Python environment.

# for manipulating dataframes
import pandas as pd# for natural language processing: named entity recognition
import spacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()# for visualizations
%matplotlib inline

The important line in this block is nlp = en_core_web_sm.load() because this is what we’ll be using later to extract the entities from the text.

Getting the Data

First, let’s get our data and load it into a dataframe. If you want to follow along, download the sample dataset here or create your own from the Trump Twitter Archive.

df = pd.read_csv('ever_trump.csv')

Running df.head() in a cell will get us acquainted with the data set quickly.

Getting the Tokens

Second, let’s create tokens that will serve as input for spaCy. In the line below, we create a variable tokens that contains all the words in the 'text' column of the df dataframe.

tokens = nlp(''.join(str(df.text.tolist())))

Third, we’re going to extract entities. We can just extract the most common entities for now:

items = [x.text for x in tokens.ents]
Screenshot by Author

Extracting Named-Entities

Next, we’ll extract the entities based on their categories. We have a few to choose from people to events and even organizations. For a complete list of all that spaCy has to offer, check out their documentation on named-entities.

Screenshot by Author

To start, we’ll extract people (real and fictional) using the PERSON type.

person_list = []for ent in tokens.ents:
if ent.label_ == 'PERSON':

person_counts = Counter(person_list).most_common(20)df_person = pd.DataFrame(person_counts, columns =['text', 'count'])

In the code above, we started by making an empty list with person_list = [].

Then, we utilized a for-loop to loop through the entities found in tokens with tokens.ents. After that, we made a conditional that will append to the previously created list if the entity label is equal to PERSON type.

We’ll want to know how many times a certain entity of PERSON type appears in the tokens so we did with person_counts = Counter(person_list).most_common(20). This line will give us the top 20 most common entities for this type.

Finally, we created the df_person dataframe to store the results and this is what we get:

Screenshot by Author

We’ll repeat the same pattern for the NORP type which recognizes nationalities, religious and political groups.

norp_list = []for ent in tokens.ents:
if ent.label_ == 'NORP':

norp_counts = Counter(norp_list).most_common(20)df_norp = pd.DataFrame(norp_counts, columns =['text', 'count'])

And this is what we get:

Screenshot by Author

Bonus Round: Visualization

Let’s create a horizontal bar graph of the df_norp dataframe.

df_norp.plot.barh(x='text', y='count', title="Nationalities, Religious, and Political Groups", figsize=(10,8)).invert_yaxis()
Screenshot by Author

Voilà, that’s it!

I hope you enjoyed this one. Natural language processing is a huge topic but I hope that this gentle introduction will encourage you to explore more and expand your repertoire.

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1]: Wikipedia. (May 22, 2020). Named-entity recognition

[2]: spaCy. (May 22, 2020). Industrial-Strength Natural Language Processing in Python

This article was first published in Towards Data Science‘ Medium publication.

Create an N-Gram Ranking in Power BI

A quick start guide on building a Python visual with a few simple clicks of the mouse and a dash of code.

In a previous article, I wrote a quick start guide on creating and visualizing n-gram ranking using nltk for natural language processing. However, I needed a way to share my findings with others who don’t have Python or Jupyter Notebook installed in their machines. I needed to use our organization’s BI reporting tool: Power BI.

Enter Python Visual.

The Python visual allows you to create a visualization generated by running Python code. In this post, we’ll walk through the steps needed to visualize the results of our n-gram ranking using this visual.

First, let’s get our data. You can download the sample dataset here. Then, we could load the data into Power BI Desktop as shown below:

Select Text/CSV and click on “Connect”.

Select the file in the Windows Explorer folder and click open:

Click on “Load”.

Next, find the Py icon on the “Visualizations” panel.

Then, click on “Enable” at the prompt that appears to enable script visuals.

You’ll see a placeholder appear in the main area and a Python script editor panel at the bottom of the dashboard.

Select the ‘text’ column on the “Fields” panel.

You’ll see a predefined script that serves as a preamble for the script that we’re going to write.

In the Python script editor panel, place your cursor at the end of line #6 and hit enter twice.

Then, copy and paste the following code:

import re
import unicodedata
import nltk
from nltk.corpus import stopwordsADDITIONAL_STOPWORDS = ['covfefe']import matplotlib.pyplot as pltdef basic_clean(text):
wnl = nltk.stem.WordNetLemmatizer()
stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
text = (unicodedata.normalize('NFKD', text)
.encode('ascii', 'ignore')
.decode('utf-8', 'ignore')
words = re.sub(r'[^\w\s]', '', text).split()
return [wnl.lemmatize(word) for word in words if word not in stopwords]words = basic_clean(''.join(str(dataset['text'].tolist())))bigrams_series = (pandas.Series(nltk.ngrams(words, 2)).value_counts())[:12]bigrams_series.sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))

In a nutshell, the code above transforms extracts n-grams from the 'text' column of thedataset dataframe and creates a horizontal bar graph out of it using matplotlib. The result of is what Power BI displays on the Python visual.

For more information on this code, please visit my previous tutorial.From DataFrame to N-GramsA quick-start guide to creating and visualizing n-gram ranking using nltk for natural language

After you’re done pasting the code, click on the “play” icon at the upper right corner of the Python script editor panel.

After a few moments, you should now be able to see the horizontal bar graph like the one below:

And that’s it!

With a few simple clicks of the mouse, along with some help from our Python script, we’re able to visualize the results of our n-gram ranking.

I hope you enjoyed today’s post on one of Power BI’s strongest features. Power BI already has some useful and beautiful built-in visuals but sometimes, you just need a little bit more flexibility. Running Python code helps with this. I hope this gentle introduction will encourage you to explore more and expand your repertoire.

In the next article, I’ll share a quick-start guide to extracting named-entities from a Pandas dataframe using spaCy.

Stay tuned!

You can reach me on Twitter or LinkedIn.

This article was first published in Towards Data Science‘ Medium publication.

From DataFrame to N-Grams

A quick-start guide to creating and visualizing n-gram ranking using nltk for natural language processing.

When I was first starting to learn NLP, I remember getting frustrated or intimidated by information overload so I’ve decided to write a post that covers the bare minimum. You know what they say, “Walk before you run!”

This is a very gentle introduction so we won’t be using any fancy code here.

In a nutshell, natural language processing or NLP simply refers to the process of reading and understanding written or spoken language using a computer. At its simplest use case, we can use a computer to read a book, for example, and count how many times each word was used instead of us manually doing it.

NLP is a big topic and there’s already been a ton of articles written on the subject so we won’t be covering that here. Instead, we’ll focus on how to quickly do one of the simplest but useful techniques in NLP: N-gram ranking.

N-Gram Ranking

Simply put, an n-gram is a sequence of n words where n is a discrete number that can range from 1 to infinity! For example, the word “cheese” is a 1-gram (unigram). The combination of the words “cheese flavored” is a 2-gram (bigram). Similarly, “cheese flavored snack” is a 3-gram (trigram). And “ultimate cheese flavored snack” is a 4-gram (qualgram). So on and so forth.

In n-gram ranking, we simply rank the n-grams according to how many times they appear in a body of text — be it a book, a collection of tweets, or reviews left by customers of your company.

Let’s get started!

Getting the Data

First, let’s get our data and load it into a dataframe. You can download the sample dataset here or create your own from the Trump Twitter Archive.

import pandas as pddf = pd.read_csv('tweets.csv')

Using df.head() we can quickly get acquainted with the dataset.

A sample of President Trump’s tweets.

Importing Packages

Next, we’ll import packages so we can properly set up our Jupyter notebook:

# natural language processing: n-gram ranking
import re
import unicodedata
import nltk
from nltk.corpus import stopwords# add appropriate words that will be ignored in the analysis
import matplotlib.pyplot as plt

In the code block above, we imported pandas so that we can shape and manipulate our data in all sorts of different and wonderful ways! Next, we imported re for regex, unicodedata for Unicode data, and nltk to help with parsing the text and cleaning them up a bit. And then, we specified additional stop words that we want to ignore. This is helpful in trimming down the noise. Lastly, we imported matplotlib matplotlib so we can visualize the result of our n-gram ranking later.

Next, let’s create a function that will perform basic cleaning of the data.

Basic Cleaning

def basic_clean(text):
A simple function to clean up the data. All the words that
are not designated as a stop word is then lemmatized after
encoding and basic regex parsing are performed.
wnl = nltk.stem.WordNetLemmatizer()
stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
text = (unicodedata.normalize('NFKD', text)
.encode('ascii', 'ignore')
.decode('utf-8', 'ignore')
words = re.sub(r'[^\w\s]', '', text).split()
return [wnl.lemmatize(word) for word in words if word not in stopwords]

The function above takes in a list of words or text as input and returns a cleaner set of words. The function does normalization, encoding/decoding, lower casing, and lemmatization.

Let’s use it!

words = basic_clean(''.join(str(df['text'].tolist())))

Above, we’re simply calling the function basic_lean() to process the 'text' column of our dataframe df and making it a simple list with tolist(). We then assign the results to words.

A list of already cleaned, normalized, and lemmatized words.


Here comes the fun part! In one line of code, we can find out which bigrams occur the most in this particular sample of tweets.

(pd.Series(nltk.ngrams(words, 2)).value_counts())[:10]

We can easily replace the number 2 with 3 so we can get the top 10 trigrams instead.

(pd.Series(nltk.ngrams(words, 3)).value_counts())[:10]

Voilà! We got ourselves a great start. But why stop now? Let’s try it and make a little eye candy.

Bonus Round: Visualization

To make things a little easier for ourselves, let’s assign the result of n-grams to variables with meaningful names:

bigrams_series = (pd.Series(nltk.ngrams(words, 2)).value_counts())[:12]trigrams_series = (pd.Series(nltk.ngrams(words, 3)).value_counts())[:12]

I’ve replaced [:10] with [:12] because I wanted more n-grams in the results. This is an arbitrary value so you can choose whatever makes the most sense to you according to your situation.

Let’s create a horizontal bar graph:

bigrams_series.sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))

And let’s spiffy it up a bit by adding titles and axis labels:

bigrams_series.sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))
plt.title('20 Most Frequently Occuring Bigrams')
plt.xlabel('# of Occurances')

And that’s it! With a few simple lines of code, we quickly made a ranking of n-grams from a Pandas dataframe and even made a horizontal bar graph out of it.

I hope you enjoyed this one. Natural Language Processing is a big topic but I hope that this gentle introduction will encourage you to explore more and expand your repertoire.

In the next article, we’ll visualize an n-gram ranking in Power BI with a few simple clicks of the mouse and a dash of Python!

Stay tuned!

You can reach me on Twitter or LinkedIn.

This article was first published on Towards Data Science‘ Medium publication.

Initial Impressions of Fivetran

Their website is super slow.

When logging into the site, it takes 10 seconds or more for the dashboard to load.

They don’t have voice or chat support.

The only avenue of support is through their “Submit Request” portal which, depending on your support service level agreement, can take hours to get a reply. For example, with the Starter and Standard pricing plans, their initial response time is within 4 hours for the most urgent issues. However, if you get their Enterprise plan, the initial response time drops to just 1 hour.

They lack troubleshooting guidelines.

On the page where you set up a connector, there is a link to the “Configurations Instructions” on the upper right area of the page. These instructions are great because they give you step-by-step instructions (complete with pictures). The instructions are detailed enough but they lack basic troubleshooting guidelines.

To quote a reviewer on

Fivetran is a blackbox – when it works, great, when it doesn’t, good luck”

Michael E,

File and folder pattern configuration is finicky.

For example, I was setting up an S3 connector and I run into trouble with the regex I used for the file pattern. The support technician advised me to change my file pattern to ^zendesk_tickets_* but it didn’t work. We spent at least an hour going through setting up role permissions in AWS before the technician finally gave up and told me that he’ll just get back to me later. A few hours later, the technician finally determined that the regex needed to be changed to ^zendesk_tickets_\d*.csv.

When setting up a connector in Google Cloud Storage, I tried to use a similar regex using ^chat-\d*\.csv. However, since the files were in a sub-folder, it didn’t take. Instead I had to use .*chat-\d*\.csv.

This problem is easy enough to rectify, just learn some Java-flavored regex! However, it can still be annoying.

Built for analysts, but you need access.

Fivetran boasts that it can “Replicate everything, with zero configuration and schemas designed for analytics. Eliminate engineering busywork while empowering your analysts to prove value.” However, this dream is difficult to realize unless you give said analysts access. In most cases, this is hardly the case because data engineers like to control their data, but then again, this is not Fivetran’s problem but more of an internal company issue so make sure your organization is suitably prepared.


I was writing about my journey from slacker to data scientist and I was reminded of just how fortunate I am because I had a lot of help along the way.

  • I am blessed to be working in the field of data science.
  • I am blessed to be employed a ridiculously good company.
  • I am blessed to still have a job amidst the COVID-19 crisis.

And most importantly, I truly am very fortunate to have family and friends– both professional and personal– that help me get to where I am now.

Today, I created a Kiva Team “Data Scientists for Good” with hopes of encouraging other data scientists, data analysts, and data engineers to give back. Click here if you’re interested in joining the team.

So, what are you grateful for?