Grateful

I was writing about my journey from slacker to data scientist and I was reminded of just how fortunate I am because I had a lot of help along the way.

  • I am blessed to be working in the field of data science.
  • I am blessed to be employed a ridiculously good company.
  • I am blessed to still have a job amidst the COVID-19 crisis.

And most importantly, I truly am very fortunate to have family and friends– both professional and personal– that help me get to where I am now.

Today, I created a Kiva Team “Data Scientists for Good” with hopes of encouraging other data scientists, data analysts, and data engineers to give back. Click here if you’re interested in joining the team.

So, what are you grateful for?

PAPEM-DM: 7 Steps Towards a Data Science Win

Introduction

In this post, I’ll attempt to break down the steps involved in a successful data science project: from understanding business requirements to the maintenance of whatever data product your data science team ends up producing. Along the way, we’ll discuss the requirements for each step as well as the skills and tools that are essential for each step.

PAPEM-DM: A Data Science Framework

I’ve first learned about the data science pipeline while attending Codeup (fully-immersive, project-based 20-week Data Science and Web Development career accelerator). Since then, I’ve read countless articles enumerating the steps involved in data science projects but I have yet to come across one that includes them all. Hence, this article.

PAPEM-DM is a handy acronym that I made up to remind me of the steps: planning, acquisition, preparation, exploration, modeling, delivery, and maintenance. This should cover all the main steps of the data science process from end to end. Let me know if I’ve missed something important! Oh wait, they’re all important… so comment away! LOL.

Planning

In this very first step, we must determine the questions that need to be addressed. Most stakeholders and end-users speak business, not data science; more often than not, they ask questions that are either too general or too specific to address the real problem that needs to be solved.  When starting a new project, it’s important to understand the business side of things first before getting lost in the weeds.

In the book, Thinking with Data, Max Shron advocates answering the “why” before figuring out the “how.” He proposes using the CoNVO framework which consists of context, needs, vision, and outcome. What are the circumstances or terms in which the problem can be better understood or explained? What are the specific needs that can be satisfied by leveraging data? What will the project look like once it achieves its goal(s)? Last but not least, how will the deliverables of the project be used within the organization, and who will own or maintain it?

No fancy tools are needed for this step but that doesn’t mean it’s not important. In fact, this step is very critical because it could mean the difference between success and failure for your data science project.

The deliverable for this step is a clear delineation of the things that you want to accomplish and your measure of success.

Acquisition

This step involves acquiring or getting the raw data that you need to provide an insight into the problem you’re having. It requires you to think about where and how you will get the data, and whether or not it’s a manual or automated process.

The tools you’ll need will depend on the data source (where the data will be coming from) and your working environment. For example, if the data already lives in a data warehouse, getting the data might simply mean connecting to your data warehouse like Redshift using SQL Alchemy and loading it into your Python environment via a Pandas dataframe. On the other hand, if the data is sent periodically via email, you might want to use a connector service like Fivetran to load the file into Redshift first, and then connect to the Redshift cluster using a business analytic tool like Microsoft Power BI. If a project is a one-time ad-hoc request, the process might simply mean retrieving a flat-file (.csv, .tsv, or even .xlsx) from a local or shared drive and then loading it directly into a Pandas dataframe for further processing in a Jupyter notebook. Another straightforward means is manually extracting the data from a CRM platform like Salesforce or Zendesk.

Whatever the case might be, the outcome of this step is a data set that’s ready to be processed by your weapon (or tool) of choice.

Preparation

Real-world data is often messy and will need to be cleaned and processed prior to doing analysis. Data preparation is all about transforming the raw data that we get from the previous step into a format that allows us to glean insights. Some of the processes involved in this step usually include taking care of missing, null, NaN, and duplicated values. Depending on the nature of the project, these might need to be imputed or be dropped entirely. Data types will also need to be handled as well as DateTime conflicts. In addition, you might also need to combine several datasets into one massive table or dataframe, or even trim multitudes of columns into just a few.

Again, the tools needed will vary depending on your working environment. For Python, the libraries you would likely use are Pandas, matplotlib, and scikit-learn. If you’re working with Power BI, you most likely will use its Power Query Editor and DAX formulas, keeping in mind that relationships between tables must also be defined properly before joins and merges can work. If your organization uses Fivetran, you might use its “transformations” feature to execute some SQL queries prior to analysis.

Once you’re done with cleaning and preparation, you’re ready to proceed to the next step: exploratory data analysis.

Exploration

According to the Engineering Statistics Handbook, exploratory data analysis, or EDA for short, is an approach to data analysis that employs a variety of techniques in order to:

  • maximize insight into a data set;
  • uncover underlying structure;
  • extract important variables;
  • detect outliers and anomalies;
  • test underlying assumptions;
  • develop parsimonious models; and
  • determine optimal factor settings.

In other words, this step is where you get to play with data to discover interesting patterns, anomalies, and discover which features or variables are your biggest drivers.

Feature engineering and pre-processing are also major components of this step. For Python, you can use Pandas, scipy, numpy, statsmodels, and visualization libraries like matplotlib, seaborn, plotly, and bokeh.

At the end of this step, not only should you have a dataset in a format that could be used in a machine learning model, but you should also answer some of the questions that were addressed during the planning step.

Modeling

This is the most popular part of the data science project lifecycle. In the modeling, we take our cleaned and processed data, and use it to train and test one or few machine learning algorithms in order to make a prediction.

The types of machine learning algorithms include regression, classification, clustering, anomaly detection, time-series forecasting, and my favorite: natural language processing or NLP.

The tasks involved in this step are:

  1. Split the data into training and testing sets
  2. Identify the machine learning model or models that are most appropriate in the project’s specific use case
  3. Train a model
    1. Make predictions based on the training set
    2. Evaluate the results on the training set
    3. Tune the hyperparameters
    4. Rinse and repeat
  4. Choose your best performing model
  5. Make predictions based on the testing set
  6. Evaluate the results on the testing set

There are a plethora of machine learning algorithms out there and it is up to the data scientist to select which ones to use depending on the nature of the features (variables) and target (what we’re trying to predict.

Delivery

Having trained and evaluated a machine learning model is fine and dandy but sooner or later, you’re going to need to enable others to use what you have discovered or developed. The deliverables of a data science project can be as simple as a slide deck that reports the findings of your exploratory data analysis with recommendations on the next actions to take. Another is a self-service dashboard that others can use to facilitate data-based management. You can generate another table in a database that can be used for live real-time reporting. And last but not the least, you can develop an application that uses your trained model to make predictions based on new observations. For example, a mechanism that processes chat transcripts with a customer to predict their satisfaction or current sentiment.

The tools needed for this step would depend on the type of deliverable for the project. The skills needed can be as simple as employing story-telling techniques or a full-blown pipeline deployment in a serverless environment using services like AWS or Azure. Technological capabilities are also a factor that can dictate just what type of data product you can deliver. This is also why it is crucial that you have a great working relationship with your data engineering team. In fact, in a perfect world, Data Scientists and Data Engineers should be walking in lock step with each other.

In this agile world that we live in today, shipping a minimum viable product (MVP) today is more beneficial than shipping something perfect tomorrow.

“Ship now, iterate later.”

– Chana

Maintenance

Maintaining data science projects requires constant vigilance on each of the components of the pipeline. It helps to review and examine any changes that would affect the project from beginning to end. For example, the shape and structure of the data might change and this will affect your pre-processing scripts. Make sure that you’re aware of the security vulnerabilities on the packages and libraries that you use and update each one accordingly while making sure project dependencies do not break. Also, look for opportunities to improve the pipeline (like automation) as new technologies become available.

Conclusion

It’s been a long article already. To recap, use the PAPEM-DM framework when planning and executing data science projects. Please note, the steps in the framework are not isolated from each other. More often than not, you’ll find yourself doing these steps in an iterative cyclical fashion rather than sequentially. For example, you might want to do more exploration after modeling because you discovered something interesting in the results and would like to dive deeper. Just don’t get lost in the weeds too much!

That’s it for today!

What is your experience like regarding the data science project lifecycle? How would this framework change or stay the same when switching from dev to production? Please comment away, the more the merrier!

Initial Impressions of Fivetran

Their website is super slow.

When logging into the site, it takes 10 seconds or more for the dashboard to load.

They don’t have voice or chat support.

The only avenue of support is through their “Submit Request” portal which, depending on your support service level agreement, can take hours to get a reply. For example, with the Starter and Standard pricing plans, their initial response time is within 4 hours for the most urgent issues. However, if you get their Enterprise plan, the initial response time drops to just 1 hour.

They lack troubleshooting guidelines.

On the page where you set up a connector, there is a link to the “Configurations Instructions” on the upper right area of the page. These instructions are great because they give you step-by-step instructions (complete with pictures). The instructions are detailed enough but they lack basic troubleshooting guidelines.

To quote a reviewer on G2.com:

Fivetran is a blackbox – when it works, great, when it doesn’t, good luck”

Michael E, g2.com

File and folder pattern configuration is finicky.

For example, I was setting up an S3 connector and I run into trouble with the regex I used for the file pattern. The support technician advised me to change my file pattern to ^zendesk_tickets_* but it didn’t work. We spent at least an hour going through setting up role permissions in AWS before the technician finally gave up and told me that he’ll just get back to me later. A few hours later, the technician finally determined that the regex needed to be changed to ^zendesk_tickets_\d*.csv.

When setting up a connector in Google Cloud Storage, I tried to use a similar regex using ^chat-\d*\.csv. However, since the files were in a sub-folder, it didn’t take. Instead I had to use .*chat-\d*\.csv.

This problem is easy enough to rectify, just learn some Java-flavored regex! However, it can still be annoying.

Built for analysts, but you need access.

Fivetran boasts that it can “Replicate everything, with zero configuration and schemas designed for analytics. Eliminate engineering busywork while empowering your analysts to prove value.” However, this dream is difficult to realize unless you give said analysts access. In most cases, this is hardly the case because data engineers like to control their data, but then again, this is not Fivetran’s problem but more of an internal company issue so make sure your organization is suitably prepared.

Create a Network Graph in Power BI

In a previous article, I wrote a quick start guide to visualize a Pandas dataframe using networkx and matplotlib. While it was fun to learn and explore more about network graphs in Python, I got to thinking about how to present the results to others who don’t have Python or Jupyter Notebook installed in their machines. At TaskUs, we use Power BI for most of our reporting so I began to search for a custom Power BI visualization that can take the data and transform it into a meaningful network graph.

Enter Network Navigator.

Network Navigator is a custom visual in Power BI that is created by Microsoft. It allows you to “explore node-link data by panning over and zooming into a force-directed node layout (which can be precomputed or animated live).”¹ In this post, we’ll walk through the steps needed to create a network graph using the custom visual.

First, let’s get our data. You can download the sample dataset here. Then, we could load the data into Power BI Desktop as shown below:

Select Text/CSV and click on “Connect”.

Select the file in the Windows Explorer folder and click open:

Click on “Transform Data”.

Click on “Use first Row as Headers”.

Click on “Close & Apply”.

Next, find the three dots at the end of the “Visualizations” panel.

And select “Get more visuals”.

Point your mouse cursor inside the search text box and type in “network” and hit the “Enter” key and click on the “Add” button.

Wait a few moments and you’ll see the notification below. Click on the “OK” button to close the notification.

You’ll see a new icon appear at the bottom of the “Visualizations” panel as shown below.

Click on the new icon and you will see something similar to the picture below.

With the new visual placeholder selected, click on “Reporter” and “Assignee” in the “Fields” panel and it will automatically assign the columns as the Source and Target Node.

Let’s add labels by clicking on the paintbrush icon.

Click on “Layout” to expand the section and scroll down until you see the “Labels” section.

Click on the toggle switch under “Labels” to turn it on

and voila!

That’s it! With a few simple clicks of the mouse, we’re able to create a network graph from a csv file.


I hope you enjoy today’s post on one of Power BI’s coolest visuals. Network graph analysis is a big topic but I hope this gentle introduction will encourage you to explore more and expand your repertoire.

In the next article, I’ll share my journey from slacker to data scientist and I hope it’ll inspire others instead of being dissuaded by haters.

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1]: Business Apps — Microsoft AppSource. (May 16, 2020). Network Navigator Chart https://appsource.microsoft.com/en-us/product/power-bi-visuals/WA104380795?src=office&tab=Overview

From DataFrame to Network Graph

I just discovered — quite accidentally — how to export data from JIRA so naturally, I began to think of ways to visualize the information and potentially glean some insight from the dataset. I’ve stumbled upon the concept of network graphs and the idea quickly captured my imagination. I realized that I can use it to tell stories not only about the relationships between people but between words as well! But NLP is a big topic, so how about we walk first and run later?!

This is just a very gentle introduction so we won’t be using any fancy code here.


Network graphs “show interconnections between a set of entities”¹ where entities arenodes and the connections between them are represented through links or edges¹. In the graph below, the dots are the nodes and the lines are called edges.

Martin Grandjean / CC BY-SA (https://creativecommons.org/licenses/by-sa/3.0)

In this post, I’ll share the code that will let us quickly visualize a Pandas dataframe using a popular network graph package: networkx.

First, let’s get our data and load it into a datagram. You can download the sample dataset here.

import pandas as pd
df = pd.read_csv(‘jira_sample.csv’)

Second, let’s trim the dataframe to only include the columns we want to examine. In this case, we only want the columns ‘Assignee’ and ‘Reporter’.

df1 = df[[‘Assignee’, ‘Reporter’]]

Third, it’s time to create the world into which the graph will exist. If you haven’t already, install the networkx package by doing a quick pip install networkx.

import networkx as nx
G = nx.from_pandas_edgelist(df1, ‘Assignee’, ‘Reporter’)

Next, we’ll materialize the graph we created with the help of matplotlib for formatting.

from matplotlib.pyplot import figure
figure(figsize=(10, 8))
nx.draw_shell(G, with_labels=True)

The most important line in the block above is nx.draw_shell(G, with_labels=True). It tells the computer to draw the graph Gusing a shell layout with the labels for entities turned on.

Voilà! We got ourselves a network graph:

A network graph of reporters and assignees from a sample JIRA dataset.

Right off the bat, we can tell that there’s a heavy concentration of lines originating from three major players, ‘barbie.doll’, ‘susan.lee’, and ‘joe.appleseed’. Of course, just to be sure, it’s always a good idea to confirm our ‘eyeballing’ with some hard data.

Bonus Round

Let’s check out ‘barbie.doll’.

G[‘barbie.doll’]

To see how many connections ‘barbie.doll’ has, let’s use len():

len(G[‘barbie.doll’])

Next, let’s created another dataframe that shows the nodes and their number of connections.

leaderboard = {}
for x in G.nodes:
leaderboard[x] = len(G[x])
s = pd.Series(leaderboard, name=’connections’)
df2 = s.to_frame().sort_values(‘connections’, ascending=False)

In the code block above, we first initialized an empty dictionary called ‘leaderboard’ and then used a simple for-loop to populate the dictionary with names and number of connections. Then, we created a series out of the dictionary. Finally, we created another dataframe from the series that we created using to_frame().

To display the dataframe, we simply use df2.head() and we got ourselves a leaderboard!

And that’s it! With a few simple lines of code, we quickly made a network graph from a Pandas dataframe and even displayed a table with names and number of connections.


I hope you enjoyed this one. Network graph analysis is a big topic but I hope that this gentle introduction will encourage you to explore more and expand your repertoire.

In the next article, I’ll walk through Power BI’s custom visual called ‘Network Navigator’ to create a network graph with a few simple clicks of the mouse.

Stay tuned!

[1]: Data-To-Viz. (May 15, 2020). Network Diagram https://www.data-to-viz.com/graph/network.html

Flexing my Regex muscles

I’ve spent 12 hours scouring the internet for this problem. I am close to giving up and I’ve decided to ask a question on stackoverflow. Since this blog is all about my journey, I am going to share the love and document my exploits here.

Here’s the problem:

On a dataframe, I have a column of duration in a human-readable format like “29 days 4 hours 32 minutes 1 second”. I want to break them down by having columns of days, hours, minutes, seconds with the values derived from the duration column. Like 29 for days, 4 for hours, 32 minutes, and 1 for seconds. I’ve already used this but it’s not working correctly:

# Use regex to extract time values into their respective columns
new_df = df['duration'].str.extract(r'(?P<days>\d+(?= day))|(?P<hours>\d+(?= hour))|(?P<minutes>\d+(?= min))|(?P<seconds>\d+(?= sec))')

Results in the following dataframe:

The new dataframe only has the first value but not the rest. It captured the 29 for days, and 1, 3, 4, 2, for minutes but the subsequent columns values are NaNs.

Ideally, the dataframe should like this below:

I have a feeling something is wrong with my regex.

Update: Yes, I was right about being wrong with the regex. I should’ve been using extractall instead of just extract. Here’s the correct code:

import pandas as pd
import re

list = {'id': ['123','124','125','126','127'],
        'date': ['1/1/2018', '1/2/2018', '1/3/2018', '1/4/2018','1/5/2018'],
        'duration': ['29 days 4 hours 32 minutes',
                     '1 hour 23 minutes',
                     '3 hours 2 minutes 1 second',
                     '4 hours 46 minutes 22 seconds',
                     '2 hours 1 minute']}

df = pd.DataFrame(list)

# Use regex to extract time values into their respective columns
pattern = ( '(?P<days>\d+)(?= days?\s*)|'
          + '(?P<hours>\d+)(?= hours?\s*)|'
          + '(?P<minutes>\d+)(?= minutes?\s*)|'
          + '(?P<seconds>\d+)(?= seconds?\s*)'
          )

new_df = (df.duration.str.extractall(pattern)   # extract all with regex
          .reset_index('match',drop=True)      # merge the matches of the same row
          .stack()
          .unstack(level=-1, fill_value=0)     # remove fill_value if you want NaN instead of 0
          )

Thanks to Quang Hoang and everybody else for answering the problem!

How to put two or more dataframes together

The Pandas function below takes a list of dataframes and concatenates them into. This basic flavor of concat()joins the dataframes vertically. In other words, the rows of one dataframe gets added on to the previous one.

df = pd.concat([df1,df2,df3])

Or if you want, you can store the list of dataframes into a variable first and then call the concat function. Like so:

# we must import pandas first
# put it in the beginning of your file
import pandas as pd

frames = [df1, df2, df3, df4, df5]
df = pd.concat(frames)

On the other hand, if I want to join the dataframes horizontally, then I can use merge().

For example, in the code below, we are merging df1 with df2 using ‘column_name’ as the common column. This is the column from which to base the merge. If there are any other identical columns that exist between the two dataframes, the suffixes are then appended to the each of the column names accordingly.

This particular flavor of merge() joins the dataframes horizontally. In the words, the columns of the dataframes gets added together to make one big mamma jamma of a dataframe;

df_merged = df1.merge(df2,
                      left_on='column_name',
                      right_on='column_name',
                      suffixes=('_left', '_right'))