Using MapQuest API to Get Geo Data

A friendly tutorial on getting zip codes and other geographic data from street addresses.


Knowing how to deal with geographic data is a must-have for a data scientist. In this post, we will play around with the MapQuest Search API to get zip codes from street addresses along with their corresponding latitude and longitude to boot!

The Scenario

In 2019, my friends and I participated in CivTechSA Datathon. At one point in the competition, we wanted to visualize the data points and overlay them on San Antonio’s map. The problem is, we had incomplete data. Surprise! All we had were a street number and a street name — no zip code, no latitude, nor longitude. We then turned to the great internet for some help.

We found a great API by MapQuest that will give us exactly what we needed. With just a sprinkle of Python code, we were able to accomplish our goal.

Today, we’re going to walk through this process.

The Data

To follow along, you can download the data from here. Just scroll down to the bottom tab on over to the Data Catalog 2019. Look for SAWS (San Antonio Water System) as shown below.

Image for post
Screenshot by Ednalyn C. De Dios

Download the file by clicking on the link to the Excel file.

Image for post
Screenshot by Ednalyn C. De Dios

OR, you can click on this.

MapQuest API Key

Head on over to https://developer.mapquest.com/ and create an account to get a free API key.

Image for post
Screenshot by Ednalyn C. De Dios
Image for post
Screenshot by Ednalyn C. De Dios
Image for post
Screenshot by Ednalyn C. De Dios
Image for post
Screenshot by Ednalyn C. De Dios
Image for post
Screenshot by Ednalyn C. De Dios

Copy the ‘Consumer Key’ and keep it in a safe place. We’ll need it later.

Jupyter Notebook

Now, let’s fire up a Jupyter notebook and get coding!

For starters, let’s set up the environment by doing a couple of imports.https://towardsdatascience.com/media/7d0f7ced4082761e995ecf8ce0213c3f

Don’t forget to replace the API_KEY (line#12) with your own key above.

Now. let’s read the Excel file with a simple df = pd.read_excel().

Image for post
Screenshot by Ednalyn C. De Dios

Next, we’ll combine the street number and street name columns.https://towardsdatascience.com/media/1696465db27770b7f2942ab707d2efa5

Image for post
Screenshot by Ednalyn C. De Dios

The ALL CAPS hurts my eyes. Let’s do something about it:

df['street_address'] = df.street_address.str.title() .
Image for post
Screenshot by Ednalyn C. De Dios

Below are two functions that call the API and returns geo data.https://towardsdatascience.com/media/3ec6009e8b6069387a9edde18bdad0d3

We can manually call it with the line below. Don’t forget to replace the ‘#####’ with your own API key. You can use any address you want (replace spaces with a + character).

get_zip('https://www.mapquestapi.com/geocoding/v1/address?key=####################&inFormat=kvp&outFormat=json&location=100+ Military+Plaza&thumbMaps=false&delimiter=%2C')

But we’ve got many addresses, so we’ll use a loop to call the API repeatedly.https://towardsdatascience.com/media/9a970862f0352997417c5211df359a9b

Let’s see what the result looks like:

Image for post
Screenshot by Ednalyn C. De Dios

Finally, let’s create a dataframe that will house the street addresses — complete with zip code, latitude, and longitude.https://towardsdatascience.com/media/adfcc23ff94f54877bc80b72e2537ed9

Voila! We’ve got ourselves geo data.

Image for post
Screenshot by Ednalyn C. De Dios

For extra credit, let’s import the data in Tableau and get a pretty spiffy visual:

Image for post
Screenshot by Ednalyn C. De Dios

And that’s it, folks!

You can find the jupyter notebook here.

Thanks for stopping by and reading my post. Hope it was useful 🙂

If you want to learn more about my journey from slacker to data scientist, check out the article below:From Slacker to Data ScientistMy journey into data science without a degree.towardsdatascience.com

And if you’re thinking about switching gears and venture into data science, start thinking about rebranding now:The Slacker’s Guide to Rebranding Yourself as a Data ScientistOpinionated advice for the rest of us. Love of math, optional.towardsdatascience.com

Stay tuned!

You can reach me on Twitter or LinkedIn.

This article was first published in Towards Data Science’ Medium publication.

Get Your Feet Wet in Power B I

A hands-on introduction to Microsoft Analytics Tool


As a data scientist, you’ll need to learn to be comfortable with analytics tools sooner or later. In today’s post, we will dive headfirst and learn the very basics of Power BI.

Be sure to click on the images to better see some details.

The Data

The dataset that we will be using for today’s hands-on tutorial can be found at https://www.kaggle.com/c/instacart-market-basket-analysis/data. This dataset is “a relational set of files describing customers’ orders over time.” Download the zip files and extract them to a folder on your local hard drive.

Download Power BI Desktop

If you haven’t already, go to https://powerbi.microsoft.com/desktop and click on the “Download free” button.

Image for post
Screenshot by Ednalyn C. De Dios

If you’re using Windows 10, it will ask you to open Microsoft Store.

Image for post
Screenshot by Ednalyn C. De Dios

Go ahead and click on the “Install” button.

Image for post
Screenshot by Ednalyn C. De Dios

And let’s get started by clicking on the “Launch” button.

A Thousand Clicks

Image for post
Screenshot by Ednalyn C. De Dios

Click on “Get data” when the splash screen appears.

Image for post
Screenshot by Ednalyn C. De Dios

You will be presented with a lot of file format and sources; let’s choose “Text/CSV” and click on the “Connect” button.

Image for post
Screenshot by Ednalyn C. De Dios

Select “order_products_prior.csv” and click on the “Open” button.

Image for post
Screenshot by Ednalyn C. De Dios

The image below shows what the data looks like. Click on the “Load” button to load the dataset into Power BI Desktop.

Image for post
Screenshot by Ednalyn C. De Dios

Load the rest of the dataset by selecting “Get Data” and choosing the “Text/CSV” option on the dropdown.

Image for post
Screenshot by Ednalyn C. De Dios

You should have these three files loaded into Power BI Desktop:

  • order_products_prior.csv
  • orders.csv
  • products.csv

You should see the following tables appear on the “Fields” panel of Power BI Desktop, as shown below. (Note: the image shows Power BI in Report View.)

Image for post
Screenshot by Ednalyn C. De Dios

Let’s see what the Data View looks like by clicking on the second icon on the left side of Power BI Desktop.

Image for post
Screenshot by Ednalyn C. De Dios

And now, let’s check out the Model View where we will see how the different tables are related to each other.

Image for post
Screenshot by Ednalyn C. De Dios

If we hover a line, it will turn yellow and the corresponding related fields are both highlighted as well.

Image for post
Screenshot by Ednalyn C. De Dios

In this case, Power BI Desktop is smart to infer the two relationships. However, most of the time, we will have to create the relationships ourselves. We will cover this topic in the future.

Image for post
Screenshot by Ednalyn C. De Dios

Let’s go back to the Report View and examine the “Visualizations” panel closely. Look for the “slicer” icon which looks like a square with a funnel at the bottom right corner. Click on it to add a visual to the report.

Image for post
Screenshot by Ednalyn C. De Dios

In the “Fields” panel, find the “department_id” and click the checkbox on its left.

Image for post
Screenshot by Ednalyn C. De Dios

This will cause the “department_id” field to appear under the “Visualizations” panel in the “Field” box.

Next, take your mouse cursor and hover over the top right corner of the visual in the Report View. Click on the three dots that appeared in the corner as shown below.

Image for post
Screenshot by Ednalyn C. De Dios

Click on “List” in the dropdown that appeared.

Image for post
Screenshot by Ednalyn C. De Dios

While the “department_id” visual is selected, you should see corner marks indicating the visual as the active visual. While the “department_id” is active, press CTRL+C to copy it and then CTRL+V to paste it. Move the new visual to the right of the original visual.

Make the second visual active by clicking somewhere inside it. Then look for the “aisle_id” field in the “Fields” panel on the right of Power BI Desktop as shown below.

Image for post
Screenshot by Ednalyn C. De Dios

Try selecting a value on the “department_id” visual and observe how the selection on “aisle_id” changes accordingly.

Image for post
Screenshot by Ednalyn C. De Dios
Image for post
Screenshot by Ednalyn C. De Dios

Now, examine the “Visualizations” panel again and click on the table visual as shown below.

Image for post
Screenshot by Ednalyn C. De Dios

In the “Fields” panel, select “product_id” and “product_name” or drag them in the “Values” box.

Image for post
Screenshot by Ednalyn C. De Dios

Power BI Desktop should look similar to the image below.

Image for post
Screenshot by Ednalyn C. De Dios

This time, try selecting a value from both “department_id” and “aisle_id” — observe what happens to the table visual on the right.

Image for post
Screenshot by Ednalyn C. De Dios

Let’s create another visual by copying and pasting the table visual. This time, select (or drag) the following fields to the “Values” box of the visual.

  • order_id
  • user_id
  • order_number
  • order_hour_of_day
  • order_dow
  • days_since_prior_order

Power BI Desktop should now look similar to the image below.

Image for post
Screenshot by Ednalyn C. De Dios

Try clicking one of the selections in the table visual (where it’s showing “product_id” and “product_name”) and observe how the table on the right changes accordingly.

Image for post
Screenshot by Ednalyn C. De Dios

For a closer look, activate Focus Mode by clicking on the icon as shown below.

Image for post
Screenshot by Ednalyn C. De Dios

The table displays the details of orders that have the product that you selected in the table with “product_id” and “product_name.”

Get out of Focus Mode by clicking on “Back to report” as shown below.

Image for post
Screenshot by Ednalyn C. De Dios

Let’s rename this page or tab by right-clicking on the page name (“Page 1”) and selecting “Rename Page.”

Image for post
Screenshot by Ednalyn C. De Dios

Type in “PRODUCTS” and press ENTER.

Image for post
Screenshot by Ednalyn C. De Dios

let’s add another page or tab to the report by right-clicking on the page name again (“PRODUCTS”) and selecting “Duplicate Page.”

Image for post
Screenshot by Ednalyn C. De Dios

Rename the new page “TRANSACTIONS” and delete (or remove) the right-most table with order details on it.

Change the top-left visual and make update the fields as shown below. The “Fields” box should say “order_dow” while the top-left visual is activated.

Move the visuals around so it looks similar below.

Image for post
Screenshot by Ednalyn C. De Dios

Do the same thing for the next visual. This time, select “order_hour_of_day” and your Power BI Desktop should like the image below.

Image for post
Screenshot by Ednalyn C. De Dios

Do the same thing one last time for the last table and it should now contain fields as shown below.

Image for post
Screenshot by Ednalyn C. De Dios

Let’s add another page or tab to the report by clicking on the “+” icon at the bottom of the report’s main work area.

Image for post
Screenshot by Ednalyn C. De Dios

Basic Exploration

In the “Visualizations” panel, select “Stacked column chart.”

Image for post
Screenshot by Ednalyn C. De Dios

Resize the chart by drabbing their move-handles.

Make sure the “Axis” box contains “order_dow” and the “Values” box with “order_id” respectively. Power BI Desktop should automatically calculate the count for “order_id” and display the field as “Count of order_id” as shown below.

Image for post
Screenshot by Ednalyn C. De Dios

The graph above is interesting because it shows a higher number of orders for Day 0 and Day 1.

Let’s make another chart.

We will follow the same procedure of adding a chart and for this time, we’ll use “order_hour_of_day” in the “Axis” box as shown below.

Image for post
Screenshot by Ednalyn C. De Dios

The graph shows the peak time for the number of orders.

One last graph!

We will add another chart with “days_since_prior_order” in the “Axis” box.

Image for post
Screenshot by Ednalyn C. De Dios

This last graph is the most interesting because the number of reorders peaks during these three time periods: 7 days, 14 days, and 30 days since prior order. This means that people are in a habit of resupplying every week, every two weeks, and every month.

That’s it, folks!

In the next article, we will “prettify” our charts and make them more readable to the others.

The procedures above have been drawn out. But if you’re a novice Power BI user, don’t despair! With regular practice, the concepts demonstrated in this article will soon become second nature and you’ll probably be able to do them in your sleep.

Thank you for reading. If you want to learn more about my journey from slacker to data scientist, check out the article below:From Slacker to Data ScientistMy journey into data science without a degree.towardsdatascience.com

And if you’re thinking about switching gears and venture into data science, start thinking about rebranding now:The Slacker’s Guide to Rebranding Yourself as a Data ScientistOpinionated advice for the rest of us. Love of math, optional.towardsdatascience.com

Stay tuned!

You can reach me on Twitter or LinkedIn.

This article was first published in Towards Data Science’ Medium publication.

Forecasting in Power BI

A visual step-by-step guide to forecasting using Power BI.


In this post, we’ll go through the process of creating forecasting in Power BI.

Get the Data

You can download the dataset that I used here. It contains daily female births in California in 1959¹. For a list of other time-series datasets, check out Jason Brownlee’s article 7 Time Series Datasets for Machine Learning – Machine Learning Mastery.

Let’s load the data into Power BI. Open up Power BI and click on “Get data” on the welcome screen as shown below.

Image for post
Screenshot by the Author

Next, you’ll be presented with another pane that asks what type of data we want to get. Select “Text/CSV” as shown below and click on “Connect.”

Image for post
Screenshot by the Author

When the File Open window appears, navigate to where we saved the dataset and click on the “Open” button on the lower right-hand corner.

Image for post
Screenshot by the Author

When a preview appears, just click on “Load.”

Image for post
Screenshot by the Author

We’ll now see the main working area of Power BI. Head over to the “Visualizations” panel and look for “Line Chart.”

Image for post
.Screenshot by the Author

This is what the line chart icon looks like:

Image for post
Screenshot by the Author

Next, a visual placeholder will appear. Drag the hot corner marking on the lower right-hand corner of the placeholder and drag it diagonally down and to the right corner of the main working area.

Image for post
Screenshot by the Author
Image for post
Screenshot by the Author

Next, head over the “Fields” panel.

Image for post
Screenshot by the Author

With the line chart placeholder still selected, find the “Date” field and click on the square box to put a checkmark on it.

Image for post
Screenshot by the Author

We’ll now see the “Date” field under Axis. Click on the down arrow on the right of the “Date” as shown below.

Image for post
Screenshot by the Author

Select “Date” instead of the default, “Date Hierarchy.”

Image for post
Screenshot by the Author

Then, let’s put a checkmark on the “Births” field.

Image for post
Screenshot by the Author

We’ll now see a line graph like the one below. Head over the Visualization panel and under the list of icons, find the analytics icon as shown below.

Image for post
Screenshot by the Author

Scroll down the panel and find the “Forecast” section. Click on the down arrow to expand it if necessary.

Image for post
Screenshot by the Author

Next, click on “+Add” to add forecasting on the current visualization.

Image for post
Screenshot by the Author

We’ll now see a solid gray fill area and a line plot to the right of the visualization like the one below.

Image for post
Screenshot by the Author

Let’s change the Forecast length to 31 points. In this case, a data point equals a day so 31 would roughly equate to a month’s worth of predictions. Click on “Apply” on the lower right-corner of the Forecast group to apply the changes.

Image for post
Screenshot by the Author

Instead of points, let’s change the unit of measure to “Months” instead as shown below.

Image for post
Screenshot by the Author

Once we click “Apply,” we’ll see the changes in the visualization. The graph below contains forecast for 3 months.

Image for post
Screenshot by the Author

What if we wanted to compare how the forecast compares to actual data? We can do this with the “Ignore last” setting.

For this example, let’s ignore the last 3 months of the data. Power Bi will then forecast 3 months worth of data using the dataset but ignoring the last 3 months. This way, we can compare the Power BI’s forecasting result with the actual data in the last 3 months of the dataset.

Let’s click on “Apply” when we’redone changing the settings as shown below.

Image for post
Screenshot by the Author

Below, we can see how the Power BI forecasting compares with the actual data. The black solid line represents the forecasting while the blue line represents the actual data.

Image for post
Screenshot by the Author

The solid gray fill on the forecasting represents the confidence interval. The higher its value, the large the area will be. Let’s lower our confidence interval to 75% as shown below and see how it affects the graph.

Image for post
Screenshot by the Author

The solid gray fill became smaller as shown below.

Image for post
Screenshot by the Author

Next, let’s take into account seasonality. Below, let’s set it to 90 points which is equivalent to about 3 months. Putting this value will tell Power BI to look for seasonality within a 3-month cycle. Play with this value with what makes sense according to the data.

Image for post
Screenshot by the Author

The result is show below.

Image for post
Screenshot by the Author

Let’s return our confidence interval to the default value of 95% and scroll down the group to see formatting options.

Image for post
Screenshot by the Author

Let’s change the forecasting line to an orange color and let’s make the gray fill disappear by changing formatting to “None.”

Image for post
Screenshot by the Author

And that’s it! With a few simple clicks of the mouse, we got ourselves a forecast from the dataset.

Thank you for reading. If you want to learn more about my journey from slacker to data scientist, check out the article below:From Slacker to Data ScientistMy journey into data science without a degree.towardsdatascience.com

And if you’re thinking about switching gears and venture into data science, start thinking about rebranding now:The Slacker’s Guide to Rebranding Yourself as a Data ScientistOpinionated advice for the rest of us. Love of math, optional.towardsdatascience.com

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1] Machine Learning Mastery. (June 21, 2020). 7 Time Series Datasets for Machine Learning. https://machinelearningmastery.com/time-series-datasets-for-machine-learning/

This article was first published in Towards Data Science’ Medium publication.

Democratize Data Science

Every once in a while, I would come across an article that decries online data science courses and boot camps as pathways towards getting a data science job. Most of the articles aim not to discourage but serve as a reminder to take a hard look in the mirror first and realize what we’re up against. However, a few detractors have proclaimed that the proliferation of these online courses and boot camps have caused the degradation of the profession.

To the latter, I vehemently disagree.

Bridging the Skill Gap

Data science have captured popular imagination ever since Harvard Business Review dubbed data scientist as the sexiest job of the 21st century. More than seven years later, data science remains one of the most highly sought-after job markets today. In fact, due to the dynamics of supply and demand, “the United States alone is projected to face a shortfall of some 250,000 data scientists by 2024¹.”

As a result, capitalism and entrepreneurship answered the call and companies like Codeup have vowed to “help bridge the gap between companies and people wanting to enter the field.”²

In addition, AutoML libraries like PyCaret are “democratizing machine learning and the use of advanced analytics by providing free, open-source, and low-code machine learning solution for business analysts, domain experts, citizen data scientists, and experienced data scientists”³.

The availability of online courses, boot camps, and AutoML libraries has led a lot of data scientists to raise their brows. They fear that boot camp alumni and self-taught candidates would somehow lower the overall caliber of data scientists and disgrace the field. Furthermore, they are afraid that the availability of tools like AutoML would allow anyone to be a data scientist.

I mean, God forbid if anyone thinks that they too can be data scientists! Right?

Wrong.

The Street Smart Data Scientist

Alumni of boot camps and self-taught learners, like myself, have one thing going for them: our rookie smarts. To quote Liz Wiseman, author of the book Rookie Smarts:

In a rapidly changing world, experience can be a curse. Being new, naïve, and even clueless can be an asset. — Liz Wiseman

Rookies are unencumbered. We are alert and constantly seeking like hunter-gatherers, cautious but quick like firewalkers, and hungry and relentless like frontiersmen⁴. In other words, we’re street smart.

Many are so bogged down by “you’ve got to learn this” and “you’ve got learn that” that they forget to stress the fact that data science is so vast that you can’t possibly know everything about anything. And that’s okay.

We learn fast and adapt quickly.

At the end of the day, it’s all about the value that we bring to our organizations. They are, after all, the ones paying our bills. We don’t get paid to memorize formulas or by knowing how to code an algorithm from scratch.

We get paid to solve problems.

And this is where the street smart data scientist excels. We don’t suffer from analysis paralysis or be bothered with theories, at least not while on the clock. Our center of focus is based on pragmatic solutions to problems, not on academic debate.

This is not to say we’re not interested in the latest research. In fact, it’s quite the contrary. We are voracious consumers of the latest development in machine learning and AI. We drool over the latest development in natural language processing. And we’re always on the lookout for the latest tool that will make our jobs easier and less boring.

And AutoML

So what if we have to use AutoML? If it gets us to an automatic pipeline where analysts can get the results of machine learning without manual intervention by a data scientist, the better. We’re not threatened by automation, we’re exhilarated by it!

Do not let perfection be the enemy of progress. — Winston Churchill

By building an automatic pipeline, there’s bound to be some tradeoffs. But building it this way will free our brain cells and gives us more time to focus on solving other higher-level problems and produce more impactful solutions.

We’re not concerned about job security, because we know that it doesn’t exist. What we do know is that the more value we bring to a business, the better we will be in the long run.

Maybe They’re Right?

After all this, I will concede a bit. For the sake of argument, maybe they’re right. Maybe online courses, boot camps, and low-code machine learning libraries really do produce low-caliber data scientists.

Big maybe.

But still, I argue, this doesn’t mean we don’t have value. Data science skills lie on a spectrum and so does companies’ maturity when it comes to data. Why hire a six-figure employee when your organization barely has a recognizable machine learning infrastructure?

Again, maybe.

The Unicorn

Maybe, to be labeled as a data scientist, one must be a unicorn first. A unicorn data scientist is a data scientist who excels at all facets of data science.

Image for post
Hckum / CC BY-SA (https://creativecommons.org/licenses/by-sa/4.0)

Data science has long been described as the intersection between computer science, applied statistics, and business or domain knowledge. To this, they ask, how can one person possibly accumulate all those knowledge into just a few months? To this, we also ask the same question, how can a college grad?

Unicorns do exist I believe, but they also have had to start from somewhere.

So why can’t we?

Conclusion

A whole slew of online courses and tools promise to democratize data science, and this is a good thing.

Thank you for reading. If you want to learn more about my journey from slacker to data scientist, check out the article From Slacker to Data Scientist: My journey into data science without a degree.

And if you’re thinking about switching gears and venture into data science, start thinking about rebranding now The Slacker’s Guide to Rebranding Yourself as a Data ScientistOpinionated advice for the rest of us. Love of math, optional.

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1] Harvard Business Review. (June 3, 2020). Democratizing Data Science in Your Organization. https://hbr.org/sponsored/2019/04/democratizing-data-science-in-your-organization

[2] San Antonio Express-News. (June 3, 2020). Software development bootcamp Codeup launching new data science program. https://www.mysanantonio.com/business/technology/article/Software-development-bootcamp-Codeup-launching-13271597.php

[3] Towards Data Science. (June 4, 2020). Machine Learning in Power BI Using PyCaret. https://towardsdatascience.com/machine-learning-in-power-bi-using-pycaret-34307f09394a

[4] The Wiseman Group. (June 4, 2020). Rookie Smarts
Why Learning Beats Knowing in the New Game of Work. 
https://thewisemangroup.com/books/rookie-smarts/

This article was first published in Towards Data Science’ Medium publication.

Exploring the Trump Twitter Archive with PyCaret

For adventurous beginners in NLP.


For this project, we’ll be using PyCaret:

PyCaret is an open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within seconds in your choice of notebook environment.¹

PyCaret

PyCaret does a lot more than NLP. It also does a whole slew of both supervised and unsupervised ML including classification, regression, clustering, anomaly detection, and associate rule mining.

To learn more, check out Moez Ali’s announcement.


Housekeeping

Let’s begin by installing PyCaret. Just do pip install pycaret and we are good to go! Note: PyCaret is a big library so you may want to go grab a cup of coffee while waiting for it to install.

Also, we need to download the English language model because it is not automatically downloaded with PyCaret:

python -m spacy download en_core_web_sm
python -m textblob.download_corpora

Getting the Data

Let’s read the data into a dataframe. If you want to follow along, you can download the dataset here. This dataset contains Trump’s tweets from the moment he took office on January 20, 2017 to May 30, 2020.

import pandas as pd
from pycaret.nlp import *
df = pd.read_csv('trump_20200530.csv')

Let’s check the shape of our data first:

df.shape

And let’s take a quick look:

df.head()

For expediency, let’s sample only 1,000 tweets.

# sampling the data to select only 1000 tweets
df = df.sample(1000, random_state=493).reset_index(drop=True)
df.shape

Topic Modeling

The fun part!

nlp = setup(data = df, target = 'text', session_id = 493,
customI _stopwords = [ 'rt', 'https', 'http', 'co', 'amp'])

PyCaret’s setup() function performs the following text-processing steps:

  1. Removing Numeric Characters
  2. Removing Special Characters
  3. Word Tokenization
  4. Stopword Removal
  5. Bigram Extraction
  6. Trigram Extraction
  7. Lemmatizing
  8. Custom Stopwords

And all in one line of code!

It takes in two parameters: the dataframe in data and the name of the text column that we want to pass in target. In our case, we also used the optional parameters session_id for reproducibility and custom_stopwords to reduce the noise coming from the tweets.

After all is said and done, we’ll get something similar to this:

In the next step, we’ll create the model and we’ll use ‘lda’:

lda = create_model('lda', num_topics = 6, multi_core = True)

Above, we created an ‘lda’ model and passed in the number of topics as 6 and set it so that the LDA will use all CPU cores available to parallelize and speed up training.

Finally, we’ll assign topic proportions to the rest of the dataset using assign_model().

lda_results = assign_model(lda)
lda_results.head()

Visualizing the Results

Let’s the plot the overall frequency distribution of the entire corpus:

plot_model()

Now let’s extract the bigrams and trigrams for the entire corpus:

plot_model(plot = 'bigram')
plot_model(plot = 'trigram')

But what if we only want to extract the n-grams from a specific topic? Easy, we’ll just pass in the topic_num parameter.

plot_model(lda, plot = 'trigram', topic_num = 'Topic 1')

If we want the distribution of topics we’ll simply change it and specify it in the plot parameter.

plot_model(lda, plot = 'topic_distribution')

And that’s it!

We’ve successfully conducted topic modeling on President Trump’s tweets since taking office.

Bonus Round

Moez Ali wrote a great tutorial on using PyCaret in Power BI. Check it out.


Thank you for reading! Exploratory data analysis uses a lot of techniques and we’ve only explored a few on this post. I encourage you to keep practicing and employ other techniques to derive insights from data.

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1] PyCaret. (June 4, 2020). Why PyCarethttps://pycaret.org/

This article was first published in Towards Data Science’ Medium publication.

The Slacker’s Guide to Rebranding Yourself as a Data Scientist

Opinionated advice for the rest of us. Love of math, optional.


Since my article about my journey to data science, I’ve had a lot of people ask me for advice regarding their own journey towards becoming a data scientist. A common theme started to emerge: aspiring data scientists are confused about how to start, and some are drowning because of the overwhelming amount of information available in the wild. So, what’s another, right?

Well, let’s see.

I urge aspiring data scientists to slow it down a bit and take a step back. Before we get to learning, let’s take care of some business first: the fine art of reinventing yourself. Reinventing yourself takes time, so we better get started early on in the game.

In this post, I will share a very opinionated approach to do-it-yourself rebranding as a data scientist. I will assume three things about you:

  • You’re broke, but you’ve got grit.
  • You’re willing to sacrifice and learn.
  • You’ve made a conscious decision to become a data scientist.

Let’s get started!


First Things First

I’m a strong believer in Yoda’s wisdom: “Do or do not, there is no try.” For me, either you do something or you don’t. Failure for me was not an option, and I took comfort in knowing that I won’t really fail unless I quit entirely. So first bit of advice: don’t quit. Ever.

Do or do not, there is no try.

Yoda

Begin with the End in Mind

Let’s get our online affairs in order and start thinking about SEO. SEO stands for search engine optimization. The simplest way to think about is the very fine art of putting as much “stuff” as you can on the internet with your real professional name out there so that when somebody searches for you, all they will find are the stuff that you want them to find.

In our case, we want the words “data science” or “data scientist” to appear whenever your name appears in the search results.

So let’s start littering the interweb!

  1. Create a professional Gmail account if you don’t already have one. Don’t make your username be sexxydatascientist007@gmail.com. Play it safe, the more boring, the better. Start with first.last@gmail.com, or if your name is a common one, append it with “data” like first.name.data@gmail.com. Avoid numbers at all costs. If you have one already, but it doesn’t follow the aforementioned guidelines, create another one!
  2. Create a LinkedIn account and use your professional email address. Put “Data Scientist in Training” in the headline. “Data Science Enthusiast” is too weak. We’ve made a conscious decision and committed to the mission, remember? While we’re at it, let’s put the app on our phone too.
  3. If you don’t have a Facebook account yet, create one just so you could claim your name. If you already have one, put that thing on private pronto! Go the extra mile and also delete the app on your phone so you won’t get distracted. Do the same for other social networks like Twitter, Instagram, and Pinterest. Set them to private for now, we’ll worry about cleaning them up later.
  4. Create a Twitter account if you don’t already have one. We can take a little bit of leeway in the username. Make it short and memorable but still professional, so you don’t offend anybody’s sensibilities. If you already have one, decide if you want to keep it or start all over. The main thing to ask yourself: is there any content in your history that can be construed as unprofessional or mildly controversial? Err on the side of caution.
  5. Start following the top voices in data science on LinkedIn and Twitter. Here are a few suggestions: Cassie Kozyrkov, Angela Baltes, Sarah N., Kate Strachnyi, Kristen Kehrer, Favio Vazquez, and of course, my all-time favorite: Eric Weber.
  6. Create a Hootsuite account and connect your LinkedIn and Twitter accounts. Start scheduling data science-related posts. You can share interesting articles from other people about data science or post about your own data science adventures! If you do share other people’s posts, please make sure you give the appropriate credit. Simply adding a URL is lazy and no bueno. Thanks to Eric Weber for this pro-tip!
  7. Take a professional picture and put it as your profile picture in all of your social media accounts. Aim for a neutral background, if possible. Make sure it’s only you in the picture unless you’re Eric (he’s earned his chops so don’t question him! LOL.)
  8. Create a Github account if you don’t have one already. You’re going to need this as you start doing data science projects.
  9. BONUS: if you can spare a few dollars, go to wordpress.org and get yourself a domain that has your professional name on it. I was fortunate enough to have an uncommon name, so I have ednalyn.com, but if your name is common, be creative and make one up that’s recognizably yours. Maybe something like janesmithdoesdatascience.com. Then you can start planning on having your resumé online or maybe even have a blog post or two about data science. As for me, I started with writing my experience when I first started to learn data science.
  10. Clean-up: when time permits, start auditing your social media posts for offensive, scandalous, or unflattering content. If you’re looking to save time, try a service like brandyourself.com. Warning! It can get expensive, so watch where you click.

Do Your Chores

No kidding! When you’re doing household chores, taking a walk, or maybe even while driving, listen to podcasts that talk about data science topics like Linear Digression and TwiML. Don’t get too bogged down about committing what they say to memory. Just go along with the flow, and sooner or later, the terminology and concepts that they discuss will start to sound familiar. Just remember not to get too caught up with the discussions that you start burning whatever you’re cooking or miss your exit like I have many times in the past.

Meat and Potatoes

Now that we’ve taken care of the preliminaries of living and breathing data science, it’s time to take care of the meat and potatoes: actually learning about data science.

There’s no shortage of opinions about how to learn data science. There are so many of them that it can overwhelm you, especially when they start talking about learning the foundational math and statistics first.

Blah!

Tell me and I forget,
teach me and I remember,
involve me and I learn.

Old Chinese Adage

While important, I don’t see the point of studying theory first when I may soon fall asleep or worst, get too intimidated by the onslaught of mathematical formulas that I get so exasperated, and ended up quitting!

What I humbly propose, rather, is to employ the idea of “minimum viable knowledge” or MVK as described by Ken Jee. in his article: How I Would Learn Data Science (If I Had to Start Over). Ken Jee describes minimum viable knowledge as learning “just enough to be able to learn through doing.”² I suggest checking it out:

My approach to MVK is pretty straight-forward: learn just enough SQL to be able to get the data from a database, learn enough Python so that you could have program control and be able to use the pandas library, and then do end-to-end projects, from simple ones to increasingly more challenging ones. Along the way, you’d learn about data wrangling, exploratory data analysis, and modeling. Other techniques like cross-validation and grid search would surely be a part of your journey as well. The trick is never to get too comfortable and always push yourself slowly.

To the list-oriented, here is my process:

  1. Learn enough SQL and Python to be able to do end-to-end projects with increasing complexity.
  2. For each project, go through the steps of the data science pipeline: planning, acquisition, preparation, exploration, modeling, delivery (story-telling/presentation). Be sure to document your efforts on your Github account.
  3. Rinse and repeat (iterate).

For a more in-depth discussion of the data science pipeline, I recommend the following article: PAPEM-DM: & Steps Towards a Data Science Win.

For each iteration, I suggest doing an end-to-end project that practices each of these following data science methodologies:

  • regression
  • classification
  • clustering
  • time-series analysis
  • anomaly detection
  • natural language processing
  • distributed ML
  • deep learning

And for each methodology, practice its different algorithms, models, or techniques. For example, for natural language processing, you might want to practice these following techniques:

  • n-gram ranking
  • named-entity recognition
  • sentiment analysis
  • topic modeling
  • text classification

Just Push It

As you do end-to-end projects, it’s a good practice to push your work publicly on Github. Not only will it track your progress, but it also backups your work in case your local machine breaks down. Not to mention, it’s a great way to showcase your progress. Note that I said progress, not perfection. Generally, people understand if our Github repositories are a little bit messy. In fact, most expect it. At a minimum, just make sure that you have a great README.md file for each repo.

What to put on a Github Repo README.md:

  • Project name
  • What goal or purpose of the project
  • Background on the project
  • How to use the project (if somebody wants to try it for themselves)
  • Mention your keywords: “data science,” “data scientist,” “machine learning,” et cetera.

Don’t ignore this note: don’t make the big mistake or hard-coding your credentials or any passwords in your public code. Put them in an .env file and .gitignore them. For reference, check out this documentation from Github.

For a great in-depth tutorial on how to use Git and Github, check out
Anne Bonner’s guide: Getting Started with Git and Github: the complete beginner’s guide.

For the Love of Math

And finally, as you get better with employing different techniques and you begin to do hyper-parameter tuning, I believe at this point that you’re ready to face the necessary evil that is math. And more than likely, the more you understand and develop intuition, the less you’ll hate it. And maybe, just maybe, you’ll even grow to love it.

I have one general recommendation when it comes to learning the math behind data science: take it slow. Be gentle on yourself and don’t set deadlines. Again, there’s no sense in being ambitious and tackling something monumental if it ends up driving you insane. There’s just no fun in it.

There are generally two approaches to learning math.

One is to take the structured approach, which starts on learning the basics first and then incrementally take on the more challenging parts. For this I recommend KhanAcademy. Personalize your learning towards calculus, linear algebra, and statistics. Take small steps and celebrate small wins.

The other approach is slightly geared for more hands-on involvement and takes a little bit of reverse engineering. I call it learning backward. You start with finding out what math concept is involved in a project and breaking down that concept into more basic ideas and go from there. This approach is better suited for those who prefer to learn by doing.

A good example of learning by doing is illustrated by a post on Analytics Vidhya.

Supplemented by this article.

Take a Break

Well, learning math sure is hard! It’s so powerful and intense that you’d better take a break often or risk overheating your brain. On the other hand, taking a break does not necessarily mean taking a day off. After all, there is no rest for the weary!
Every once in a while, I strongly recommend supplementing your technical studies with a little bit of understanding of the business side of things. For this, I suggest the classic book: Thinking with Data by Max Shron. You can also find a lot of articles here on Medium.

For example, check out Eric Kleppen’s article.

Talk to People

Taking a break can be lonely sometimes, and being alone with only your thoughts can be exhausting. So you may decide to finally talk with your family, the problem is, you’re so motivated and gung-ho about data science that it’s all you can talk about. Sooner or later, you’re going to annoy your loved ones.

It happened to me.

This is why I decided to talk to other people with similar interests. I went on Meetups and started networking with people who are either already practicing data science or people like you who are aspiring to be a data scientist as well. In this post-COVID (hopefully) age that we’re in, having group video calls are more prevalent. This is actually more beneficial because now, geography won’t be an issue anymore.

A good resource to start is LinkedIn. You can use the social network to find others with similar interests or even find local data scientists who can still spare an hour or two every month to mentor motivated learners. Start with companies in your local municipality. Find out if they have a data scientist that works there, and if you do find one, kindly send them a personalized message with a request to connect. Give them the option to refuse gracefully and just ask them to repoint or recommend you to another person who does have the time to mentor.

The worst that can happen is they said no. No hard feelings, eh?

Conclusion

Thanks for reading! This concludes my very opinionated advice on rebranding yourself as a data scientist. I hope you got something out of it. I welcome any feedback. If you have something you’d like to add, please post it in the comments or responses.

Let’s continue this discussion!


If you’d like to connect with me, you can reach me on Twitter or LinkedIn. I love to connect, and I do my best to respond to inquiries as they come.

Stay tuned, and see you in the next post!

If you want to learn more about my journey from slacker to data scientist, check out this article.


[1] Quote Investigator. (June 10, 2020). Tell Me and I Forget; Teach Me and I May Remember; Involve Me and I Learn. https://quoteinvestigator.com/2019/02/27/tell/

[2] Towards Data Science. (June 11, 2020). How I Would Learn Data Science (If I Had to Start Over). https://towardsdatascience.com/how-i-would-learn-data-science-if-i-had-to-start-over-f3bf0d27ca87

This article was first published in Towards Data Science’ Medium publication.

Terminal Makeover with Oh-my-zsh and iTerm.

A visual step-by-step guide to replacing the default terminal application with iTerm2.

Over the weekend, I’ve decided to restore my Macbook Pro to factory settings so I can have a clean start at setting up a programming environment.

In this post, we’ll work through setting up oh-my-zsh and iTerm2 on the Mac.

This is what the end-result will look like:

The end-result.

Let’s begin!

Press CMD + SPACE to call the spotlight service.

Start typing in “terminal” and you should see something similar below.

Hit the enter key (gently, of course) to open the terminal application.

If you see something that says “The default interactive shell is now zsh…” it means you’re still using bash as your shell.

Let’s switch to zsh.

Click on “Terminal” and select “Preferences…” as shown below.

This will open up the terminal settings window.

In the “Shells open with” section, click on “Default login shell” as shown below.

Close the window by click on the “X” t the top left-hand corner and then restart the terminal. You should now see the terminal using the zsh like the one below.

Installing Powerline Fonts

The theme “agnoster” will require some special fonts to be render properly. Let’s install them now.

Type the following command into the terminal:

git clone https://github.com/powerline/fonts.git --depth=1

And then the following to change directory:

cd fonts

The directory will change ~/fonts as shown below.

Type the following command to install the fonts into your system.

./install.sh

The output should be something like one below.

Let’s back up to the parent directory so we could do some cleaning up:

cd ..

You should the following output below indicating the home directory.

Let’s delete the installation folder with the following command:

rm -rf fonts

The fonts folder should be deleted now. Let’s clear our console output.

clear

You should see a clear window now on the console like the one below.

Installing Oh-My-ZSH

Oh-My-ZSH takes care of the configuration for our zsh shell. Let’s install it now.

Type the following into the terminal (do not use any line breaks, this should be only one line):

sh -c "$(curl -fsSL https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"

You should now see oh-my-zsh installed on your computer.

If you see a message that says “Insecure completion-dependent directories detected,” we need to set the ZSH_DISABLE_COMPFIX to true in the .zshrc file on the home directory.

To do this, open up a Finder window and navigate to home directory.

Press SHIFT + CMD + . to reveal hidden files. You should now see something similar below.

Open the .zshrc file using a text editor like Sublime.

This is what the inside of the .zshrc file looks like:

Scroll down around line #73.

Insert the following line right before source $ZSH/oh-my-zsh.sh:

ZSH_DISABLE_COMPFIX="true"

Save and close the .zshrc file, and open a new terminal window. You should something similar like the one below.

Replacing the Default Terminal

Go to https://www.iterm2.com/version3.html and download the latest version.

Save the installer on your “Downloads” folder like so:

Open a new Finder window and navigate to “Downloads.” You should see something similar below. Double click on the zip file and it should extract an instance of the iTerm app.

Double-click on “iTerm.app”

If prompted regarding the app being download from the Intermet, , click “Open.”

If prompted to move the app into the application folder, please click on “Move to Allocations Folder.”

Close all windows and press CMD + SPACE to pull up thre spotlight search service and type in “iterm.” Hit ENTER and you should now see the iTerm App.

Open a Finder window, navigate to the home directory, and find the .zshrc file.

Open the .zshrc file using a text editor.

Find ZSH_THEME=”robbyrussell” and replace “robbyrussell” with “agnoster” as shown below.

Save and close the file. Close any remaining open iTerm window by pressing CTRL + Q.

Restart iTerm by pressing CMD + SPACE and typing in “iterm” as shown in the images below.

Hit the ENTER key and a new iTerm window should open like the one below.

The prompt looks a little weird. Let’s fix it!

Go to iTerm2 and select Preferences… as shown below.

You’ll see something like the image below.

Click on “Profiles.”

Find the “+” on the lower left corner of the window below the Profile Name area besides “Tags >”

Click on the “+” sign.

On the General tab, under the Basics area, replace the default “New Profile” name with your preferred profile name. Below, I had typed in “Gunmetal Blue.”

In Title, click on the drop down and check or uncheck your preferences for the window title appearances.

Navigate to the Colors tab and click on the “Color Presets…” dropdown in the lower right hand corner of the window and selet “Smooooooth.”

Find “Background” in the Basic Colors section and set the color to R:0 G:50 B:150 as shown below.

Navigate to the “Text” tab and find the “Font” section. Select any of the Powerline fonts. Below, I selected Roboto Mono Medium for Powerline” and increase the font size to 13.

Under the same “Font” section, check “Use a different font for non-ASCII text” and select the same font as before. Refer to the image below.

Next, navigate to the “Window” tab and set the Transparency and Blur as show below.

Then, navigate to the “Terminal” tab and check “Unlimited scrollback.”

Finally, let’s set this newly created profile by as the default by clicking on “Other Actions…” dropdown and selecting “Set as Default” as shown below.

You should now see a star next to the newly created profile indicating that its status as the default profile for new windows.

Restart iTerm and you should something similar like the one below.

Notice that we can barely see the directory indicator on the prompt. Also, the username@hostname is a little long for liking. Let’s fix those.

Go to the iTerm preferences again and navigate to “Profiles” tab. Find “Blue” on the ANSI Colors under the “Normal” column and click on the colored box.

Set the RGB values as R:0 G:200 B:250 as shown below.

Quit iTerm by pressing CMD + Q and open a Finder window. Navigate to the home directory, reveal the hidden files with SHIFT + CMD + . and double click on the “.oh-my-zsh” folder.

Navigate to and click on the “themes” folder.

Look for the “agnoster.zsh-theme” file and open it using a text editor.

This is what the inside of the theme looks like:

Around line #92, look for the “%n@%m” character string.

Select “%n@%m” and replace it with whatver you’d like to display on the prompt.

Below, I simply replaced “%n@%m” with “Dd” for brevity.

Restart iTerm and you should get something similar like the image below.

If you navigate to a git repository, you’ll see something similar below:

And that’s it!

Happy coding!