I’ve always been enamored with code and I love data science because of its inherent power to solve real problems. Having grown up in the Philippines, served in the United States Navy, and worked in the nonprofit sector, I am driven to make the world a better place. I have started and participated in numerous campaigns that aim to reduce domestic violence and child abuse in the community.
I am super stoked to announce that I’ve started to work with some awesome peeps at Microsoft through InsightsGlobal as a member of the Azure Machine Learning Team.
Thank you, Kathleen Scanlon and Kristi Calagno for helping and guiding me towards this move.
I would like to express gratitude for all the experience and support that TaskUs has given me. Being part of TaskUs’ success and seeing it grow from a startup to a publicly traded company is something I feel very proud of.
It’s been a pleasure working with a great team in the Business Insights and Data Science department. I am blessed to have worked with talented people, in particular, Scott Gamester, Rachel Perez, and Darcy Delamore. I have built a lasting friendship with colleagues that I will continue to cherish. I am grateful for my teammates, William Li, Dahlia Curtin, Sabrina Castillo, Antonio Morena, Tim Reyna, Sanjana Putchala, Priyanka Manchanda, and countless others.
I am sad to leave. At the same time, I am excited that everything I have learned during my time with TaskUs will help shape the rest of my career. Being with TaskUs afforded me the opportunity to apply data science to bring about actionable insights.
Lastly, I would like to offer my sincerest gratitude to Shauna Zamarippa and Tom Flynn for taking a chance on me.
A friendly tutorial on getting zip codes and other geographic data from street addresses.
Knowing how to deal with geographic data is a must-have for a data scientist. In this post, we will play around with the MapQuest Search API to get zip codes from street addresses along with their corresponding latitude and longitude to boot!
The Scenario
In 2019, my friends and I participated in CivTechSA Datathon. At one point in the competition, we wanted to visualize the data points and overlay them on San Antonio’s map. The problem is, we had incomplete data. Surprise! All we had were a street number and a street name — no zip code, no latitude, nor longitude. We then turned to the great internet for some help.
We found a great API by MapQuest that will give us exactly what we needed. With just a sprinkle of Python code, we were able to accomplish our goal.
Today, we’re going to walk through this process.
The Data
To follow along, you can download the data from here. Just scroll down to the bottom tab on over to the Data Catalog 2019. Look for SAWS (San Antonio Water System) as shown below.
Below are two functions that call the API and returns geo data.https://towardsdatascience.com/media/3ec6009e8b6069387a9edde18bdad0d3
We can manually call it with the line below. Don’t forget to replace the ‘#####’ with your own API key. You can use any address you want (replace spaces with a + character).
Finally, let’s create a dataframe that will house the street addresses — complete with zip code, latitude, and longitude.https://towardsdatascience.com/media/adfcc23ff94f54877bc80b72e2537ed9
A hands-on introduction to Microsoft Analytics Tool
As a data scientist, you’ll need to learn to be comfortable with analytics tools sooner or later. In today’s post, we will dive headfirst and learn the very basics of Power BI.
Be sure to click on the images to better see some details.
The Data
The dataset that we will be using for today’s hands-on tutorial can be found at https://www.kaggle.com/c/instacart-market-basket-analysis/data. This dataset is “a relational set of files describing customers’ orders over time.” Download the zip files and extract them to a folder on your local hard drive.
In this case, Power BI Desktop is smart to infer the two relationships. However, most of the time, we will have to create the relationships ourselves. We will cover this topic in the future.
Let’s go back to the Report View and examine the “Visualizations” panel closely. Look for the “slicer” icon which looks like a square with a funnel at the bottom right corner. Click on it to add a visual to the report.
This will cause the “department_id” field to appear under the “Visualizations” panel in the “Field” box.
Next, take your mouse cursor and hover over the top right corner of the visual in the Report View. Click on the three dots that appeared in the corner as shown below.
While the “department_id” visual is selected, you should see corner marks indicating the visual as the active visual. While the “department_id” is active, press CTRL+C to copy it and then CTRL+V to paste it. Move the new visual to the right of the original visual.
Make the second visual active by clicking somewhere inside it. Then look for the “aisle_id” field in the “Fields” panel on the right of Power BI Desktop as shown below.
Let’s create another visual by copying and pasting the table visual. This time, select (or drag) the following fields to the “Values” box of the visual.
order_id
user_id
order_number
order_hour_of_day
order_dow
days_since_prior_order
Power BI Desktop should now look similar to the image below.
Try clicking one of the selections in the table visual (where it’s showing “product_id” and “product_name”) and observe how the table on the right changes accordingly.
Make sure the “Axis” box contains “order_dow” and the “Values” box with “order_id” respectively. Power BI Desktop should automatically calculate the count for “order_id” and display the field as “Count of order_id” as shown below.
This last graph is the most interesting because the number of reorders peaks during these three time periods: 7 days, 14 days, and 30 days since prior order. This means that people are in a habit of resupplying every week, every two weeks, and every month.
That’s it, folks!
In the next article, we will “prettify” our charts and make them more readable to the others.
The procedures above have been drawn out. But if you’re a novice Power BI user, don’t despair! With regular practice, the concepts demonstrated in this article will soon become second nature and you’ll probably be able to do them in your sleep.
Next, a visual placeholder will appear. Drag the hot corner marking on the lower right-hand corner of the placeholder and drag it diagonally down and to the right corner of the main working area.
Let’s change the Forecast length to 31 points. In this case, a data point equals a day so 31 would roughly equate to a month’s worth of predictions. Click on “Apply” on the lower right-corner of the Forecast group to apply the changes.
What if we wanted to compare how the forecast compares to actual data? We can do this with the “Ignore last” setting.
For this example, let’s ignore the last 3 months of the data. Power Bi will then forecast 3 months worth of data using the dataset but ignoring the last 3 months. This way, we can compare the Power BI’s forecasting result with the actual data in the last 3 months of the dataset.
Let’s click on “Apply” when we’redone changing the settings as shown below.
Below, we can see how the Power BI forecasting compares with the actual data. The black solid line represents the forecasting while the blue line represents the actual data.
The solid gray fill on the forecasting represents the confidence interval. The higher its value, the large the area will be. Let’s lower our confidence interval to 75% as shown below and see how it affects the graph.
Next, let’s take into account seasonality. Below, let’s set it to 90 points which is equivalent to about 3 months. Putting this value will tell Power BI to look for seasonality within a 3-month cycle. Play with this value with what makes sense according to the data.
Every once in a while, I would come across an article that decries online data science courses and boot camps as pathways towards getting a data science job. Most of the articles aim not to discourage but serve as a reminder to take a hard look in the mirror first and realize what we’re up against. However, a few detractors have proclaimed that the proliferation of these online courses and boot camps have caused the degradation of the profession.
To the latter, I vehemently disagree.
Bridging the Skill Gap
Data science have captured popular imagination ever since Harvard Business Review dubbed data scientist as the sexiest job of the 21st century. More than seven years later, data science remains one of the most highly sought-after job markets today. In fact, due to the dynamics of supply and demand, “the United States alone is projected to face a shortfall of some 250,000 data scientists by 2024¹.”
As a result, capitalism and entrepreneurship answered the call and companies like Codeup have vowed to “help bridge the gap between companies and people wanting to enter the field.”²
In addition, AutoML libraries like PyCaret are “democratizing machine learning and the use of advanced analytics by providing free, open-source, and low-code machine learning solution for business analysts, domain experts, citizen data scientists, and experienced data scientists”³.
The availability of online courses, boot camps, and AutoML libraries has led a lot of data scientists to raise their brows. They fear that boot camp alumni and self-taught candidates would somehow lower the overall caliber of data scientists and disgrace the field. Furthermore, they are afraid that the availability of tools like AutoML would allow anyone to be a data scientist.
I mean, God forbid if anyone thinks that they too can be data scientists! Right?
Wrong.
The Street Smart Data Scientist
Alumni of boot camps and self-taught learners, like myself, have one thing going for them: our rookie smarts. To quote Liz Wiseman, author of the book Rookie Smarts:
In a rapidly changing world, experience can be a curse. Being new, naïve, and even clueless can be an asset. — Liz Wiseman
Rookies are unencumbered. We are alert and constantly seeking like hunter-gatherers, cautious but quick like firewalkers, and hungry and relentless like frontiersmen⁴. In other words, we’re street smart.
Many are so bogged down by “you’ve got to learn this” and “you’ve got learn that” that they forget to stress the fact that data science is so vast that you can’t possibly know everything about anything. And that’s okay.
We learn fast and adapt quickly.
At the end of the day, it’s all about the value that we bring to our organizations. They are, after all, the ones paying our bills. We don’t get paid to memorize formulas or by knowing how to code an algorithm from scratch.
We get paid to solve problems.
And this is where the street smart data scientist excels. We don’t suffer from analysis paralysis or be bothered with theories, at least not while on the clock. Our center of focus is based on pragmatic solutions to problems, not on academic debate.
This is not to say we’re not interested in the latest research. In fact, it’s quite the contrary. We are voracious consumers of the latest development in machine learning and AI. We drool over the latest development in natural language processing. And we’re always on the lookout for the latest tool that will make our jobs easier and less boring.
And AutoML
So what if we have to use AutoML? If it gets us to an automatic pipeline where analysts can get the results of machine learning without manual intervention by a data scientist, the better. We’re not threatened by automation, we’re exhilarated by it!
Do not let perfection be the enemy of progress. — Winston Churchill
By building an automatic pipeline, there’s bound to be some tradeoffs. But building it this way will free our brain cells and gives us more time to focus on solving other higher-level problems and produce more impactful solutions.
We’re not concerned about job security, because we know that it doesn’t exist. What we do know is that the more value we bring to a business, the better we will be in the long run.
Maybe They’re Right?
After all this, I will concede a bit. For the sake of argument, maybe they’re right. Maybe online courses, boot camps, and low-code machine learning libraries really do produce low-caliber data scientists.
Big maybe.
But still, I argue, this doesn’t mean we don’t have value. Data science skills lie on a spectrum and so does companies’ maturity when it comes to data. Why hire a six-figure employee when your organization barely has a recognizable machine learning infrastructure?
Again, maybe.
The Unicorn
Maybe, to be labeled as a data scientist, one must be a unicorn first. A unicorn data scientist is a data scientist who excels at all facets of data science.
Data science has long been described as the intersection between computer science, applied statistics, and business or domain knowledge. To this, they ask, how can one person possibly accumulate all those knowledge into just a few months? To this, we also ask the same question, how can a college grad?
Unicorns do exist I believe, but they also have had to start from somewhere.
So why can’t we?
Conclusion
A whole slew of online courses and tools promise to democratize data science, and this is a good thing.