In this post, I’ll attempt to break down the steps involved in a successful data science project: from understanding business requirements to the maintenance of whatever data product your data science team ends up producing. Along the way, we’ll discuss the requirements for each step as well as the skills and tools that are essential for each step.
PAPEM-DM: A Data Science Framework
I’ve first learned about the data science pipeline while attending Codeup (fully-immersive, project-based 20-week Data Science and Web Development career accelerator). Since then, I’ve read countless articles enumerating the steps involved in data science projects but I have yet to come across one that includes them all. Hence, this article.
PAPEM-DM is a handy acronym that I made up to remind me of the steps: planning, acquisition, preparation, exploration, modeling, delivery, and maintenance. This should cover all the main steps of the data science process from end to end. Let me know if I’ve missed something important! Oh wait, they’re all important… so comment away! LOL.
In this very first step, we must determine the questions that need to be addressed. Most stakeholders and end-users speak business, not data science; more often than not, they ask questions that are either too general or too specific to address the real problem that needs to be solved. When starting a new project, it’s important to understand the business side of things first before getting lost in the weeds.
In the book, Thinking with Data, Max Shron advocates answering the “why” before figuring out the “how.” He proposes using the CoNVO framework which consists of context, needs, vision, and outcome. What are the circumstances or terms in which the problem can be better understood or explained? What are the specific needs that can be satisfied by leveraging data? What will the project look like once it achieves its goal(s)? Last but not least, how will the deliverables of the project be used within the organization, and who will own or maintain it?
No fancy tools are needed for this step but that doesn’t mean it’s not important. In fact, this step is very critical because it could mean the difference between success and failure for your data science project.
The deliverable for this step is a clear delineation of the things that you want to accomplish and your measure of success.
This step involves acquiring or getting the raw data that you need to provide an insight into the problem you’re having. It requires you to think about where and how you will get the data, and whether or not it’s a manual or automated process.
The tools you’ll need will depend on the data source (where the data will be coming from) and your working environment. For example, if the data already lives in a data warehouse, getting the data might simply mean connecting to your data warehouse like Redshift using SQL Alchemy and loading it into your Python environment via a Pandas dataframe. On the other hand, if the data is sent periodically via email, you might want to use a connector service like Fivetran to load the file into Redshift first, and then connect to the Redshift cluster using a business analytic tool like Microsoft Power BI. If a project is a one-time ad-hoc request, the process might simply mean retrieving a flat-file (.csv, .tsv, or even .xlsx) from a local or shared drive and then loading it directly into a Pandas dataframe for further processing in a Jupyter notebook. Another straightforward means is manually extracting the data from a CRM platform like Salesforce or Zendesk.
Whatever the case might be, the outcome of this step is a data set that’s ready to be processed by your weapon (or tool) of choice.
Real-world data is often messy and will need to be cleaned and processed prior to doing analysis. Data preparation is all about transforming the raw data that we get from the previous step into a format that allows us to glean insights. Some of the processes involved in this step usually include taking care of missing, null, NaN, and duplicated values. Depending on the nature of the project, these might need to be imputed or be dropped entirely. Data types will also need to be handled as well as DateTime conflicts. In addition, you might also need to combine several datasets into one massive table or dataframe, or even trim multitudes of columns into just a few.
Again, the tools needed will vary depending on your working environment. For Python, the libraries you would likely use are Pandas, matplotlib, and scikit-learn. If you’re working with Power BI, you most likely will use its Power Query Editor and DAX formulas, keeping in mind that relationships between tables must also be defined properly before joins and merges can work. If your organization uses Fivetran, you might use its “transformations” feature to execute some SQL queries prior to analysis.
Once you’re done with cleaning and preparation, you’re ready to proceed to the next step: exploratory data analysis.
According to the Engineering Statistics Handbook, exploratory data analysis, or EDA for short, is an approach to data analysis that employs a variety of techniques in order to:
- maximize insight into a data set;
- uncover underlying structure;
- extract important variables;
- detect outliers and anomalies;
- test underlying assumptions;
- develop parsimonious models; and
- determine optimal factor settings.
In other words, this step is where you get to play with data to discover interesting patterns, anomalies, and discover which features or variables are your biggest drivers.
Feature engineering and pre-processing are also major components of this step. For Python, you can use Pandas, scipy, numpy, statsmodels, and visualization libraries like matplotlib, seaborn, plotly, and bokeh.
At the end of this step, not only should you have a dataset in a format that could be used in a machine learning model, but you should also answer some of the questions that were addressed during the planning step.
This is the most popular part of the data science project lifecycle. In the modeling, we take our cleaned and processed data, and use it to train and test one or few machine learning algorithms in order to make a prediction.
The types of machine learning algorithms include regression, classification, clustering, anomaly detection, time-series forecasting, and my favorite: natural language processing or NLP.
The tasks involved in this step are:
- Split the data into training and testing sets
- Identify the machine learning model or models that are most appropriate in the project’s specific use case
- Train a model
- Make predictions based on the training set
- Evaluate the results on the training set
- Tune the hyperparameters
- Rinse and repeat
- Choose your best performing model
- Make predictions based on the testing set
- Evaluate the results on the testing set
There are a plethora of machine learning algorithms out there and it is up to the data scientist to select which ones to use depending on the nature of the features (variables) and target (what we’re trying to predict.
Having trained and evaluated a machine learning model is fine and dandy but sooner or later, you’re going to need to enable others to use what you have discovered or developed. The deliverables of a data science project can be as simple as a slide deck that reports the findings of your exploratory data analysis with recommendations on the next actions to take. Another is a self-service dashboard that others can use to facilitate data-based management. You can generate another table in a database that can be used for live real-time reporting. And last but not the least, you can develop an application that uses your trained model to make predictions based on new observations. For example, a mechanism that processes chat transcripts with a customer to predict their satisfaction or current sentiment.
The tools needed for this step would depend on the type of deliverable for the project. The skills needed can be as simple as employing story-telling techniques or a full-blown pipeline deployment in a serverless environment using services like AWS or Azure. Technological capabilities are also a factor that can dictate just what type of data product you can deliver. This is also why it is crucial that you have a great working relationship with your data engineering team. In fact, in a perfect world, Data Scientists and Data Engineers should be walking in lock step with each other.
In this agile world that we live in today, shipping a minimum viable product (MVP) today is more beneficial than shipping something perfect tomorrow.
“Ship now, iterate later.”– Chana
Maintaining data science projects requires constant vigilance on each of the components of the pipeline. It helps to review and examine any changes that would affect the project from beginning to end. For example, the shape and structure of the data might change and this will affect your pre-processing scripts. Make sure that you’re aware of the security vulnerabilities on the packages and libraries that you use and update each one accordingly while making sure project dependencies do not break. Also, look for opportunities to improve the pipeline (like automation) as new technologies become available.
It’s been a long article already. To recap, use the PAPEM-DM framework when planning and executing data science projects. Please note, the steps in the framework are not isolated from each other. More often than not, you’ll find yourself doing these steps in an iterative cyclical fashion rather than sequentially. For example, you might want to do more exploration after modeling because you discovered something interesting in the results and would like to dive deeper. Just don’t get lost in the weeds too much!
That’s it for today!
What is your experience like regarding the data science project lifecycle? How would this framework change or stay the same when switching from dev to production? Please comment away, the more the merrier!