The Data Analysis Process
Data analysis is nothing more than a sequence of steps:
- Problem definition
- Data extraction
- Data preparation: Cleaning
- Data preparation: Transformation
- Data exploration and visualization
- Predictive modeling
- Model validation/test
- Deployment: visualization and interpretation of results
- Deployment: deployment of solutions
“Data analysis always starts with a problem to be solved.” A study of the system is conducted and is designed to be able to make informed predictions or choices.
“Building a good team is certainly one of the key factors leading to success in data analysis.” Fabio recommends an effective cross-disciplinary team.
As much as possible, sample data must reflect the real world. In addition to data selection, extracting and using the best data sources is another issue to keep in mind.
Data preparation comprises of obtaining, cleaning, normalizing, transforming, and optimizing a data set. Although it may seem that data preparation is less problematic, it actually requires the more resources and more time to be completed. Potential problems includes data values that are ambiguous, missing, replicated, or out of range.
Exploring data involves “searching the data in graphical or statistical presentation to find patterns, connections, and relationships. Data visualization is the best tool to highlight possible patterns.”
Summarization is the process where data are reduced without sacrificing important information. Clustering is used to find groups united by a common attributes. Another step of analysis focuses on identification of relationships, trends, and anomalies in the data.Other methods of data mining automatically extract important facts or rules from the data.
Predictive modeling is used to create or choose a statistical model that predicts the probability of a result. The purpose of these models is to make predictions about the data values and to classify new data products.
The models can be divided into three types:
- Classification models: if the result is categorical
- Regression models: if the result is numerical
- Clustering models: if the result is descriptive
Some of the methods include linear regression, logistical regression, classification and regression trees, and k-nearest neighbors.
Some models explain the characteristics of the system under study in a clear and simple way while some models have limited ability to explain the characteristics of systems but still make good predictions.
Validation of the model is the test phase. Data is called the training set when used to build model. It is called validation set when used to validate the model.
Comparing data enables us to evaluate the error and estimate the limits of validity.
This process allows you to numerically evaluate the effectiveness of the model and compare it with other existing models.
This is the final step of the analysis process which aims to translate the result into a benefit. Normally, it consists of “writing a report for management or for the customer who requested the analysis.”
In the report, the following topics are discussed:
- Analysis results
- Decision deployment
- Risk analysis
- Measuring the business impact
We’ll conclude this summary by discussing quantitative/qualitative data analysis and open data sources in part III.