# Introduction to Data Analysis – Part II

This is a continuation of Chapter 1 summary of Python Data Analytics by Fabio Nelli. Click here for Part I.

## The Data Analysis Process

Data analysis is nothing more than a sequence of steps:

1. Problem definition
2. Data extraction
3. Data preparation: Cleaning
4. Data preparation: Transformation
5. Data exploration and visualization
6. Predictive modeling
7. Model validation/test
8. Deployment: visualization and interpretation of results
9. Deployment: deployment of solutions

## Problem Definition

“Data analysis always starts with a problem to be solved.” A study of the system is conducted and is designed to be able to make informed predictions or choices.

“Building a good team is certainly one of the key factors leading to success in data analysis.” Fabio recommends an effective cross-disciplinary team.

## Data Extraction

As much as possible, sample data must reflect the real world. In addition to data selection, extracting and using the best data sources is another issue to keep in mind.

## Data Preparation

Data preparation comprises of obtaining, cleaning, normalizing, transforming, and optimizing a data set. Although it may seem that data preparation is less problematic, it actually requires the more resources and more time to be completed. Potential problems includes data values that are ambiguous, missing, replicated, or out of range.

## Data Exploration/Visualization

Exploring data involves “searching the data in graphical or statistical presentation to find patterns, connections, and relationships. Data visualization is the best tool to highlight possible patterns.”

Summarization is the process where data are reduced without sacrificing important information. Clustering is used to find groups united by a common attributes. Another step of analysis focuses on identification of relationships, trends, and anomalies in the data.Other methods of data mining automatically extract important facts or rules from the data.

## Predictive Modeling

Predictive modeling is used to create or choose a statistical model that predicts the probability of a result. The purpose of these models is to make predictions about the data values and to classify new data products.

The models can be divided into three types:

• Classification models: if the result is categorical
• Regression models: if the result is numerical
• Clustering models: if the result is descriptive

Some of the methods include linear regression, logistical regression, classification and regression trees, and k-nearest neighbors.

Some models explain the characteristics of the system under study in a clear and simple way while some models have limited ability to explain the characteristics of systems but still make good predictions.

## Model Validation

Validation of the model is the test phase. Data is called the training set when used to build model. It is called validation set when used to validate the model.

Comparing data enables us to evaluate the error and estimate the limits of validity.

This process allows you to numerically evaluate the effectiveness of the model and compare it with other existing models.

Deployment

This is the final step of the analysis process which aims to translate the result into a benefit. Normally, it consists of “writing a report for management or for the customer who requested the analysis.”

In the report, the following topics are discussed:

• Analysis results
• Decision deployment
• Risk analysis