How to load data into a DataFrame

Below are three Panda functions that I use to load data into a dataframe. I use read_csv() to read a regular .csv file. For excel files, I use read_excel(). And finally, for those weird tab-delimited files the extension of .tsv, I also use read_csv() but with the added parameter of sep='\t'.

# we must first import pandas
# add it at the beginning of your code
import pandas as pd

# to read a regular good ol' csv file
df = pd.read_csv('../data/data_file.csv')

# to read an excel (:::gasp:::) file
df = pd.read_excel('../data/data_file.xlsx')

# to read a .tsv file
df = pd.read_csv('../data/data_file.tsv', sep='\t')

Show missing values in a Pandas DataFrame

The function below show any null, NaN, or empty values in the dataframe. It takes a dataframe as a parameter and returns a dataframe with columns that shows the count and percentage of any null, NaNs, and empty values.

def missing_values_col(df):
    """
    Write or use a previously written function to return the
    total missing values and the percent missing values by column.
    """
    null_count = df.isnull().sum()
    null_percentage = (null_count / df.shape[0]) * 100
    empty_count = pd.Series(((df == ' ') | (df == '')).sum())
    empty_percentage = (empty_count / df.shape[0]) * 100
    nan_count = pd.Series(((df == 'nan') | (df == 'NaN')).sum())
    nan_percentage = (nan_count / df.shape[0]) * 100
    return pd.DataFrame({'num_missing': null_count, 'missing_percentage': null_percentage,
                         'num_empty': empty_count, 'empty_percentage': empty_percentage,
                         'nan_count': nan_count, 'nan_percentage': nan_percentage})

How to display all the columns and rows in the a Jupyter notebook output

Here’s the code I use in my jupyter notebook to display all the columns, rows, and characters. I usually put in the beginning of my notebook, after the import statements.

# set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', -1)

Select Rows from a DataFrame Based on Multiple Conditions in Pandas

I’m continuing to break down a very popular stackoverflow post at https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas.

df.loc[(df['column_name'] >= A) & (df['column_name'] <= B)]

Here, the df‘s in the line of code above represent the name of data frame that we are working with.

.loc is a keyword.

The outermost square brackets [...] contains the condition that you want to use as filter. In this case, we have two conditions inside separated by & with the each condition enclosed by an open and close parenthesis (...).

This is condition #1:

(df['column_name'] >= A)

This is condition #2:

(df['column_name'] <= B)

The column_name in condition #1 can be the same or different from condition #2; it all depends on how you want to filter the data frame. The same goes for the >= and <= comparison signs, and A & B.

Suppose I have a data frame called my_tie_collection. If I only want ties that are both blue and made from silk, then, I would type:

my_tie_collection.loc[(my_tie_collection['color'] == 'blue') & (my_tie_collection['material'] == 'silk')]

Take note, I used == to denote matching and I also put single quotes around blue and silk.

Select Rows from a DataFrame based on a Value in a Column in Pandas

Today, I’ll be breaking down a very popular stackoverflow post at https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas.

df.loc[df['column_name'] == some_value]

Above, df is the name of the data frame. You should replace it twice with the name of your data frame.

.loc is a keyword.

Next, replace column_name with the name of the column that contains the values you want to filter.

Finally, replace some_value with the desired value.

For example, if I have a data frame named “my_shoe_collection” and I want to select only the rows where the value of “color” is “blue” then:

my_shoe_collection.loc[my_shoe_collection['color'] == 'blue']

Also, if I have the same data frame named and I want to only select rows where the value of “price” is less than $50, then:

my_shoe_collection.loc[my_shoe_collection['price'] <= 50]

Notice how I got rid of the single quotation marks since I’m dealing with an actual number?