Selecting Rows with .loc

As data scientists, we spent most of our time wrangling knee-deep in manipulating data using Pandas. In this post, we’ll be looking at the .loc property of Pandas to select rows based on some predefined conditions.

Let’s open up a Jupyter notebook, and let’s get wrangling!


The Data

We will be using the 311 Service Calls dataset¹ from the City of San Antonio Open Data website to illustrate how the different .loc techniques work.

Housekeeping

Before we get started, let’s do a little housekeeping first.

import pandas as pd

# to print out all the outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', -1)

Nothing fancy going on here. We’re just importing the mandatory Pandas library and setting the display options so that when we inspect our dataframe, the columns and rows won’t be truncated by Jupyter. We’re setting it up so that every output within a single cell is displayed and not just the last one.

def show_missing(df):
    """
    Return the total missing values and the percentage of
    missing values by column.
    """
    null_count = df.isnull().sum()
    null_percentage = (null_count / df.shape[0]) * 100
    empty_count = pd.Series(((df == ' ') | (df == '')).sum())
    empty_percentage = (empty_count / df.shape[0]) * 100
    nan_count = pd.Series(((df == 'nan') | (df == 'NaN')).sum())
    nan_percentage = (nan_count / df.shape[0]) * 100
    return pd.DataFrame({'num_missing': null_count, 'missing_percentage': null_percentage,
                         'num_empty': empty_count, 'empty_percentage': empty_percentage,
                         'nan_count': nan_count, 'nan_percentage': nan_percentage})

In the code above, we’re defining a function that will show us the number of missing or null values and their percentage.

Getting the Data

Let’s load the data into a dataframe.

Doing a quick df.head() we’ll see the first five rows of the data:

And df.info() will let us see the dtypes of the columns.

Then, show_missing(df) shows us if there are any missing values in the data.

Selecting rows where the column is null or not.

Let’s select rows where the 'Dept' column has null values and also filtering a dataframe where null values are excluded.

df['Dept'].value_counts(dropna=False)

df_null = df.loc[df['Dept'].isnull()]
df_null.head()
df_null.shape

df_notnull = df.loc[df['Dept'].notnull()]
df_notnull.head()
df_notnull.shape

First, we did a value count of the column ‘Dept’ column. The method .value_counts() returns a panda series listing all the values of the designated column and their frequency. By default, the method ignores NaN values and will not list it. However, if you include the parameter dropna=False it will include any NaN values in the result.

Next, the line df_null = df.loc[df['Dept'].isnull()] tells the computer to select rows in df where the column 'Dept' is null. The resulting dataframe is assigned to df_null , and all its rows will NaN as values in the ‘Dept’ column.

Similarly, the line df_notnull = df.loc[df['Dept'].notnull()] tells the computer to select rows in df where the column 'Dept' is not null. The resulting dataframe is assigned to df_notnull , and all its rows will not have any NaN as values in the ‘Dept’ column.

The general syntax for these two techniques are:

df_new = df_old.loc[df_old['Column Name'].isnull()]
df_new = df_old.loc[df_old['Column Name'].notnull()]

Selecting rows where the column is a specific value.

The 'Late (Yes/No)' column looks interesting. Let’s take a look at it!

df['Late (Yes/No)'].value_counts(dropna=False)

df_late = df.loc[df['Late (Yes/No)'] == 'YES']
df_late.head()
df_late.shape

df_notlate = df.loc[df['Late (Yes/No)'] == 'NO']
df_notlate.head()
df_notlate.shape

Again, we did a quick value count on the 'Late (Yes/No)' column. Then, we filtered for the cases that were late with df_late = df.loc[df['Late (Yes/No)'] == 'YES']. Similarly, we did the opposite by changing 'YES' to 'NO' and assign it to a different dataframe df_notlate.

The syntax is not much different from the previous example except the addition of == sign between the column and the value we want to compare. It basically asks, for every row, if the value on a particular column (left side) matches the value that we specified (right-side). If the match is True, it includes that row in the result. If the match is False, it ignores it.

Here’s the resulting dataframe for df_late:

And here’s the one for df_notlate:

The general syntax for this technique is:

df_new = df_old.loc[df_old['Column Name'] == 'some_value' ]

Selecting rows where the column is not a specific value.

We’ve learned how to select rows based on ‘yes’ and ‘no.’ But what if the values are not binary? For example, let’s look at the ‘Category’ column:

One hundred ninety-two thousand one hundred ninety-seven rows or records do not have a category assigned, but instead of NaN, empty, or null value, we get 'No Category' as the category itself. What if we want to filter these out? Enter: the != operator.

df.Category.value_counts(dropna=False)

df_categorized = df.loc[df['Category'] != 'No Category']
df_categorized.head()
df_categorized.shape

df_categorized.Category.value_counts(dropna=False)

As usual, we did customary value counts on the 'Category' column to see what we’re working with. Then, we created the df_categorized dataframe to include any records in the the df dataframe that don’t have 'No Category' as their value in the 'Category' column.

Here’s the result of doing a value count on the 'Category' column of the df_categorized dataframe:

As the screenshot above shows, the value counts retained everything but the ‘No Category.’

The general syntax for this technique is:

df_new = df_old.loc[df_old['Column Name'] != 'some_value' ]

Select rows based on multiple conditions.

Let’s consider the following columns, 'Late (Yes/No)' and 'CaseStatus':

What if we wanted to know which open cases right now are already passed their SLA (service level agreement)? We would need to use multiple conditions to filter the cases or rows in a new dataframe. Enter the & operator.

df_late_open = df.loc[(df['Late (Yes/No)'] == 'YES') & (df['CaseStatus'] == 'Open')]

df_late_open.head()
df_late_open.shape

The syntax is similar to the previous ones except for the introduction of the & operator in between parenthesis. In the line df_late_open = df.loc[(df[‘Late (Yes/No)’] == ‘YES’) & (df[‘CaseStatus’] == ‘Open’)], there are two conditions:

  1. (df[‘Late (Yes/No)’] == ‘YES’)
  2. (df[‘CaseStatus’] == ‘Open’)

We want both of these to be true to match a row, so we included the operator & in between them. In plain speak, the & bitwise operator simply means AND. Other bitwise operators include pipe| sign for OR and the tilde ~ for NOT. I encourage you to experiment using these bitwise operators to get a good feel of what all they can do. Just remember to enclose each condition between parenthesis so that you don’t confuse Python.

The general syntax for this technique is:

df_new = df_old.loc[(df_old['Column Name 1'] == 'some_value_1') & (df['Column Name 2'] == 'some_value_2')]

Select rows having a column value that belongs in some list of values.

Let’s look at the value count for the 'Council District' column:

What if we wanted to focus on districts #2, #3, #4, and #5 because they’re in south San Antonio, and they’re known for getting poor service from the city? (I’m so totally making this up by the way!) In this case, we could use the .isin() method like so:

df['Council District'].value_counts(dropna=False)

df_south = df.loc[df['Council District'].isin([2,3,4,5])]
df_south.head()
df_south.shape

df_south['Council District'].value_counts()

Remember to pass your choices inside the .isin() method as a list like ['choice1', 'choice2', 'choice3'] because otherwise, it will cause an error. For integers like in our example, it is not necessary to include quotation marks because quotation marks are for string values only.

Here’s the result of our new dataframe df_south:

The general syntax for this technique is:

df_new = df_old.loc[df_old[Column Name'].isin(['choice1', 'choice2', 'choice3'])]

Conclusion

And that’s it! In this post, we loaded the 311 service calls data into a dataframe and created subsets of data using the .loc method.


Thanks for reading! I hope you enjoyed today’s post. Data wrangling, at least for me, is a fun exercise because this is the phase where I first get to know the data and it gives me a chance to hone my problem-solving skills when faced with really messy data. Happy wrangling folks!

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1] City of San Antonio Open Data. (May 31, 2020). 311 Service Calls. https://data.sanantonio.gov/dataset/service-calls

Published by

Ednalyn C. De Dios

I’ve always been enamored with code and I love data science because of its inherent power to solve real problems. Having grown up in the Philippines, served in the United States Navy, and worked in the nonprofit sector, I am driven to make the world a better place. I have started and participated in numerous campaigns that aim to reduce domestic violence and child abuse in the community.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.