Terminal Makeover with Oh-my-zsh and iTerm.

A visual step-by-step guide to replacing the default terminal application with iTerm2.

Over the weekend, I’ve decided to restore my Macbook Pro to factory settings so I can have a clean start at setting up a programming environment.

In this post, we’ll work through setting up oh-my-zsh and iTerm2 on the Mac.

This is what the end-result will look like:

The end-result.

Let’s begin!

Press CMD + SPACE to call the spotlight service.

Start typing in “terminal” and you should see something similar below.

Hit the enter key (gently, of course) to open the terminal application.

If you see something that says “The default interactive shell is now zsh…” it means you’re still using bash as your shell.

Let’s switch to zsh.

Click on “Terminal” and select “Preferences…” as shown below.

This will open up the terminal settings window.

In the “Shells open with” section, click on “Default login shell” as shown below.

Close the window by click on the “X” t the top left-hand corner and then restart the terminal. You should now see the terminal using the zsh like the one below.

Installing Powerline Fonts

The theme “agnoster” will require some special fonts to be render properly. Let’s install them now.

Type the following command into the terminal:

git clone https://github.com/powerline/fonts.git --depth=1

And then the following to change directory:

cd fonts

The directory will change ~/fonts as shown below.

Type the following command to install the fonts into your system.

./install.sh

The output should be something like one below.

Let’s back up to the parent directory so we could do some cleaning up:

cd ..

You should the following output below indicating the home directory.

Let’s delete the installation folder with the following command:

rm -rf fonts

The fonts folder should be deleted now. Let’s clear our console output.

clear

You should see a clear window now on the console like the one below.

Installing Oh-My-ZSH

Oh-My-ZSH takes care of the configuration for our zsh shell. Let’s install it now.

Type the following into the terminal (do not use any line breaks, this should be only one line):

sh -c "$(curl -fsSL https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"

You should now see oh-my-zsh installed on your computer.

If you see a message that says “Insecure completion-dependent directories detected,” we need to set the ZSH_DISABLE_COMPFIX to true in the .zshrc file on the home directory.

To do this, open up a Finder window and navigate to home directory.

Press SHIFT + CMD + . to reveal hidden files. You should now see something similar below.

Open the .zshrc file using a text editor like Sublime.

This is what the inside of the .zshrc file looks like:

Scroll down around line #73.

Insert the following line right before source $ZSH/oh-my-zsh.sh:

ZSH_DISABLE_COMPFIX="true"

Save and close the .zshrc file, and open a new terminal window. You should something similar like the one below.

Replacing the Default Terminal

Go to https://www.iterm2.com/version3.html and download the latest version.

Save the installer on your “Downloads” folder like so:

Open a new Finder window and navigate to “Downloads.” You should see something similar below. Double click on the zip file and it should extract an instance of the iTerm app.

Double-click on “iTerm.app”

If prompted regarding the app being download from the Intermet, , click “Open.”

If prompted to move the app into the application folder, please click on “Move to Allocations Folder.”

Close all windows and press CMD + SPACE to pull up thre spotlight search service and type in “iterm.” Hit ENTER and you should now see the iTerm App.

Open a Finder window, navigate to the home directory, and find the .zshrc file.

Open the .zshrc file using a text editor.

Find ZSH_THEME=”robbyrussell” and replace “robbyrussell” with “agnoster” as shown below.

Save and close the file. Close any remaining open iTerm window by pressing CTRL + Q.

Restart iTerm by pressing CMD + SPACE and typing in “iterm” as shown in the images below.

Hit the ENTER key and a new iTerm window should open like the one below.

The prompt looks a little weird. Let’s fix it!

Go to iTerm2 and select Preferences… as shown below.

You’ll see something like the image below.

Click on “Profiles.”

Find the “+” on the lower left corner of the window below the Profile Name area besides “Tags >”

Click on the “+” sign.

On the General tab, under the Basics area, replace the default “New Profile” name with your preferred profile name. Below, I had typed in “Gunmetal Blue.”

In Title, click on the drop down and check or uncheck your preferences for the window title appearances.

Navigate to the Colors tab and click on the “Color Presets…” dropdown in the lower right hand corner of the window and selet “Smooooooth.”

Find “Background” in the Basic Colors section and set the color to R:0 G:50 B:150 as shown below.

Navigate to the “Text” tab and find the “Font” section. Select any of the Powerline fonts. Below, I selected Roboto Mono Medium for Powerline” and increase the font size to 13.

Under the same “Font” section, check “Use a different font for non-ASCII text” and select the same font as before. Refer to the image below.

Next, navigate to the “Window” tab and set the Transparency and Blur as show below.

Then, navigate to the “Terminal” tab and check “Unlimited scrollback.”

Finally, let’s set this newly created profile by as the default by clicking on “Other Actions…” dropdown and selecting “Set as Default” as shown below.

You should now see a star next to the newly created profile indicating that its status as the default profile for new windows.

Restart iTerm and you should something similar like the one below.

Notice that we can barely see the directory indicator on the prompt. Also, the username@hostname is a little long for liking. Let’s fix those.

Go to the iTerm preferences again and navigate to “Profiles” tab. Find “Blue” on the ANSI Colors under the “Normal” column and click on the colored box.

Set the RGB values as R:0 G:200 B:250 as shown below.

Quit iTerm by pressing CMD + Q and open a Finder window. Navigate to the home directory, reveal the hidden files with SHIFT + CMD + . and double click on the “.oh-my-zsh” folder.

Navigate to and click on the “themes” folder.

Look for the “agnoster.zsh-theme” file and open it using a text editor.

This is what the inside of the theme looks like:

Around line #92, look for the “%n@%m” character string.

Select “%n@%m” and replace it with whatver you’d like to display on the prompt.

Below, I simply replaced “%n@%m” with “Dd” for brevity.

Restart iTerm and you should get something similar like the image below.

If you navigate to a git repository, you’ll see something similar below:

And that’s it!

Happy coding!

Programming Environment Setup

I was bored over the weekend so I decided to restore my Macbook Pro to factory settings so that I can set up my programming environment the proper way. After all, what’s a data scientist without her toys?

Let’s start with a replacement to the default terminal and pyenv installation to manage different Python versions.

Let’s move on to managing different Python interpreters and virtual environments using pyenv-virtualenv.

Drop It Like It’s Hot

I have a recurring dream where my instructor from a coding boot camp would constantly beat my head with a ruler telling me to read a package or library’s documentation. Hence, as a past time, I would find myself digging into Python or Panda’s documentation.

Today, I found myself wandering into pandas’ .drop() function. So, in this post, I shall attempt to make sense of panda’s documentation for the ever famous .drop().


Housekeeping

Let’s import pandas and create a sample dataframe.

import pandas as pd

data = {'fname': ['Priyanka', 'Jane', 'Sarah', 'Jake', 'Tatum', 'Shubham', 'Antonio'],
        'color': ['Red', 'Orange', 'Yellow', 'Green', 'Blue', 'Indigo', 'Violet'],
        'value': [0, 1, 2, 3, 5, 8, 13],
        'score': [345, 778, 124, 554, 864, 908, 456]
       }

df = pd.DataFrame(data)

If we type df into a cell in Jupyter notebook, this will give us the whole dataframe:

One-level DataFrame Operations

Now let’s get rid of some columns.

df.drop(['color', 'score'], axis=1)

The code above simply tells Python to get rid of the 'color' and 'score' in axis=1 which means look in the columns. Alternatively, we could’ve just as easily not used the named parameter axis because it’s confusing. So, let’s try that now:

df.drop(columns=['color', 'score'])

Both of the methods above will result in the following:

Next, we’ll get rid of some rows (or records).

df.drop([1, 2, 4, 6])

Above, we’re simply telling Python to get rid of the rows with the index of 1, 2, 4, and 6. Note that the indices are passed as a list [1, 2, 4, 6]. This will result in the following:

MultiIndex DataFrame Operations

In this next round, we’re going to work with a multi-index dataframe. Let’s set it up:

data = pd.MultiIndex(levels=[['slim jim', 'avocado', 'banana', 'pork rinds'],
                             ['carbs', 'fat', 'protein']],
                     codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3],
                            [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 1, 2]])

df = pd.DataFrame(index=data, columns=['thinghy', 'majig'],
                  data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
                        [250, 150], [1.5, 0.8], [320, 250],
                        [1, 0.8], [0.3, 0.2], [34.2, 56], [33, 45.1], [67.3, 98]])

This is how the multi-index dataframe looks like:

Now, let’s get rid of the 'thinghy' column with:

df.drop(columns='thinghy')

And this is what we get:

Next, let’s get rid of 'pork rinds' because I don’t like them:

df.drop(index='pork rinds', level=0)

And this is what we get:

And finally, let’s cut the fat:

df.drop(index='fat', level=1)

Above, level=1 simply means the second level (since the first level starts with 0). In this case, it’s the carbs, fat, and protein levels. By specifying index='fat', we’re telling Python to get rid of the fat in level=1.

Here’s what we get:

Staying Put

So far, with all the playing that we did, somehow, if we type df into a cell, the output that we’re going to get is the original dataframe without modifications. this is because all the changes that we’ve been making take effect only on the display.

But what if we want to make the changes permanent? Enter: inplace.

df.drop(index='fat', level=1, inplace=True)

Above, we added inplace=True in the parameter. This signals Python that we want the changes to be taken in place so that when we output df, this is what we’ll get:

We had permanently cut the fat off. LOL!


Thank you for reading! That’s it for today.

Stay tuned!

You can reach me on Twitter or LinkedIn.

Selecting Rows with .loc

As data scientists, we spent most of our time wrangling knee-deep in manipulating data using Pandas. In this post, we’ll be looking at the .loc property of Pandas to select rows based on some predefined conditions.

Let’s open up a Jupyter notebook, and let’s get wrangling!


The Data

We will be using the 311 Service Calls dataset¹ from the City of San Antonio Open Data website to illustrate how the different .loc techniques work.

Housekeeping

Before we get started, let’s do a little housekeeping first.

import pandas as pd

# to print out all the outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', -1)

Nothing fancy going on here. We’re just importing the mandatory Pandas library and setting the display options so that when we inspect our dataframe, the columns and rows won’t be truncated by Jupyter. We’re setting it up so that every output within a single cell is displayed and not just the last one.

def show_missing(df):
    """
    Return the total missing values and the percentage of
    missing values by column.
    """
    null_count = df.isnull().sum()
    null_percentage = (null_count / df.shape[0]) * 100
    empty_count = pd.Series(((df == ' ') | (df == '')).sum())
    empty_percentage = (empty_count / df.shape[0]) * 100
    nan_count = pd.Series(((df == 'nan') | (df == 'NaN')).sum())
    nan_percentage = (nan_count / df.shape[0]) * 100
    return pd.DataFrame({'num_missing': null_count, 'missing_percentage': null_percentage,
                         'num_empty': empty_count, 'empty_percentage': empty_percentage,
                         'nan_count': nan_count, 'nan_percentage': nan_percentage})

In the code above, we’re defining a function that will show us the number of missing or null values and their percentage.

Getting the Data

Let’s load the data into a dataframe.

Doing a quick df.head() we’ll see the first five rows of the data:

And df.info() will let us see the dtypes of the columns.

Then, show_missing(df) shows us if there are any missing values in the data.

Selecting rows where the column is null or not.

Let’s select rows where the 'Dept' column has null values and also filtering a dataframe where null values are excluded.

df['Dept'].value_counts(dropna=False)

df_null = df.loc[df['Dept'].isnull()]
df_null.head()
df_null.shape

df_notnull = df.loc[df['Dept'].notnull()]
df_notnull.head()
df_notnull.shape

First, we did a value count of the column ‘Dept’ column. The method .value_counts() returns a panda series listing all the values of the designated column and their frequency. By default, the method ignores NaN values and will not list it. However, if you include the parameter dropna=False it will include any NaN values in the result.

Next, the line df_null = df.loc[df['Dept'].isnull()] tells the computer to select rows in df where the column 'Dept' is null. The resulting dataframe is assigned to df_null , and all its rows will NaN as values in the ‘Dept’ column.

Similarly, the line df_notnull = df.loc[df['Dept'].notnull()] tells the computer to select rows in df where the column 'Dept' is not null. The resulting dataframe is assigned to df_notnull , and all its rows will not have any NaN as values in the ‘Dept’ column.

The general syntax for these two techniques are:

df_new = df_old.loc[df_old['Column Name'].isnull()]
df_new = df_old.loc[df_old['Column Name'].notnull()]

Selecting rows where the column is a specific value.

The 'Late (Yes/No)' column looks interesting. Let’s take a look at it!

df['Late (Yes/No)'].value_counts(dropna=False)

df_late = df.loc[df['Late (Yes/No)'] == 'YES']
df_late.head()
df_late.shape

df_notlate = df.loc[df['Late (Yes/No)'] == 'NO']
df_notlate.head()
df_notlate.shape

Again, we did a quick value count on the 'Late (Yes/No)' column. Then, we filtered for the cases that were late with df_late = df.loc[df['Late (Yes/No)'] == 'YES']. Similarly, we did the opposite by changing 'YES' to 'NO' and assign it to a different dataframe df_notlate.

The syntax is not much different from the previous example except the addition of == sign between the column and the value we want to compare. It basically asks, for every row, if the value on a particular column (left side) matches the value that we specified (right-side). If the match is True, it includes that row in the result. If the match is False, it ignores it.

Here’s the resulting dataframe for df_late:

And here’s the one for df_notlate:

The general syntax for this technique is:

df_new = df_old.loc[df_old['Column Name'] == 'some_value' ]

Selecting rows where the column is not a specific value.

We’ve learned how to select rows based on ‘yes’ and ‘no.’ But what if the values are not binary? For example, let’s look at the ‘Category’ column:

One hundred ninety-two thousand one hundred ninety-seven rows or records do not have a category assigned, but instead of NaN, empty, or null value, we get 'No Category' as the category itself. What if we want to filter these out? Enter: the != operator.

df.Category.value_counts(dropna=False)

df_categorized = df.loc[df['Category'] != 'No Category']
df_categorized.head()
df_categorized.shape

df_categorized.Category.value_counts(dropna=False)

As usual, we did customary value counts on the 'Category' column to see what we’re working with. Then, we created the df_categorized dataframe to include any records in the the df dataframe that don’t have 'No Category' as their value in the 'Category' column.

Here’s the result of doing a value count on the 'Category' column of the df_categorized dataframe:

As the screenshot above shows, the value counts retained everything but the ‘No Category.’

The general syntax for this technique is:

df_new = df_old.loc[df_old['Column Name'] != 'some_value' ]

Select rows based on multiple conditions.

Let’s consider the following columns, 'Late (Yes/No)' and 'CaseStatus':

What if we wanted to know which open cases right now are already passed their SLA (service level agreement)? We would need to use multiple conditions to filter the cases or rows in a new dataframe. Enter the & operator.

df_late_open = df.loc[(df['Late (Yes/No)'] == 'YES') & (df['CaseStatus'] == 'Open')]

df_late_open.head()
df_late_open.shape

The syntax is similar to the previous ones except for the introduction of the & operator in between parenthesis. In the line df_late_open = df.loc[(df[‘Late (Yes/No)’] == ‘YES’) & (df[‘CaseStatus’] == ‘Open’)], there are two conditions:

  1. (df[‘Late (Yes/No)’] == ‘YES’)
  2. (df[‘CaseStatus’] == ‘Open’)

We want both of these to be true to match a row, so we included the operator & in between them. In plain speak, the & bitwise operator simply means AND. Other bitwise operators include pipe| sign for OR and the tilde ~ for NOT. I encourage you to experiment using these bitwise operators to get a good feel of what all they can do. Just remember to enclose each condition between parenthesis so that you don’t confuse Python.

The general syntax for this technique is:

df_new = df_old.loc[(df_old['Column Name 1'] == 'some_value_1') & (df['Column Name 2'] == 'some_value_2')]

Select rows having a column value that belongs in some list of values.

Let’s look at the value count for the 'Council District' column:

What if we wanted to focus on districts #2, #3, #4, and #5 because they’re in south San Antonio, and they’re known for getting poor service from the city? (I’m so totally making this up by the way!) In this case, we could use the .isin() method like so:

df['Council District'].value_counts(dropna=False)

df_south = df.loc[df['Council District'].isin([2,3,4,5])]
df_south.head()
df_south.shape

df_south['Council District'].value_counts()

Remember to pass your choices inside the .isin() method as a list like ['choice1', 'choice2', 'choice3'] because otherwise, it will cause an error. For integers like in our example, it is not necessary to include quotation marks because quotation marks are for string values only.

Here’s the result of our new dataframe df_south:

The general syntax for this technique is:

df_new = df_old.loc[df_old[Column Name'].isin(['choice1', 'choice2', 'choice3'])]

Conclusion

And that’s it! In this post, we loaded the 311 service calls data into a dataframe and created subsets of data using the .loc method.


Thanks for reading! I hope you enjoyed today’s post. Data wrangling, at least for me, is a fun exercise because this is the phase where I first get to know the data and it gives me a chance to hone my problem-solving skills when faced with really messy data. Happy wrangling folks!

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1] City of San Antonio Open Data. (May 31, 2020). 311 Service Calls. https://data.sanantonio.gov/dataset/service-calls

Flexing my Regex muscles

I’ve spent 12 hours scouring the internet for this problem. I am close to giving up and I’ve decided to ask a question on stackoverflow. Since this blog is all about my journey, I am going to share the love and document my exploits here.

Here’s the problem:

On a dataframe, I have a column of duration in a human-readable format like “29 days 4 hours 32 minutes 1 second”. I want to break them down by having columns of days, hours, minutes, seconds with the values derived from the duration column. Like 29 for days, 4 for hours, 32 minutes, and 1 for seconds. I’ve already used this but it’s not working correctly:

# Use regex to extract time values into their respective columns
new_df = df['duration'].str.extract(r'(?P<days>\d+(?= day))|(?P<hours>\d+(?= hour))|(?P<minutes>\d+(?= min))|(?P<seconds>\d+(?= sec))')

Results in the following dataframe:

The new dataframe only has the first value but not the rest. It captured the 29 for days, and 1, 3, 4, 2, for minutes but the subsequent columns values are NaNs.

Ideally, the dataframe should like this below:

I have a feeling something is wrong with my regex.

Update: Yes, I was right about being wrong with the regex. I should’ve been using extractall instead of just extract. Here’s the correct code:

import pandas as pd
import re

list = {'id': ['123','124','125','126','127'],
        'date': ['1/1/2018', '1/2/2018', '1/3/2018', '1/4/2018','1/5/2018'],
        'duration': ['29 days 4 hours 32 minutes',
                     '1 hour 23 minutes',
                     '3 hours 2 minutes 1 second',
                     '4 hours 46 minutes 22 seconds',
                     '2 hours 1 minute']}

df = pd.DataFrame(list)

# Use regex to extract time values into their respective columns
pattern = ( '(?P<days>\d+)(?= days?\s*)|'
          + '(?P<hours>\d+)(?= hours?\s*)|'
          + '(?P<minutes>\d+)(?= minutes?\s*)|'
          + '(?P<seconds>\d+)(?= seconds?\s*)'
          )

new_df = (df.duration.str.extractall(pattern)   # extract all with regex
          .reset_index('match',drop=True)      # merge the matches of the same row
          .stack()
          .unstack(level=-1, fill_value=0)     # remove fill_value if you want NaN instead of 0
          )

Thanks to Quang Hoang and everybody else for answering the problem!

How to put two or more dataframes together

The Pandas function below takes a list of dataframes and concatenates them into. This basic flavor of concat()joins the dataframes vertically. In other words, the rows of one dataframe gets added on to the previous one.

df = pd.concat([df1,df2,df3])

Or if you want, you can store the list of dataframes into a variable first and then call the concat function. Like so:

# we must import pandas first
# put it in the beginning of your file
import pandas as pd

frames = [df1, df2, df3, df4, df5]
df = pd.concat(frames)

On the other hand, if I want to join the dataframes horizontally, then I can use merge().

For example, in the code below, we are merging df1 with df2 using ‘column_name’ as the common column. This is the column from which to base the merge. If there are any other identical columns that exist between the two dataframes, the suffixes are then appended to the each of the column names accordingly.

This particular flavor of merge() joins the dataframes horizontally. In the words, the columns of the dataframes gets added together to make one big mamma jamma of a dataframe;

df_merged = df1.merge(df2,
                      left_on='column_name',
                      right_on='column_name',
                      suffixes=('_left', '_right'))

How to read all the files inside a folder into a single dataframe

Have a bunch of csv files and don’t want to import them all individually? Well, I can’t blame you, and you shouldn’t anyways.

The function below reads all the files within a single folder and returns it as one Pandas dataframe. Of course, for it to work the files need to be of the same data structure. In other words, they all got to have the same column names in the same order.

# we import these first
# put them in the beginning of your file
import os
import pandas as pd

def read_data(folder):
    '''
    This function reads each the raw data files as dataframes and
    combines them into a single data frame.
    '''
    for i, file_name in enumerate(os.listdir(input_folder)):
        try:
            # df = pd.read_excel(os.path.join(input_folder, file_name)) # excel
            # df = pd.read_csv(os.path.join(input_folder, file_name), sep='\t') # tsv file
            df = pd.read_csv(os.path.join(input_folder, file_name)) # vanilla csv
            df['file_name'] = file_name
            if i == 0:
                final_df = df.copy()
            else:
                final_df = final_df.append(df)

        except Exception as e:
            print(f"Cannot read file: {file_name}")
            print(str(e))
    return final_df

# provide the folder path where your files reside
folder = 'G:/path/to/data/parent_folder_name'

# call the function
df = read_data(folder)