I was bored over the weekend so I decided to restore my Macbook Pro to factory settings so that I can set up my programming environment the proper way. After all, what’s a data scientist without her toys?
Let’s start with a replacement to the default terminal and pyenv installation to manage different Python versions.
Replacing the default Mac terminal with iTerm2 and Oh My Zsh.
Let’s move on to managing different Python interpreters and virtual environments using pyenv-virtualenv.
I remember a brief conversation with my boss’ boss a while back. He said that he wouldn’t be impressed if somebody in the company built a face recognition tool from scratch because, and I quote, “Guess what? There’s an API for that.” He then goes on about the futility of doing something that’s already been done instead of just using it.
This gave me an insight into how an executive thinks. Not that they don’t care about the coolness factor of a project, but at the end of that day, they’re most concerned about how a project will add value to the business and even more importantly, how quickly it can be done.
In the real world, the time it takes to build prototype matters. And the quicker we get from data to insights, the better off we will be. These help us stay agile.
And this brings me to PyCaret.
PyCaret is an open source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within seconds in your choice of notebook environment.[1]
Pycaret is basically a wrapper for some of the most popular machine learning libraries and frameworks scikit-learn and spaCy. Here are the things that PyCaret can do:
Classification
Regression
Clustering
Anomaly Detection
Natural Language Processing
Associate Rule Mining
If you’re interested in reading about the difference between traditional NLP approach vs. PyCaret’s NLP module, check out Prateek Baghel’s article.
Natural Language Processing
In just a few lines of code, PyCaret makes natural language processing so easy that it’s almost criminal. Like most of its other modules, PyCaret’s NLP module streamlined pipeline cuts the time from data to insights in more than half the time.
For example, with only one line, it performs text processing automatically, with the ability to customize stop words. Add another line or two, and you got yourself a language model. With yet another line, it gives you a properly formatted plotly graph. And finally, adding another line gives you the option to evaluate the model. You can even tune the model with, guess what, one line of code!
Instead of just telling you all about the wonderful features of PyCaret, maybe it’s be better if we do a little show and tell instead.
The Pipeline
For this post, we’ll create an NLP pipeline that involves the following 6 glorious steps:
Getting the Data
Setting up the Environment
Creating the Model
Assigning the Model
Plotting the Model
Evaluating the Model
We will be going through an end-to-end demonstration of this pipeline with a brief explanation of the functions involved and their parameters.
Let’s get started.
Housekeeping
Let us begin by installing PyCaret. If this is your first time installing it, just type the following into your terminal:
pip install pycaret
However, if you have a previously installed version of PyCaret, you can upgrade using the following command:
pip install —-upgrade pycaret
Beware: PyCaret is a big library so it’s going to take a few minutes to download and install.
We’ll also need to download the English language model because it is not included in the PyCaret installation.
Next, let’s fire up a Jupyter notebook and import PyCaret’s NLP module:
#import nlp module
from pycaret.nlp import *
Importing the pycaret.nlp automatically sets up your environment to perform NLP tasks only.
Getting the Data
Before setup, we need to decide first how we’re going to ingest data. There are two methods of getting the data into the pipeline. One is by using a Pandas dataframe and another is by using a simple list of textual data.
Passing a DataFrame
#import pandas if we're gonna use a dataframe
import pandas as pd
# load the data into a dataframe
df = pd.read_csv('hilaryclinton.csv')
Above, we’re simply loading the data into a dataframe.
Passing a List
# read a file containing a list of text data and assign it to 'lines'
with open('list.txt') as f:
lines = f.read().splitlines()
Above, we’re opening the file 'list.txt' and reading it. We assign the resulting list into the lines.
Sampling
From the rest of this experiment, we’ll just use a dataframe to pass textual data to thesetup() function of the NLP module. And for the sake of expediency, we’ll sample the dataframe to only select a thousand tweets.
# sampling the data to select only 1000 tweets
df = df.sample(1000, random_state=493).reset_index(drop=True)
Let’s take a quick look at our dataframe with df.head() and df.shape.
Setting Up the Environment
In the line below, we’ll initialize the setup by calling the setup() function and assign it to nlp.
With data and target, we’re telling PyCaret that we’d like to use the values in the 'text' column of df. Also, we’re setting the session_id to an arbitrary number of 493 so that we can reproduce the experiment over and over again and get the same result. Finally, we added custom_stopwords so that PyCaret will exclude the specified list of words in the analysis.
Note that if we want to use a list instead, we could replace df with lines and get rid of target = ‘text’ because a list has no columns for the PyCaret to target!
Here’s the output of nlp:
The output table above confirms our session id, number of documents (rows or records), and vocabulary size. It also shows up if we used custom stopwords or not.
Creating the Model
Below, we’ll create the model by calling the create_model() function and assign it to lda. The function already knows to use the dataset that we specified during setup(). In our case, the PyCaret knows we want to create a model based on the 'text' in df.
# create the model
lda = create_model('lda', num_topics = 6, multi_core = True)
In the line above, notice that w param used 'lda' as the parameter. LDA stands for Latent Dirichlet Allocation. We could’ve just as easily opted for other types of models.
Here’s the list of models that PyCaret currently supports:
‘lda’: Latent Dirichlet Allocation
‘lsi’: Latent Semantic Indexing
‘hdp’: Hierarchical Dirichlet Process
‘rp’: Random Projections
‘nmf’: Non-Negative Matrix Factorization
I encourage you to research the difference between the models above, To start, check out Lettier’s awesome guide on LDA.
The next parameter we used is num_topics = 6. This tells PyCaret to use six topics in the results ranging from 0 to 5. If num_topic is not set, the default number is 4. Lastly, we set multi_core to tell PyCaret to use all available CPUs for parallel processing. This saves a lot of computational time.
Assigning the Model
By calling assign_model(), we’re going to label our data so that we’ll get a dataframe (based on our original dataframe: df) with additional columns that include the following information:
Topic percent value for each topic
The dominant topic
The percent value of the dominant topic
# label the data using trained model
df_lda = assign_model(lda)
Let’s take a look at df_lda.
Plotting the Model
Calling the plot_model() function will give us some visualization about frequency, distribution, polarity, et cetera. The plot_model() function takes three parameters: model, plot, and topic_num. The model instructs PyCaret what model to use and must be preceded by a create_model() function. topic_num designates which topic number (from 0 to 5) will the visualization be based on.
PyCarets offers a variety of plots. The type of graph generated will depend on the plot parameter. Here is the list of currently available visualizations:
‘frequency’: Word Token Frequency (default)
‘distribution’: Word Distribution Plot
‘bigram’: Bigram Frequency Plot
‘trigram’: Trigram Frequency Plot
‘sentiment’: Sentiment Polarity Plot
‘pos’: Part of Speech Frequency
‘tsne’: t-SNE (3d) Dimension Plot
‘topic_model’ : Topic Model (pyLDAvis)
‘topic_distribution’ : Topic Infer Distribution
‘wordcloud’: Word cloud
‘umap’: UMAP Dimensionality Plot
Evaluating the Model
Evaluating the models involves calling the evaluate_model() function. It takes only one parameter: the model to be used. In our case, the model is stored is lda that was created with the create_model() function in an earlier step.
The function returns a visual user interface for plotting.
And voilà, we’re done!
Conclusion
Using PyCaret’s NLP module, we were able to quickly from getting the data to evaluating the model in just a few lines of code. We covered the functions involved in each step and examined the parameters of those functions.
Thank you for reading! PyCaret’s NLP module has a lot more features and I encourage you to read their documentation to further familiarize yourself with the module and maybe even the whole library!
In the next post, I’ll continue to explore PyCaret’s functionalities.
If you want to learn more about my journey from slacker to data scientist, check out the article here.
I have a recurring dream where my instructor from a coding boot camp would constantly beat my head with a ruler telling me to read a package or library’s documentation. Hence, as a past time, I would find myself digging into Python or Panda’s documentation.
Today, I found myself wandering into pandas’ .drop() function. So, in this post, I shall attempt to make sense of panda’s documentation for the ever famous .drop().
Housekeeping
Let’s import pandas and create a sample dataframe.
If we type df into a cell in Jupyter notebook, this will give us the whole dataframe:
One-level DataFrame Operations
Now let’s get rid of some columns.
df.drop(['color', 'score'], axis=1)
The code above simply tells Python to get rid of the 'color' and 'score' in axis=1 which means look in the columns. Alternatively, we could’ve just as easily not used the named parameter axis because it’s confusing. So, let’s try that now:
df.drop(columns=['color', 'score'])
Both of the methods above will result in the following:
Next, we’ll get rid of some rows (or records).
df.drop([1, 2, 4, 6])
Above, we’re simply telling Python to get rid of the rows with the index of 1, 2, 4, and 6. Note that the indices are passed as a list [1, 2, 4, 6]. This will result in the following:
MultiIndex DataFrame Operations
In this next round, we’re going to work with a multi-index dataframe. Let’s set it up:
Next, let’s get rid of 'pork rinds' because I don’t like them:
df.drop(index='pork rinds', level=0)
And this is what we get:
And finally, let’s cut the fat:
df.drop(index='fat', level=1)
Above, level=1 simply means the second level (since the first level starts with 0). In this case, it’s the carbs, fat, and protein levels. By specifying index='fat', we’re telling Python to get rid of the fat in level=1.
Here’s what we get:
Staying Put
So far, with all the playing that we did, somehow, if we type df into a cell, the output that we’re going to get is the original dataframe without modifications. this is because all the changes that we’ve been making take effect only on the display.
But what if we want to make the changes permanent? Enter: inplace.
df.drop(index='fat', level=1, inplace=True)
Above, we added inplace=True in the parameter. This signals Python that we want the changes to be taken in place so that when we output df, this is what we’ll get:
As data scientists, we spent most of our time wrangling knee-deep in manipulating data using Pandas. In this post, we’ll be looking at the .loc property of Pandas to select rows based on some predefined conditions.
Let’s open up a Jupyter notebook, and let’s get wrangling!
The Data
We will be using the 311 Service Calls dataset¹ from the City of San Antonio Open Data website to illustrate how the different .loc techniques work.
Housekeeping
Before we get started, let’s do a little housekeeping first.
import pandas as pd
# to print out all the outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', -1)
Nothing fancy going on here. We’re just importing the mandatory Pandas library and setting the display options so that when we inspect our dataframe, the columns and rows won’t be truncated by Jupyter. We’re setting it up so that every output within a single cell is displayed and not just the last one.
First, we did a value count of the column ‘Dept’ column. The method .value_counts() returns a panda series listing all the values of the designated column and their frequency. By default, the method ignores NaN values and will not list it. However, if you include the parameter dropna=False it will include any NaN values in the result.
Next, the line df_null = df.loc[df['Dept'].isnull()] tells the computer to select rows in df where the column 'Dept' is null. The resulting dataframe is assigned to df_null , and all its rows will NaN as values in the ‘Dept’ column.
Similarly, the line df_notnull = df.loc[df['Dept'].notnull()] tells the computer to select rows in df where the column 'Dept' is not null. The resulting dataframe is assigned to df_notnull , and all its rows will not have any NaN as values in the ‘Dept’ column.
Again, we did a quick value count on the 'Late (Yes/No)' column. Then, we filtered for the cases that were late with df_late = df.loc[df['Late (Yes/No)'] == 'YES']. Similarly, we did the opposite by changing 'YES' to 'NO' and assign it to a different dataframe df_notlate.
The syntax is not much different from the previous example except the addition of == sign between the column and the value we want to compare. It basically asks, for every row, if the value on a particular column (left side) matches the value that we specified (right-side). If the match is True, it includes that row in the result. If the match is False, it ignores it.
Selecting rows where the column is not a specific value.
We’ve learned how to select rows based on ‘yes’ and ‘no.’ But what if the values are not binary? For example, let’s look at the ‘Category’ column:
One hundred ninety-two thousand one hundred ninety-seven rows or records do not have a category assigned, but instead of NaN, empty, or null value, we get 'No Category' as the category itself. What if we want to filter these out? Enter: the != operator.
As usual, we did customary value counts on the 'Category' column to see what we’re working with. Then, we created the df_categorized dataframe to include any records in the the df dataframe that don’t have 'No Category' as their value in the 'Category' column.
Here’s the result of doing a value count on the 'Category' column of the df_categorized dataframe:
As the screenshot above shows, the value counts retained everything but the ‘No Category.’
Let’s consider the following columns, 'Late (Yes/No)' and 'CaseStatus':
What if we wanted to know which open cases right now are already passed their SLA (service level agreement)? We would need to use multiple conditions to filter the cases or rows in a new dataframe. Enter the & operator.
The syntax is similar to the previous ones except for the introduction of the & operator in between parenthesis. In the line df_late_open = df.loc[(df[‘Late (Yes/No)’] == ‘YES’) & (df[‘CaseStatus’] == ‘Open’)], there are two conditions:
(df[‘Late (Yes/No)’] == ‘YES’)
(df[‘CaseStatus’] == ‘Open’)
We want both of these to be true to match a row, so we included the operator & in between them. In plain speak, the & bitwise operator simply means AND. Other bitwise operators include pipe| sign for OR and the tilde ~ for NOT. I encourage you to experiment using these bitwise operators to get a good feel of what all they can do. Just remember to enclose each condition between parenthesis so that you don’t confuse Python.
The general syntax for this technique is:
df_new = df_old.loc[(df_old['Column Name 1'] == 'some_value_1') & (df['Column Name 2'] == 'some_value_2')]
Select rows having a column value that belongs in some list of values.
Let’s look at the value count for the 'Council District' column:
What if we wanted to focus on districts #2, #3, #4, and #5 because they’re in south San Antonio, and they’re known for getting poor service from the city? (I’m so totally making this up by the way!) In this case, we could use the .isin() method like so:
Remember to pass your choices inside the .isin() method as a list like ['choice1', 'choice2', 'choice3'] because otherwise, it will cause an error. For integers like in our example, it is not necessary to include quotation marks because quotation marks are for string values only.
And that’s it! In this post, we loaded the 311 service calls data into a dataframe and created subsets of data using the .loc method.
Thanks for reading! I hope you enjoyed today’s post. Data wrangling, at least for me, is a fun exercise because this is the phase where I first get to know the data and it gives me a chance to hone my problem-solving skills when faced with really messy data. Happy wrangling folks!
In a previous post, we set out to explore the dataset provided by the Trump Twitter Archive. My initial goal was to do something fun by using a very interesting dataset. However, it didn’t quite turn out that way.
On this post, we’ll continue our journey but this time we’ll be using spaCy.
For this project, we’ll be using pandas for data manipulation, spaCy for natural language processing, and joblib to speed things up.
Let’s get started by firing up a Jupyter notebook!
Housekeeping
Let’s import pandas and also set the display options so Jupyter won’t truncate our columns and rows. Let’s also set a random seed for reproducibility.
# for manipulating data import pandas as pd
# setting the random seed for reproducibility import random random.seed(493)
# to print out all the outputs from IPython.core.interactiveshell import InteractiveShell InteractiveShell.ast_node_interactivity = "all"
# set display options pd.set_option('display.max_columns', None) pd.set_option('display.max_rows', None) pd.set_option('display.max_colwidth', -1)
Getting the Data
Let’s read the data into a dataframe. If you want to follow along, you can download the cleaned dataset here along with the file for stop words¹. This dataset contains Trump’s tweets from the moment he took office on January 20, 2017 to May 30, 2020.
Now let’s import spaCy and begin natural language processing.
# for natural language processing: named entity recognition import spacy import en_core_web_sm
We’re only going to use spaCy’s ner functionality or named-entity recognition so we’ll disable the rest of the functionalities. This will save us a lot of loading time later.
Now let’s load the contents stopwords file into the variable stopswords. Note that we converted the list into a set to also save some processing time later.
with open(‘twitter-stopwords — TA — Less.txt’) as f: contents = f.read().split(‘,’) stopwords = set(contents)
Next, we’ll import joblib and define a few functions to help with parallel processing.
from joblib import Parallel, delayed
def chunker(iterable, total_length, chunksize):
return (iterable[pos: pos + chunksize] for pos in range(0, total_length, chunksize))
def flatten(list_of_lists):
"Flatten a list of lists to a combined list"
return [item for sublist in list_of_lists for item in sublist]
def process_chunk(texts):
preproc_pipe = []
for doc in nlp.pipe(texts, batch_size=20):
preproc_pipe.append([(ent.text) for ent in doc.ents if ent.label_ in ['NORP', 'PERSON', 'FAC', 'ORG', 'GPE', 'LOC', 'PRODUCT', 'EVENT']])
return preproc_pipe
def preprocess_parallel(texts, chunksize=100):
executor = Parallel(n_jobs=7, backend='multiprocessing', prefer="processes")
do = delayed(process_chunk)
tasks = (do(chunk) for chunk in chunker(texts, len(df), chunksize=chunksize))
result = executor(tasks)
return flatten(result)
In the code above², the function preprocess_parallel executes the other function process_chunks in parallel to help with speed. The function process_chunks iterates through a series of texts — in our case, the column 'tweet' of our the df dataframe — and inspects the entity if it belongs to either NORP, PERSON, FAC, ORG, GPE, LOC, PRODUCT, or EVENT. If it is, the entity is then appended to 'preproc_pipe' and subsequently returned to its caller. Prashanth Rao has a very good article on making spaCy super fast.
Doing a quick df.head() will reveal the new column 'entities' that we added earlier to hold the entities found in the 'tweet' column.
Prettifying the Results
In the code below, we’re making a list of lists called 'entities' and then flattening it for easier processing. We’re also converting it into a set called 'entities_set'.
entities = [entity for entity in df.entities if entity != []] entities = [item for sublist in entities for item in sublist]
entities_set = set(entities)
Next, let’s count the frequency of the entities and append it to the list of tuples entities_counts. Then let’s convert the results into a dataframe df_counts.
For this step, we’re going to reinitialize an empty list entity_counts and manually construct a list of tuples with a combined set of entities and the sum of their frequencies or count.
We’ve successfully created a ranking of the named entities that President Trump most frequently talked about in his tweets since taking office.
Thank you for reading! Exploratory data analysis uses a lot of techniques and we’ve only explored a few on this post. I encourage you to keep practicing and employ other techniques to derive insights from data.
In the next post, we shall continue our journey into the heart of darkness and do some topic-modeling using LDA.
Exploring the Trump Twitter Archive with Python. For beginners.
In this post, we’ll explore the dataset provided by the Trump Twitter Archive. My goal was to do something fun by using a very interesting dataset. However, as it turned out, exposure to Trump’s narcissism and shenanigans were quite depressing — if not traumatic.
You’d been warned!
For this project, we’ll be using pandas and numpy for data manipulation, matplotlib for visualizations, datetime for working with timestamps, unicodedata and regex for processing strings, and finally, nltk for natural language processing.
Let’s get started by firing up a Jupyter notebook!
Environment
We’re going to import pandas and matplotlib, and also set the display options for Jupyter so that the rows and columns are not truncated.
# for manipulating data import pandas as pd import numpy as np
# for visualizations %matplotlib inline import matplotlib.pyplot as plt
# to print out all the outputs from IPython.core.interactiveshell import InteractiveShell InteractiveShell.ast_node_interactivity = "all"
# set display options pd.set_option('display.max_columns', None) pd.set_option('display.max_rows', None) pd.set_option('display.max_colwidth', -1)
Getting the Data
Let’s read the data into a dataframe. If you want to follow along, you can download the dataset here. This dataset contains Trump’s tweets from the moment he took office on January 20, 2017 to May 30, 2020.
df = pd.read_csv('trump_20200530.csv')
Let’s look at the first five rows and see the number of records (rows) and fields (columns).
df.head() df.shape
Let’s do a quick renaming of the columns to make it easier for us later.
Let’s explore both of the dataframes and answer a few questions.
What time does the President tweet the most? What time does he tweet the least?
The graph below shows that the President most frequently tweets around 12pm. He also tweets the least around 8am. Maybe he’s not a morning person?
title = 'Number of Tweets by Hour' df.tweet.groupby(df.datetime.dt.hour).count().plot(figsize=(12,8), fontsize=14, kind='bar', rot=0, title=title) plt.xlabel('Hour') plt.ylabel('Number of Tweets')
What day does the President tweet the most? What day does he tweet the least?
The graph below shows that the President most frequently tweets on Wednesday. He also tweets the least on Thursday.
title = 'Number of Tweets by Day of the Week' df.tweet.groupby(df.datetime.dt.dayofweek).count().plot(figsize=(12,8), fontsize=14, kind='bar', rot=0, title=title) plt.xlabel('Day of the Week') plt.ylabel('Number of Tweets') plt.xticks(np.arange(7),['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun'])
Isolating Twitter Handles from the Retweets
Let’s import regex so we can use it to parse the text and isolate the Twitter handles of the original tweets. In the code below, we add another column that contains the Twitter handle.
import re
pattern = re.compile('(?<=RT @).*?(?=:)') df_retweets['original'] = [re.search(pattern, tweet).group(0) for tweet in df_retweets.tweet]
Let’s create another dataframe that will contain only the original Twitter handles and their associated number of retweets.
Which Twitter user does the President like to retweet the most?
The graph below shows that the President likes to retweet the tweets from ‘@realDonaldTrump’. Does this mean the president likes to retweet himself? You don’t say!
The interesting handle on this one is ‘@charliekirk11’. Charlie Kirk is the founder of Turning Point USA. CBS News described the organization as a far-right organization that is “shunned or at least ignored by more established conservative groups in Washington, but embraced by many Trump supporters”.¹
The Top 5 Retweets
Let’s look at the top 5 tweets that were retweeted the most by others based on the original Twitter handle.
Let’s start with the ones with ‘@realDonaldTrump’.
Let’s find out how many of the retweets are favorited by others.
df_retweets.favorites.value_counts()
Surprisingly, none of the retweets seemed to have been favorited by anybody. Weird.
We should drop it.
Counting N-Grams
To do some n-gram ranking, we need to import unicodedata and nltk. We also need to specify additional stopwords that we may need to exclude from our analysis.
# for cleaning and natural language processing import unicodedata import nltk
# add appropriate words that will be ignored in the analysis ADDITIONAL_STOPWORDS = ['rt']
Here are a few of my favorite functions for natural language processing:
def clean(text):
"""
A simple function to clean up the data. All the words that
are not designated as a stop word is then lemmatized after
encoding and basic regex parsing are performed.
"""
wnl = nltk.stem.WordNetLemmatizer()
stopwords = nltk.corpus.stopwords.words('english') + ADDITIONAL_STOPWORDS
text = (unicodedata.normalize('NFKD', text)
.encode('ascii', 'ignore')
.decode('utf-8', 'ignore')
.lower())
words = re.sub(r'[^\w\s]', '', text).split()
return [wnl.lemmatize(word) for word in words if word not in stopwords]
def get_words(df, column):
"""
Takes in a dataframe and columns and returns a list of
words from the values in the specified column.
"""
return clean(''.join(str(df[column].tolist())))
def get_bigrams(df, column):
"""
Takes in a list of words and returns a series of
bigrams with value counts.
"""
return (pd.Series(nltk.ngrams(get_words(df, column), 2)).value_counts())[:10]
def get_trigrams(df, column):
"""
Takes in a list of words and returns a series of
trigrams with value counts.
"""
return (pd.Series(nltk.ngrams(get_words(df, column), 3)).value_counts())[:10]
def viz_bigrams(df ,column):
get_bigrams(df, column).sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))
plt.title('20 Most Frequently Occuring Bigrams')
plt.ylabel('Bigram')
plt.xlabel('# Occurances')
def viz_trigrams(df, column):
get_trigrams(df, column).sort_values().plot.barh(color='blue', width=.9, figsize=(12, 8))
plt.title('20 Most Frequently Occuring Trigrams')
plt.ylabel('Trigram')
plt.xlabel('# Occurances')
Let’s look at the top 10 bigrams in the df dataframe using the ‘tweet’ column.
get_bigrams(df, 'tweet')
And now, for the top 10 trigrams:
Let’s use the viz_bigrams() function and visualize the bigrams.
viz_bigrams(df, ‘tweet’)
Similarly, let’s use the viz_trigrams() function and visualize the trigrams.
viz_trigrams(df, 'tweet')
And there we have it!
From the moment that Trump took office, we can confidently say that the “fake news media” has been on top of the president’s mind.
Conclusion
Using basic Python and the nltk library, we’ve explored the dataset from the Trump Twitter Archive and did some n-gram ranking out of it.
Thank you for reading! Exploratory data analysis uses a lot of techniques and we’ve only explored a few on this post. I encourage you to keep practicing and employ other techniques to derive insights from data.
In the next post, we shall continue our journey into the heart of darkness and use spaCy to extract named-entities from the same dataset.