Populating a Network Graph with Named-Entities

An early attempt of using networkx to visualize the results of natural language processing.


I do a lot of natural language processing and usually, the results are pretty boring to the eye. When I learned about network graphs, it got me thinking, why not use keywords as nodes and connect them together to create a network graph?

Yupp, why not!

In this post, we’ll do exactly that. We’re going to extract named-entities from news articles about coronavirus and then use their relationships to connect them together in a network graph.


A Brief Introduction

Network graphs are a cool visual that contains nodes (vertices) and edges (lines). It’s often used in social network analysis and network analysis but data scientists also use it for natural language processing.

Photo by Anders Sandberg on Flicker

Natural Language Processing or NLP is a branch of artificial intelligence that deals with programming computers to process and analyze large volumes of text and derive meaning out of them.¹ In other words, it’s all about teaching computers how to understand human language… like a boss!

Photo by brewbooks on Flickr

Enough introduction, let’s get to coding!


To get started, let’s make sure to take care of all dependencies. Open up a terminal and execute the following commands:

pip install -U spacy
python -m spacy download en
pip install networkx
pip install fuzzywuzzy

This will install spaCy and download the trained model for English. The third command installs networkx. This should work for most systems. If it doesn’t work for you, check out the documentation for spaCy and networkx. Also, we’re using fuzzywuzzy for some text preprocessing.

With that out of the way, let’s fire up a Jupyter notebook and get started!


Imports

Run the following code block into a cell to get all the necessary imports into our Python environment.

import pandas as pd
import numpy as np
import pickle
from operator import itemgetter
from fuzzywuzzy import process, fuzz# for natural language processing
import spacy
import en_core_web_sm# for visualizations
%matplotlib inline
from matplotlib.pyplot import figureimport networkx as nx

Getting the Data

If you want to follow along, you can download the sample dataset here. The file was created using newspaper to import news articles from the npr.org. If you’re feeling adventurous, use the code snippet below to build your own dataset.

import requests
import json
import time
import newspaper
import pickle

npr = newspaper.build('https://www.npr.org/sections/coronavirus-live-updates')

corpus = []
count = 0
for article in npr.articles:
    time.sleep(1)
    article.download()
    article.parse()
    text = article.text
    corpus.append(text)
    if count % 10 == 0 and count != 0:
        print('Obtained {} articles'.format(count))
    count += 1

corpus300 = corpus[:300]

with open("npr_coronavirus.txt", "wb") as fp:   # Pickling
    pickle.dump(corpus300, fp)

# with open("npr_coronavirus.txt", "rb") as fp:   # Unpickling
#     corpus = pickle.load(fp)

Let’s get our data.

with open('npr_coronavirus.txt', 'rb') as fp:   # Unpickling
corpus = pickle.load(fp)

Extract Entities

Next, we’ll start by loading spaCy’s English model:

nlp = en_core_web_sm.load()

Then, we’ll extract the entities:

entities = []for article in corpus[:50]:
tokens = nlp(''.join(article))
gpe_list = []
for ent in tokens.ents:
if ent.label_ == 'GPE':
gpe_list.append(ent.text)
entities.append(gpe_list)

In the above code block, we created an empty list called entities to store a list of lists that contains the extracted entities from each of the articles. In the for-loop, we looped through the first 50 articles of the corpus. For each iteration, we converted each articles into tokens (words) and then we looped through all those words to get the entities that are labeled as GPE for countries, states, and cities. We used ent.text to extract the actual entity and appended them one by one to entities.

Here’s the result:

Note that North Carolina has several variations of its name and some have “the” prefixed in their names. Let’s get rid of them.

articles = []for entity_list in entities:
cleaned_entity_list = []
for entity in entity_list:
cleaned_entity_list.append(entity.lstrip('the ').replace("'s", "").replace("’s",""))
articles.append(cleaned_entity_list)

In the code block above, we’re simply traversing the list of lists articles and cleaning the entities one by one. With each iteration, we’re stripping the prefix “the” and getting rid of 's.

Optional: FuzzyWuzzy

Looking at the entities, I’ve noticed that there are also variations in the “United States” is represented. There exists “United States of America” while some are just “United States”. We can trim these down into a more standard naming convention.

FuzzyWuzzy can help with this.

Described by pypi.org as “string matching like a boss,” FiuzzyWuzzy uses Levenshtein distance to calculate the similarities between words.¹ For a really good tutorial on how to use FuzzyWuzzy, check out Thanh Huynh’s article.FuzzyWuzzy: Find Similar Strings within one column in PythonToken Sort Ratio vs. Token Set Ratiotowardsdatascience.com

Here’s the optional code for using FuzzyWuzzy:

choices = set([item for sublist in articles for item in sublist])

cleaned_articles = []
for article in articles:
    article_entities = []
    for entity in set(article):
        article_entities.append(process.extractOne(entity, choices)[0])
    cleaned_articles.append(article_entities)

For the final step before creating the network graph, let’s get rid of the empty lists within our list of list that were generated by articles who didn’t have any GPE entity types.

articles = [article for article in articles if article != []]

Create the Network Graph

For the next step, we’ll create the world into which the graph will exist.

G = nx.Graph()

Then, we’ll manually add the nodes with G.add_nodes_from().

for entities in articles:
G.add_nodes_from(entities)

Let’s see what the graph looks like with:

figure(figsize=(10, 8))
nx.draw(G, node_size=15)

Next, let’s add the edges that will connect the nodes.

for entities in articles:
if len(entities) > 1:
for i in range(len(entities)-1):
G.add_edges_from([(str(entities[i]),str(entities[i+1]))])

For each iteration of the code above, we used a conditional that will only entertain a list of entities that has two or more entities. Then, we manually connect each of the entities with G.add_edges_from().

Let’s see what the graph looks like now:

figure(figsize=(10, 8))
nx.draw(G, node_size=10)

This graph reminds me of spiders! LOL.

To organize it a bit, I decided to use the shell version of the network graph:

figure(figsize=(10, 8))
nx.draw_shell(G, node_size=15)

We can tell that some nodes are heavier on connections than others. To see which nodes have the most connections, let’s use G.degree().

G.degree()

This gives the following degree view:

Let’s find out which node or entity has the most number of connections.

max(dict(G.degree()).items(), key = lambda x : x[1])

To find out which other nodes have the most number of connections, let’s check the top 5:

degree_dict = dict(G.degree(G.nodes()))
nx.set_node_attributes(G, degree_dict, 'degree')sorted_degree = sorted(degree_dict.items(), key=itemgetter(1), reverse=True)

Above, sorted_degrees is a list that contains all the nodes and their degree values. We only wanted the top 5 like so:

print("Top 5 nodes by degree:")
for d in sorted_degree[:5]:
print(d)

Bonus Round: Gephi

Gephi is an open-source and free desktop application that lets us visualize, explore, and analyze all kinds of graphs and networks.²

Let’s export our graph data into a file so we can import it into Gephi.

nx.write_gexf(G, "npr_coronavirus_GPE_50.gexf")

Cool beans!

Next Steps

This time, we only processed 50 articles from npr.org. What would happen if we processed all 300 articles from our dataset? What will we see if we change the entity type from GPE to PERSON? How else can we use network graphs to visualize natural language processing results?

There’s always more to do. The possibilities are endless!


I hope you enjoyed today’s post. The code is not perfect and we have a long way to go towards realizing insights from the data. I encourage you to dive deeper and learn more about spaCynetworkxfuzzywuzzy, and even Gephi.

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1]: Wikipedia. (May 25, 2020). Natural language processing https://en.wikipedia.org/wiki/Natural_language_processing

[2]: Gephi. (May 25, 2020). The Open Graph Viz Platform https://gephi.org/

This article was first published in Towards Data Science‘ Medium publication.

Create a Network Graph in Power BI

In a previous article, I wrote a quick start guide to visualize a Pandas dataframe using networkx and matplotlib. While it was fun to learn and explore more about network graphs in Python, I got to thinking about how to present the results to others who don’t have Python or Jupyter Notebook installed in their machines. At TaskUs, we use Power BI for most of our reporting so I began to search for a custom Power BI visualization that can take the data and transform it into a meaningful network graph.

Enter Network Navigator.

Network Navigator is a custom visual in Power BI that is created by Microsoft. It allows you to “explore node-link data by panning over and zooming into a force-directed node layout (which can be precomputed or animated live).”¹ In this post, we’ll walk through the steps needed to create a network graph using the custom visual.

First, let’s get our data. You can download the sample dataset here. Then, we could load the data into Power BI Desktop as shown below:

Select Text/CSV and click on “Connect”.

Select the file in the Windows Explorer folder and click open:

Click on “Transform Data”.

Click on “Use first Row as Headers”.

Click on “Close & Apply”.

Next, find the three dots at the end of the “Visualizations” panel.

And select “Get more visuals”.

Point your mouse cursor inside the search text box and type in “network” and hit the “Enter” key and click on the “Add” button.

Wait a few moments and you’ll see the notification below. Click on the “OK” button to close the notification.

You’ll see a new icon appear at the bottom of the “Visualizations” panel as shown below.

Click on the new icon and you will see something similar to the picture below.

With the new visual placeholder selected, click on “Reporter” and “Assignee” in the “Fields” panel and it will automatically assign the columns as the Source and Target Node.

Let’s add labels by clicking on the paintbrush icon.

Click on “Layout” to expand the section and scroll down until you see the “Labels” section.

Click on the toggle switch under “Labels” to turn it on

and voila!

That’s it! With a few simple clicks of the mouse, we’re able to create a network graph from a csv file.


I hope you enjoy today’s post on one of Power BI’s coolest visuals. Network graph analysis is a big topic but I hope this gentle introduction will encourage you to explore more and expand your repertoire.

In the next article, I’ll share my journey from slacker to data scientist and I hope it’ll inspire others instead of being dissuaded by haters.

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1]: Business Apps — Microsoft AppSource. (May 16, 2020). Network Navigator Chart https://appsource.microsoft.com/en-us/product/power-bi-visuals/WA104380795?src=office&tab=Overview

From DataFrame to Network Graph

I just discovered — quite accidentally — how to export data from JIRA so naturally, I began to think of ways to visualize the information and potentially glean some insight from the dataset. I’ve stumbled upon the concept of network graphs and the idea quickly captured my imagination. I realized that I can use it to tell stories not only about the relationships between people but between words as well! But NLP is a big topic, so how about we walk first and run later?!

This is just a very gentle introduction so we won’t be using any fancy code here.


Network graphs “show interconnections between a set of entities”¹ where entities arenodes and the connections between them are represented through links or edges¹. In the graph below, the dots are the nodes and the lines are called edges.

Martin Grandjean / CC BY-SA (https://creativecommons.org/licenses/by-sa/3.0)

In this post, I’ll share the code that will let us quickly visualize a Pandas dataframe using a popular network graph package: networkx.

First, let’s get our data and load it into a datagram. You can download the sample dataset here.

import pandas as pd
df = pd.read_csv(‘jira_sample.csv’)

Second, let’s trim the dataframe to only include the columns we want to examine. In this case, we only want the columns ‘Assignee’ and ‘Reporter’.

df1 = df[[‘Assignee’, ‘Reporter’]]

Third, it’s time to create the world into which the graph will exist. If you haven’t already, install the networkx package by doing a quick pip install networkx.

import networkx as nx
G = nx.from_pandas_edgelist(df1, ‘Assignee’, ‘Reporter’)

Next, we’ll materialize the graph we created with the help of matplotlib for formatting.

from matplotlib.pyplot import figure
figure(figsize=(10, 8))
nx.draw_shell(G, with_labels=True)

The most important line in the block above is nx.draw_shell(G, with_labels=True). It tells the computer to draw the graph Gusing a shell layout with the labels for entities turned on.

Voilà! We got ourselves a network graph:

A network graph of reporters and assignees from a sample JIRA dataset.

Right off the bat, we can tell that there’s a heavy concentration of lines originating from three major players, ‘barbie.doll’, ‘susan.lee’, and ‘joe.appleseed’. Of course, just to be sure, it’s always a good idea to confirm our ‘eyeballing’ with some hard data.

Bonus Round

Let’s check out ‘barbie.doll’.

G[‘barbie.doll’]

To see how many connections ‘barbie.doll’ has, let’s use len():

len(G[‘barbie.doll’])

Next, let’s created another dataframe that shows the nodes and their number of connections.

leaderboard = {}
for x in G.nodes:
leaderboard[x] = len(G[x])
s = pd.Series(leaderboard, name=’connections’)
df2 = s.to_frame().sort_values(‘connections’, ascending=False)

In the code block above, we first initialized an empty dictionary called ‘leaderboard’ and then used a simple for-loop to populate the dictionary with names and number of connections. Then, we created a series out of the dictionary. Finally, we created another dataframe from the series that we created using to_frame().

To display the dataframe, we simply use df2.head() and we got ourselves a leaderboard!

And that’s it! With a few simple lines of code, we quickly made a network graph from a Pandas dataframe and even displayed a table with names and number of connections.


I hope you enjoyed this one. Network graph analysis is a big topic but I hope that this gentle introduction will encourage you to explore more and expand your repertoire.

In the next article, I’ll walk through Power BI’s custom visual called ‘Network Navigator’ to create a network graph with a few simple clicks of the mouse.

Stay tuned!

[1]: Data-To-Viz. (May 15, 2020). Network Diagram https://www.data-to-viz.com/graph/network.html