Populating a Network Graph with Named-Entities

An early attempt of using networkx to visualize the results of natural language processing.

I do a lot of natural language processing and usually, the results are pretty boring to the eye. When I learned about network graphs, it got me thinking, why not use keywords as nodes and connect them together to create a network graph?

Yupp, why not!

In this post, we’ll do exactly that. We’re going to extract named-entities from news articles about coronavirus and then use their relationships to connect them together in a network graph.

A Brief Introduction

Network graphs are a cool visual that contains nodes (vertices) and edges (lines). It’s often used in social network analysis and network analysis but data scientists also use it for natural language processing.

Photo by Anders Sandberg on Flicker

Natural Language Processing or NLP is a branch of artificial intelligence that deals with programming computers to process and analyze large volumes of text and derive meaning out of them.¹ In other words, it’s all about teaching computers how to understand human language… like a boss!

Photo by brewbooks on Flickr

Enough introduction, let’s get to coding!

To get started, let’s make sure to take care of all dependencies. Open up a terminal and execute the following commands:

pip install -U spacy
python -m spacy download en
pip install networkx
pip install fuzzywuzzy

This will install spaCy and download the trained model for English. The third command installs networkx. This should work for most systems. If it doesn’t work for you, check out the documentation for spaCy and networkx. Also, we’re using fuzzywuzzy for some text preprocessing.

With that out of the way, let’s fire up a Jupyter notebook and get started!


Run the following code block into a cell to get all the necessary imports into our Python environment.

import pandas as pd
import numpy as np
import pickle
from operator import itemgetter
from fuzzywuzzy import process, fuzz# for natural language processing
import spacy
import en_core_web_sm# for visualizations
%matplotlib inline
from matplotlib.pyplot import figureimport networkx as nx

Getting the Data

If you want to follow along, you can download the sample dataset here. The file was created using newspaper to import news articles from the npr.org. If you’re feeling adventurous, use the code snippet below to build your own dataset.

import requests
import json
import time
import newspaper
import pickle

npr = newspaper.build('https://www.npr.org/sections/coronavirus-live-updates')

corpus = []
count = 0
for article in npr.articles:
    text = article.text
    if count % 10 == 0 and count != 0:
        print('Obtained {} articles'.format(count))
    count += 1

corpus300 = corpus[:300]

with open("npr_coronavirus.txt", "wb") as fp:   # Pickling
    pickle.dump(corpus300, fp)

# with open("npr_coronavirus.txt", "rb") as fp:   # Unpickling
#     corpus = pickle.load(fp)

Let’s get our data.

with open('npr_coronavirus.txt', 'rb') as fp:   # Unpickling
corpus = pickle.load(fp)

Extract Entities

Next, we’ll start by loading spaCy’s English model:

nlp = en_core_web_sm.load()

Then, we’ll extract the entities:

entities = []for article in corpus[:50]:
tokens = nlp(''.join(article))
gpe_list = []
for ent in tokens.ents:
if ent.label_ == 'GPE':

In the above code block, we created an empty list called entities to store a list of lists that contains the extracted entities from each of the articles. In the for-loop, we looped through the first 50 articles of the corpus. For each iteration, we converted each articles into tokens (words) and then we looped through all those words to get the entities that are labeled as GPE for countries, states, and cities. We used ent.text to extract the actual entity and appended them one by one to entities.

Here’s the result:

Note that North Carolina has several variations of its name and some have “the” prefixed in their names. Let’s get rid of them.

articles = []for entity_list in entities:
cleaned_entity_list = []
for entity in entity_list:
cleaned_entity_list.append(entity.lstrip('the ').replace("'s", "").replace("’s",""))

In the code block above, we’re simply traversing the list of lists articles and cleaning the entities one by one. With each iteration, we’re stripping the prefix “the” and getting rid of 's.

Optional: FuzzyWuzzy

Looking at the entities, I’ve noticed that there are also variations in the “United States” is represented. There exists “United States of America” while some are just “United States”. We can trim these down into a more standard naming convention.

FuzzyWuzzy can help with this.

Described by pypi.org as “string matching like a boss,” FiuzzyWuzzy uses Levenshtein distance to calculate the similarities between words.¹ For a really good tutorial on how to use FuzzyWuzzy, check out Thanh Huynh’s article.FuzzyWuzzy: Find Similar Strings within one column in PythonToken Sort Ratio vs. Token Set Ratiotowardsdatascience.com

Here’s the optional code for using FuzzyWuzzy:

choices = set([item for sublist in articles for item in sublist])

cleaned_articles = []
for article in articles:
    article_entities = []
    for entity in set(article):
        article_entities.append(process.extractOne(entity, choices)[0])

For the final step before creating the network graph, let’s get rid of the empty lists within our list of list that were generated by articles who didn’t have any GPE entity types.

articles = [article for article in articles if article != []]

Create the Network Graph

For the next step, we’ll create the world into which the graph will exist.

G = nx.Graph()

Then, we’ll manually add the nodes with G.add_nodes_from().

for entities in articles:

Let’s see what the graph looks like with:

figure(figsize=(10, 8))
nx.draw(G, node_size=15)

Next, let’s add the edges that will connect the nodes.

for entities in articles:
if len(entities) > 1:
for i in range(len(entities)-1):

For each iteration of the code above, we used a conditional that will only entertain a list of entities that has two or more entities. Then, we manually connect each of the entities with G.add_edges_from().

Let’s see what the graph looks like now:

figure(figsize=(10, 8))
nx.draw(G, node_size=10)

This graph reminds me of spiders! LOL.

To organize it a bit, I decided to use the shell version of the network graph:

figure(figsize=(10, 8))
nx.draw_shell(G, node_size=15)

We can tell that some nodes are heavier on connections than others. To see which nodes have the most connections, let’s use G.degree().


This gives the following degree view:

Let’s find out which node or entity has the most number of connections.

max(dict(G.degree()).items(), key = lambda x : x[1])

To find out which other nodes have the most number of connections, let’s check the top 5:

degree_dict = dict(G.degree(G.nodes()))
nx.set_node_attributes(G, degree_dict, 'degree')sorted_degree = sorted(degree_dict.items(), key=itemgetter(1), reverse=True)

Above, sorted_degrees is a list that contains all the nodes and their degree values. We only wanted the top 5 like so:

print("Top 5 nodes by degree:")
for d in sorted_degree[:5]:

Bonus Round: Gephi

Gephi is an open-source and free desktop application that lets us visualize, explore, and analyze all kinds of graphs and networks.²

Let’s export our graph data into a file so we can import it into Gephi.

nx.write_gexf(G, "npr_coronavirus_GPE_50.gexf")

Cool beans!

Next Steps

This time, we only processed 50 articles from npr.org. What would happen if we processed all 300 articles from our dataset? What will we see if we change the entity type from GPE to PERSON? How else can we use network graphs to visualize natural language processing results?

There’s always more to do. The possibilities are endless!

I hope you enjoyed today’s post. The code is not perfect and we have a long way to go towards realizing insights from the data. I encourage you to dive deeper and learn more about spaCynetworkxfuzzywuzzy, and even Gephi.

Stay tuned!

You can reach me on Twitter or LinkedIn.

[1]: Wikipedia. (May 25, 2020). Natural language processing https://en.wikipedia.org/wiki/Natural_language_processing

[2]: Gephi. (May 25, 2020). The Open Graph Viz Platform https://gephi.org/

This article was first published in Towards Data Science‘ Medium publication.

Published by

Ednalyn C. De Dios

I’ve always been enamored with code and I love data science because of its inherent power to solve real problems. Having grown up in the Philippines, served in the United States Navy, and worked in the nonprofit sector, I am driven to make the world a better place. I have started and participated in numerous campaigns that aim to reduce domestic violence and child abuse in the community.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.