An early attempt of using networkx to visualize the results of natural language processing.
I do a lot of natural language processing and usually, the results are pretty boring to the eye. When I learned about network graphs, it got me thinking, why not use keywords as nodes and connect them together to create a network graph?
Yupp, why not!
In this post, we’ll do exactly that. We’re going to extract named-entities from news articles about coronavirus and then use their relationships to connect them together in a network graph.
A Brief Introduction
Network graphs are a cool visual that contains nodes (vertices) and edges (lines). It’s often used in social network analysis and network analysis but data scientists also use it for natural language processing.

Natural Language Processing or NLP is a branch of artificial intelligence that deals with programming computers to process and analyze large volumes of text and derive meaning out of them.¹ In other words, it’s all about teaching computers how to understand human language… like a boss!

Enough introduction, let’s get to coding!
To get started, let’s make sure to take care of all dependencies. Open up a terminal and execute the following commands:
pip install -U spacy
python -m spacy download en
pip install networkx
pip install fuzzywuzzy
This will install spaCy and download the trained model for English. The third command installs networkx. This should work for most systems. If it doesn’t work for you, check out the documentation for spaCy and networkx. Also, we’re using fuzzywuzzy for some text preprocessing.
With that out of the way, let’s fire up a Jupyter notebook and get started!
Imports
Run the following code block into a cell to get all the necessary imports into our Python environment.
import pandas as pd
import numpy as np
import pickle
from operator import itemgetter
from fuzzywuzzy import process, fuzz# for natural language processing
import spacy
import en_core_web_sm# for visualizations
%matplotlib inline
from matplotlib.pyplot import figureimport networkx as nx
Getting the Data
If you want to follow along, you can download the sample dataset here. The file was created using newspaper to import news articles from the npr.org. If you’re feeling adventurous, use the code snippet below to build your own dataset.
import requests import json import time import newspaper import pickle npr = newspaper.build('https://www.npr.org/sections/coronavirus-live-updates') corpus = [] count = 0 for article in npr.articles: time.sleep(1) article.download() article.parse() text = article.text corpus.append(text) if count % 10 == 0 and count != 0: print('Obtained {} articles'.format(count)) count += 1 corpus300 = corpus[:300] with open("npr_coronavirus.txt", "wb") as fp: # Pickling pickle.dump(corpus300, fp) # with open("npr_coronavirus.txt", "rb") as fp: # Unpickling # corpus = pickle.load(fp)
Let’s get our data.
with open('npr_coronavirus.txt', 'rb') as fp: # Unpickling
corpus = pickle.load(fp)
Extract Entities
Next, we’ll start by loading spaCy’s English model:
nlp = en_core_web_sm.load()
Then, we’ll extract the entities:
entities = []for article in corpus[:50]:
tokens = nlp(''.join(article))
gpe_list = []
for ent in tokens.ents:
if ent.label_ == 'GPE':
gpe_list.append(ent.text)
entities.append(gpe_list)
In the above code block, we created an empty list called entities
to store a list of lists that contains the extracted entities from each of the articles. In the for-loop, we looped through the first 50 articles of the corpus. For each iteration, we converted each articles into tokens (words) and then we looped through all those words to get the entities that are labeled as GPE
for countries, states, and cities. We used ent.text
to extract the actual entity and appended them one by one to entities
.
Here’s the result:

Note that North Carolina has several variations of its name and some have “the” prefixed in their names. Let’s get rid of them.
articles = []for entity_list in entities:
cleaned_entity_list = []
for entity in entity_list:
cleaned_entity_list.append(entity.lstrip('the ').replace("'s", "").replace("’s",""))
articles.append(cleaned_entity_list)
In the code block above, we’re simply traversing the list of lists articles
and cleaning the entities one by one. With each iteration, we’re stripping the prefix “the” and getting rid of 's.
Optional: FuzzyWuzzy
Looking at the entities, I’ve noticed that there are also variations in the “United States” is represented. There exists “United States of America” while some are just “United States”. We can trim these down into a more standard naming convention.
FuzzyWuzzy can help with this.
Described by pypi.org as “string matching like a boss,” FiuzzyWuzzy uses Levenshtein distance to calculate the similarities between words.¹ For a really good tutorial on how to use FuzzyWuzzy, check out Thanh Huynh’s article.FuzzyWuzzy: Find Similar Strings within one column in PythonToken Sort Ratio vs. Token Set Ratiotowardsdatascience.com
Here’s the optional code for using FuzzyWuzzy:
choices = set([item for sublist in articles for item in sublist]) cleaned_articles = [] for article in articles: article_entities = [] for entity in set(article): article_entities.append(process.extractOne(entity, choices)[0]) cleaned_articles.append(article_entities)
For the final step before creating the network graph, let’s get rid of the empty lists within our list of list that were generated by articles who didn’t have any GPE
entity types.
articles = [article for article in articles if article != []]
Create the Network Graph
For the next step, we’ll create the world into which the graph will exist.
G = nx.Graph()
Then, we’ll manually add the nodes with G.add_nodes_from()
.
for entities in articles:
G.add_nodes_from(entities)
Let’s see what the graph looks like with:
figure(figsize=(10, 8))
nx.draw(G, node_size=15)

Next, let’s add the edges that will connect the nodes.
for entities in articles:
if len(entities) > 1:
for i in range(len(entities)-1):
G.add_edges_from([(str(entities[i]),str(entities[i+1]))])
For each iteration of the code above, we used a conditional that will only entertain a list of entities that has two or more entities. Then, we manually connect each of the entities with G.add_edges_from()
.
Let’s see what the graph looks like now:
figure(figsize=(10, 8))
nx.draw(G, node_size=10)

This graph reminds me of spiders! LOL.
To organize it a bit, I decided to use the shell version of the network graph:
figure(figsize=(10, 8))
nx.draw_shell(G, node_size=15)

We can tell that some nodes are heavier on connections than others. To see which nodes have the most connections, let’s use G.degree()
.
G.degree()
This gives the following degree view:

Let’s find out which node or entity has the most number of connections.
max(dict(G.degree()).items(), key = lambda x : x[1])

To find out which other nodes have the most number of connections, let’s check the top 5:
degree_dict = dict(G.degree(G.nodes()))
nx.set_node_attributes(G, degree_dict, 'degree')sorted_degree = sorted(degree_dict.items(), key=itemgetter(1), reverse=True)
Above, sorted_degrees
is a list that contains all the nodes and their degree values. We only wanted the top 5 like so:
print("Top 5 nodes by degree:")
for d in sorted_degree[:5]:
print(d)

Bonus Round: Gephi
Gephi is an open-source and free desktop application that lets us visualize, explore, and analyze all kinds of graphs and networks.²
Let’s export our graph data into a file so we can import it into Gephi.
nx.write_gexf(G, "npr_coronavirus_GPE_50.gexf")


Cool beans!
Next Steps
This time, we only processed 50 articles from npr.org. What would happen if we processed all 300 articles from our dataset? What will we see if we change the entity type from GPE
to PERSON
? How else can we use network graphs to visualize natural language processing results?
There’s always more to do. The possibilities are endless!
I hope you enjoyed today’s post. The code is not perfect and we have a long way to go towards realizing insights from the data. I encourage you to dive deeper and learn more about spaCy, networkx, fuzzywuzzy, and even Gephi.
Stay tuned!
You can reach me on Twitter or LinkedIn.
[1]: Wikipedia. (May 25, 2020). Natural language processing https://en.wikipedia.org/wiki/Natural_language_processing
[2]: Gephi. (May 25, 2020). The Open Graph Viz Platform https://gephi.org/
This article was first published in Towards Data Science‘ Medium publication.