In the previous post we built and saved
Word2Vec2Graph for pair of words of Stress Data file.
In this post we will look for connected word pair groups using Spark GraphFrames library functions: Connected Components and Label Propagation.
Read stored vertices and edges and rebuilt the graph:
Connected Components
As we could expect, almost all word pairs are connected therefore almost alls of them are in the same large connected component.
When we looked at all word to word combinations from text file, pairs of words were tightly connected and we could not split them to separate groups. Now looking at ngram word pairs we can use community detection algorithms to split them to word pair groups. We'll start with the simplest community detection algorithm - Label Propagation.
Label Propagation
As Label Propagation algorithm is cutting loosely connected edges, we want to see which {word1, word2} ngram pairs from text file are within the same Label Propagation groups.
For now we will ignore small groups and look at groups that have at least 3 {word1, word2} pairs.
Word Pair Groups
We'll start with the second group - group that contains 30 {word1, word2} pairs.
Here are edges that belong to this group - {word1, word2, word2vec cosine similarity}:
Graph (via Gephi):
We use a semi-manual way on building Gephi graphs. Create a list of direct edges:
Then put the list within 'digraph{...}' and getting data in .DOT format:
Here is the graph for the group of 54 pair. 'Stress' - the word with the highest PageRank is in the center of this graph:
Here is the graph for the group of 8 pair:
High Topics of Label Groups
We can see that in the center of the biggest group is the word 'stress' - the word with the highest PageRank. We'll calculate high PageRank words for word pair groups.
Calculate PageRank:
Calculate lists of distinct words in the label groups:
Top 10 Words in Label Groups
The biggest group:
Second group:
Third Group:
Comparing group graphs with PageRanks of words within groups shows that the words with high PageRanks are located in graph center.
Next Post - More Pair Connections
In the next post we will continue playing with Spark GraphFrames library to find more interesting word to word connections.