Word2Vec model maps words to vectors which gives us an opportunity to calculate cosine similarities within pairs of words then translate pairs of words to graph using words as nodes, word pairs as edges and cosine similarities as edge weights.
We are running a small AWS cluster
on Databricks community edition
and for the Word2Vec2Graph model we will use a small size text file with data about stress taken from a Wikipedia article. We will call this text file Stress Data File.
We will use our trained Word2Vec model for word pairs cosine similarities. First, we will read our trained Word2VecModel:
Next we will get the list of all words from the Word2Vec model:
To be able to use this Word2Vec model for Stress Data file cosine similarities, we will filter out words from Stress Data file that are not in the Word2Vec list of words:
Finally we will create word to word matrix:
Word2Vec Cosine Similarity Function
Now we want to use Word2Vec cosine similarity to see how words are connected with other words. We will create a function to calculate cosine similarity between vectors from the Word2Vec model
Cosine Similarity between Stress Data File Words
Now we can calculate word to word cosine similarities between word pairs from Stress Data File and save the results.
Example: Word combinations with high Cosine Similarities:
Example: Word combinations with low Cosine Similarity:
Store and read Stress Data File word pairs with cosine similarities between them:
Graph of Combinations of Stress Data File Words
Now we can build a graph using words as nodes, {word1, word2} word combinations as edges and cosine similarities between the words as edge weights:
We will save graph vertices and edges in Parquet format to use them for future posts:
Load vertices and edges and rebuild the same graph:
Connected Components
They are many interesting things we can do with Spark GraphFrames. In this post we will play with connected components.
This graph was built on all {word1, word2} combinations of Stress Data File so all word pairs are in the same large connected component. We will look at connected components of subgraphs with different edge weight thresholds.
Connected Components with High Cosine Similarity
For this post we will use edge weight threshold 0.75, i.e. we will use only word pairs with cosine similarity higher than 0.75.
Run connected components for graph with high cosine similarity:
Words in the biggest component:
Words in the second component:
And of course some components are not very interesting:
Next Post - Word2Vec2Graph Page Rank
Spark GraphFrames library has many interesting functions. In the next post we will look at Page Rank for Word2Vec2Graph.