In previous posts we introduced
Word2Vec2Graph model in Spark.
Word2Vec2Graph model connects Word2Vec model with Spark GraphFrames library and gives us new opportunities to use graph approach to text mining.
In this post as Word2Vec model we will use the same model that was
trained on the corpus of News and Wiki data and as a text file we will use the same Stress Data file. In previous posts we looked at graph for all pairs of words from Stress Data file. Now we will look at pairs of words that stay next to each other in text file and will use these pairs as graph edges.
Read and Clean Stress Data File
Read Stress Data file:
Using Spark ML functions tokenize and remove stop words from Stress Data file:
Transform the results to Pairs of Words
Get pairs of words - use Spark ML library ngram function:
Exclude Word Pairs that are not in the Word2Vec Model
In the post where we
introduced Word2Vec2Graph model, we calculated cosine similarities of all word-to-word combinations of
Stress Data File based on Word2Vec model and saved the results.
Filter out word pairs with words that are not in the set of words from the Word2Vec model
Example: Word Pairs with high Cosine Similarity >0.7:
Example: Word Pairs with Cosine Similarity close to 0:
Graph on Word Pairs
Now we can build a graph on word pairs: words will be nodes, ngrams - edges and cosine similarities - edge weights.
To use this graph in several posts we will save graph vertices and edges as Parquet to Databricks locations.
Load vertices and edges and rebuild the same graph back
Calculate Page Rank:
Next Post - Connected Word Pairs
In the next post we will run Connected Components and Label Propagation functions of Spark GraphFrames library to analyze direct Word2Vec2Graph model.