Word2Vec2Graph technique to find text topics is similar to Free Association technique used in psychoanalysis: "The importance of free association is that the patients spoke for themselves, rather than repeating the ideas of the analyst; they work through their own material, rather than parroting another's suggestions" (Freud).
In this post we will show some examples that prove this analogy. As a text file we will use data about Psychoanalysis from Wikipedia.
Read and Clean Psychoanalysis Data File
Read Psychoanalysis Data file, tokenize and remove stop words:
Explode Psychoanalysis word arrays to words:
Are Word Pairs in Trained Word2Vec Model?
Read trained Word2Vec model that was trained and described in
our first post.
Get a set of all words from the Word2Vec model and compare Psychoanalysis file word pairs with words from the Word2Vec model
The Word2Vec model was trained on corpus based on News and Wikipedia data about psychology but only 82% of Psychoanalysis file word pairs are in the model. To increase this percentage we will include Psychoanalysis file data to training corpus and retrain the Word2Vec model.
Retrain Word2Vec Model
Get a set of all words from the new Word2Vec model and compare them with Psychoanalysis file words:
This new Word2Vec model works a little bit better: 85% of Psychoanalysis File words are in the model.
How Word Pairs are Connected?
Now we will calculate cosine similarities of words within word pairs.
We introduced Word2Vec Cosine Similarity Function in the
Word2Vec2Graph model Introduction post.
Transform to Pairs of Words
Get pairs of words and explode ngrams:
Cosine similarities for pairs of words:
Graph on Word Pairs
Now we can build a graph on word pairs: words will be nodes, ngrams - edges and cosine similarities - edge weights.
We will save graph vertices and edges as Parquet to Databricks locations, load vertices and edges and rebuild the same graph.
Page Rank
Calculate Page Rank:
Finding Topics
In the previous post we described
how to find document topics via Word2Vec2Graph model.
We created a function to calculate connected components with cosine similarly and component size parameters and a function to transform subgraph edges to DOT language:
To select parameters we analyzed cosine similarity distribution.
Based on cosine similarity distribution we'll look at topics with high, medium and low cosine similarities.
Psychoanalysis Topics with High Cosine Similarities
Connected components with edge weights greater than 0.7:
We selected component '94489280524'. First we'll create a graph with the same cosine similarity parameters them we used to look at connected components, i.e. for word pairs with cosine similarity >0.7:
Next we'll expand the topic graph for word pairs with cosine similarity >0.6:
Then we'll expand the same connected component to cosine similarity >0.5:
Psychoanalysis Topics with Medium Cosine Similarities
Connected components parameters: edge weights in (0.17, 0.2):
Graph picture parameters: edge weights in (0.1, 0.2):
Psychoanalysis Topics with Low Cosine Similarities
Connected components with edge weights in (-0.5, 0.0):
Graph picture with no parameters: edge weights in (-1.0, 1.0):
Example 1:
Example 2:
Example 3:
This post example topics with high cosine similarity word pairs are more expected then topics with low cosine similarity word pairs. Lowly correlated word pairs give us more interesting and unpredicted results. The last example shows that within Psychoanalysis text file the word 'association' is associated with some unexpected words...
Next Post - Associations
In the next several posts we will deeper look at data associations.