About Word2Vec Model
Word2Vec model was created by a team lead by Tomas Mikolov in Google. In 2015 is became an open source product. Word2Vec model transforms words to vectors which gives us new insights in text analytics. Here is an excellent article about word2vec model: The amazing power of word vectors.
In our posts we will introduce a new Word2Vec2Graph model - a model that combines Word2Vec and graph functionalities. We will build graphs using words as nodes and Word2Vec cosine similarities and edge weights. Word2Vec graphs will give us new insights like top words in text file - pageRank, word topics - connected components, word neighbors - 'find' function.
Let's look at some examples of Word2Vec2Graph model based on text that describes Word2Vec model. We'll start with well known algorithm - Google pageRank. Here are top pageRank words that shows us then Word2Vec model is about words, vectors, training, and so on:
Spark GraphFrames 'find' function shows us which words in documents about Word2Vec model are located between the words 'words' and 'vectors'?
The next few graphs demonstrate one of well known examples about the Word2Vec model: Country - Capital associations like France - Germany + Berlin = Paris:
The first picture shows connected component, the second 'Germany' neighbors and neighbors of neighbors, the third a circle of word pairs. Numbers on edges are Word2Vec cosine similarities between the words.
Here are some more examples. We built a graph of words with low Word2Vec cosine similarities, ran connected components (first picture) and looked at neighbors of neighbors for the word 'vectors' (second picture):
In the next several posts we will show how to build and use the Word2Vec2Graph model. As a tool we will use Spark. We will run it on Amazon Cloud via Databricks Community.
Why Spark?
Until recently there were no single processing framework that was able to solve several very different analytical problems like statistics and graphs. Spark is the first framework that can do it. It is the fundamental advantage of Spark that provides a framework for advanced analytics right out of the box. This framework includes a tool for accelerated queries, a machine learning library, and graph processing engine.
Databricks Community
Databricks community edition is an entry to Spark Big Data Analytics. It allows to create a cluster on Amazon Cloud and makes it is easy for data scientists and data engineers to write Spark code and debug it. And it's free!Training a Word2Vec Model
In our first post we will train Word2vec model in Spark and show how training corpus affects the Word2Vec model results.
AWS cluster that we run via Databricks Community is not so big. To be able to train Word2vec model we will get a 42 MB public file about news and load it to Databricks:
First we'll tokenize the data
Then we'll train the Word2VecModel
Then we will save the model and we don't need to train it again.
Now let's test the model. The most popular function of Word2Vec model shows us how different words are associated:
How Trained Corpus Affects the Word2Vec Model?
To see how the corpus that we used to train the model affects the results we will add a small file, train the model on combined corpus and compare the results.To play with data about psychology we copied it from several Wikipedia articles, got a small file (180 KB), and combined it with news file (42 MB). Then we trained the Word2vec model on this combined file.
Train Word2Vec model and save the results:
The results of these models are very different for some words and very similar for some other words. Here are examples:
Word: Stress - Input File: News:
Input File: News + Wiki:
Word: Rain - Input File: News:
Input File: News + Wiki:
Word: Specialty - Input File: News:
Input File: News + Wiki: