We will analyze a long document, uncover new topics (clusters) and use CNN classification as a validation method for graph clustering.
To find topics in long text file we will build Word2Vec2Graph on top of Word2Vec model. Document words will be used as graph nodes and cosine similarities between word vectors as edge weights for this graph. The Word2Vec2Graph model is described in details in previous posts of this blog.
Word vectors will be transformed to images using method described in notebook created by Ignacio Oguiza
Time series - Olive oil country. As different clusters we will use topics generated from Word2Vec2Graph graph. Than we will use CNN classification model to validate clustering:
In particular, we will use CNN classification model to prove that topic we discover are different. This validation method will not let us to get rid of noise in our clusters: if two words are in the same cluster it does not mean that they are highly connected. But if two words are in different clusters they obviously do not belong to the same topic.
Data Preparation
Data preparation process for Word2Vec2Graph model in described in previous posts and summarized in the "Word2Vec2Graph - Insights" post. Here we used the same data preparation process of text data about Creativity and Aha Moments:Word2Vec2Graph - Insights
Read text file
Tokenize
Remove stop words
Read trained Word2Vec model
Build a graph with words as nodes and cosine similarities as edge weights.
Save graph vertices and edges
Read and Clean File about Creativity and Aha Moments
Read text file, tokenize it and remove stop words:
Read trained Word2Vec model and from data about Creativity and Aha Moments text exclude words that are not in the model:
Build a Graph and Find Topics
Read nodes and edges that we calculated and saved before, build a graph with words as nodes and cosine similarities as edge weights. How to build the graph was described in details in our post "Introduction to Word2Vec2Graph Model."
Function to calculate connected components with cosine similarly and component size parameters:
Calculate connected components with high cosine similarly:
Calculate Top PageRanks for Connected Components
Calculate graph Page Ranks:
Join PageRank data with connected components and find the top Page Rank word for each component:
Use the top PageRank as a class word for connected components:
Define word vectors, convert vectors to strings and save it as csv file:
Using CNN Deep Learning for Topic Validation
To convert vectors to images and classify images via CNN we used almost the same code that Ignacio Oguiza shared on fast.ai forum
Time series - Olive oil country.
We splitted the source file to words={class, classWord} and vecString. The 'class' column was used to define a topic category for images and 'classWord' column to define image name. The 'vecString' column was splitted by comma to numbers.
We tuned the classification model and we've got about 91% accuracy.
Potentially this accuracy can be improved using more advanced Word2Vec model.
Graphs of Topics
Function to find two degree neighbors ('friend of a friend') by word and transform the results to DOT language:
Calculate two degree neighbors for top PageRank words of connected components.
Topic Examples
We used a semi-manual way on building Gephi graphs: created a list of friends of a friends for top PageRank words of each topic on DOT language.
Top PageRank word - 'funny':
Top PageRank word - 'decrease':
Top PageRank word - 'integrated':
Top PageRank word - 'symptoms':
Top PageRank word - 'emory':
Next Post - Associations and Deep Learning
In the next post we will deeper look at deep learning for data associations.