Word2Vec2Graph for Pairs of Words - Sparkling Data Ocean

Word2Vec2Graph Model - Direct Graph

In previous posts we introduced Word2Vec2Graph model in Spark. Word2Vec2Graph model connects Word2Vec model with Spark GraphFrames library and gives us new opportunities to use graph approach to text mining.

In this post as Word2Vec model we will use the same model that was trained on the corpus of News and Wiki data and as a text file we will use the same Stress Data file. In previous posts we looked at graph for all pairs of words from Stress Data file. Now we will look at pairs of words that stay next to each other in text file and will use these pairs as graph edges.

Read and Clean Stress Data File

Read Stress Data file:

val inputStress=sc.textFile("/FileStore/tables/stressWiki.txt").
   toDF("charLine")

Using Spark ML functions tokenize and remove stop words from Stress Data file:

import org.apache.spark.ml._
import org.apache.spark.ml.feature._
val tokenizer = new RegexTokenizer().
   setInputCol("charLine").
   setOutputCol("value").
   setPattern("[^a-z]+").
   setMinTokenLength(5).
   setGaps(true)
val tokenizedStress = tokenizer.
   transform(inputStress)
val remover = new StopWordsRemover().
   setInputCol("value").
   setOutputCol("stopWordFree")
val removedStopWordsStress = remover.
   setStopWords(Array("none","also","nope","null")++
   remover.getStopWords).
   transform(tokenizedStress)

Transform the results to Pairs of Words

Get pairs of words - use Spark ML library ngram function:

val ngram = new NGram().
   setInputCol("stopWordFree").
   setOutputCol("ngrams").
   setN(2)
val ngramCleanWords = ngram.
   transform(removedStopWordsStress)

Explode ngrams:

import org.apache.spark.sql.functions.explode
val slpitNgrams=ngramCleanWords.
   withColumn("ngram",explode($"ngrams")).
   select("ngram").
   map(s=>(s(0).toString,
      s(0).toString.split(" ")(0),
      s(0).toString.split(" ")(1))).
   toDF("ngram","ngram1","ngram2").
   filter('ngram1=!='ngram2)

display(slpitNgrams)
ngram,ngram1,ngram2
psychological stress,psychological,stress
wikipedia encyclopedia,wikipedia,encyclopedia
kinds stress,kinds,stress
stress disambiguation,stress,disambiguation
video explanation,video,explanation
psychology stress,psychology,stress
stress feeling,stress,feeling
feeling strain,feeling,strain
strain pressure,strain,pressure
pressure small,pressure,small
small amounts,small,amounts
amounts stress,amounts,stress
stress desired,stress,desired
desired beneficial,desired,beneficial
beneficial healthy,beneficial,healthy
healthy positive,healthy,positive
positive stress,positive,stress

Exclude Word Pairs that are not in the Word2Vec Model

In the post where we introduced Word2Vec2Graph model, we calculated cosine similarities of all word-to-word combinations of Stress Data File based on Word2Vec model and saved the results.

val w2wStressCos = sqlContext.read.parquet("w2wStressCos")

display(w2wStressCos.
   filter('cos< 0.1).
   filter('cos> 0.0).limit(7))
word1,word2,cos
conducted,contribute,0.08035969605150468
association,contribute,0.06940379539008698
conducted,crucial,0.0254494353390933
conducted,consequences,0.046451274237478545
exhaustion,ideas,0.08462263299060188
conducted,experience,0.05733563656740034
conducted,inflammation,0.09058846853618428

Filter out word pairs with words that are not in the set of words from the Word2Vec model

val ngramW2V=slpitNgrams.
   join(w2wStressCos,'ngram1==='word1 && 'ngram2==='word2).
   select("ngram","ngram1","ngram2","cos").distinct

Example: Word Pairs with high Cosine Similarity >0.7:

display(ngramW2V.
   select('ngram,'cos).
   filter('cos>0.7).orderBy('cos.desc))
ngram,cos
acute chronic,0.7848571640793651
governmental organizations,0.7414504735574394
realistic helpful,0.730824091817287
disease chronic,0.7064366889098306
feelings thoughts,0.7000105635150229
thoughts feelings,0.7000105635150229

Example: Word Pairs with Cosine Similarity close to 0:

display(ngramWord2VecDF.
   select('ngram,'cos).
   filter('cos>(-0.002)).
   filter('cos<(0.002)).orderBy('cos))
ngram,cos
researchers interested,-0.0019752767768097153
defense mechanisms,-0.0014974826488316265
whether causes,-0.0008734112750530817
share others,0.0002295526607795157
showed direct,0.00045697478567580015
individual takes,0.0017983474881583593

Graph on Word Pairs

Now we can build a graph on word pairs: words will be nodes, ngrams - edges and cosine similarities - edge weights.

import org.graphframes.GraphFrame
val graphNodes1=ngramW2V.
   select("ngram1").
   union(ngramW2V.select("ngram2")).
   distinct.toDF("id")
val graphEdges1=ngramW2V.  
   select("ngram1","ngram2","cos").
   distinct.toDF("src","dst","edgeWeight")
val graph1 = GraphFrame(graphNodes1,graphEdges1)

To use this graph in several posts we will save graph vertices and edges as Parquet to Databricks locations.

graph1.vertices.write.
   parquet("graphNgramVertices")
graph1.edges.write.
   parquet("graphNgramEdges")

Load vertices and edges and rebuild the same graph back

val graphNgramStressVertices = sqlContext.read.
   parquet("graphNgramVertices")
val graphNgramStressEdges = sqlContext.read.
   parquet("graphNgramEdges")
val graphNgramStress = GraphFrame(graphNgramStressVertices, graphNgramStressEdges)

Page Rank

Calculate Page Rank:

val graphNgramStressPageRank = graphNgramStress.
   pageRank.
   resetProbability(0.15).
   maxIter(11).
   run()

display(graphNgramStressPageRank.vertices.
   distinct.
   sort($"pagerank".desc).
   limit(11))
   id,pagerank
   stress,36.799029843873065
   social,8.794399876715186
   individual,8.756866689676286
   person,8.466242702036295
   stressful,7.9825617601531444
   communication,7.274847096155088
   health,6.398223040310048
   situation,5.924707831050667
   events,5.7227621841425975
   changes,5.642126628136843
   chronic,5.2918611240572755

Next Post - Connected Word Pairs

In the next post we will run Connected Components and Label Propagation functions of Spark GraphFrames library to analyze direct Word2Vec2Graph model.