Direct Word2Vec2Graph - Connected Pair Groups

Read Word Pairs Graph

In the previous post we built and saved Word2Vec2Graph for pair of words of Stress Data file. In this post we will look for connected word pair groups using Spark GraphFrames library functions: Connected Components and Label Propagation.

Read stored vertices and edges and rebuilt the graph:

import org.graphframes.GraphFrame
val graphNgramStressVertices = sqlContext.read.parquet("graphNgramVertices")
val graphNgramStressEdges = sqlContext.read.parquet("graphNgramEdges")

val graphNgramStress = GraphFrame(graphNgramStressVertices, graphNgramStressEdges)

Connected Components

sc.setCheckpointDir("/FileStore/")
val resultStressCC = graphNgramStress.
   connectedComponents.
   run()
val ccStressCount=resultStressCC.
   groupBy("component").
   count.
   toDF("cc","ccCt")

display(ccStressCount.orderBy('ccCt.desc))
cc,ccCt
0,1111
240518168576,2

As we could expect, almost all word pairs are connected therefore almost alls of them are in the same large connected component.

When we looked at all word to word combinations from text file, pairs of words were tightly connected and we could not split them to separate groups. Now looking at ngram word pairs we can use community detection algorithms to split them to word pair groups. We'll start with the simplest community detection algorithm - Label Propagation.

Label Propagation

val lapelPropId = graphNgramStress.
  labelPropagation.
  maxIter(5).
  run().
  toDF("lpWord","lpLabel")

display(lapelPropId.
  groupBy("lpLabel").count.toDF("label","count").
  orderBy('count.desc).limit(11))
  label,count
  386547056642,115
  317827579910,107
  515396075520,26
  420906795012,22
  274877906949,19
  738734374914,12
  481036337155,10
  1382979469314,10
  575525617667,10
  188978561028,9
  927712935936,9

As Label Propagation algorithm is cutting loosely connected edges, we want to see which {word1, word2} ngram pairs from text file are within the same Label Propagation groups.

val pairLabel=graphNgramStress.
  edges.
  join(lapelPropId,'src==='lpWord).
  join(lapelPropId.toDF("lpWord2","lpLabel2"),'dst==='lpWord2).
  filter('lpLabel==='lpLabel2).
  select('src,'dst,'edgeWeight,'lpLabel)

For now we will ignore small groups and look at groups that have at least 3 {word1, word2} pairs.

  
display(pairLabel.
  groupBy("lpLabel").count.toDF("lpLabelCount","pairCount").
  filter("pairCount>2").orderBy('pairCount.desc))
lpLabelCount,pairCount
  386547056642,54
  317827579910,30
  274877906949,8

Word Pair Groups

We'll start with the second group - group that contains 30 {word1, word2} pairs. Here are edges that belong to this group - {word1, word2, word2vec cosine similarity}:

display(pairLabel.
  filter('lpLabel==="317827579910").
  select('src,'dst,'edgeWeight).
  orderBy('edgeWeight.desc))
src,dst,edgeWeight
military,combat,0.6585100225253417
theories,kinds,0.614683170553027
greatly,affect,0.5703911227199971
individual,susceptible,0.5092298863558692
individual,indirectly,0.50178581898798
changes,affect,0.44110535802060397
individual,examples,0.435892677464733
individual,either,0.4149070407876195
requires,individual,0.4125513014023833
affect,individual,0.3983243027379392
individual,better,0.38429278796103433
individual,perceive,0.3798804220663263
individual,experience,0.36348468757986896
cause,changes,0.35436057661804354
indirectly,deals,0.33550190287381315
prevention,requires,0.3049172137994746
displacement,individual,0.2841213622282649
better,negative,0.2451413181091326
individual,diminished,0.23837691546795897
cause,either,0.22592468033966373
humor,individual,0.221442007247884
individual,personality,0.20933309402530506
affect,promoting,0.19505527986421928
cause,individual,0.18144254665538492
changes,caused,0.15654198248664133
experience,conflicting,0.14323076298532875
individual,level,0.09802076461154222
level,combat,0.0915599285372975
causing,individual,0.008010528297421033
individual,takes,0.0017983474881583593

Graph (via Gephi): We use a semi-manual way on building Gephi graphs. Create a list of direct edges:

display(pairLabel.
  filter('lpLabel==="317827579910").
  map(s=>(s(0).toString + " -> "+ s(1).toString)))

Then put the list within 'digraph{...}' and getting data in .DOT format:

digraph{p>
level -> combat
military -> combat
changes -> caused
indirectly -> deals
requires -> individual
displacement -> individual
cause -> individual
humor -> individual
causing -> individual
affect -> individual
individual -> examples
experience -> conflicting
individual -> either
cause -> either
individual -> susceptible
individual -> experience
individual -> better
individual -> personality
affect -> promoting
individual -> perceive
individual -> diminished
theories -> kinds
changes -> affect
greatly -> affect
individual -> level
individual -> indirectly
cause -> changes
individual -> takes
better -> negative
prevention -> requires
}

Here is the graph for the group of 54 pair. 'Stress' - the word with the highest PageRank is in the center of this graph:

Here is the graph for the group of 8 pair:

High Topics of Label Groups

We can see that in the center of the biggest group is the word 'stress' - the word with the highest PageRank. We'll calculate high PageRank words for word pair groups. Calculate PageRank:

val graphNgramStressPageRank =
  graphNgramStress.
  pageRank.
  resetProbability(0.15).
  maxIter(11).
  run()
val pageRankId=graphNgramStressPageRank.
  vertices.
  toDF("prWord","pagerank")

Calculate lists of distinct words in the label groups:

val wordLabel=pairLabel.
  select('src,'lpLabel).
  union(pairLabel.
    select('dst,'lpLabel)).
  distinct.
  toDF("lpWord","lpLabel")

display(wordLabel.
  groupBy('lpLabel).count.
  toDF("lpLabel","labelCount").
  filter("labelCount>2").
  orderBy('labelCount.desc))
lpLabel,labelCount
386547056642,47
317827579910,30
274877906949,7
146028888068,4
1675037245443,3

Top 10 Words in Label Groups

The biggest group:

val wordLabelPageRank=wordLabel.
  join(pageRankId,'lpWord==='prWord).
  select('lpLabel,'lpWord,'pageRank)

display(wordLabelPageRank.
  select('lpWord,'pageRank).
  filter('lpLabel==="386547056642").
  orderBy('pageRank.desc).
  limit(10))
lpWord,pageRank
  stress,36.799029843873036
  stressful,7.982561760153138
  anxiety,5.280935662282566
  levels,3.577601059501528
  depression,2.997965863478802
  cognitive,2.4835377323499968
  event,2.376589797720234
  system,2.209925145397034
  physiological,2.010263387749949
  laughter,1.9427846994029507

Second group:

display(wordLabelPageRank.
  select('lpWord,'pageRank).
  filter('lpLabel==="317827579910").
  orderBy('pageRank.desc).
  limit(10))
lpWord,pageRank
  individual,8.75686668967628
  changes,5.642126628136839
  negative,3.89748211412626
  affect,2.869162036449995
  cause,2.82449665904923
  humor,2.654039734715573
  either,2.629897237315239
  examples,2.2158219523034477
  experience,2.086137279362367
  level,1.7538722184950524

Third Group:

display(wordLabelPageRank.
  select('lpWord,'pageRank).
  filter('lpLabel==="274877906949").
  orderBy('pageRank.desc).
  limit(10))
lpWord,pageRank
  disease,4.195635531226847
  illness,2.90222622174496
  heart,2.1357113367905662
  increased,1.4318340158498353
  stage,1.0047304801814618
  resistance,0.8456856157412355
  confronting,0.6327411341532586

Comparing group graphs with PageRanks of words within groups shows that the words with high PageRanks are located in graph center.

Next Post - More Pair Connections

In the next post we will continue playing with Spark GraphFrames library to find more interesting word to word connections.