Word2Vec2Graph - Insights

Find New Associations in Text

Posted by Melenar on April 4, 2018

Word2Vec2Graph Model and New Associations

In our previous post we proved that high cosine similarity text topics are already well known and topics with low cosine similarities give us new associations and new ideas.

In this post we will show another Word2Vec2Graph method to find new word to word associations. This method is similar to psychoanalysis Free Association techniques as well as methods of finding text topics. As a text file in this post we will use data about Creativity and Aha Moments.

We will look at word neighbors using 'find' GraphFrames function that we described in "Word2Vec2Graph Model - Neighbors" post.

Data Preparation

Data preparation process for Word2Vec2Graph model in described in several post and summarized in the "Quick Look at Text" post. Here we used the same data preparation process of text data about Creativity and Aha Moments:
  • Read text file
  • Tokenize
  • Remove stop words
  • Transform text file to next to each other pairs of words
  • Retrain Word2Vec model
  • Calculate cosine similarities within pairs of words
  • Build a direct graph on pairs of words using words as nodes and cosine similarities as edges weights
  • Save graph vertices and edges

Build a Graph

Read graph vertices and edges saved as Parquet to Databricks locations and build the graph for Word2Vec2Graph model.

import org.graphframes.GraphFrame
val graphInsightVertices = sqlContext.read.parquet("graphInsightVertices")
val graphInsightEdges = sqlContext.read.parquet("graphInsightEdges")

val graphInsight = GraphFrame(graphInsightVertices, graphInsightEdges)

Page Rank

The highest Page Rank words about Creativity and Aha Moments:

val graphInsightPageRank = graphInsight.
   pageRank.resetProbability(0.15).maxIter(11).run()

display(graphInsightPageRank.vertices.
   distinct.
   sort($"pagerank".desc).limit(12))
   id,pagerank
   creative,36.427592633298694
   brain,35.64966777994186
   creativity,26.75614299294646
   insight,21.3449149933242
   cognitive,20.66552322641501
   problem,20.291860811528586
   people,18.962734667376907
   university,16.66836572882581
   social,16.540315233753887
   genius,16.250827488029646
   knowledge,15.852328556068747
   study,15.565795496200513

Lines between Words

To see how word pair are connected we will look at 'find' GraphFrames function that we already used in the "Word2Vec2Graph Model - Neighbors" post.

We will start with two high Page Ranks words 'insight' and 'brain' and look for lines of words in both directions.

'insight' -> 'word' -> 'brain':

val path=graphInsight.
   find("(a) - [ab] -> (b); (b) - [bc] -> (c)").
   filter($"a.id"==="insight").
   filter($"c.id"==="brain").
   select("a.id","ab.edgeWeight","b.id","bc.edgeWeight","c.id").
   toDF("a","ab","b","bc","c")

display(path.orderBy(('ab+'bc).desc))
a,ab,b,bc,c
insight,0.4128722197714052,invasive,0.5830608090075667,brain
insight,0.5926405065153766,creative,0.35538729273722036,brain
insight,0.6265201186962078,solutions,0.13029716155867507,brain
insight,0.20443740746326608,pattern,0.4753597960223179,brain
insight,0.15960073745912792,occur,0.416785760503172,brain
insight,0.32890495067080877,requires,0.19730741846887495,brain
insight,0.11848059507149529,increase,0.1508410859156626,brain
insight,0.10729267923132169,beeman,0.15344660014170144,brain
insight,0.004491484498190776,suggested,0.16421754414027762,brain
insight,-0.043810741336856945,regions,-0.02137437192577632,brain
insight,-0.14634712420947668,along,-0.02381072199458428,brain
insight,-0.13813478755633732,first,-0.1554410765939254,brain
'brain' -> 'word' -> 'insight':

val path=graphInsight.
   find("(a) - [ab] -> (b); (b) - [bc] -> (c)").
   filter($"a.id"==="brain").
   filter($"c.id"==="insight").
   select("a.id","ab.edgeWeight","b.id","bc.edgeWeight","c.id").
   toDF("a","ab","b","bc","c")

display(path.orderBy(('ab+'bc).desc))
a,ab,b,bc,c
brain,0.4486765246697501,conscious,0.22780425594285608,insight
brain,0.30888238508458593,activity,0.2913329808544577,insight
brain,0.1871056077079674,moments,-0.17359890717130036,insight
brain,0.06816500537193591,right,-0.16052900998488828,insight

Next we will look for longer lines, i.e. 'word1' -> 'word2' -> 'word3' -> 'word4'. With longer lines the number of word combination is growing fast so we will show only some results.

'insight' -> 'word1' -> 'word2' -> 'brain' (142 lines):

val path=graphInsight.
   find("(a) - [ab] -> (b); (b) - [bc] -> (c); (c) - [cd] -> (d)").
   filter($"a.id"==="insight").
   filter($"d.id"==="brain").
   select("a.id","ab.edgeWeight","b.id","bc.edgeWeight","c.id","cd.edgeWeight","d.id").
   toDF("a","ab","b","bc","c","cd","d")

display(path.orderBy(('ab+'bc+'cd).desc))
a,ab,b,bc,c,cd,d
insight,0.28168782668396153,brain,0.7644093868324334,cells,0.7644093868324334,brain
insight,0.5873913431733343,creativity,0.72979093470829,creative,0.35538729273722036,brain
insight,0.5873913431733343,creativity,0.3978587884107718,suggests,0.540965758459218,brain
insight,0.0695940003338105,watch,0.2189340241873386,clues,0.4328969882637767,brain
insight,0.23392584234097255,unexpected,0.27952840262,happened,0.202974504974,brain
insight,0.1806737835610359,occurs,0.29306721804427444,various,0.23348013916537874,brain
insight,0.004491484498190776,suggested,0.0039846481446,creative,0.355387292737,brain
iinsight,0.109973700652,analytically,0.091236920188010,increase,0.1508410859156626,brain
insight,0.2537959966753002,flash,0.1505789981661047,across,-0.06288528966788781,brain

'brain' -> 'word1' -> 'word2' -> 'insight' (105 lines):

val path=graphInsight.
   find("(a) - [ab] -> (b); (b) - [bc] -> (c); (c) - [cd] -> (d)").
   filter($"a.id"==="brain").
   filter($"d.id"==="insight").
   select("a.id","ab.edgeWeight","b.id","bc.edgeWeight","c.id","cd.edgeWeight","d.id").
   toDF("a","ab","b","bc","c","cd","d")

display(path.orderBy(('ab+'bc+'cd).desc))
brain,0.627845334314687,memory,0.5969736849364111,creative,0.5926405065153766,insight
brain,0.5818230817465604,function,0.6050462263809657,creativity,0.5873913431733343,insight
brain,0.522079405363812,abilities,0.6439093697793404,creative,0.5926405065153766,insight
brain,0.2572351478541803,thinking,0.3586918857219751,nature,0.3236593351220391,insight
brain,0.14564507991393688,create,0.4844363325210495,ideas,0.28877305186266056,insight
brain,0.17086866711570267,complex,0.432425015153635,system,0.2989597664441415,insight
brain,0.133737192411332,connections,0.4770965803506682,ideas,0.28877305186266056,insight
brain,0.0695041528304244,fundamental,0.5521335607322843,problem,0.2315418458136853,insight
brain,0.2091917558421882,epiphany,0.4292848059738405,moments,-0.17359890717130036,insight
brain,0.17280034376836279,research,0.04989028618906677,problem,0.2315418458136853,insight
brain,0.07751489118586756,require,0.06005363995904192,spontaneous,0.19017912534928294,insight

Word2Vec2Graph Model with Lower Cosine Similarities

In our results we can see than word pairs with high cosine similarity are well known similar words - they are called "synonyms" in Word2Vec model. Pairs of words with lower cosine similarity are more interesting to look at. This is similar to what we showed in our previous post: lower cosine similarity text topics are less known then topics with high cosine similarities.

We will build a graph with lower cosine similarity.

val graphInsightLow = GraphFrame(graphInsightVertices,
     graphInsightEdges.filter("edgeWeight<0.33"))

Path String for 'Find' Function

To look at different word-to-word lines we'll start with function that creates a string for 'find' function.

import org.apache.spark.sql.functions._
def formLine(steps:Int): String = {
   var line=new String
   for (i <- 1 to steps-1) line+="(x"+lit(i)+")-[]->(x"+lit(i+1)+");"
   line.substring(0, line.length() - 1)
}

Path line examples:

formLine(3)
(x1)-[]->(x2);(x2)-[]->(x3)

formLine(4)
(x1)-[]->(x2);(x2)-[]->(x3);(x3)-[]->(x4)

formLine(5)
(x1)-[]->(x2);(x2)-[]->(x3);(x3)-[]->(x4);(x4)-[]->(x5)

'Find' result examples:

val path=graphInsightLow.find(formLine(3))
display(path)
x1,x2,x3
"{""id"":""answers""}","{""id"":""abandon""}","{""id"":""ideas""}"
"{""id"":""according""}","{""id"":""internal""}","{""id"":""goals""}"
"{""id"":""apparent""}","{""id"":""effort""}","{""id"":""takes""}"
"{""id"":""apparent""}","{""id"":""effort""}","{""id"":""achievement""}"

path.printSchema
root
 |-- x1: struct (nullable = false)
 |    |-- id: string (nullable = true)
 |-- x2: struct (nullable = false)
 |    |-- id: string (nullable = true)
 |-- x3: struct (nullable = false)
 |    |-- id: string (nullable = true)

val path=graphInsightLow.find(formLine(4))
x1,x2,x3,x4
"{""id"":""according""}","{""id"":""internal""}","{""id"":""goals""}","{""id"":""chapter""}"
"{""id"":""association""}","{""id"":""brain""}","{""id"":""require""}","{""id"":""spreading""}"
"{""id"":""association""}","{""id"":""brain""}","{""id"":""dominant""}","{""id"":""problem""}"
"{""id"":""according""}","{""id"":""internal""}","{""id"":""goals""}","{""id"":""competitiveness""}"
"{""id"":""according""}","{""id"":""internal""}","{""id"":""goals""}","{""id"":""everyone""}"

path.printSchema
root
 |-- x1: struct (nullable = false)
 |    |-- id: string (nullable = true)
 |-- x2: struct (nullable = false)
 |    |-- id: string (nullable = true)
 |-- x3: struct (nullable = false)
 |    |-- id: string (nullable = true)
 |-- x4: struct (nullable = false)
 |    |-- id: string (nullable = true)

To change the structure of path first we will get a string of words:

val path1=graphInsightLow.find(formLine(4))
val path2=path1.map(s=>(s.toString)).
  map(s=>(s.replaceAll("\\[",""))).
  map(s=>(s.replaceAll("\\]","")))

display(path2)
value
"according,internal,goals,competitiveness"
"adult,returns,child,interest"
"adult,returns,child,ability"

Next we will transform a string of words to array of words:

val path3=path2.map(s=>(s.split(","))).map(s=>(s.toArray))
display(path3)
value
"[""according"",""internal"",""goals"",""competitiveness""]"
"[""adult"",""returns"",""child"",""interest""]"
"[""adult"",""returns"",""child"",""ability""]"

Define the Line between Specific Words

Now we can specify words for beginning and end of the path.

val word1="insight"
val word2="brain"
val steps=4
val path4=path3.filter(s=>(s(0).toString==word1)).
  filter(s=>(s(steps-1).toString==word2))

display(path4)
value
"[""insight"",""prefrontal"",""activity"",""brain""]"
"[""insight"",""requires"",""different"",""brain""]"
"[""insight"",""depends"",""higher"",""brain""]"
"[""insight"",""depends"",""various"",""brain""]"
"[""insight"",""brain"",""activity"",""brain""]"
"[""insight"",""brain"",""epiphany"",""brain""]"
"[""insight"",""brain"",""creations"",""brain""]"
"[""insight"",""brain"",""active"",""brain""]"
"[""insight"",""believes"",""different"",""brain""]"
"[""insight"",""looking"",""specific"",""brain""]"

We can see that in some lines the same word appears several times, for example:

"[""insight"",""brain"",""activity"",""brain""]""

Clean Word Duplicates

Explode words within line:

import org.apache.spark.sql.functions.explode
val path5=path4.rdd.zipWithIndex.toDF("words","id").
  withColumn("word",explode($"words")).distinct

display(path5.orderBy("id"))
words,id,word
"[""insight"",""prefrontal"",""activity"",""brain""]",0,prefrontal
"[""insight"",""prefrontal"",""activity"",""brain""]",0,insight
"[""insight"",""prefrontal"",""activity"",""brain""]",0,activity
"[""insight"",""prefrontal"",""activity"",""brain""]",0,brain
"[""insight"",""along"",""without"",""brain""]",3,along
"[""insight"",""along"",""without"",""brain""]",3,without
"[""insight"",""along"",""without"",""brain""]",3,brain
"[""insight"",""along"",""without"",""brain""]",3,insight
"[""insight"",""brain"",""connections"",""brain""]",33,connections
"[""insight"",""brain"",""connections"",""brain""]",33,insight
"[""insight"",""brain"",""connections"",""brain""]",33,brain

Counts of words within each line:

import org.apache.spark.sql.expressions.Window
val partitionWindow = Window.partitionBy($"id")
val path6 = path5.withColumn("wordCount", count($"word").
  over(partitionWindow))

display(path6)
words,id,word
"[""insight"",""prefrontal"",""activity"",""brain""]",0,activity,4
"[""insight"",""prefrontal"",""activity"",""brain""]",0,prefrontal,4
"[""insight"",""prefrontal"",""activity"",""brain""]",0,brain,4
"[""insight"",""prefrontal"",""activity"",""brain""]",0,insight,4
"[""insight"",""along"",""without"",""brain""]",3,without,4
"[""insight"",""along"",""without"",""brain""]",3,insight,4
"[""insight"",""along"",""without"",""brain""]",3,brain,4
"[""insight"",""along"",""without"",""brain""]",3,along,4
"[""insight"",""brain"",""connections"",""brain""]",33,connections,3
"[""insight"",""brain"",""connections"",""brain""]",33,brain,3
"[""insight"",""brain"",""connections"",""brain""]",33,insight,3

To exclude line with word duplicates we will filter out lines with count of words smaller then 'steps' - the number of words parameter. If we are calculating circles (like 'brain' -> * -> 'brain' ) we will filter out lines with count of words smaller then 'steps-1':

val result=  
if (word1!=word2)path6.filter('wordCount===steps).
  select("words","id").distinct
else path6.filter('wordCount===steps-1).
  select("words","id").distinct

Function to Find Word Lines

Now we can combine all scripts into function:

import org.apache.spark.sql.DataFrame
def findForm(graph: GraphFrame, word1: String,
    word2: String, steps: Int): DataFrame ={
  val path1=graph.find(formLine(steps))
  val path2=path1.map(s=>(s.toString)).
    map(s=>(s.replaceAll("\\[",""))).
    map(s=>(s.replaceAll("\\]","")))
  val path3=path2.
    map(s=>(s.split(","))).
    map(s=>(s.toArray))
  val path4=path3.
    filter(s=>(s(0).toString==word1)).
    filter(s=>(s(steps-1).toString==word2))
  val path5=path4.rdd.zipWithIndex.
    toDF("words","id").
    withColumn("word",explode($"words")).distinct
  val partitionWindow = Window.partitionBy($"id")
  val path6 = path5.withColumn("wordCount",count($"word").
    over(partitionWindow))
  if (word1!=word2)path6.
    filter('wordCount===steps).
    select("words","id").distinct
  else path6.filter('wordCount===steps-1).
    select("words","id").distinct
}

Examples

Line: 'insight' -> (1 word) -> 'brain'

display(findForm(graphInsightLow,"brain","insight",3))
words
"[""brain"",""right"",""insight""]"
"[""brain"",""activity"",""insight""]"
"[""brain"",""moments"",""insight""]"

Line: 'insight' -> (2 words) -> 'brain' - some examples

display(findForm(graphInsightLow,"brain","insight",4))
words
"[""brain"",""require"",""spontaneous"",""insight""]"
"[""brain"",""thought"",""called"",""insight""]"
"[""brain"",""dominant"",""problem"",""insight""]"
"[""brain"",""response"",""associated"",""insight""]"

Line: 'insight' -> (3 words) -> 'brain'

display(findForm(graphInsightLow,"brain","insight",5))
words
"[""brain"",""require"",""neural"",""activity"",""insight""]"
"[""brain"",""dominant"",""problem"",""called"",""insight""]"
"[""brain"",""complex"",""drawing"",""problem"",""insight""]"
"[""brain"",""functions"",""creator"",""ideas"",""insight""]"
"[""brain"",""thinking"",""likely"",""solve"",""insight""]"

Circle: 'insight' -> (3 words) -> 'insight'

display(findForm(graphInsightLow,"insight","insight",5))
words
"[""insight"",""requires"",""prefrontal"",""activity"",""insight""]"
"[""insight"",""increase"",""creative"",""moment"",""insight""]"
"[""insight"",""analytically"",""showed"",""associated"",""insight""]"
"[""insight"",""revealed"",""surprising"",""power"",""insight""]"
"[""insight"",""brain"",""activity"",""right"",""insight""]"
"[""insight"",""positive"",""biases"",""conscious"",""insight""]"

Circle: 'brain' -> (2 words) -> 'brain'

display(findForm(graphInsightLow,"brain","brain",4))
words
"[""brain"",""right"",""hemisphere"",""brain""]"
"[""brain"",""develop"",""human"",""brain""]"
"[""brain"",""thinking"",""active"",""brain""]"
"[""brain"",""developed"",""different"",""brain""]"
"[""brain"",""involved"",""executive"",""brain""]"

Line: 'insight' -> (3 words) -> 'association'

display(findForm(graphInsightLow,"insight","association",5))
words
"[""insight"",""using"",""different"",""sudden"",""association""]"
"[""insight"",""increase"",""creative"",""process"",""association""]"
"[""insight"",""combined"",""social"",""psychological"",""association""]"
"[""insight"",""sometimes"",""analytically"",""sudden"",""association""]"
"[""insight"",""prefrontal"",""circuits"",""process"",""association""]"

Circle: 'associated' -> (3 words) -> 'associated'

display(findForm(graphInsightLow,"associated","associated",5))
words
"[""associated"",""different"",""prefrontal"",""activity"",""associated""]"
"[""associated"",""musical"",""ability"",""children"",""associated""]"
"[""associated"",""mental"",""processing"",""music"",""associated""]"
"[""associated"",""specific"",""brain"",""activity"",""associated""]"
"[""associated"",""sustained"",""aerobic"",""activity"",""associated""]"
"[""associated"",""greater"",""knowledge"",""children"",""associated""]"

Line: 'association' -> (4 words) -> 'insight'

display(findForm(graphInsightLow,"association","insight",6))
words
"[""association"",""useful"",""elegant"",""ideas"",""without"",""insight""]"
"[""association"",""brain"",""great"",""knowledge"",""shows"",""insight""]"
"[""association"",""cortex"",""given"",""creative"",""moment"",""insight""]"
"[""association"",""likely"",""solve"",""verbal"",""problem"",""insight""]"
Post Sample Image

Next Post - Word2Vec2Graph The Web Conf Presentation

In the next post we will talk about Word2Vec2Graph presentation on The Web Conference in Lyon, France.