Free Associations - Sparkling Data Ocean

Finding Free Associations

Free Associations is a psychoanalytic technique that was developed by Sigmund Freud and still used by some therapists today. Patients relate to whatever thoughts come to mind in order for the therapist to learn more about how the patient thinks and feels. As Freud described it: "The importance of free association is that the patients spoke for themselves, rather than repeating the ideas of the analyst; they work through their own material, rather than parroting another's suggestions"

In one of our previous posts - "Word2Vec2Graph - Psychoanalysis Topics" - we showed how to find free associations using Word2Vec2Graph technique. In this post we will show a different method - unsupervised Convolutional Neural Network classification. As a text file we will use data about Psychoanalysis taken from Wikipedia.

Word Pair Classification - Step by Step

We will convert word pairs to vectors, than convert vectors to images, than classify images via CNN classification method. To transform pairs of words to images will use method described in Ignacio Oguiza's notebook Time series - Olive oil country. Technique we use in this post is different than technique we used in our previous post:

Read text file, tokenize, remove stop words
Transform text file to pairs of words that stay in text next to each other
Read trained Word2Vec model and map words to vectors
Concatenate word vectors with themselves reversing the second vector: {word1, word1} pairs will generate symmetrical (mirror) sequences of numbers. Label these sequences as "Same".
Concatenate word vectors of pairs {word1, word2} reversing the word2 vector. Label these sequences as "Different".
Randomly select a subset of "Different" pairs.
Convert vectors to images and run CNN classification model.

Unsupervised Image Classification

So we are concatenating pairs of vectors, transforming concatenated vectors to images and classifying images. This CNN image classification compares "Same" - mirror images with "Different" - non-mirror images. Images that are similar to mirror images represent pairs of similar words - common associations. Images that are very different than mirror images represent pair of words that are not expected as pairs, i.e. "free associations" psychoanalysis is looking for.

This technique allows us to do unsupervised CNN classification. Of course, this method is not limited to word pair classification. In particularly it can be applied to unsupervised outlier detection.

For example, we can take time series stock prices data, concatenate TS vectors with themselves (reversed) and get 'mirror' vectors/images. Then we can concatenate TS vectors with reversed market index vectors (like S&P 500) and convert them to images. CNN classifier will find {TS vector, S&P 500 vector} images that are very different than mirror images. These images will represent stock price outliers.

Read and Clean Text File

Read text file, tokenize it and remove stop words:

import org.apache.spark.ml._
import org.apache.spark.ml.feature._
import org.apache.spark.sql.functions.explode
import org.apache.spark.ml.feature.Word2Vec
import org.apache.spark.ml.feature.Word2VecModel
import org.apache.spark.sql.Row
import org.apache.spark.ml.linalg.Vector
import org.graphframes.GraphFrame
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions.explode

val inputPsychoanalysis=sc.textFile("/FileStore/tables/psychoanalisys1.txt").
   toDF("charLine")

val tokenizer = new RegexTokenizer().
   setInputCol("charLine").
   setOutputCol("value").
   setPattern("[^a-z]+").
   setMinTokenLength(5).
   setGaps(true)

val tokenizedPsychoanalysis = tokenizer.
   transform(inputPsychoanalysis)

val remover = new StopWordsRemover().
   setInputCol("value").
   setOutputCol("stopWordFree")

val removedStopWordsPsychoanalysis = remover.
   setStopWords(Array("none","also","nope","null")++
   remover.getStopWords).
   transform(tokenizedPsychoanalysis)

Get Pairs of Words

Get pairs of words from text than explode ngrams:

val ngram = new NGram().
   setInputCol("stopWordFree").
   setOutputCol("ngrams").
   setN(2)

val ngramCleanWords = ngram.
   transform(removedStopWordsPsychoanalysis)

val slpitNgrams=ngramCleanWords.
   withColumn("ngram",explode($"ngrams")).
   select("ngram").
   map(s=>(s(0).toString,
      s(0).toString.split(" ")(0),
      s(0).toString.split(" ")(1))).
   toDF("ngram","ngram1","ngram2").
   filter('ngram1=!='ngram2)

Vectors for Pairs of Words

Read trained Word2Vec model:

val word2vec= new Word2Vec().
   setInputCol("value").
   setOutputCol("result")

val modelNewsBrain=Word2VecModel.
   read.
   load("w2VmodelNewsBrain")

val modelWordsPsychoanalysis=modelNewsBrain.
   getVectors.
   select("word","vector")

Map words of word pairs to Word2Vec model and get sets: {word1, vector1, word2, vector2}:

val ngramW2V=slpitNgrams.
   join(modelWordsPsychoanalysis,'ngram1==='word).
   join(modelWordsPsychoanalysis.toDF("word2","vector2"),'ngram2==='word2).
   select("ngram1","vector","ngram2","vector2").
   toDF("ngram1","vector1","ngram2","vector2").
   distinct

Get single words with vectors from word pairs: {word1, vector1}:

val ngram1W2V=ngramW2V.select("ngram1","vector1").
   union(ngramW2V.select("ngram2","vector2")).
   distinct.toDF("word","vector")

Combine Vectors of Word Pairs

Combine vectors from word pairs {word1, word2} reversing the second vector.

val arrayDFdiff = ngramW2V.rdd.map(x => (x.getAs[String](0) +"~"+  x.getAs[String](2) ,
   x.getAs[Vector](1).toArray++x.getAs[Vector](3).toArray.reverse)).
   toDF("word","array").
   select(col("word") +: (0 until 200).map(i =>  
   col("array")(i).alias(s"col$i")): *_*).withColumn("pairType",lit("diff"))

Combine vectors from single words with themselves reversing the second vector.

val arrayDFsame = ngram1W2V.rdd.map(x => (x.getAs[String](0) +"~"+  x.getAs[String](0) ,
   x.getAs[Vector](1).toArray++x.getAs[Vector](1).toArray.reverse)).
   toDF("word","array").
   select(col("word") +: (0 until 200).map(i =>  col("array")(i).alias(s"col$i")): *_*).withColumn("pairType",lit("same"))

CNN Classification

To convert vectors to images and classify images via CNN we used almost the same code that Ignacio Oguiza shared on fast.ai forum Time series - Olive oil country.

We splitted the source file to words={pairType, word} and vector. The 'pairType' column was used to define "Same" or "Different" category for images and 'word' column to define word pairs.

a = pd.read_csv(PATH + ‘words.csv', sep=',')
d=a.drop(a.columns[0], axis=1).drop(a.columns[201], axis=1)
fX=d.fillna(0).values
image_size = 200
gasf = GASF(image_size)
fX_gasf = gasf.fit_transform(fX)

f = a.iloc[:, [0,201]]
imgId = PATH + str(f['pairType'][i])+'/'+str(f['word'][i])+'.jpg'

Tuning classification model we've got abound 96% accuracy. Here is a code to display results:

i=778
f['word'][i],f['pairType'][i]
plt.plot(fX[i])
plt.imshow(fX_gasf[i], cmap='rainbow', origin='lower')

Examples: "Mirror" Word Pairs

Word pair - 'explanations~explanations':

Word pair - 'requirements~requirements':

Word pair - 'element~element':

Examples: Pairs of Similar Words

Word pair - 'thoughts~feelings':

Word pair - 'source~basic':

Word pair - 'eventually~conclusion':

Examples: Unexpected Free Associations

Word pair - 'personality~development':

Word pair - 'societal~restrictions':

Word pair - 'contingents~accompanying':

Word pair - 'neurotic~symptoms':

Word pair - 'later~explicitly':

Word pair - 'theory~published':

Next Post - Associations and Deep Learning

In the next post we will deeper look at deep learning for data associations.