### Link Prediction for Knowledge Graphs

In our previous post *
'Knowledge Graph for Data Mining'* we discussed knowledge graph building and mining techniques. These techniques were presented in 2020 in DEXA conference "Machine Learning and Knowledge Graphs" workshop and published as *
'Building Knowledge Graph in Spark without SPARQL'* paper.

### Introduction: Knowledge Graphs Exploration

In recent years knowledge graph becomes more and more popular for data mining. DEXA conference is well known for data mining and in 2020 they organized the first "Machine Learning and Knowledge Graphs" workshop. In that workshop we presented a*paper*where we showed how to build knowledge graph in Spark without SPARQL and how conceptually knowledge graph builds a bridge between logical thinking and graph thinking for data mining. As a data source for that study we used data about paintings of several artists from MoMA collection taken from kaggle dataset

*'Museum of Modern Art Collection'*. Through knowledge graph we explored how artists were conneted and how they influensed each other: In that study we explored knowledge graph using Spark DataFrames library techniques and found unknown connections between artists and between modern art movements. In this post as data source we will use Wikipedia text data about the same 20 artists that we used in the previous study and we will investigate semantic connections between the artists through GNN link prediction model.

### Methods

To find connections between the artists we will do the following:- Build a graph with artist names and Wikipedia text as nodes and connections between artist names and corresponding Wikipedia articles as edges.
- Embed node text to vectors by transformers model.
- Analyze cosine similarity matrix for transformer embedded nodes and add graph edges for artist pairs with high cosine similarities.
- On top of this graph run GNN link prediction model.

#### Building Graph

For data processing, model training and interpreting the results we will use the following steps:- Tokenize Wikipedia text to compare artist Wikipedia pages by size distribution
- Define nodes as artist names and Wikipedia articles
- Define edges as pairs of artist names and corresponding articles
- Build a knowledge graph on those nodes and edges

#### Transform Text to Vectors

As a method of text to vector translation we used * 'all-MiniLM-L6-v2'* model from Hugging Face. This is a sentence-transformers model that maps text to a 384 dimensional dense vector space.

There are two advantages of embedding text nodes:

- Vectors generated by transformers can be used for GNN link prediction model as node features
- Based on highly connected vector pairs additional graph edges can be generated.

#### Run GNN Link Prediction Model

As Graph Neural Networks link prediction we used a model from Deep Graph Library (DGL). The model is built on two GrapgSAGE layers and computes node representations by averaging neighbor information.
We used the code provided by DGL tutorial *DGL Link Prediction using Graph Neural Networks*.

The results of this code are embedded nodes that can be used for further analysis such as node classification, k-means clustering, link prediction and so on. In this study we used it for link prediction by estimating cosine similarities between embedded nodes.

#### Find Connections

To calculate how similar are vectors to each other we will do the following:- Calculate cosine simmilarity matrix
- Demonstrate examples of highly connected and lowly connected node pairs.

Cosine Similarities function:

### Experiments

#### Data Source Analysis

As data source we used text data from Wikipedia articles about the same 20 artists that we used in the previous study*"Building Knowledge Graph in Spark without SPARQL"*. In that study coding was done in Scala Spark and in this study coding was done in Python. As envinroment we used Google Colab and Google Drive.

To estimate the size distribution of Wikipedia text data we tokenized the text and exploded the tokens:

Here is the distribution of number of words in Wikipedia related to artits:

Based on Wikipedia text size distribution, the most well known artist in our artist list is Vincent van Gogh and the most unknown artist is Franz Marc.

#### Building Graph

Index data: Define nodes as artist names ('Artist' column) and Wikipedia article text ('Wiki' column): Define edges as index pairs of nodes:#### Transform Text to Vectors

For text to vector translation we used 'all-MiniLM-L6-v2' model from Hugging Face: Load nodes data to Google Drive:#### Add Edges to the Knowledge Graph

To indicate what edges should be aded to the knowledge graph we analyzed a cosine similarity matrix and selected pairs of vectors with high cosine similarities: To add edges to the knowledge graph we selected artist pairs with cosine similarities greater than 0.6: Graph on artist pairs with cosine similarities > 0.6: Selected edges were added to the list of graph edges and loaded to Google Drive:#### Run GNN Link Prediction Model

As Graph Neural Networks (GNN) link prediction model we used a model from Deep Graph Library (DGL). The model code was provided by DGL tutorial and we only had to transform nodes and edges data from our data format to DGL data format.

Read embedded nodes and edges from Google Drive:Convert data to DGL format:

Define the model, loss function, and evaluation metric. To estimate the results we calculated accuracy metrics as Area Under Curve (AUC). The model accuracy metric was about 90 percents.#### Knowledge Graph with Predicted Edges

To calculate predicted edges, first we looked at cosine similarity matrix for pairs of nodes embedded by GNN link prediction model: Then by indexes we combined edges with cosine similarity scores with corresponding artist names. Edges with cosine similarity scores:List of Artist Names:

Join edge cosine similarity scores with artist names by indexes:

For graph visualization on Gephi tool we added a 'line' column with artists pairs in DOT language: In the following examples will show graphs of artists with hign cosine similarities and low codine similarities. Pairs of artists with high cosine similarities -- higher than 0.6: Example 1: artist pairs with cosine similarities > 0.6: Example 2: artist pairs with cosine similarities > 0.7: Pairs of artists with low cosine similarities -- less than -0.5: Example 3: artist pairs with cosine similarities < -0.5:### Conclusion

In this post we demonstrated how to use transformers and GNN link predictions to rewire knowledge graphs.- Trough transformers we mapped Wikipedia articles to vectors and added pairs of highly connected artists as edges to the knowledge graph.
- On top of the renovated knowledge graph we ran GNN link prediction model.
- We used cosine similarities between GNN embedded nodes to estimate knowledge graph predicted edges.
- We demonstrated how to apply these techniques to find pairs of artists that are highly connected or lowly connected.