Link Prediction for Knowledge Graphs
In our previous post 'Knowledge Graph for Data Mining' we discussed knowledge graph building and mining techniques. These techniques were presented in 2020 in DEXA conference "Machine Learning and Knowledge Graphs" workshop and published as 'Building Knowledge Graph in Spark without SPARQL' paper.
The goal of that study was to demonstrate that knowledge graph area is much wider that traditional semantic web SPARQL approach and there are non-traditional ways to build and explore knowledge graphs. In that study we demonstrated how knowledge graph techniques can be accomplished by Spark GraphFrames library. In this study we will show other techniques that can be applied to creating and rewiring knowledge graphs. We will explore building knowledge graphs based on Wikipedia data and Graph Neural Networks (GNN) link prediction model. To compare results of this study with results of our previous study we will use data about the same list of modern art artists.Introduction: Knowledge Graphs Exploration
In recent years knowledge graph becomes more and more popular for data mining. DEXA conference is well known for data mining and in 2020 they organized the first "Machine Learning and Knowledge Graphs" workshop. In that workshop we presented a paper where we showed how to build knowledge graph in Spark without SPARQL and how conceptually knowledge graph builds a bridge between logical thinking and graph thinking for data mining. As a data source for that study we used data about paintings of several artists from MoMA collection taken from kaggle dataset 'Museum of Modern Art Collection'. Through knowledge graph we explored how artists were conneted and how they influensed each other: In that study we explored knowledge graph using Spark DataFrames library techniques and found unknown connections between artists and between modern art movements. In this post as data source we will use Wikipedia text data about the same 20 artists that we used in the previous study and we will investigate semantic connections between the artists through GNN link prediction model.
Methods
To find connections between the artists we will do the following:- Build a graph with artist names and Wikipedia text as nodes and connections between artist names and corresponding Wikipedia articles as edges.
- Embed node text to vectors by transformers model.
- Analyze cosine similarity matrix for transformer embedded nodes and add graph edges for artist pairs with high cosine similarities.
- On top of this graph run GNN link prediction model.
Building Graph
For data processing, model training and interpreting the results we will use the following steps:- Tokenize Wikipedia text to compare artist Wikipedia pages by size distribution
- Define nodes as artist names and Wikipedia articles
- Define edges as pairs of artist names and corresponding articles
- Build a knowledge graph on those nodes and edges
Transform Text to Vectors
As a method of text to vector translation we used 'all-MiniLM-L6-v2' model from Hugging Face. This is a sentence-transformers model that maps text to a 384 dimensional dense vector space.
There are two advantages of embedding text nodes:
- Vectors generated by transformers can be used for GNN link prediction model as node features
- Based on highly connected vector pairs additional graph edges can be generated.
Run GNN Link Prediction Model
As Graph Neural Networks link prediction we used a model from Deep Graph Library (DGL). The model is built on two GrapgSAGE layers and computes node representations by averaging neighbor information. We used the code provided by DGL tutorial DGL Link Prediction using Graph Neural Networks.
The results of this code are embedded nodes that can be used for further analysis such as node classification, k-means clustering, link prediction and so on. In this study we used it for link prediction by estimating cosine similarities between embedded nodes.
Find Connections
To calculate how similar are vectors to each other we will do the following:- Calculate cosine simmilarity matrix
- Demonstrate examples of highly connected and lowly connected node pairs.
Cosine Similarities function:
Experiments
Data Source Analysis
As data source we used text data from Wikipedia articles about the same 20 artists that we used in the previous study "Building Knowledge Graph in Spark without SPARQL". In that study coding was done in Scala Spark and in this study coding was done in Python. As envinroment we used Google Colab and Google Drive.To estimate the size distribution of Wikipedia text data we tokenized the text and exploded the tokens:
Here is the distribution of number of words in Wikipedia related to artits:
Based on Wikipedia text size distribution, the most well known artist in our artist list is Vincent van Gogh and the most unknown artist is Franz Marc.
Building Graph
Index data: Define nodes as artist names ('Artist' column) and Wikipedia article text ('Wiki' column): Define edges as index pairs of nodes:Transform Text to Vectors
For text to vector translation we used 'all-MiniLM-L6-v2' model from Hugging Face: Load nodes data to Google Drive:Add Edges to the Knowledge Graph
To indicate what edges should be aded to the knowledge graph we analyzed a cosine similarity matrix and selected pairs of vectors with high cosine similarities: To add edges to the knowledge graph we selected artist pairs with cosine similarities greater than 0.6: Graph on artist pairs with cosine similarities > 0.6: Selected edges were added to the list of graph edges and loaded to Google Drive:Run GNN Link Prediction Model
As Graph Neural Networks (GNN) link prediction model we used a model from Deep Graph Library (DGL). The model code was provided by DGL tutorial and we only had to transform nodes and edges data from our data format to DGL data format.
Read embedded nodes and edges from Google Drive:Convert data to DGL format:
Define the model, loss function, and evaluation metric. To estimate the results we calculated accuracy metrics as Area Under Curve (AUC). The model accuracy metric was about 90 percents.Knowledge Graph with Predicted Edges
To calculate predicted edges, first we looked at cosine similarity matrix for pairs of nodes embedded by GNN link prediction model: Then by indexes we combined edges with cosine similarity scores with corresponding artist names. Edges with cosine similarity scores:List of Artist Names:
Join edge cosine similarity scores with artist names by indexes:
For graph visualization on Gephi tool we added a 'line' column with artists pairs in DOT language: In the following examples will show graphs of artists with hign cosine similarities and low codine similarities. Pairs of artists with high cosine similarities -- higher than 0.6: Example 1: artist pairs with cosine similarities > 0.6: Example 2: artist pairs with cosine similarities > 0.7: Pairs of artists with low cosine similarities -- less than -0.5: Example 3: artist pairs with cosine similarities < -0.5:Conclusion
In this post we demonstrated how to use transformers and GNN link predictions to rewire knowledge graphs.- Trough transformers we mapped Wikipedia articles to vectors and added pairs of highly connected artists as edges to the knowledge graph.
- On top of the renovated knowledge graph we ran GNN link prediction model.
- We used cosine similarities between GNN embedded nodes to estimate knowledge graph predicted edges.
- We demonstrated how to apply these techniques to find pairs of artists that are highly connected or lowly connected.