Connecting Art and Data: Building a Unified Knowledge Graph

Unified Knowledge Graph for Artists and Paintings

Think of this project as one map of the art world, where artists and paintings are dots and their relationships form the lines between them. We turn biographies and images into a single unified knowledge graph, then use Graph AI to see which artists and works are close, far, or unexpectedly linked. Instead of scrolling through lists, you can explore art as a connected network and discover relationships you wouldn’t notice from text or images alone. The same pattern extends far beyond art: this pipeline can fuse text, images, time series, and other data into a unified knowledge graph in any domain, keeping everything analyzable with one Graph AI workflow.

Conference & Publication

This work was presented at CVIT 2025 in Florence, Italy, on June 20, 2025, and published as: Romanova, A. (2025). “Rewiring Multi-Modal Knowledge Graphs with GNN Link Prediction: Insights from Art History.” doi: 10.1117/12.3078062.

Introduction: Knowledge Graphs Exploration

Art and artists are deeply connected, reflecting creativity, culture, and history. Graphs provide a powerful way to explore these relationships by representing artists, artworks, and movements as nodes, and their connections as edges. This helps us uncover patterns, find hidden links, and better understand the evolution of art. With advanced tools like Graph Neural Networks (GNNs), we can predict new connections, group related ideas, and gain even deeper insights into the art world.

Building on Our Previous Research

The first study, titled "Building Knowledge Graph in Spark without SPARQL", was presented at the DEXA 2020 online conference. It explored non-traditional methods for constructing and analyzing knowledge graphs, moving beyond SPARQL-based semantic web approaches. The study demonstrated how the Spark GraphFrames library could be leveraged to efficiently build and mine knowledge graphs. Focusing on modern art artists, it highlighted the versatility and scalability of knowledge graph techniques.

For more details, visit: Building Knowledge Graph in Spark without SPARQL
Reference: Romanova, A. (2020). Building knowledge graph in spark without SPARQL. CCIS, vol. 1285, pp 96–102.

paper where we showed how to build knowledge graph in Spark without SPARQL and how conceptually knowledge graph builds a bridge between logical thinking and graph thinking for data mining.

The second study was presented at ICAART 2023: 22-24 February, 2023 in Lisbon. It focused on methods for rewiring knowledge graphs to uncover hidden relationships between nodes using GNN link prediction models. The experiments analyzed semantic similarities and dissimilarities between the biographies of modern art artists by utilizing Wikipedia articles. A traditional method used the full text of articles and cosine similarities between re-embedded nodes, while a novel method examined the distribution of co-located words within and across articles. Output vectors from the GNN link prediction model were aggregated by artist, and link predictions were calculated based on cosine similarities.

The study highlighted the importance of both highly connected and highly disconnected node pairs in graph mining. Disconnected pairs offered unique insights, revealing the knowledge graph's overall coverage and serving as indicators for validating community detection. Additionally, the study demonstrated practical applications for rewired knowledge graphs in recommender systems. High-similarity node pairs suggested related artists or movements, while high-dissimilarity pairs encouraged users to explore contrasting art movements, enhancing the diversity of recommendations.

For more details, visit: Rewiring Knowledge Graphs by Graph Neural Network Link Predictions
Reference: Romanova, A. Rewiring Knowledge Graphs by Graph Neural Network Link Predictions. International Conference on Agents and Artificial Intelligence (2023). doi:10.5220/0011664400003393.

This Study

The dataset, 'Best Artworks of All Time', was taken from Kaggle. It features information and artwork images from 50 of the most influential artists in history. The dataset includes detailed artist metadata sourced from Wikipedia, as well as a collection of high-resolution images scraped from artchallenge.ru.

This study focuses on the Construction of a Unified Knowledge Graph to analyze the relationships between artists and their works. Starting with raw data comprising artist biographies and painting images, two separate graphs are created: an Artist Graph and a Painting Graph.

Textual data from artist biographies is transformed into vectors using Large Language Models (LLMs), while visual data from painting images is processed into vectors using Convolutional Neural Networks (CNNs). These initial embeddings are further refined through Graph Neural Networks (GNNs) to produce consistent, high-quality representations for the nodes in both graphs.

The final step involves combining the Artist Graph and Painting Graph into a single Unified Knowledge Graph, where links represent the connections between artists and their respective works. This integrated graph provides a robust framework for exploring the intricate relationships within the art world, enabling tasks such as classification, clustering, and link prediction.

Traditional EEG Graph Example — This figure represents construction of a Unified Knowledge Graph. Raw data of artist biographies and painting images are transformed into an Artist Graph and a Painting Graph. Text and image data are converted into vectors using LLMs and CNNs, respectively, followed by GNN-based embedding. The graphs are then combined into a Unified Knowledge Graph, linking artists to their works.

Methods

Graph Construction for Artists

The process of constructing a graph to represent artist relationships begins with defining nodes for each artist and creating edges based on shared attributes such as genre and nationality.

Each artist is represented as a node in the graph, with attributes like the artist's index and name. To capture relationships:

Genre-Based Edges: Edges are added between all pairs of artists who share the same genre, reflecting thematic or stylistic connections.
Nationality-Based Edges: Edges are created between artists who share the same nationality, representing cultural or geographical ties.

To enable analysis, the graph’s edges are converted into a DataFrame, organized with columns for source nodes, target nodes, and edge attributes. If the edge attributes are stored as dictionaries, these are expanded into individual columns for easier interpretation. The resulting DataFrame is reviewed to confirm its accuracy and structure.

Transforming Text to Vectors for Node Features

Artist biographies are transformed into vector embeddings to enrich the graph with semantic information. Using the Hugging Face 'all-MiniLM-L6-v2' model, the text is encoded into a 384-dimensional vector space, producing a tensor of size [50, 384], where each row represents an artist.

These embeddings are assigned as node features in the graph using the ndata['feat'] attribute. This enables:

Using vectors as input for GNN link prediction models.
Generating additional graph edges by calculating a cosine similarity matrix and selecting highly connected vector pairs.

This approach provides a robust foundation for graph-based analysis, capturing both semantic relationships and potential hidden connections between artists.

Run GNN Link Prediction Model

As Graph Neural Networks link prediction we used a model from Deep Graph Library (DGL). The model is built on two GrapgSAGE layers and computes node representations by averaging neighbor information. We used the code provided by DGL tutorial DGL Link Prediction using Graph Neural Networks.

The results of this code are embedded nodes that can be used for further analysis such as node classification, k-means clustering, link prediction and so on. In this study we used it for link prediction by estimating cosine similarities between embedded nodes.

Building the Painting Graph

To construct a graph representing paintings, artwork files are processed to link paintings to their respective artists and prepare the data for graph creation. Each painting is assigned a unique File Index, and artist names extracted from file names are mapped to their corresponding Artist Index using artists_df. This information—File Name, Artist Name, File Index, and Artist Index—is combined into a structured DataFrame.

The Painting Graph is created where nodes represent paintings and edges connect paintings by the same artist. Using NetworkX, nodes are added with their File Index as identifiers and File Name as attributes. Edges are added between paintings that share the same Artist Index, capturing relationships based on shared creators.

Image Features for Node Embeddings

Painting images are processed to generate visual embeddings using a pre-trained CNN, such as ResNet. Images are resized, normalized, and passed through the model to extract high-dimensional feature vectors. These embeddings are stored as node features, enriching the graph with visual content for machine learning tasks.

Transforming the Painting Graph to DGL Format

The Painting Graph is converted into DGL format to facilitate efficient processing and modeling using DGL's graph-based machine learning tools. The edges from painting_graph_edges are transformed into a PyTorch tensor, defining the graph's structure. Self-loops are added to ensure each node is connected to itself, which can enhance the performance of GNN models.

Image feature vectors, previously saved and loaded from storage, are assigned as node features in the graph. These embeddings are converted into a PyTorch tensor and set using the ndata['feat'] attribute, enriching the graph with meaningful visual data for each painting node.

GNN Link Prediction Model Training for the Painting Graph

The GNN model is trained for link prediction on the Painting Graph using the constructed graph and its visual embeddings. Similar to the Artist Graph, the training process involves predicting links by leveraging the node features, allowing the model to identify relationships between paintings based on shared visual or contextual patterns. This tailored approach reveals hidden connections and enhances the understanding of painting relationships within the graph.

Unified Knowledge Graph

To analyze relationships between artists and their works, we created a Unified Knowledge Graph by combining the Artist graph and Painting graph. This integration enables consistent multi-modal data embedding and analysis.

Artist biographies were transformed into 384-dimensional vectors using a transformer model, then reduced to 128-dimensional embeddings through GNN link prediction. Paintings were similarly represented as 2048-dimensional vectors using a pre-trained CNN model, which were also reduced to 128-dimensional embeddings.

To unify the graphs, unique indices ensured no overlap between artist and painting nodes, with artist nodes offset by 1000. Painting file names were processed to map each painting to its corresponding artist, and edges were added to represent these relationships.

The resulting Unified Knowledge Graph integrates artist and painting data into a single structure, providing a foundation for advanced graph-based analysis.

GNN Link Prediction Model for the Unified Graph

We applied the GNN Link Prediction model three times across different stages of the study:

1. Artist Graph: The model was used to process artist biographies, initially embedded as 384-dimensional vectors derived from text data. The GNN Link Prediction model reduced these embeddings to a consistent size of 128 dimensions, capturing semantic relationships between artists effectively.

2. Painting Graph: For paintings, visual features extracted via a CNN model were initially represented as 2048-dimensional vectors. The GNN Link Prediction model was applied to reduce these high-dimensional embeddings to 128 dimensions, emphasizing meaningful stylistic or thematic connections between paintings.

3. Unified Knowledge Graph: After merging the Artist and Painting graphs into a single structure, the GNN Link Prediction model was run again on the unified graph. This step refined the node embeddings, ensuring consistent 128-dimensional representations across all nodes while integrating multi-modal relationships between artists and their works.

By running the model at each stage, we achieved consistent embeddings that facilitate advanced graph-based analysis, uncovering complex patterns and relationships within the unified framework.

Experiments

Data Source Analysis

As data source we used text data from kaggle.com: 'Best Artworks of All Time'.

Metadata Overview

This dataset provides the following key files:

artists.csv:
- Contains metadata for 50 influential artists, including:
- Name, Genre, Nationality, Biography, and Years (lifespan or active period).
resized.zip:
- A collection of resized artwork images, optimized for faster processing and reduced storage.
- Ideal for machine learning workflows requiring efficient model training and testing.

These files provide a compact yet comprehensive resource for analyzing and classifying artworks.

Raw Data Processing

The code loads the artist metadata from a CSV file, selects relevant columns (name, genre, nationality, bio, years), sorts the data alphabetically by name, resets the index, and assigns a sequential artistIndex based on the new order.

artists = pd.read_csv("/content/drive/My Drive/Art/artists.csv")
artists_dff = artists[['name', 'genre', 'nationality','bio', 'years']]
artists_df= artists_dff.sort_values('name').reset_index(drop=True)
artists_df['artistIndex'] = artists_df.index

Split the standardized 'years' column into 'start_year' and 'end_year':

The years column is cleaned to replace special characters (e.g., em-dash) with a hyphen and split into start_year and end_year.
The genre column is exploded, creating separate rows for each genre associated with an artist, and the DataFrame is reset to include artistIndex, name, start_year, end_year, and genre.
A new DataFrame, artist2nationality, is created by splitting the nationality column into multiple entries, exploding it into separate rows, and resetting the index to include artistIndex, name, start_year, end_year, and nationality.

artists_df['years'] = artists_df['years'].apply(lambda x: re.sub(r'[–—]', '-', x))
artists_df[['start_year', 'end_year']] = artists_df['years'].str.split(' - ', expand=True)
  .explode('genre')
  .reset_index(drop=True)[['artistIndex', 'name', 'start_year', 'end_year', 'genre']]
artist2nationality = artists_df.assign(nationality=artists_df['nationality'].str.split(',')) \
.explode('nationality')
.reset_index(drop=True)[['artistIndex', 'name', 'start_year', 'end_year', 'nationality']]

Artist Graph

Building Graph on Artists

To construct a graph where nodes represent artists and edges are created based on shared attributes, follow these steps:

Create nodes: Add a node for each artist using their artistIndex and name.
Create genre-based edges: For each genre, add edges between all pairs of artists who share that genre.
Create nationality-based edges: Similarly, for each nationality, add edges between all pairs of artists who share that nationality.

Below is the Python code used to implement this graph:

import networkx as nx
G = nx.Graph()
for _, row in artist2genre.iterrows():
    G.add_node(row['artistIndex'], name=row['name'])# !pip install torch
for genre, group in artist2genre.groupby('genre'):
    artist_indices = group['artistIndex'].tolist()
    for i in range(len(artist_indices)):
        for j in range(i + 1, len(artist_indices)):
            G.add_edge(artist_indices[i], artist_indices[j], genre=genre)
for nationality, group in artist2nationality.groupby('nationality'):
    artist_indices = group['artistIndex'].tolist()
    for i in range(len(artist_indices)):
        for j in range(i + 1, len(artist_indices)):
            G.add_edge(artist_indices[i], artist_indices[j], nationality=nationality)
print(f"Number of nodes: {G.number_of_nodes()}")
print(f"Number of edges: {G.number_of_edges()}")
print(f"Average degree: {nx.density(G) * (G.number_of_nodes() - 1)}")
Number of nodes: 50
Number of edges: 218
Average degree: 8.72

The following steps describe how to convert graph edges into a DataFrame for analysis:

Convert edges to a DataFrame: Extract the edges from the graph G and organize them into a DataFrame with columns for source nodes, target nodes, and edge attributes.
Expand edge attributes: If the edge attributes are stored as dictionaries, expand these into separate columns to make them more accessible.
Display the edges DataFrame: View the resulting DataFrame to verify its structure and content.

The Python code snippet below demonstrates these steps:

import pandas as pd
edges_df = pd.DataFrame(list(G.edges(data=True)), columns=['Source', 'Target', 'Attributes'])
if not edges_df.empty and isinstance(edges_df['Attributes'].iloc[0], dict):
    attributes_df = edges_df['Attributes'].apply(pd.Series)
    edges_df = pd.concat([edges_df.drop(columns=['Attributes']), attributes_df], axis=1)

Convert edges to dgl format:

import dgl
import dgl.nn as dglnn
import dgl.data
from dgl.data import DGLDataset
import torch
import torch.nn as nn
import torch.nn.functional as F
import itertools
unpickEdges=edges_df
edge_index=torch.tensor(unpickEdges[['Source','Target']].T.values)
u,v=edge_index[0],edge_index[1]
g=dgl.graph((u,v))
g=dgl.add_self_loop(g)
g
Graph(num_nodes=50, num_edges=268,
      ndata_schemes={}
      edata_schemes={})

Transform Text to Vectors

This section describes how text data, specifically the artist biographies, is transformed into vector embeddings to be used as node features in a graph. We utilized the 'all-MiniLM-L6-v2' model from Hugging Face for this purpose, which generates high-quality sentence embeddings.

Model Selection: The 'all-MiniLM-L6-v2' model, a lightweight yet powerful transformer, was used for efficient text-to-vector translation.
Embedding Creation: Artist biographies from the DataFrame are encoded into a tensor of size [50, 384], where each row represents a 384-dimensional vector for an artist.
Assigning Features: These embeddings are assigned as node features to the graph using the ndata['feat'] attribute.

The Python code below demonstrates the embedding process and assignment to the graph:

model = SentenceTransformer('all-MiniLM-L6-v2')
node_embeddings = modelST.encode(artists_df['bio'],convert_to_tensor=True)
node_embeddings = node_embeddings.to(torch.device('cpu'))
gNew.ndata['feat'] = node_embeddings
node_embeddings.shape
torch.Size([50, 384])

Input graph

g
Graph(num_nodes=50, num_edges=268,
      ndata_schemes={'feat': Scheme(shape=(384,), dtype=torch.float32)}
      edata_schemes={})

GNN Link Prediction Model Training

This subsection describes the process of training a GNN link prediction model, using code adapted from the DGL library. The focus is on leveraging the built-in functionalities for implementing and training GNN-based link prediction tasks.

The training pipeline includes constructing positive and negative edge graphs for the training and testing phases, defining the GNN architecture, and training the model using standard techniques. The training concludes with an evaluation of the model's performance using the AUC (Area Under the Curve) metric.

u, v = g.edges()
eids = np.arange(g.number_of_edges())
eids = np.random.permutation(eids)
test_size = int(len(eids) * 0.1)
train_size = g.number_of_edges() - test_size
test_pos_u, test_pos_v = u[eids[:test_size]], v[eids[:test_size]]
train_pos_u, train_pos_v = u[eids[test_size:]], v[eids[test_size:]]
adj = sp.coo_matrix((np.ones(len(u)), (u.numpy(), v.numpy())))
adj_neg = 1 - adj.todense() - np.eye(g.number_of_nodes())
neg_u, neg_v = np.where(adj_neg != 0)
neg_eids = np.random.choice(len(neg_u), g.number_of_edges())
test_neg_u, test_neg_v = neg_u[neg_eids[:test_size]], neg_v[neg_eids[:test_size]]
train_neg_u, train_neg_v = neg_u[neg_eids[test_size:]], neg_v[neg_eids[test_size:]]
train_g = dgl.remove_edges(g, eids[:test_size])

from dgl.nn import SAGEConv
class GraphSAGE(nn.Module):
    def __init__(self, in_feats, h_feats):
        super(GraphSAGE, self).__init__()
        self.conv1 = SAGEConv(in_feats, h_feats, 'mean')
        self.conv2 = SAGEConv(h_feats, h_feats, 'mean')
    def forward(self, g, in_feat):
        h = self.conv1(g, in_feat)
        h = F.relu(h)
        h = self.conv2(g, h)
        return h

import dgl.function as fn
train_pos_g = dgl.graph((train_pos_u, train_pos_v), num_nodes=g.number_of_nodes())
train_neg_g = dgl.graph((train_neg_u, train_neg_v), num_nodes=g.number_of_nodes())
test_pos_g = dgl.graph((test_pos_u, test_pos_v), num_nodes=g.number_of_nodes())
test_neg_g = dgl.graph((test_neg_u, test_neg_v), num_nodes=g.number_of_nodes())
class DotPredictor(nn.Module):
    def forward(self, g, h):
        with g.local_scope():
            g.ndata['h'] = h
            g.apply_edges(fn.u_dot_v('h', 'h', 'score'))
            return g.edata['score'][:, 0]

class MLPPredictor(nn.Module):
    def __init__(self, h_feats):
        super().__init__()
        self.W1 = nn.Linear(h_feats * 2, h_feats)
        self.W2 = nn.Linear(h_feats, 1)
    def apply_edges(self, edges):
        h = torch.cat([edges.src['h'], edges.dst['h']], 1)
    return {'score': self.W2(F.relu(self.W1(h))).squeeze(1)}
    def forward(self, g, h):
        with g.local_scope():
            g.ndata['h'] = h
            g.apply_edges(self.apply_edges)
    return g.edata['score']

model = GraphSAGE(train_g.ndata['feat'].shape[1], 128)
pred = MLPPredictor(128)
def compute_loss(pos_score, neg_score):
    scores = torch.cat([pos_score, neg_score])
    labels = torch.cat([torch.ones(pos_score.shape[0]), torch.zeros(neg_score.shape[0])])
    return F.binary_cross_entropy_with_logits(scores, labels)
def compute_auc(pos_score, neg_score):
    scores = torch.cat([pos_score, neg_score]).numpy()
    labels = torch.cat(
        [torch.ones(pos_score.shape[0]), torch.zeros(neg_score.shape[0])]).numpy()
    return roc_auc_score(labels, scores)
optimizer = torch.optim.Adam(itertools.chain(model.parameters(), pred.parameters()), lr=0.01)

all_logits = []
for e in range(100):
    h = model(train_g, train_g.ndata['feat'].float())
    pos_score = pred(train_pos_g, h)
    neg_score = pred(train_neg_g, h)
    loss = compute_loss(pos_score, neg_score)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if e % 5 == 0:
        print('In epoch {}, loss: {}'.format(e, loss))

Final evaluation:

from sklearn.metrics import roc_auc_score
with torch.no_grad():
    pos_score = pred(test_pos_g, h)
    neg_score = pred(test_neg_g, h)
    print('AUC', compute_auc(pos_score, neg_score))
AUC 0.9171597633136095

To ensure the embedded vectors are preserved and accessible across sessions, they are saved to Google Drive as PyTorch tensors. This allows seamless reuse without recomputing the embeddings, saving time and computational resources.

import torch
torch.save(h, '/content/drive/My Drive/Art/artists_h.pt')
load_h = torch.load('/content/drive/My Drive/Art/artists_h.pt')

Paintings Graph

Create a Graph on Paintings

In this subsection, we process artwork files to build a DataFrame that links paintings to their respective artists and prepares data for graph construction.

The following steps outline the workflow:

Process files in the directory: Iterate through the files in the specified directory, extract the painting's file name, artist's name, and assign a unique File Index to each painting.
Match artist names to indices: The artist names extracted from the file names are mapped to their corresponding Artist Index using a sorted version of the artists_df.
Create the paintings DataFrame: Combine the extracted file data, including File Name, Artist Name, File Index, and the mapped Artist Index, into a structured DataFrame for further analysis and graph creation.

paintings_dir = '/content/drive/My Drive/Art/paintings/'
import os
import pandas as pd
file_data = []
file_index = 0
if os.path.exists(paintings_dir):
    for file in sorted(os.listdir(paintings_dir)):  
        if os.path.isfile(os.path.join(paintings_dir, file)):
            artist_name = file.rsplit('_', 1)[0].replace('_', ' ')
            file_data.append((file, artist_name, file_index))
            file_index += 1
columns = ['File Name', 'Artist Name', 'File Index']
file_df = pd.DataFrame(file_data, columns=columns)
artists_sorted = artists_df.sort_values('name').reset_index(drop=True)  
artist_name_to_index = {name: idx for idx, name in enumerate(artists_sorted['name'])}  
file_df['Artist Index'] = file_df['Artist Name'].map(artist_name_to_index)
file_df.tail()

Initialize and Build the Painting Graph

This subsection describes the creation of a graph where nodes represent paintings, and edges connect paintings created by the same artist. The graph is then analyzed and converted to DGL format for use in machine learning tasks.

The following steps outline the workflow:

Initialize the graph: Use NetworkX to create an undirected graph.
Add nodes with attributes: Each painting is added as a node, with its File Index as the identifier and File Name as a node attribute.
Add edges: Paintings created by the same artist are connected. Edges are added between all pairs of paintings that share the same Artist Index.

The Python code below demonstrates the implementation:

import networkx as nx
G = nx.Graph()
for _, row in file_df.iterrows():
    G.add_node(row['File Index'], file_name=row['File Name'])
artist_groups = file_df.groupby('Artist Index')['File Index'].apply(list)
for group in artist_groups:
    for i in range(len(group)):
        for j in range(i + 1, len(group)):
            G.add_edge(group[i], group[j])
print(f"Number of nodes: {G.number_of_nodes()}")
print(f"Number of edges: {G.number_of_edges()}")
print(f"Average degree: {nx.density(G) * (G.number_of_nodes() - 1)}")
Number of nodes: 1000
Number of edges: 9500
Average degree: 177.781

We extract nodes and edges from the graph and save them as DataFrames in Google Drive for future use.

nodes_data = [{'Node': n, **G2.nodes[n]} for n in G2.nodes]
nodes_df = pd.DataFrame(nodes_data)
nodes_df.to_csv('/content/drive/My Drive/Art/painting_graph_nodes.csv', index=False)

edges_data = [{'Source': u, 'Target': v, **G2.edges[u, v]} for u, v in G2.edges]
edges_df = pd.DataFrame(edges_data)
edges_df.to_csv('/content/drive/My Drive/Art/painting_graph_edges.csv', index=False)

Image Features for Node Embeddings

In this section, we use a pre-trained Convolutional Neural Network (CNN) to extract feature embeddings from the painting images. These features serve as node embeddings for the graph, enabling the model to capture meaningful visual patterns.

Workflow:

Image Preprocessing: Images are resized to 224x224 pixels, converted to tensors, and normalized using standard ImageNet mean and standard deviation values. This ensures compatibility with the pre-trained CNN model.
Feature Extraction: Each preprocessed image is passed through the CNN (e.g., ResNet) to extract a high-dimensional feature vector. These vectors represent the visual content of the paintings.
Embedding Storage: The extracted feature vector for each image is flattened, converted to a NumPy array, and stored in a dictionary with the file name as the key. This allows easy mapping between the image and its corresponding feature representation.

This process generates a rich set of visual embeddings that can be used as input features for graph-based learning tasks.

from torchvision import models, transforms
from PIL import Image
import torch
import os
transform = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
image_vectors = {}
for file_name in file_names:
    img_path = os.path.join(paintings_dir, file_name)
    img = Image.open(img_path).convert('RGB')
    img_tensor = transform(img).unsqueeze(0)  
    with torch.no_grad():
        features = model(img_tensor).flatten().numpy()
    image_vectors[file_name] = features

Save extracted vectors to Google Drive and validate:

vectors_df = pd.DataFrame(image_vectors)
vectors_df.to_csv('/content/drive/My Drive/Art/image_vectors.csv', index=False)
loaded_image_vectors_df = pd.read_csv('/content/drive/My Drive/Art/image_vectors.csv')
loaded_image_vectors_df.shape
(2048, 1000)

Transforming the Painting Graph to DGL Format

This subsection describes the process of transforming the painting graph into DGL format, including edge and node feature assignments. This transformation enables efficient processing and modeling using DGL's graph-based machine learning tools.

The previously saved image feature vectors are loaded from the CSV file stored on Google Drive

image_vectors = pd.read_csv("/content/drive/My Drive/Art/image_vectors.csv")
image_vectors.head()

Workflow:

Create the DGL graph: The edges from painting_graph_edges are converted into a PyTorch tensor and used to define the graph structure. Self-loops are added to ensure every node has a connection to itself, which can enhance model performance.
Assign node features: The image embeddings are transformed into a PyTorch tensor and assigned to the graph as node features using the ndata['feat'] attribute. This allows the graph to carry meaningful data for each node.

The following code demonstrates the transformation process:

unpickEdges=painting_graph_edges
edge_index=torch.tensor(unpickEdges[['Source','Target']].T.values)
u,v=edge_index[0],edge_index[1]
gPaintings=dgl.graph((u,v))
gPaintings=dgl.add_self_loop(gPaintings)
gPaintings
Graph(num_nodes=1000, num_edges=10500,
      ndata_schemes={}
      edata_schemes={})

Next, the node features are assigned to the graph:

image_vectors_T = image_vectors.T
image_vectors_T_df=pd.DataFrame(image_vectors_T).reset_index(drop=True)
import torch
image_vectors_tensor = torch.tensor(image_vectors_T_df.values, dtype=torch.float)
gPaintings.ndata['feat'] = image_vectors_tensor
gPaintings
Graph(num_nodes=1000, num_edges=10500,
      ndata_schemes={'feat': Scheme(shape=(2048,), dtype=torch.float32)}
      edata_schemes={})

The resulting graph gPaintings includes 1000 nodes, 10,500 edges, and node features of shape (2048,). It is now ready for use in graph-based machine learning tasks.

GNN Link Prediction Model Training for the Painting Graph

In this subsection, we focus on training a GNN model for link prediction on the painting graph, utilizing the graph and node features constructed in the previous steps. This process is similar to the training conducted for the artist graph but specifically tailored to predict links between paintings based on their visual embeddings.

Forward pass: The node features (image embeddings) are passed through the GNN model to compute embeddings for each painting. Positive and negative edge scores are calculated using the model's prediction function.
Loss computation: A loss function compares the predicted scores against the ground truth, guiding the model's optimization.
Backward pass and optimization: Gradients are calculated with respect to the loss, and the optimizer updates the model's parameters to minimize the loss.

The training process iterates for 100 epochs, logging the loss every 5 epochs:

all_logits = []
for e in range(100):
    # forward
    h = model(train_g, train_g.ndata['feat'].float())
    pos_score = pred(train_pos_g, h)
    neg_score = pred(train_neg_g, h)
    loss = compute_loss(pos_score, neg_score)
    # backward
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if e % 5 == 0:
        print('In epoch {}, loss: {}'.format(e, loss))

from sklearn.metrics import roc_auc_score
with torch.no_grad():
    pos_score = pred(test_pos_g, h)
    neg_score = pred(test_neg_g, h)
    print('AUC', compute_auc(pos_score, neg_score))
AUC 0.9827046611570248

Save embedded vectors

torch.save(h, '/content/drive/My Drive/Art/paintings_h.pt')
load_h = torch.load('/content/drive/My Drive/Art/paintings_h.pt')
load_h.shape
torch.Size([1000, 128])

Unified Knowledge Graph

Combining Artist and Painting Graphs into a Unified Knowledge Graph

We started by constructing two separate graphs:

Artist Graph: Artist biographies were converted into vectors of size 384 using a transformer model, followed by GNN link prediction, which transformed these embeddings into vectors of size 128.
Painting Graph: Paintings were represented as vectors of size 2048 using a pre-trained CNN model, followed by GNN link prediction, which transformed these embeddings into vectors of size 128.

Next Step: We will combine the Artist graph and Painting graph into a unified Knowledge Graph. This Knowledge Graph (suggested name: ArtKnowledgeGraph) will map painting nodes to their corresponding artist nodes. The node features for this unified graph will be represented by vectors of size 128, ensuring a consistent embedding space across all node types.

To create a unified Knowledge Graph, we need to ensure that nodes from both the Artist graph and the Painting graph have unique indices. This adjustment is achieved by offsetting the indices of the Artist nodes:

Unique indexing: Since the Painting graph contains 1000 nodes (indexed from 0 to 999), we add an offset of 1000 to the indices of the Artist nodes. This ensures that Artist node indices start from 1000 and do not overlap with Painting node indices.
Prepare adjusted Artist nodes: After offsetting the indices, the Artist nodes are prepared with the updated nodeIdx and their corresponding names. These are then included in the unified Knowledge Graph.

artist_graph_nodes = pd.read_csv("/content/drive/My Drive/Art/artist_graph_nodes.csv")
artist_graph_edges = pd.read_csv("/content/drive/My Drive/Art/artist_graph_edges.csv")
artist_graph_nodes['Node'] = artist_graph_nodes['nodeIdx'] + 1000
artist_graph_nodes = artist_graph_nodes[['Node', 'name']]
artist_graph_edges['Source'] = artist_graph_edges['Source'] + 1000
artist_graph_edges['Target'] = artist_graph_edges['Target'] + 1000

Knowledge Graph nodes: head and tail

Painting nodes are prepared for the unified Knowledge Graph by mapping each painting to its corresponding artist and assigning unique artist indices.

Load painting nodes:The painting graph nodes are loaded from a CSV file, and the column file_name is renamed to name for consistency.
Map paintings to artists: A new column, artist_name, is created by extracting the artist's name from the painting's file name. This assumes the artist's name is the first two words in the file name, separated by underscores.
Assign unique artist indices: An artist_index column is created by dividing the painting node index (Node) by 20 (integer division) and adding an offset of 1000. This ensures unique indexing for artists across the graph.

painting_graph_nodes = pd.read_csv("/content/drive/My Drive/Art/painting_graph_nodes.csv")
painting_graph_nodes = painting_graph_nodes.rename(columns={'file_name': 'name'})
painting_artist_graph_nodes=painting_graph_nodes
painting_artist_graph_nodes['artist_name'] = painting_graph_nodes['name']
   .apply(lambda x: x.split('_')[0]+' '+x.split('_')[1] )
painting_artist_graph_nodes['artist_index'] = painting_graph_nodes['Node']
   .apply(lambda x: (x // 20)+1000)

xxx

In this post we demonstrated how to use transformers and GNN link predictions to rewire knowledge graphs.

artist_graph_nodes['nodeIdx'] = artist_graph_nodes['nodeIdx'] + 1000
artist_graph_nodes = artist_graph_nodes[['nodeIdx', 'name']]

artist_graph_edges['Source'] = artist_graph_edges['Source'] + 1000
artist_graph_edges['Target'] = artist_graph_edges['Target'] + 1000
artist_graph_edges = artist_graph_edges[['Source', 'Target']]

painting_artist_graph_nodes=painting_graph_nodes
painting_artist_graph_nodes['artist_name'] =
    painting_graph_nodes['name'].apply(lambda x: x.split('_')[0]+' '+x.split('_')[1] )
painting_artist_graph_nodes['artist_index'] =
    painting_graph_nodes['Node'].apply(lambda x: (x // 20)+1000)

painting_graph_nodes.tail()

To establish connections between painting nodes and their corresponding artist nodes, edges are created by matching paintings to artists based on their indices:

Find matching artist nodes: For each painting node, the artist_index is used to identify the matching artist node in artist_graph_nodes.
Create edges: For each match, an edge is added between the painting node and the artist node. These edges represent the relationship between a painting and its creator.
Convert edges to DataFrame: The resulting edges are stored in a DataFrame with Source and Target columns, representing the painting and artist nodes, respectively.

The following code implements this process:

edges = []
for _, painting_row in painting_artist_graph_nodes.iterrows():
    artist_matches = artist_graph_nodes[artist_graph_nodes['nodeIdx'] ==
        painting_row['artist_index']]
for _, artist_row in artist_matches.iterrows():
    edges.append((painting_row['Node'], artist_row['nodeIdx']))
painting_artist_graph_edges = pd.DataFrame(edges, columns=['Source', 'Target'])

The unified Knowledge Graph is created by combining nodes and edges from the Painting and Artist graphs. Painting nodes and Artist nodes are concatenated into a single DataFrame, ensuring all nodes are represented. Similarly, edges between paintings and their artists (painting_artist_graph_edges) and edges within the Artist graph (artist_graph_edges) are concatenated into a single DataFrame of edges. The resulting Knowledge Graph contains 1218 edges, ready for further graph-based processing and analysis.

knowledge_graph_nodes = pd.concat([painting_nodes, artist_nodes], ignore_index=True)
knowledge_graph_edges = pd.concat([painting_artist_graph_edges, artist_graph_edges],
    ignore_index=True)
knowledge_graph_edges.shape
(1218, 2)

Transforming the Unified Graph to DGL Format

The Knowledge Graph is constructed as a DGL graph object. The edges from knowledge_graph_edges are transformed into a PyTorch tensor to define the connections between nodes. These edges are used to create a DGL graph, and self-loops are added to ensure that each node is connected to itself, which can improve GNN performance. The node features are represented by knowledge_graph_embeddings, with each node having a 128-dimensional feature vector. The resulting graph contains 1050 nodes and 2268 edges, making it ready for graph-based learning tasks.

unpickEdges=knowledge_graph_edges
edge_index=torch.tensor(unpickEdges[['Source','Target']].T.values)
u,v=edge_index[0],edge_index[1]
gKG=dgl.graph((u,v))
gKG=dgl.add_self_loop(gKG)
gKG.ndata['feat'] = knowledge_graph_embeddings
gKG
Graph(num_nodes=1050, num_edges=2268,
      ndata_schemes={'feat': Scheme(shape=(128,), dtype=torch.float32)}
      edata_schemes={})

GNN Link Prediction Model for Unified Knowledge Graph

The final phase of training involves running the GNN link prediction model for 2000 epochs. During each epoch, the model performs a forward pass to compute node embeddings, calculates scores for positive and negative edges, and computes the loss. Gradients are backpropagated to update model parameters, optimizing the model for link prediction. Progress is logged every 100 epochs, showing the loss to monitor convergence.

After training, the model's performance is evaluated on the test set. Positive and negative edge scores are computed, and the AUC (Area Under the Curve) metric is used to assess the model's ability to distinguish between true and false links. The final AUC score achieved is 0.9007, indicating strong predictive performance.

all_logits = []
for e in range(2000):
    # forward
    h = model(train_g, train_g.ndata['feat'].float())
    pos_score = pred(train_pos_g, h)
    neg_score = pred(train_neg_g, h)
    loss = compute_loss(pos_score, neg_score)
    # backward
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    if e % 100 == 0:
        print('In epoch {}, loss: {}'.format(e, loss))

from sklearn.metrics import roc_auc_score
with torch.no_grad():
    pos_score = pred(test_pos_g, h)
    neg_score = pred(test_neg_g, h)
    print('AUC', compute_auc(pos_score, neg_score))
AUC 0.9007130791642577

The node embeddings for the unified Knowledge Graph are saved and reloaded to enable persistent storage and reuse:

Save embeddings: The node embeddings, represented by the variable h, are saved to Google Drive as a PyTorch tensor using torch.save. This ensures that the embeddings can be preserved for later use.
Load embeddings: The saved embeddings are reloaded from Google Drive using torch.load. This allows for seamless reuse of the computed embeddings without requiring recomputation.

torch.save(h, '/content/drive/My Drive/Art/artist_knowledge_graph_h.pt')
load_h = torch.load('/content/drive/My Drive/Art/artist_knowledge_graph_h.pt')

Interpreting GNN Link Prediction Model Results

The results of the GNN Link Prediction (LP) model go beyond merely predicting links; they generate embedded vectors for each node in the Unified Knowledge Graph. These vectors, consistent in size, provide high-dimensional representations of the nodes, capturing both contextual and relational information. Such embeddings enable advanced analysis using techniques like cosine similarity, clustering, and building new graph layers, offering deeper insights into the structure and relationships within the graph.

The Unified Knowledge Graph in this study consists of 50 artist nodes and 1000 painting nodes, all encoded as vectors of size 128. The GNN LP model ensures a consistent embedding space across all nodes, enabling seamless analysis and comparisons between different node types.

In this section, we explore two main areas of interpretation:

Artist-artist relationships: Examining similarities in the embedded vectors to identify shared themes, influences, or overlapping artistic philosophies.
Painting-painting connections: Analyzing stylistic or contextual similarities between paintings through their embeddings.

By focusing on embeddings rather than direct link predictions, this study highlights the ability of GNN models to uncover nuanced patterns and provide a deeper understanding of the complex relationships within the art world.

Model Results Analysis

In this step, we calculate the cosine similarity between all pairs of nodes in the Unified Knowledge Graph using their embedded vectors. The similarity scores, generated by a PyTorch function, quantify the relationships between nodes based on their high-dimensional embeddings.

The result is stored in a DataFrame containing 1,102,500 rows, representing all possible pairs of 1,050 nodes (50 artists and 1,000 paintings). Each row includes the indices of the two nodes and their corresponding similarity score, which serves as the foundation for analyzing relationships between nodes in the graph.

cosine_scores_gnn = pytorch_cos_sim(load_h, load_h)
pairs_scores = []
for i in range( len(cosine_scores_gnn)):
    for j in range(len(cosine_scores_gnn)):  
        pairs_scores.append({
            'idx1': i,
            'idx2': j,
            'score': cosine_scores_gnn[i][j].item()  # Use `.item()` for scalar tensors
        })

df=pd.DataFrame(pairs_scores)
df.shape
(1102500, 3)

This step enhances the similarity DataFrame by joining node information from the Unified Knowledge Graph. For each pair of indices (idx1 and idx2), node metadata such as names and types are added, providing contextual information about the nodes involved in each similarity computation.

Key steps include:

Joining node information: The idx1 and idx2 columns are matched with the nodeIdx column in the Knowledge Graph nodes DataFrame, adding relevant details about each node.
Renaming columns: After the join, node metadata (nodeName and nodeType) is renamed to node1_name, node1_type, node2_name, and node2_type for clarity.
Dropping redundant columns: Extra columns from the join (nodeIdx_x and nodeIdx_y) are removed, keeping the DataFrame clean and focused on relevant information.

The resulting DataFrame includes enriched details about each pair of nodes, laying the groundwork for deeper analysis of relationships within the graph.

df = df.merge(knowledge_graph_nodes, left_on='idx1', right_on='nodeIdx', how='left')
df = df.rename(columns={'nodeName': 'node1_name', 'nodeType': 'node1_type'})
df = df.merge(knowledge_graph_nodes, left_on='idx2', right_on='nodeIdx', how='left')
df = df.rename(columns={'nodeName': 'node2_name', 'nodeType': 'node2_type'})
df = df.drop(columns=['nodeIdx_x', 'nodeIdx_y'])
graph_pairs = df

Cosine Similarity Score Distributions

The GNN Link Prediction model generates similarity scores for each pair of nodes in the Unified Knowledge Graph. These scores represent the relationships between nodes and provide insights into the structure and connections within the graph.

For all node pairs, excluding self-loops, scores range from -0.515 to 1.000, with a mean of approximately 0.460. This indicates that many pairs exhibit moderate similarity, while higher scores highlight stronger relationships.

filtered_graph_pairs = graph_pairs[graph_pairs['idx1'] != graph_pairs['idx2']]
overall_scores = filtered_graph_pairs['score']
overall_stats = overall_scores.describe()
print("Overall Score Distribution:\n", overall_stats)
Overall Score Distribution:
 count   1101450.000000
mean          0.460064
std           0.337415
min          -0.515314
25%           0.217390
50%           0.471504
75%           0.744774
max           1.000000

Artist-artist pairs show a higher mean score of 0.621, reflecting strong semantic relationships such as shared influences or overlapping styles.

artist_artist_pairs = filtered_graph_pairs[
    (filtered_graph_pairs['node1_type'] == 'artist') & (filtered_graph_pairs['node2_type'] == 'artist')
]
artist_artist_scores = artist_artist_pairs['score']
artist_stats = artist_artist_scores.describe()
print("Artist-Artist Score Distribution:\n", artist_stats)
Artist-Artist Score Distribution:
 count    2450.000000
mean        0.621267
std         0.291015
min        -0.281765
25%         0.455145
50%         0.688079
75%         0.860471
max         0.992316

In contrast, painting-painting pairs have a mean score of 0.477, demonstrating diverse visual and contextual connections between artworks.

painting_painting_pairs = filtered_graph_pairs[
    (filtered_graph_pairs['node1_type'] == 'painting') & (filtered_graph_pairs['node2_type'] == 'painting')
]
painting_scores = painting_painting_pairs['score']
painting_stats = painting_scores.describe()
print("Painting-Painting Score Distribution:\n", painting_stats)
Painting-Painting Score Distribution:
 count    999000.000000
mean          0.477173
std           0.338798
min          -0.515314
25%           0.232907
50%           0.492818
75%           0.769807
max           1.000000

These distributions reveal the model's ability to capture meaningful relationships across different types of nodes, setting the stage for deeper analysis of highly connected and less connected nodes. Such insights are valuable for exploring relationships in art and uncovering hidden patterns.

This analysis explores the distribution of cosine similarity scores for painting-painting pairs. The scores are grouped into intervals ranging from -1.0 to 1.0 in steps of 0.1. Data is filtered to include only painting-painting connections, and a histogram visualizes the frequency of pairs within each interval. The results highlight a concentration of pairs in higher intervals (e.g., 0.9-1.0), suggesting strong stylistic or thematic similarities between many paintings. In contrast, lower intervals reveal fewer connections, emphasizing weaker or negative correlations.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
bins = np.arange(-1.0, 1.1, 0.1)
bins[-1] = bins[-1] + 0.001
painting_painting_pairs = filtered_graph_pairs[
    (filtered_graph_pairs['node1_type'] == 'painting') &
    (filtered_graph_pairs['node2_type'] == 'painting')
]
frequency, bin_edges = np.histogram(painting_painting_pairs['score'], bins=bins)
intervals = [f"{round(bin_edges[i], 1)}-{round(bin_edges[i + 1], 1)}" for i in range(len(bin_edges) - 1)]
freq_df = pd.DataFrame({'Interval': intervals, 'Frequency': frequency})
freq_df = freq_df.iloc[::-1]
print(freq_df)
plt.bar(freq_df['Interval'], freq_df['Frequency'], color='skyblue', alpha=0.7)
plt.xticks(rotation=45)
plt.title("Cosine Similarity Distribution for Painting-Painting Pairs")
plt.xlabel("Cosine Similarity Intervals")
plt.ylabel("Frequency")
plt.gca().invert_xaxis()  # Invert x-axis for descending order
plt.grid(axis='y')
plt.tight_layout()
plt.show()

Painting-Painting Connections

This section focuses on analyzing connections between paintings based on their similarity scores, emphasizing relationships that span across different artists. By filtering and enriching the DataFrame, we can better understand how paintings from various artists relate to one another.

Key steps include:

Filtering painting-painting pairs: Rows where both nodes are of type painting are extracted, creating a subset of the data that exclusively examines relationships between paintings.
Extracting artist names: For each painting, the artist’s name is derived from the file name by splitting the string and capturing the relevant segment. The artist names are stored in new columns artist1 and artist2.
Cross-artist connections: Connections between paintings created by different artists are identified by filtering for pairs where artist1 is not equal to artist2. Additionally, only unique pairs are retained by ensuring idx1 < idx2.

The resulting DataFrame highlights cross-artist painting relationships, providing insights into stylistic or thematic similarities between works by different creators.

painting_pairs=df[(df['node1_type']=='painting') & (df['node2_type']=='painting')]
painting_pairs['artist1']=painting_pairs['node1_name'].apply(lambda x: x.split('_')[-2])
painting_pairs['artist2']=painting_pairs['node2_name'].apply(lambda x: x.split('_')[-2])
dfp=painting_pairs

Cross-Artist Painting Connections Analysis

The analysis focuses on identifying connections between paintings created by different artists, highlighting stylistic or thematic similarities. By filtering pairs of paintings where the creators are distinct, and ensuring unique combinations, we analyzed a total of 490,000 cross-artist connections. The top-scoring pairs based on cosine similarity are presented below, offering insights into overlapping influences and shared artistic elements.

diffArtistPaintings= dfp[(dfp['artist1']!=dfp['artist2']) & (dfp['idx1'] < dfp['idx2'])]
diffArtistPaintings.shape : (490000, 9)
diffArtistPaintings.sort_values(by='score').tail(9)

Observations:

Stylistic Overlaps: High similarity scores suggest shared stylistic elements or thematic connections between paintings, even from different movements.
Artistic Influence: These links may indicate underlying inspirations or techniques shared across eras or artistic philosophies.
Applications: The insights can support art recommendations, historical analyses, or new narratives in understanding artistic relationships.

Exploring Painting-Painting Connections Within the Same Artist

This table highlights painting-painting connections within the works of the same artist, showcasing the least similar pairs based on cosine similarity scores derived from the GNN embeddings.

artistPaintings= dfp[(dfp['artist1']==dfp['artist2']) & (dfp['idx1'] < dfp['idx2'])]
artistPaintings.sort_values(by='score').head(11)

Key Observations

Low Cosine Similarity Within Same Artist's Works:
- Hieronymus Bosch has the lowest similarity score (0.2722), indicating significant variation between the chosen paintings, likely due to differences in themes or styles.
- Pierre-Auguste Renoir and Edouard Manet also have scores near 0.29, reflecting diversity in artistic elements.

Mikhail Vrubel Dominates: The table features multiple pairs from Mikhail Vrubel, with scores ranging from 0.2952 to 0.3195, highlighting his stylistic versatility.

The low similarity scores between paintings of the same artist emphasize stylistic versatility or exploration within their body of work. These results provide valuable insights into the evolution and experimentation of individual artists, offering a deeper understanding of their creative journeys.

Artist-Artist Connections

This analysis explores the relationships between artists by categorizing their connections based on cosine similarity scores and metadata such as shared genres or nationalities.

First, pairs of artist nodes were isolated and merged with existing edge data to identify relationships such as shared genres or nationalities. Missing values in the genre and nationality columns were replaced with "None" to standardize the dataset for further processing.

artist_pairs=df[(df['node1_type']=='artist') & (df['node2_type']=='artist')]
combined_df['genre'] = combined_df['genre'].fillna('None')
combined_df['nationality'] = combined_df['nationality'].fillna('None')
combined_df = pd.merge(
    artist_pairs,
    artist_edges,
    on=['idx1', 'idx2'],
    how='left'
)

To better understand the nature of artist connections, a new column, edge_type, was introduced to classify relationships into the following categories:

Genre: When artists share a common artistic style or movement.
Nationality: When artists belong to the same country.
Both: When artists share both genre and nationality.
None: When no direct connection exists in the provided metadata.

This categorization offers valuable insights into the nature of artist-artist relationships, helping uncover patterns and anomalies within the graph.

def determine_edge_type(row):
    if row['genre'] != 'None' and row['nationality'] != 'None':
        return 'both'
    elif row['genre'] != 'None':
        return 'genre'
    elif row['nationality'] != 'None':
        return 'nationality'
    else:
        return 'none'
combined_df['edge_type'] = combined_df.apply(determine_edge_type, axis=1)
combined_df[combined_df['edge_type']!='none'].tail(11)

stats = combined_df.groupby('edge_type')['score']
  .agg(['mean', 'median', 'min', 'max', 'std']).reset_index()
print("Statistics by Edge Type:")
print(stats)
Statistics by Edge Type:
     edge_type      mean    median       min       max       std
0         both  0.751972  0.790473  0.345487  0.989279  0.210772
1        genre  0.817520  0.864702  0.260232  0.978745  0.149252
2  nationality  0.719641  0.791969  0.052798  0.992316  0.245162
3         none  0.617343  0.685730 -0.281765  1.000000  0.296322

To better understand the relationship between edge types and their corresponding similarity scores, we use a boxplot. The plot is arranged in the following order: both, genre, nationality, and none, ensuring consistency and clarity. A uniform light gray color is applied to all boxes for a clean and professional appearance.

plt.figure(figsize=(8, 5))
order = ['both', 'genre', 'nationality', 'none']  # Specify the desired order
sns.boxplot(x='edge_type', y='score', data=combined_df, order=order, color='lightgray')  # Use single color
plt.title("Score Distribution by Edge Type", fontsize=14)
plt.xlabel("Edge Type", fontsize=12)
plt.ylabel("Score", fontsize=12)
plt.show()

This visualization provides insights into how different edge types (e.g., genre-based or nationality-based connections) influence the similarity scores among artists in the graph.

In this analysis, we focus on understanding the relationships between artist nodes by exploring similarity scores for different edge types. By examining both high and low scores, we gain insights into hidden connections and variations among artists.

Low Scores with 'Both'

Even when artists share both genre and nationality, low similarity scores reveal nuanced differences. These pairs demonstrate the diversity that can exist even within closely related artistic and cultural contexts.

df=combined_df
low_scores_both = df[(df['edge_type'] == 'both') & (df['idx2'] > df['idx1'])]
print("Low scores with 'genre' edge type:")
low_scores_both.sort_values(by='score').head(7)

With the exception of Titian and Leonardo da Vinci, all pairs are French artists from Impressionism or Post-Impressionism, yet their low cosine similarity scores highlight nuanced stylistic differences within these closely related movements.

Low Scores with 'Nationality'

Artists sharing the same nationality but exhibiting weak similarities highlight diversity within a shared cultural framework. These cases may represent unique interpretations or individualistic approaches despite shared backgrounds.

low_scores_nationality = df[(df['edge_type'] == 'nationality') & (df['idx2'] > df['idx1'])]
print("Low scores with 'nationality' edge type:")
low_scores_nationality.sort_values(by='score').head(11)

The low cosine similarity scores between artists of the same nationality often reflect differences in time periods and artistic movements. For example, Rembrandt and Piet Mondrian represent vastly different eras—17th-century Baroque and 20th-century Abstract art—while Camille Pissarro and Eugene Delacroix belong to distinct movements like Impressionism and Romanticism. These temporal and stylistic disparities highlight the diversity within shared cultural backgrounds.

Low Scores with 'Genre'

For artist pairs connected by the same genre but showing low similarity scores, we explore how artists within a single genre can differ significantly in their styles, techniques, or approaches to art.

low_scores_genre = df[(df['edge_type'] == 'genre') &  (df['idx2'] > df['idx1'])]
print("Low scores with 'genre' edge type:")
low_scores_genre.sort_values(by='score').head()

For artists connected by the same genre, even the lowest cosine similarity scores are relatively high, indicating a baseline level of shared characteristics. However, these pairs still highlight stylistic diversity within the same genre, such as the contrasting approaches of Eugene Delacroix and William Turner in Romanticism or the differences between Albrecht Dürer and Hieronymus Bosch in the Northern Renaissance.

High Scores with 'None'

Artists with no shared genre or nationality but high similarity scores reveal unexpected connections. These relationships may indicate shared influences, overlapping themes, or stylistic similarities that are not captured by explicit attributes.

high_scores_none = df[(df['edge_type'] == 'none') & (df['idx2'] > df['idx1'])]
print("High scores with 'none' edge type:")
high_scores_none.sort_values(by='score',ascending=False).head(7)

These high scores, despite no overlaps in genre or nationality, suggest deeper, unrecognized influences or thematic and stylistic parallels that transcend traditional classifications, warranting further exploration. One such connection is between Pablo Picasso (1881–1973), the Spanish pioneer of Cubism, and Mikhail Vrubel (1856–1910), a Russian Symbolist painter. Although they lived in different periods and represented contrasting artistic movements, a historical link bridges their work. In 1906, Sergey Diaghilev brought Vrubel’s works to Paris, where they captured the admiration of a young Pablo Picasso.

Conclusion

This study presented a Unified Knowledge Graph framework that combines text and image data to explore artistic relationships. By utilizing Graph Neural Networks (GNNs) for embedding and link prediction, we uncovered meaningful connections within and across artist and painting nodes, revealing both expected patterns and surprising insights.

Key findings include:

Cross-Artist Painting Connections: Identified thematic overlaps and stylistic diversity between works created by different artists.
Artist-Specific Variations: Highlighted the evolution and experimentation within individual artists' portfolios, showcasing their creative range.
Unexpected Influences: Discovered surprising connections, such as historical links between Pablo Picasso and Mikhail Vrubel, demonstrating the broader potential of this approach.

The Unified Knowledge Graph provides a powerful tool for analyzing complex datasets, offering fresh perspectives on art history. It also paves the way for applications in recommendation systems, classification tasks, and cultural analysis. Future work will focus on expanding the graph with additional data types and improving model interpretability.