The "Geometric Deep Learning" paper was written in 2021 when Convolutional Neural Networks (CNNs) were the leading models in the deep learning world. If that paper were written in 2023-2024, Large Language Models (LLMs) would undoubtedly be at the forefront. It's exciting to think about what might be the biggest breakthrough in deep learning in the next 2-3 years.
out
, capturing the structural and feature information of the entire graph.
graphUnion
and appends the results, along with their corresponding graph indices, to the cosine_similarities
list.
After GNN Link Prediction model training, we will examine structural changes in the network to uncover key influencers.
The use of Graph Neural Networks (GNNs) in time series analysis represents a rising field of study, particularly in the context of GNN Graph Classification, a technique traditionally applied in disciplines such as biology and chemistry. Our research repurposes GNN Graph Classification for the analysis of time series climate data, focusing on two distinct methodologies: the city-graph method, which effectively captures static temporal snapshots, and the sliding window graph method, adept at tracking dynamic temporal changes. This innovative application of GNN Graph Classification within time series data enables the uncovering of nuanced data trends.
We demonstrate how GNNs can construct meaningful graphs from time series data, showcasing their versatility across different analytical contexts. A key finding is GNNs’ adeptness at adapting to changes in graph structure, which significantly improves outlier detection. This enhances our understanding of climate patterns and suggests broader applications of GNN Graph Classification in analyzing complex data systems beyond traditional time series analysis. Our research seeks to fill a gap in current studies by providing an examination of GNNs in climate change analysis, highlighting the potential of these methods in capturing and interpreting intricate data trends.
This study was presented at the International Conference on Machine Learning Technologies (ICMLT) and is included in the proceedings.
In our research, we combined and compared two methods for analyzing time series climate data using Graph Neural Networks (GNNs). Our previous study, “GNN Graph Classification for Climate Change Patterns: Graph Neural Network (GNN) Graph Classification - A Novel Method for Analyzing Time Series Data”, introduced the city-graph method, which captures static temporal snapshots to sort climate data into ‘stable’ and ‘unstable’ categories. In this post, we focus on our new technique: the sliding window graph method. This approach breaks down time series data into overlapping sections to capture specific time-related features. These sections are then used to create graphs, providing a new way to understand short-term changes in climate patterns.
In 2012, deep learning and knowledge graphs took a big leap forward in data analysis and machine learning. AlexNet, a new type of Convolutional Neural Network (CNN) for image classification, showed much better results than older methods. Around the same time, Google introduced knowledge graphs, which improved how data is integrated and managed.
However, CNNs struggled with graph-structured data, and graph techniques lacked deep learning’s ability to recognize patterns. This changed with the arrival of Graph Neural Networks (GNNs). GNNs combined deep learning with graph data processing, making it easier to analyze graph-structured data.
GNN models are designed specifically for graph data. They use geometric relationships and combine node features with graph structure. This makes them very useful for tasks like node classification, link prediction, and graph classification. GNN Graph Classification models, which have been used in areas like chemistry and medicine, classify entire graphs based on their structure and features.
In 2021, the “Geometric Deep Learning” paper was written when Convolutional Neural Networks (CNNs) were the dominant models in the deep learning landscape. If the paper were written in 2023-2024, Large Language Models (LLMs) would undoubtedly be considered the leading technology. The field of deep learning is rapidly evolving, and it remains to be seen what new advancements and models will emerge as the “biggest animals” in the next 2-3 years.
In this study, we expand on our previous research using Graph Neural Network (GNN) models to analyze climate data. Our earlier method categorized climate time series data into ‘stable’ and ‘unstable’ to identify unusual patterns in climate change.
Now, we introduce the sliding window graph method, which breaks down time series data into overlapping sections to capture specific time-related features. This approach creates graphs from these sections, offering a new perspective on short-term climate changes.
Our previous study used a city-graph method, where nodes represent city-year combinations with daily temperature vectors as features. The new sliding window method compares identical dates across different cities and years, helping us understand global climate trends.
Our research aims to explore the potential of GNN graph classification in identifying and interpreting global climate dynamics, providing valuable insights into seasonal changes and long-term shifts in climate.
In our study, we explore two different methods for constructing graphs from climate data: the City-Graph Method and the Sliding Window Method.
City-Graph Method:
Sliding Window Method:
While the graph construction methods differ, both follow a common pipeline for GNN Graph Classification:
Our approach uses Graph Neural Networks (GNNs) combined with a sliding window technique to analyze time series data. Here’s an overview of the process:
We segment time series data into smaller graphs using a sliding window, which captures local temporal patterns. Each time segment forms a unique graph.
In these graphs, nodes represent data points within the window, with features reflecting their values. Edges connect these sequential points to maintain the temporal order.
The window size (W) and overlap (shift size S) are important as they determine how the data is segmented and analyzed. Edge definitions within the graphs are tailored to the specifics of the time series data, helping to detect patterns.
For a dataset with N data points, we apply a sliding window of size W with a shift of S to create nodes. The number of nodes, N_{nodes}, is calculated as: ${N}_{\mathrm{nodes}}=\lfloor \frac{N-W}{S}\rfloor +1$
With the nodes determined, we construct graphs, each comprising G nodes, with a shift of S_{g} between successive graphs. The number of graphs, N_{graphs}, is calculated by: ${N}_{\mathrm{graphs}}=\lfloor \frac{{N}_{\mathrm{nodes}}-G}{{S}_{g}}\rfloor +1$
This method allows us to analyze time series data effectively by capturing both local and global patterns, providing valuable insights into temporal dynamics.
Our methodology involves processing both city-centric and sliding window graphs. We start by generating cosine similarity matrices from time series data, which are then converted into graph adjacency matrices. This process includes creating edges for vector pairs with cosine values above a set threshold and adding a virtual node to ensure network connectivity, a critical step for preparing the graph structure.
For graph classification tasks, we use the GCNConv model from the PyTorch Geometric Library. This model excels in feature extraction through its convolutional operations, taking into account edges, node attributes, and graph labels for comprehensive graph analysis. The approach concludes with the training phase of the GNN model, applying these techniques to both types of graphs for robust classification.
For this study, we utilized climate data from Kaggle, specifically the dataset titled “Temperature History of 1000 Cities 1980 to 2020”. This dataset provides average daily temperature data from 1980 to 2020 for the 1000 most populous cities in the world. This comprehensive dataset served as the foundation for both the city-centric and sliding window graph methods employed in our analysis.
The bar chart shows city frequency by latitude. Most cities are between 20 and 60 degrees in the Northern Hemisphere. There are fewer cities around the equator and even fewer in the Southern Hemisphere.
Using a 40-year dataset of daily temperatures from 1000 cities, our study evaluates GNN’s effectiveness in identifying global climate patterns. We focus on data from January 1st to the start of each month, providing insights into climate consistency, seasonal changes, and long-term shifts.
In our global climate data analysis, we use the sliding window graph method on a dataset with 40 years of daily temperatures from 1000 cities. This approach segments the data into graphs, each defined by a 30-day window (𝑊 = 30) with a 7-day shift (𝑆 = 7), effectively capturing local climate dynamics. This results in 1624 small graphs, allowing us to analyze short-term climate variations and trends.
Our accuracy metrics provide insights into the stability and variability of global climate patterns. High accuracy suggests predictable seasonal trends, while lower accuracy indicates irregular climate patterns or shifts. The sliding window graph method allows us to thoroughly evaluate the model’s ability to identify complex patterns in large climate datasets.
When examining closely spaced months, such as January 1st to February 1st and January 1st to December 1st, the GNN model’s accuracy around 0.5 suggests difficulty in identifying distinct climate patterns. This low accuracy points to potential variability and unpredictability in global weather patterns during these periods, highlighting the complex dynamics of weather.
For periods between January and months like March, April, or October, the model achieves accuracy metrics averaging around 0.7 to 0.8, indicating moderate success in capturing climatic patterns. This is likely due to the model’s proficiency in identifying consistent seasonal transitions over these extended timeframes.
The highest accuracy metrics, ranging from 0.94 to 0.99, are observed for months other than January, such as May, June, July, August, and September. These results reflect the model’s exceptional performance in predicting climate patterns during these months, particularly in the stable summer months. This suggests that the GNN model excels in recognizing and adapting to distinctive climatic patterns, resulting in highly accurate predictions.
For classification, we split our graph dataset into ‘stable’ and ‘unstable’ groups based on average cosine similarities between consecutive years. This method segmented the global dataset into stable and unstable categories for our sliding window analysis. Using 20,000 city-year combinations, we set a window size of 30 (𝑊 = 30) and a shift size of 6 (𝑆 = 6), facilitating precise computations for both stable and unstable datasets. Each graph contains 30 nodes (𝐺 = 30), with a shift of 4 (𝑆𝑔 = 4) between successive graphs, resulting in a total of 1648 small graphs.
In our study, GNN graph classification for stable climate cities starts with moderate accuracy in February, significantly improves by May reaching a peak of 100%, and maintains high accuracy through the summer months, only to dip in October with a slight recovery in November. In contrast, unstable climate cities start with near-random accuracy in February, improve steadily, peak in August, and then decline sharply, returning to early-year levels by December. This indicates the model’s varying adaptability to stable and unstable climate patterns throughout the year.
Analysis starting from January 1 shows that the model’s performance is influenced by the time of year. Unstable climates see low accuracies in the early and late parts of the year, suggesting limited learning during these periods. Conversely, stable climates exhibit significant improvements in accuracy during spring and summer, indicating effective data integration. However, the model’s performance overall is subject to fluctuations, peaking in the summer months before declining towards the end of the year, highlighting the challenges in generalizing across seasonal variations in climate data.
The create_segments_df
function segments a specified column from a DataFrame into fixed-size windows. For each segment, it adds context such as the start date, row index, and column label. The function then combines these segments into a new DataFrame. This is useful for time series analysis or preparing data for machine learning models.
The group_segments
function takes a DataFrame of segments and groups them into larger segments based on specified sizes and shifts. It adds a group index to each group and combines them into a new DataFrame. This is useful for aggregating data over larger windows, essential for graph-based models or detailed data analysis.
Take columns col1
and col2
from a dataset, fill NaN values in with their mean values and scale these columns using MinMaxScaler.
The code creates segments from two columns (col1
and col2
) of a dataset using the create_segments_df
function, assigns node indices to each segment, and then groups these segments with the group_segments
function. It combines the grouped segments into a final dataset, assigning a unique datasetIdx to each. Finally, it generates metadata for each dataset index and merges it with the segment data to form graphList.
Continuation of coding is described in our previous study, “GNN Graph Classification for Climate Change Patterns: Graph Neural Network (GNN) Graph Classification - A Novel Method for Analyzing Time Series Data”. This current work continues the same coding methodology for both city graphs and sliding window graphs.
In this research, we evaluated two distinct GNN Graph Classification techniques for analyzing climate data: the city-graph and the sliding window graph methods. The city-graph method assigns a node to each city-year pair, connecting them based on the cosine similarity of their temperature profiles, making it particularly suited for analyzing long-term climate trends. In contrast, the sliding window technique divides time series data into overlapping segments to form graphs, adeptly identifying short-term climate variations.
Both techniques were applied to the same dataset to compare their effectiveness in categorizing cities by climate stability. We found that the city-graph method more accurately discerned long-term climate stability, whereas the sliding window approach excelled in detecting short-term climate changes. Therefore, the choice of method depends on the specific objectives of the analysis: the city-graph is preferable for examining extended trends, while the sliding window method is ideal for investigating immediate climatic shifts.
GNN graph classification has shown its strength in mapping complex relationships within graph-based datasets, making it a versatile tool in fields ranging from molecular dynamics to social network analysis. This versatility extends to climate data analysis, where it aids in identifying stable versus unstable climate patterns across cities by evaluating average cosine similarities of yearly temperature fluctuations. The addition of the sliding window graph approach further refines our study, enabling the model to continuously integrate new data and offer a detailed view of changing climate patterns. This technique is adept at capturing the dynamic nature of climate data, allowing for a more nuanced analysis of temporal trends and making it particularly suitable for managing the variable nature of climate data. This method’s ability to prioritize recent data over older information is crucial for adapting to the fast-paced changes characteristic of climate patterns.
In this study, we have leveraged GNN graph classification to address the complex challenge of analyzing climate patterns across different geographic locales, underscoring the method’s adaptability and broad applicability. Our research aimed explicitly at harnessing the potential of GNNs to distinguish between stable and unstable climate conditions in cities worldwide, using average cosine similarities of annual temperature variations as a novel classification metric. By integrating the sliding window graph approach, we have enhanced our model’s ability to dynamically assimilate and refresh data, offering a granular perspective on the fluctuating climate patterns and their implications over time.
This investigation has demonstrated that while equatorial cities exhibit consistency in climate stability, higher latitude cities experience more pronounced fluctuations. Remarkably, our analysis also brought to light certain anomalies, such as Mediterranean cities with unexpectedly consistent climates and cities in China and Mexico with notable climate variability. These findings highlight the critical importance of considering local geographical and climatic factors in climate studies and underscore the nuanced capabilities of GNN models in detecting subtle climate dynamics.
Ultimately, our study reinforces the utility of GNN graph classification, especially with the incorporation of the sliding window approach, as a potent tool for dissecting and understanding climate data. This method does not merely augment the predictive accuracy of our models but significantly bolsters their adaptability to ongoing climate changes, offering a richer comprehension of the complex interplay of factors influencing global climate trends. As such, GNN graph classification emerges as an indispensable instrument in the ongoing efforts to tackle the multifaceted challenges posed by global climate change, paving the way for more informed and effective climate resilience strategies.
]]>In recent years, knowledge graphs have become a powerful tool for integrating and analyzing data and shedding lights on the connections between entities. This study narrows its focus on unraveling detailed relationships within knowledge graphs, placing special emphasis on the role of graph connectors through link predictions and triangle analysis.
Using Graph Neural Network (GNN) Link Prediction models and graph triangle analysis in knowledge graphs, we have managed to uncover relationships that had been previously undetected or overlooked. Our findings mark a significant milestone, paving the way for more comprehensive exploration into the complex relationships that exist within knowledge graphs.
This study initiates further research in the area of unveiling the hidden dynamics and connections in knowledge graphs. The insights from this work promise to redefine our understanding of knowledge graphs and their potential for unlocking the complexities of data interrelationships.
The year 2012 was pivotal for deep learning and knowledge graphs. In that year, after AlexNet was introduced, a Convolutional Neural Network (CNN) highlighted the power of image classification techniques. Simultaneously, Google's introduction of knowledge graphs transformed data integration and management.
For many years, deep learning and knowledge graphs developed independently. CNN proved effective with grid-structured data but struggled with graph-structured data. On the other hand, graph techniques excelled in representing and reasoning about graph data but lacked deep learning's power. The late 2010s Graph Neural Networks (GNN) bridged this gap and emerged as a potent tool for processing graph-structured data through deep learning techniques.
For years, we've relied on binary graph structures, simplifying complex relationships into 'yes' or 'no', '1' or '0'. But in our ever-evolving world, is that enough? We believed there was more depth to be explored. Thus, we turned to Graph Neural Networks, a frontier technology, to help us transition from these fixed binaries to a more fluid, continuous space.
In our previous study 'Rewiring Knowledge Graphs by Link Predictions' we delved into the exploration of knowledge graph rewiring to reveal unknown relationships between modern art artists, employing GNN link prediction models. By training these models on Wikipedia articles about modern art artists' biographies and leveraging GNN link prediction models, we identified previously unknown relationships between artists.
To rewire knowledge graphs, we adopted two distinct methods. First, we utilized a traditional method that involved a full-text analysis of articles and calculation of cosine similarities between embedded nodes. The second method involved the construction of semantic graphs based on the distribution of pairs of co-located words, and edges between nodes that share common words.
Let's take a moment to appreciate the evolution and elevation that GNN Link Prediction brings to the table. Remember the days of black and white television? Now imagine transitioning from that to a high-definition colored TV. That's the kind of transformative leap we're talking about when moving from traditional graph representations to GNN Link Prediction. Instead of just binary relationships, we're now operating on a continuous spectrum. Why is this so revolutionary? Because it allows us to see the subtle intricacies, the patterns that were once invisible. We're no longer just categorizing relationships as 'connected' or 'not connected'; we're exploring the depth, the weight, the very essence of these connections. It's like being given a magnifying glass to see the intricate patterns that were always there but previously overlooked. This shift not only boosts our prediction accuracy but also broadens our understanding of the complex web of relationships within our data.
In the vast network of relationships, it's essential to understand not just who is connected to whom, but also the depth and nature of these connections. Let's take a simplified example featuring Alice, Beth, and their college. Alice and Beth shared a close bond during their college days, so their connection is strong. But when we look at their individual relationships with the college, it's more of an association by attendance, making it a weaker connection. Picture a triangle with its vertices representing Alice, Beth, and their college. The strength of the links in this triangle varies. The college acts as a 'Graph Connector'—a node that forms a bridge between different entities. Now, why is this distinction crucial? Because understanding these nuanced connections ensures we don't treat all relationships equally. It enables us to discern, prioritize, and gain richer insights into our network, ensuring our analysis is both detailed and accurate.
Analyzing graph triangles offers insights into the strength of connections between nodes within a network. Looking at the relationships among nodes A, B, and C, we are focusing on the strength of the connection between nodes A and B compared to the connections involving node C. Node C, identified as a 'graph connector' node, is critical in facilitating communication and interaction between nodes A and B. Serving as a link, node C allows the smooth flow of information and relationships between the strongly connected nodes A and B.
As analogy, imagine early 20th-century Vienna's intellectual scene as a dynamic network. Berta Zuckerkandl's salon stood out as one of central nodes, orchestrating and facilitating connections. Her salon served as the platform, connecting diverse talents like artists, scientists, and doctors. Each gathering at her salon can be seen as the creation of 'links' between nodes. Berta stands as a quintessential 'graph connector' and her role ensures not just random interactions, but impactful connections, emphasizing her integral position in this vibrant intellectual web. This characterizes the importance of graph connector nodes in enhancing the network's overall connectivity and functionality, fostering collective behaviors and dynamics among interconnected nodes.In this study, we aim to compare our previous study's results with the findings obtained through granular graph triangle analysis. Specifically, we'll examine the Wikipedia articles related to Paul Klee and Joan Miró, who were deemed as highly disconnected artists in the previous study. By employing graph triangle analysis techniques, we'll unveil previously overlooked graph connectors and patterns between these artists.
For our GNN link prediction model, we'll use the GraphSAGE model. Unlike traditional approaches relying on the entire adjacency matrix information, GraphSAGE focuses on learning aggregator functions. This allows us to generate embeddings for new nodes based on their features and neighborhood information without the need to retrain the entire model.
It's crucial to note that the outputs of the GraphSAGE model in our study are not actual predicted links, but embedded graphs. These embedded graphs capture the relationships and structural information within the original graphs. While these embeddings can be used for predicting graph edges, we will specifically utilize them for graph triangle analysis to identify and explore graph connectors within the network. These graph connectors play a pivotal role in facilitating connections and interactions between nodes, offering valuable insights into network dynamics and relationships.
pair1=[leftWord1, rightWord1], pair2=[leftWord2, rightWord2]
edge12={pair1, pair2}
To delve deeper into the intricacies of graph structures, we used graph triangle analysis. Here's a step-by-step breakdown of our methodology:
By focusing on such triangles, we can derive more insight into the underlying relationships between nodes. This allows us to uncover intricate patterns and gain a deeper understanding of the structural nuances present within the graph.
Node list:
Get unique word pairs for embedding:
Node embedding:
Save embedded word pairs:Transform data to DGL format:
The example graph triangles, like those seen in picture above, demonstrate the crucial role of graph connectors. The numbers placed next to the edges represent the cosine similarities between the vectors of the corresponding nodes, providing valuable insights into the relationships and patterns within the knowledge graph.
In this study, we utilized GNN link prediction techniques and graph triangle analysis to delve deeper into the intricacies of relationships within knowledge graphs. Leveraging these techniques, we demonstrated their potency in revealing patterns that might have previously gone unnoticed.
Our comparison between granular relationship analysis and aggregated relationships unveiled some compelling insights. In our previous study, based on an aggregated view, the artists Paul Klee and Joan Miró were deemed highly disconnected. However, that analysis failed to capture the finer nuances of their relationships. By applying graph triangle analysis techniques in this study, we found potentially significant connections and patterns between these artists, overlooked in the aggregated results.
This demonstrates the significance of granular analysis in comprehending the complex relationships within knowledge graphs. A deeper probe into the relationships between entities uncovers hidden associations and provides fresh insights into the interconnected data.
We have taken a step in exploring the concept of knowledge graph connectors. Through the use of GNN link prediction models and graph triangle analysis techniques, we have exposed the presence of graph connectors. These connectors play a critical role in facilitating connections and interactions between entities within the knowledge graphs.
Our study reveals new ways to understand complex connections in knowledge graphs, shedding light on hidden relationships and dynamics. This study is the beginning of a journey towards gaining a deeper understanding of the hidden relationships and dynamics within knowledge graphs.
Envision the transformative impact of applying our advanced graph connector techniques across various fields:
The possibilities are boundless, and the diverse applications of our graph connector methods promise a future rich with insight and innovation!
In more recent study 'Rewiring Knowledge Graphs by Link Predictions' our approach involved the application of GNN link prediction models. We trained these models on Wikipedia articles, specifically focusing on biographies of modern art artists. By leveraging GNN, we successfully identified previously unknown relationships between artists.
This study aims to extend earlier research by applying GNN graph classification models for document comparison, specifically using Wikipedia articles on modern art artists. Our methodology will involve transforming the text into semantic graphs based on co-located word pairs, then generating subsets of these semantic subgraphs as input data for GNN graph classification models. Finally, we will employ GNN graph classification models for a comparative analysis of the articles.
For several years, deep learning and knowledge graphs progressed in parallel paths. CNN deep learning excelled at processing grid-structured data but faced challenges when dealing with graph-structured data. Graph techniques effectively represented and reasoned about graph structured data but lacked the powerful capabilities of deep learning. In the late 2010s, the emergence of Graph Neural Networks (GNN) bridged this gap and combined the strengths of deep learning and graphs. GNN became a powerful tool for processing graph- structured data through deep learning techniques.
GNN models allow to use deep learning algorithms for graph structured data by modeling entity relationships and capturing structures and dynamics of graphs. GNN models are being used for the following tasks to analyze graph-structured data: node classification, link prediction, and graph classification. Node classification models predict label or category of a node in a graph based on its local and global neighborhood structures. Link prediction models predict whether a link should exist between two nodes based on node attributes and graph topology. Graph classification models classify entire graphs into different categories based on their graph structure and attributes: edges, nodes with features, and labels on graph level.
GNN graph classification models are developed to classify small graphs and in practice they are commonly used in the fields of chemistry and medicine. For example, chemical molecular structures can be represented as graphs, with atoms as nodes, chemical bonds as edges, and graphs labeled by categories.
One of the challenges in GNN graph classification models lies in their sensitivity, where detecting differences between classes is often easier than identifying outliers or incorrectly predicted results. Currently, we are actively engaged in two studies that focus on the application of GNN graph classification models to time series classification tasks: 'GNN Graph Classification for Climate Change Patterns' and 'GNN Graph Classification for EEG Pattern Analysis'.
In this post, we address the challenges of GNN graph classification on semantic graphs for document comparison. We demonstrate effective techniques to harness graph topology and node features in order to enhance document analysis and comparison. Our approach leverages the power of GNN models in handling semantic graph data, contributing to improved document understanding and similarity assessment.
To create semantic graph from documents we will use method that we introduced in our post 'Find Semantic Similarities by GNN Link Predictions'. In that post we demonstrated how to use GNN link prediction models to revire knowledge graphs. For experiments of that study we looked at semantic similarities and dissimilarities between biographies of 20 modern art artists based on corresponding Wikipedia articles. One experiment was based on traditional method implemented on full test of articles and cosine similarities between reembedded nodes. In another scenario, GNN link prediction model ran on top of articles represented as semantic graphs with nodes as pairs of co-located words and edges as pairs of nodes with common words.
In this study, we expand on our previous research by leveraging the same data source and employing similar graph representation techniques. However, we introduce a new approach by constructing separate semantic graphs dedicated to each individual artist. This departure from considering the entire set of articles as a single knowledge graph enables us to focus on the specific relationships and patterns related to each artist. By adopting this approach, we aim to capture more targeted insights into the connections and dynamics within the knowledge graph, allowing for a deeper exploration of the relationships encoded within the biographies of these artists.
To translate text of pairs of co-located to vectors we will use transformer model from Hugging Face: 'all-MiniLM-L6-v2'. This is a sentence-transformers model that maps text to a 384 dimensional vector space.
As input data for GNN graph classification model we need a set of labeled small graphs. In this study from each document of interest we will extract a set of subgraphs. By extracting relevant subgraphs from both documents, GNN graph classification models can compare the structural relationships and contextual information within the subgraphs to assess their similarity or dissimilarity. One of the ways to extract is getting subgraphs as neighbors and neighbors of neighbors of nodes with high centralities. In this study we will use betweenness centrality metrics.
The GNN graph classification model is designed to process input graph data, including both the edges and node features, and is trained on graph-level labels. In this case, the input data structure consists of the following components:
Based on Wikipedia text size distribution, the most well known artist in our artist list is Vincent van Gogh and the most unknown artist is Franz Marc:
More detail information is available in our post 'Rewiring Knowledge Graphs by Link Predictions'. To estimate document similarities based on GNN graph classification model, we experimented with pairs of highly connected artists and highly disconnected artists. Pairs of artists were selected based on our study "Building Knowledge Graph in Spark without SPARQL". This picture illustrates relationships between modern art artists based on their biographies and art movements:As highly connected artists, we selected Pablo Picasso and Georges Braque, artists with well known strong relationships between them: both Pablo Picasso and Georges Braque were pioneers of cubism art movement.
As highly disconnected artists, we selected Claude Monet and Kazimir Male- vich who were notably distant from each other: they lived in different time peri- ods, resided in separate countries, and belonged to contrasting art movements: Claude Monet was a key artist of impressionism and Kazimir Malevich a key artist of Suprematism. For a more detailed exploration of the relationships between modern art artists discovered through knowledge graph techniques, you can refer to our post: "Knowledge Graph for Data Integration".That study found that using the Gramian Angular Field (GAF) image transformation technique for time series data improved the accuracy of CNN image classification models compared to using raw plot pictures. By transforming the time series vectors into GAF images, the data was represented in a different embedded space that captured different aspects of the data compared to raw plots of the EEG data. This suggests that GAF image transformation is a useful technique for improving the accuracy of image classification models for time series data.
The study utilized a combination of advanced deep learning CNN image classification models and traditional graph mining techniques for time series pattern discovery. For image classification, the time series vectors were transformed into GAF images, and for graph mining, the study created graphs based on pairwise cosine similarities between the time series data points. To analyze these graphs, traditional graph mining techniques such as community detection and graph visualization were applied. This hybrid approach enabled the study to capture and analyze different aspects of the time series data, leading to a more comprehensive understanding of the patterns present in the data.
In this study we will explore how Graph Neural Network (GNN) graph classification models can be applied to classify time series data based on the underlying graph structure.
Graph mining is the process of extracting useful information from graphs. Traditional graph-based algorithms such as graph clustering, community detection, and centrality analysis have been used for this purpose. However, these methods have limitations in terms of their ability to learn complex representations and features from graph-structured data.
Graph Neural Networks (GNN) were developed to address these limitations. GNNs enable end-to-end learning of representations and features from graph data, allowing deep learning algorithms to process and learn from graph data. By modeling the relationships between the nodes and edges in a graph, GNNs can capture the underlying structure and dynamics of the graph. This makes them a powerful tool for analyzing and processing complex graph-structured data in various domains, including social networks, biological systems, and recommendation systems.
GNN models allow for deep learning on graph-structured data by modeling entity relationships and capturing graph structures and dynamics. They can be used for tasks such as node classification, link prediction, and graph classification. Node classification models predict the label or category of a node based on its local and global neighborhood structure. Link prediction models predict whether a link should exist between two nodes based on node attributes and graph structure. Graph classification models classify entire graphs into different categories based on their structure and attributes.
GNN graph classification models are developed to classify small graphs and in practice they are commonly used in the fields of chemistry and medicine. For example, chemical molecular structures can be represented as graphs, with atoms as nodes, chemical bonds as edges, and graphs labeled by categories.
In this post we will experiment with time series graph classification from healthcare domains and GNN graph classification models will be applied to electroencephalography (EEG) signal data by modeling the brain activity as a graph. Methods presented on this post can also be applied to time series data in various fields such as engineering, healthcare, and finance. The input data for the GNN graph classification models is a set of small labeled graphs, where each graph represents a group of nodes corresponding to time series and edges representing some measures of similarities or correlations between them.
EEG tools studying human behaviors are well described in Bryn Farnsworth's blog "EEG (Electroencephalography): The Complete Pocket Guide". There are several reasons why EEG is an exceptional tool for studying the neurocognitive processes:
The study will use the same approach as the one described above, where EEG signal data is modeled as a graph to represent brain activity. The nodes in the graph will represent brain regions or electrode locations, and edges will represent functional or structural connections between them. The raw data for the experiments will come from the kaggle.com EEG dataset 'EEG-Alcohol', which was part of a large study on EEG correlates of genetic predisposition to alcoholism.
The study aims to use GNN graph classification models to predict alcoholism, where a single graph corresponds to one brain reaction on a trial. Time series graphs will be created for each trial using electrode positions as nodes, EEG channel signals as node features, and graph edges as pairs of vectors with cosine similarities above certain thresholds. The EEG graph classification models will be used to determine whether a person is from the alcoholic or control group based on their trial reactions, which can potentially help in early detection and treatment of alcoholism.
Electroencephalography (EEG) signals are complex and require extensive training and advanced signal processing techniques for proper interpretation. Deep learning has shown promise in making sense of EEG signals by learning feature representations from raw data. In the meta-data analysis paper "Deep learning-based electroencephalography analysis: a systematic review" the authors conduct a meta-analysis of EEG deep learning and compare it to traditional EEG processing methods to determine which deep learning approaches work well for EEG data analysis and which do not.
In a previous study, EEG channel data was transformed into graphs based on pairwise cosine similarities. These graphs were analyzed using connected components and visualization techniques. Traditional graph mining methods were used to find explicit EEG channel patterns by transforming time series into vectors, constructing graphs based on cosine similarity, and identifying patterns using connected components.
In this section we will describe data processing and model training methods is the following order:
For cosine similarities we used the following functions:
Next, for each brain-trial we will calculate cosine similarity matrices and transform them into graphs by taking only vector pairs with cosine similarities higher than a threshold.
For each brain-trial graph we will add a virtual node to transform disconnected graphs into single connected components. This process makes it is easier for GNN graph classification models to process and analyze the relationships between nodes.
The GNN graph classification model is designed to process input graph data, including both the edges and node features, and is trained on graph-level labels. In this case, the input data structure consists of the following components:
This study uses a GCNConv model from PyTorch Geometric Library as a GNN graph classification model. The GCNConv model is a type of graph convolutional network that applies convolutional operations to extract meaningful features from the input graph data (edges, node features, and the graph-level labels). The code for the model is taken from a PyG tutorial.
Calculate EEG positions Define 61 groups for small graphs: Calculate cosine similarity matrix by brain-trial groups:
Most of graph classifiction model results with low confidence also are related to "single stimulus" patters:
This corresponds with the results of our previous study about EEG signal classification: trials with "single stimulus" patters had lower confidence on CNN time series classification compared to trials with "two stimuli, matched" and "two stimuli, non-matched" patterns. More interesting, graph vizualiation examples show that trials with "single stimulus" patters have much lower differences between persons from Alcoholic and Control groups then trials with "two stimuli, matched" and "two stimuli, non-matched" patterns. The results of a previous study showed that trials with "single stimulus" patterns had much lower differences between persons from the alcoholic and control groups compared to trials with "two stimuli, matched" and "two stimuli, non-matched" patterns. This suggests that "single stimulus" trials are not sufficient for accurately distinguishing between the two groups. Furthermore, graph visualization examples taken from the previous study demonstrated this difference in patterns between the different types of stimuli.
This study highlights GNN graph classifications as powerful tools for analyzing and modeling the complex relationships and dependencies in data that is represented as graphs. They are enabling to uncover hidden patterns, making more accurate predictions and improving the understanding of the Earth's climate.
For several years deep learning and knowledge graph were growing in parallel with a gap between them. This gap made it challenging to apply deep learning to graph-structured data and to leverage the strength of both approaches. In the late 2010s, Graph Neural Network (GNN) emerged as a powerful tool for processing graph-structured data and bridged the gap between them.
(Picture from a book: Bronstein, M., Bruna, J., Cohen, T., and Velickovic ́, P. “Geometric deep learning: Grids, groups, graphs, geodesics, and gauges”)CNN and GNN models have a lot in common: both CNN and GNN models are realizations of Geometric Deep Learning. But GNN models are designed specifically for graph-structured data and can leverage the geometric relationships between nodes and combine node features with graph topology. GNN models represent powerful tools for analyzing and modeling the complex relationships and dependencies in data enabling to uncover and understand hidden patterns and making more accurate predictions.
In this post we will investigate how GNN graph classification models can be used to detect abnormal climate change patters. For experiments of this study we will use climate data from kaggle.com data sets: "Temperature History of 1000 cities 1980 to 2020" - average daily temperature data for years 1980 - 2019 for 1000 most populous cities in the world.
To track long-term climate trend and patterns we will start with estimation of average daily temperature for consecutive years. For each city weather station we will calculate sequence of cosines between daily temperature vectors for consecutive years to identify changes in temperature patterns over time. This can be used to understand the effects of climate change and natural variability in weather patterns. Average values of these sequences will show effect of climate change in temperature over time. By tracking these average values, we can identify trends and changes in the temperature patterns and determine how they are related to climate change. A decrease in the average cosine similarity between consecutive years can indicate an increase in the variance or difference in daily temperature patterns, which could be a sign of climate change. On the other hand, an increase in average cosine similarity could indicate a more stable climate with less variance in daily temperature patterns.
To deeper understand the effects of climate change over a longer period of time we will calculate cosine similarity matrices between daily temperature vectors for non-consecutive years. Then by taking vector pairs with a cosine similarity higher than a threshold, we will transform cosine matrices into graph adjacency matrices. These adjacency matrices will represent city graphs that will be used as input into a graph classification model.
If a city graph produced from the cosine similarity matrix shows high degree of connectivity, it could indicate that the climate patterns in that location are relatively stable over time (Fig. 1), while a city graph with low degree of connectivity may suggest that the climate patterns in that location are more unstable or unpredictable (Fig. 2).
Graph 1: Stable climate in Malaga, Spain represented in graph with high degree of connectivity: Graph 2: Graph with low degree of connectivity at Orenburg, Russia shows that the climate patterns in that location are unstable and unpredictable:City graphs will be used as input to GNN graph classification model that will identify graph classes as stable or unstable to understand how temperature patterns change over time.
In this post we will demonstrate the following:
In practice GNN graph classification in mostly used for drug discovery and protein function prediction. It can be applied to other areas where data can be represented as graph with graph labels.
For cosine similarities we used the following functions:
Values of average cosines between consecutive years will be used as graph labels for GNN graph classification model.Next, for each city we will calculate cosine similarity matrices and transform them into graphs by taking only vector pairs with cosine similarities higher than a threshold.
For each graph we will add a virtual node to transform disconnected graphs into single connected components. This process makes it is easier for graph classification models to process and analyze the relationships between nodes. On graph visualizations pictures in Graph1 and Graph2 virtual nodes are represented with number 40 and all nodes for other years with numbers from 0 to 39.
As Graph Neural Networks (GNN) link prediction model we used a GCNConv (Graph Convolutional Network Convolution) model from tutorial of the PyTorch Geometric Library (PyG).
The GCNConv model is a type of graph convolutional network that uses convolution operations to aggregate information from neighboring nodes in a graph. The model is trained on the input graph data, including the edges and node features, and the graph-level labels and it's based on the following input data structure:
This data has average daily temperature in Celsius degrees for years from January 1, 1980 to September 30, 2020 for 1000 most populous cities in the world.
Very high average cosine similarities indicate stable climate with less variance in daily temperature patterns.
Average cosines between consecutive years were used as graph labels for GNN graph classification. The set of graphs was divided in half and marked with stable and unstable labels:
Join scores and labels to the dataSet: Split data to metadata and values: The following code prepares input data for GNN graph classification model:In the output of the graph classification model we have 36 outliers with the model's predictions not equal to the input labels.
Here is detail information about these outliers: The goal of this study is to identify whether a given graph represents a stable or an unstable climate pattern, based on the temperature data in the corresponding city and the GNN graph classification model was used to learn about the relationships between the nodes within graphs and make predictions about the stability of the temperature patterns over time. The output of the GNN graph classification model would be class labels, such as stable or unstable, indicating the stability of the temperature patterns by graph locations. Based on our observations of average cosines in consecutive years, for cities close to the equator have very high cosine similarity values which indicates that the temperature patterns in these cities are stable and consistent over time. On the contrary, cities located at higher latitudes may experience more variability in temperature patterns, making them less stable. These observations correspond with GNN graph classification model results: most of graphs for cities located in lower latitude are classified as stable and graphs of cities located in higher latitude are classified as unstable. However, the GNN graph classification model results capture some outliers: there are some cities located in higher latitudes that have stable temperature patterns and some cities located in lower latitudes that have unstable temperature patterns. In the table below you can see outliers where the model's predictions do not match the actual temperature stability of these cities. European cities located in higher latitude correspond with the results of our previous climate time series study where they were indicated as cities with very stable and consistent temperature patterns. The results of our previous climate time series study showed that cities located near the Mediterranean Sea had high similarity to a smooth line, indicating stable and consistent temperature patterns. In one of climate analysis scenarios we found that most of cities with high similarities to a smooth line are located on Mediterranean Sea not far from each other. Here is a clockwise city list: Marseille (France), Nice (France), Monaco (Monaco), Genoa (Italy), Rome (Italy), Naples (Italy), and Salerno (Italy): In the next table below you can see city outliers with the highest outlier probabilities In the table below you can see outliers with probabilities close to the classification boundary.In our previous post 'Rewiring Knowledge Graphs by Link Predictions' we showed how to rewire knowledge graph through GNN Link Prediction models. In this post we will continue discussion of applications of GNN Link Prediction techniques to rewiring knowledge graphs.
The goal of this post is the same as the goal of previous post: we want to find unknown relationships between modern art artists. We will continue exploring text data from Wikipedia articles about the same 20 modern art artists as we used in the previous post, but we will use a different approach to building initial knowledge graph: instead of building it on artist names and full text of corresponding Wikipedia articles we will build it on co-located word pairs.
On nodes and edges described above we will built an initial knowledge graph.
As a method of text to vector translation we will use 'all-MiniLM-L6-v2' transformer model from Hugging Face. This is a sentence-transformers model that maps text to a 384 dimensional vector space.
As Graph Neural Networks link prediction model we will use a GraphSAGE link prediction model from Deep Graph Library (DGL). The model is built on two GrapgSAGE layers and computes node representations by averaging neighbor information. The code for this model is provided by DGL tutorial DGL Link Prediction using Graph Neural Networks.
The results of this model are embedded nodes that can be used for further analysis such as node classification, k-means clustering, link prediction and so on. In this particular post we will calculate average vectors by artists and estimate link predictions by cosine similarities between them.
Cosine Similarities function:
To estimate the size distribution of Wikipedia text data we tokenized the text and exploded the tokens:
Based on Wikipedia text size distribution, the most well known artist in our artist list is Vincent van Gogh and the most unknown artist is Franz Marc:
As Graph Neural Networks (GNN) link prediction model we used a model from Deep Graph Library (DGL). The model code was provided by DGL tutorial and we only had to transform nodes and edges data from our data format to DGL data format.
Read embedded nodes and edges from Google Drive:Convert data to DGL format and add self-loop edges:
We used the model with the following parameters:In our previous post 'Knowledge Graph for Data Mining' we discussed knowledge graph building and mining techniques. These techniques were presented in 2020 in DEXA conference "Machine Learning and Knowledge Graphs" workshop and published as 'Building Knowledge Graph in Spark without SPARQL' paper.
The goal of that study was to demonstrate that knowledge graph area is much wider that traditional semantic web SPARQL approach and there are non-traditional ways to build and explore knowledge graphs. In that study we demonstrated how knowledge graph techniques can be accomplished by Spark GraphFrames library. In this study we will show other techniques that can be applied to creating and rewiring knowledge graphs. We will explore building knowledge graphs based on Wikipedia data and Graph Neural Networks (GNN) link prediction model. To compare results of this study with results of our previous study we will use data about the same list of modern art artists.
As a method of text to vector translation we used 'all-MiniLM-L6-v2' model from Hugging Face. This is a sentence-transformers model that maps text to a 384 dimensional dense vector space.
There are two advantages of embedding text nodes:
As Graph Neural Networks link prediction we used a model from Deep Graph Library (DGL). The model is built on two GrapgSAGE layers and computes node representations by averaging neighbor information. We used the code provided by DGL tutorial DGL Link Prediction using Graph Neural Networks.
The results of this code are embedded nodes that can be used for further analysis such as node classification, k-means clustering, link prediction and so on. In this study we used it for link prediction by estimating cosine similarities between embedded nodes.
Cosine Similarities function:
To estimate the size distribution of Wikipedia text data we tokenized the text and exploded the tokens:
Here is the distribution of number of words in Wikipedia related to artits:
Based on Wikipedia text size distribution, the most well known artist in our artist list is Vincent van Gogh and the most unknown artist is Franz Marc.
As Graph Neural Networks (GNN) link prediction model we used a model from Deep Graph Library (DGL). The model code was provided by DGL tutorial and we only had to transform nodes and edges data from our data format to DGL data format.
Read embedded nodes and edges from Google Drive:Convert data to DGL format:
Define the model, loss function, and evaluation metric. To estimate the results we calculated accuracy metrics as Area Under Curve (AUC). The model accuracy metric was about 90 percents.List of Artist Names:
Join edge cosine similarity scores with artist names by indexes:
For graph visualization on Gephi tool we added a 'line' column with artists pairs in DOT language: In the following examples will show graphs of artists with hign cosine similarities and low codine similarities. Pairs of artists with high cosine similarities -- higher than 0.6: Example 1: artist pairs with cosine similarities > 0.6: Example 2: artist pairs with cosine similarities > 0.7: Pairs of artists with low cosine similarities -- less than -0.5: Example 3: artist pairs with cosine similarities < -0.5:One of the problems related to word pair dissimilarities is called in psychology 'free associations'. This is psychoanalysis method that is being used to get into unconscious process. In this study we will show how to find unexpected free associations by symmetry metrics.
We introduced a nontraditional vector similarity measure, symmetry metrics in our previous post "Symmetry Metrics for High Dimensional Vector Similarity". These metrics are based on transforming pairwise vectors to GAF images and classifying images through CNN image classifcation. In this post we will demonstrate how to use symmetry metrics to find dissimilar words pairs.
Free Associations is a psychoanalytic technique that was developed by Sigmund Freud and still used by some therapists today. Patients relate to whatever thoughts come to mind in order for the therapist to learn more about how the patient thinks and feels. As Freud described it: "The importance of free association is that the patients spoke for themselves, rather than repeating the ideas of the analyst; they work through their own material, rather than parroting another's suggestions"
In our posts to detect semantically similar or dissimilar word pairs we experimented with data about Psychoanalysis taken from Wikipedia and used different techniques that all start with the following steps:
In our post "Word2Vec2Graph - Psychoanalysis Topics" we showed how to find free associations using Word2Vec2Graph techniques. For vector similarity measures we used cosine similarities. To create Word2Vec2Graph model we selected pairs of words located next to each other in the document and built a direct graph on word pairs with words as nodes, word pairs as edges and vector cosine similarities as edge weights. This method was publiched in 2021: "SEMANTICS GRAPH MINING FOR TOPIC DISCOVERY AND WORD ASSOCIATIONS"
In another post - "Free Associations" - we demonstrated a different method - word pair similarity based on unsupervised Convolutional Neural Network image classification. We joined word vector pairs reversing right vectors, tranformed joint vectors to GAF images and classified them as 'similar' and 'different'.
In this post we will show how to predict word similarity measures using a novel technique - symmetry metrics.
To distinguish between similar and dissimilar vector pairs this mode classifies data to 'same' and 'different' classes. Trained data for the 'same' class consists of self-reflected, mirror vectors and 'different' class of non-equal pairs. Visually mirror vectors are represented as symmetric images and 'different' pairwise vector as asymmetric images. Similarity metric is defined as a probability of pairwise vectors to get into the 'same' class.
In this post we will show how to apply trained unsupervised GAF image classification model to find vector similarities for entities taken from different domain. We will experiment with model trained on daily temperature time series data and apply it to word pair similarities.
In this post we will use this method for one-way related pairs of words that are located next to each other in the document. We will generate pairwise word vectors for left and right words, transform joint vectors to GAF images and run these images through trained model to predict word similaritites through symmetry metrics.
As a method of vector to image translation we used Gramian Angular Field (GAF) - a polar coordinate transformation based techniques. We learned this technique in fast.ai 'Practical Deep Learning for Coders' class and fast.ai forum 'Time series/ sequential data' study group. This method is well described by Ignacio Oguiza in Fast.ai forum 'Time series classification: General Transfer Learning with Convolutional Neural Networks'.
To describe vector to GAF image translation Ignacio Oguiza referenced to paper Encoding Time Series as Images for Visual Inspection and Classification Using Tiled Convolutional Neural Networks.
For model training we used fast.ai CNN transfer learning image classification. To deal with comparatively small set of training data, instead of training the model from scratch, we followed ResNet-50 transfer learning: loaded the results of model trained on images from the ImageNet database and fine tuned it with data of interest. Python code for transforming vectors to GAF images and fine tuning ResNet-50 is described in fast.ai forum.
As a method of vector to image translation we used Gramian Angular Field (GAF) - a polar coordinate transformation based techniques. We learned this technique in fast.ai 'Practical Deep Learning for Coders' class and fast.ai forum 'Time series/ sequential data' study group.
Here are examples that show that self-reflected vectors are represented as symmetric plots and GAF images and semantically different joint word vectors are represented as asymmetric plots and GAF images.