Connect various datasets via Spark Knowledge Graph
Posted by Melenar on February 2, 2020
Data Integration
In this post we will show how to use knowledge graph to integrate data of different data types from multiple data sources.
We will integrate the following datasets:
Data in Museum of Modern Art Collection has information about artists, there biographies and there paintings. In two previous posts based on this data we demonstrated how to use Knowledge Graph for data mining and semantics.
In the
'Knowledge Graph for Data Mining' post we created Artist Biography knowledge graph connecting artist names with their nationalities, countries where they were born, genders and life years.
This knowledge graph allowed us to find different groups of artists, for example artists that were born in the same country or artists that changed their nationalities.
We will start data integration from the Artist Biography knowledge graph nodes and edges:
This graph is artist name centric:
Edge types in the Artist Biography knowledge graph are the following:
To define edge destination types we will show 3 examples of destinations for all edge types:
Ontology graph edges:
Ontology graph:
Ontology of Artist Biography knowledge graph:
"Inventing Abstraction 1910-1925" MoMA Exhibition Data
As the next set of data we will use data from MoMA exhibition "Inventing Abstraction 1910-1925"
presented many abstraction artists.
The following artists from our Artist Biography knowledge graph were presented on that exhibition:
Vasily Kandinsky
Franz Marc
Kazimir Malevich
Natalia Goncharova
Piet Mondrian
Paul Klee
Pablo Picasso
MoMA website has a lot of interesting information about this exhibition. In particularly this exhibition's
Artist Connections graph illustrated productive relationships between artists. From this network we've got pair relationships between artists from the Artist Biography knowledge graph:
Connections between artist pairs:
Knowledge graph edges:
Artist pairs knowledge graph:
Artist pairs knowledge graph:
Ontology:
Ontology of connections between artists:
Integrate Artist Biography knowledge graph with artists' relationships
Metadata integration
Ontology of integrated graph:
Data integration
There are two ways to integrate Artist Biography knowledge graph with Artists Pairs knowledge graph:
Add Artists Pairs edges to Artist Biography edges
Overlap Artists Pairs nodes with Artist Biography nodes
Add Artists Pairs edges to Artist Biography edges
Overlap Artists Pairs nodes with Artist Biography nodes
Modern Art Movements Timeline Data
Our next dataset is data about a timeline of the
Modern Art Movements:
1872 – 1892
Impressionism
Summary: Masters of color and light. Marked a radical departure from the realistic academic painting that had dominated the eras prior.
"Key artists: Claude Monet, Pierre-Auguste Renoir, Camille Pissarro, Edgar Degas, Edouard Manet, Mary Cassatt"
"""When you go out to paint try to forget what object you have before you - a tree, a house, a field or whatever. Merely think, here is a little square of blue, here an oblong of pink, here a streak of yellow, and paint it just as it looks to you, the exact color and shape, until it emerges as your own naive impression of the scene before you."" Claude Monet"Early
1880s - 1914
Post-Impressionism
"Summary: Emphasis on symbolic content and the artist's interpretation of the world. Post-impressionism shared many of the characteristics of Impressionism such as the use of vivid colors, expressive brushwork and everyday subjects. But there seemed to be a focus on distorted forms, geometric shapes and unnaturalistic colors to depict emotions and feelings. Artists often used the pointillism technique, which involved placing small dabs of distinct color."
"Key artists: Paul Cézanne (the ""father of Post-Impressionism""), Vincent Van Gogh, Paul Gauguin, Georges-Pierre Seurat, Paul Signac"
"""I dream of painting and then I paint my dream."" Vincent van Gogh"
1907 – 1922
Cubism
"Summary: Focused on abstraction and geometric shapes, rather than space, perspective and realistic rendering."
"Key artists: Pablo Picasso, Georges Braque, Juan Gris, Fernand Léger"
"""Cubism is not a reality you can take in your hand. It's more like a perfume, in front of you, behind you, to the sides, the scent is everywhere but you don't quite know where it comes from."" Pablo Picasso"
1924-1966
Surrealism
"Summary: Depicted dreams, fantasies and the unconscious state. Often incorporated the juxtaposition of incompatible elements."
"Key artists: Joan Miró, Salvador Dalí, René Magritte, André Breton, Yves Tanguy, Frida Kahlo, Max Ernst, Méret Oppenheim"
"""Surrealism is destructive, but it destroys only what it considers to be shackles limiting our vision."" Salvador Dalí"
To convert this data to DataFrame we will do the following:
Index data by zipping text lines with range from 0 to 84
Transform to DataFrame (index, line)
Calculate reminder index%5 - add a column "rem"
Calculate deviser index/5- add a column "div"
Next we will combine data by art movements by self-joining the table several times:
Then we will split "keyArtists" column to Key Artist names:
From Modern Art Movements Key Artists we will take the list of artists from our Artist Biography knowledge graph:
We will build Modern Art Movement knowledge graph based on ontology:
Modern Art Movement knowledge graph edges:
Modern Art Movement knowledge graph:
Artist Pairs within Modern Art Movement Key Artists list
Ontology of integrated graph:
Data integration
To add artist pair relationship info to modern art movement info we will combine edges of Modern Art Movement and Artist Pairs knowledge graphs:
Add Artist Biography info to Modern Art Movement key artists
Ontology of integrated graph:
Data integration
From Artist Biography knowledge graph we will take information about nationalities and born countries of modern art key artists.
Combine edges of Artist Biography knowledge graph and edges of Modern Art Movement knowledge graph and exclude edges related to genders and dates:
Biographies of Modern Art Movement key artists:
Integrate All Three Knowledge Graphs
Metadata integration
Ontology of integrated graph:
Data Integration
First from each graph we will get artist name list then calculate overlap of these artist lists.
Artist names from Artist Biography knowledge graph:
Artist names from Artist Pairs knowledge graph:
Artist names from Modern Art Movement knowledge graph:
There are only three artist names that are in each of three knowledge graphs:
Next, we need to get edges from knowledge graphs related to overlapping artists.
As Artist Biography knowledge graph is artist name centric, to get combined graph edges we will filter them by edge 'src':
To get edges from Artist Pairs knowledge graph we need to filter them by both edge parameters - 'src' and edge 'dst':
Modern Art Movement knowledge graph has two types of edges: Artist -> Modern Art Movement and Modern Art Movement -> time period. {Artist, Modern Art Movement} edges :
{Modern Art Movement, time period} edges:
Edges of combined graph:
Build a combine graph:
Next Post - Paintings
In the next several posts we will continue looking at Knowledge Graphs as more natural way to represent data.