Deep Learning on Graphs: Integration of DGL and Neo4j DBMS for Social Analysis
DataResolutions Research Division Spring 2022 — Research Head: Irsyad Adam
As machine learning progresses at an impressive rate, the subgenre of graph machine learning is no exception. With literature dating to as recent as 2017 with Kipf and Welling’s graph convolutional layer, one of the most cited publications in graph neural networks, to papers being pushed out monthly using graph analysis techniques in a variety of STEM fields, graph neural networks are undoubtedly becoming a massive component of deep learning.
While there are many applications of graph analysis and graph neural networks, our team decided to focus on an aspect of made popular by Wayne Zachary in 1977 and the infamous Zachary’s Karate Club: social analysis.
For this project, we wanted to see if we could train a model to predict team assignments of Datares members based on who they know in the organization using Kipf, Welling GCN (ICLR 2017).
Extension of GCN: A Dynamic Approach
Many of the graph neural network applications rely on both a fixed edge list and a node list, and then aggregate the two into a network representation when model training starts. While in research, this is quite efficient, as the same network data (CORA, PubMed Citation, Citeseer, etc.) is used to benchmark different models, but in other applications where the data is ever changing, recreating an edge list and a node list every time a dataset grows is quite tedious.
Thus, an extension was formed: building a GCN pipeline on a dynamic knowledge graph database, fixing the problem of scalable graph deep learning, graph visualization, and data management.
By integrating the DGL framework with the Neo4j DBMS, analysis of network data is more wholistic, combining the visualization prowess, scalability, and data management of Neo4j with the end-to-end deep learning models of DGL.
Exploratory Data Analysis
Below is the network of the DataResolutions organization, with people as nodes and directed edges connecting nodes if the source node knows the destination node. There a total of 62 nodes and 336 edges. Each node has a respective name (try to find someone you know!) and a respective team assignment.
To begin graph machine learning, we first have to get a good idea of our data. To do this, we used topological graph algorithms to explore features of our data that are not explicitly defined. First, we used the PageRank algorithm (Page 1998) to explore which nodes (people in this case) are of higher importance.
The results from PageRank would indicate which team has the most well-known members. From this case, it is more likely that people in UCLA Athletics are more known, as UCLA Athletics does not switch members as often and are a fixed team throughout the school year.
Next, we wanted to project our graph representation into Euclidean Space. To do this, we used torch.Embeddings to assign node features as metadata to each node. This way, each node has a vector representation, which are able to visualize on Real Space instead of the actual network itself.
Semi-Supervised Node Classification using GCN (ICLR 2017)
After successfully loading the graph into the Neo4j instance, the model training can be initiated. Here, we are using semi-supervised node classification to see if a model can be trained to predict team assignments of Datares members based on who they know in the club.
Above is the hidden layer of the GCN that is going to be used for node classification. Notice that it takes in node features of a specific node H and adjacency matrix A, and aggregates the node features based on the node features of the neighboring nodes. Because the nodes learn its features from its local neighborhood and based on how it is connected, the GCN does not need the class assignments of every single node to train; instead, given a single labelled node per class, it learns the rest of the nodes based on how the network is connected. This is the beauty of GCN.
Below, we instantiated the node features using torch.Embeddings after choosing 3 nodes to label, and plotted the respective confusion matrix per epoch. The black nodes are the pre-labelled nodes and the rest are decided by the model.
Below is the GCN with Node2Vec embeddings; notice how the embeddings determine how the model classifies each node. By the confusion matrix, it seems like Node2Vec has better results for this particular network.
Conclusion
Notice how the GCN with Node2Vec embeddings has a higher accuracy across all of the classes instead of the GCN trained with torch embeddings; this seems to be the case in literature as well, as Node2Vec is deemed to be the standard for graph embeddings.
Overall, this project was a success, as we were able to integrate a graph deep learning pipeline with a Neo4j DBMS for scalable deep learning and wholistic network analysis.
The repository for this project can be found here.
We are very proud of this project and what we learned from it, and we look forward to exploring the cutting edge of deep learning at Research@DataRes in the fall.