Gaussian Embedding of Large-scale Attributed Graphs :- Research paper summary
Core Ideas :-
- Embbed nodes as Gausian distribution with some mean U and co- variance matrix Sigma
- Gaussian distribution takes care of uncertainty in graphs
- Mean U is similar point based representation of normal models ( bert etc)
- Sigma is learnt as diagonal matrix to reduce parameters
Advantages :-
- graphs constructed from real-world data can be very complex, noisy and imbalanced. Therefore, a mere point-based representation of the nodes may not be able to capture the variability of the graph and so some hidden patterns
- Most existing graph embedding approaches are transductive and cannot infer embeddings for nodes unseen at training time. In practice, however, graphs evolve with time, and new nodes and edges can be added into the graph. There are a few recent studies which tried to provide a solution to this limitation. However, these methods either do not scale up to large graphs, or require additional information about the graph structure.
Architecture and some Equations :-
- uses any encoder MLP bases transformer based CNN based for encoding each node into xi
1 is added so that ELU never creates negative outputs
We have our mean U and triangular matrix Sigma
- Shared encoder for all nodes
- Loss :- KL divergence
Where zi and zj are embeddings of connected nodes
Jargons :-
First order proximity
- We learn first-order proximity of nodes, by modelling local pairwise proximity between two connected nodes in the graph. The empirical probability for first-order proximity measure observed in the original graph between nodes i and j is defined as the ratio of the weight of the edge (i, j) to the total of the weights of all the edges in the graph.
distance between the two distributions, O1 = DKL(ˆp1||p1)
Which means minimizing the the negative of log likelihood i.e NLL is same as minimizing the distance between two distributions
Second order proximity
- Nodes which have more similar neighbourhoods should be closer in embedding space with respect to the nodes with less similar neighbourhoods
So we generate a context vector of all the neighbour nodes and calculate the loss between zi and Z’ which is generate by context vector f’ embeddings.
- similarly O2 = Pi ∈V λiDKL(ˆp2(.|i)||p2(.|i))
this has a hard bottleneck on running performance because we need all nodes to calculate the context vector
so they do Negative edge sampling hence only take few of nodes connected reducing the cost
- The encoder for context nodes is f0 : V → R m. The latent representation of context node i is z 0 i = N (µ0i , Σ0i ), where µ0i = f0µ (f0(xi)) and Σ0i = f 0 Σ(f0(xi))
Performance :-
Embedding visualisation :-
ELBO for NLL and KL divergence trick :- https://www.youtube.com/watch?v=IXsA5Rpp25w