Scaling up temporal Graph neural networks NAT (Neighborhood-aware Scalable Temporal Network Representation Learning) paper explained
Research paper:- https://arxiv.org/pdf/2209.01084.pdf
The problems with Gnn:-
Primarily graph neural networks are just neural network models made to learn the graph structure in static settings. In simple terms, gnn’s are just a bunch of dense layers with non-linearities, intuitively speaking the update of gnn is like who M I, and how do my surroundings affect me (basically quoting the popular you are the average of 5 people you hang out with p) meaning the update is guided by their K-hop neighborhood and themselves. implying one must know the entire graph structure for prediction hence gnn’s works best on the static transductive scenario.
And
Gnn are not able to distinguish between two nodes if they have the same structural neighborhood hence rendering them ineffective for dynamic inductive temporal neural network
There have been other approaches that help better in discriminating against these issues such as labeling a linked node and calculating the distance between neighbors of all nodes,labeling a linked node and constructing its distance to the other node, but it’s really expensive and time-consuming to do so.
Let us see how NAT solves all the problem
The key novelty of NAT is to incorporate dictionary-type neighborhood representations in place of one-single-vector node representation and a computation-friendly neighborhood cache (N-cache) to maintain such dictionary-type representations.
So NAT typically has a dictionary Zuk with space Mk X F pre-allocated for all nodes with down sampled node id’s as keys and values as features, And as denoted it’s a neighborhood dictionary that can have potential k hop neighbors.
So let’s suppose we are predicting for node U and V interaction at time T3:- so its joint neighborhood will be constructed pretty quickly via cache dictionary Zuk(for simplicity’s sake we are only considering 1 hop representation) then Z(1)u,v will be updated by RNN based on link feature Eu,v, time step T3, Z0 (initial representation)
This basically means V representation as 1 hop neighbor of U will be stored in a dictionary with some hash value for fast retrieval and update, If does not exist in current Z(1)u (e.g., in the first v, u interaction), a default initialization of Z(1)u,v is used. Once updated, the new value Z(1)u,v paired with the key (node ID) v will be inserted into Z(1)u
Constructing this N cache, handle hashing and collision and size
Authors use at most 2 hop neighbors and represent the node as only a small size F with only M1 number of neighbors (this is similar to graph sage )
Second, they use a very simple hash function which allows them to parallelize the operation using a large prime number q
Hashing is intuitive in a mall u know which floor and which side of the mall belongs to which department say 4 th floor is the food court so u can directly go there
Collision is handled as a nice sampling with replacement trick:- so Su holds the most recent neighbors of a node U specifically Suk(hash(a)) is used to check whether node a is key in Z(1)u
So the algo is like this:-
For a hop let’s say we want all neighbors of U so we query a node “a” with hash(a) in S(k)u. If the hash(a) in S(k)u gives “a” then we update the representation Z(k)u,a in Z(k)u , if its empty or not “a” then we record another observation
If another node b satisfies hash(a) = hash(b) = p and Z(k)u,b has occupied the position p of
Z(k)u, then with some probability “alpha” a node is replaced which is a hyperparameter
Hence the whole thing acts like a sampling with a replacement
How to predict for a node u,v?
This means we union the n cache’s of k hop representation of nodes a with respect to u and v for a timestep t.
DE = dual encoding
And equation in line 2 reads as the dual encoding of a node “a” with respect to u,v is given by concatenation of dual encoding of a with respect to u at time t, and dual encoding of “a” with respect to V at time t where that fancy X is a Xor operation for k hop N-cache (Xor helps in telling which nodes participate i.e if they are not a neighbor)
Then these dual encodings are concatenated with the above aggregation embeddings
Improvements in inference and prediction power
On Wikipedia, Reddit, Enron, and UCI, CAWN outperforms all baselines on the inductive study and most baselines on transductive. the reason could be it captures neighborhood structural information via its temporal random walk sampling. However, it is not performing as well on Ubuntu. Because of the sparsity of the Ubuntu network
beats TGn-pg on inductive tasks by 8–10 %
On Social Evolve, NAT significantly outperforms all baselines that do not construct neighborhood structural features by at least 25% on transductive and 7% on inductive predictions
The inference speed of NAT is significantly faster than CAWN which can also construct neighborhood structural features, which achieves 25–29 times speedup on inference for attributed networks.
NAT also achieves at least four times faster inference than TGN, JODIE and DyRep. Compared to TGN-pg
NAT generally has comparable inference time while achieving about 10% speed up over the largest dataset Wiki-talk. This is because when the network is large, online sampling of TGN-pg may dominate the time cost.