MDE:- Modeling (Anti)symmetric, composition, inversion, and reflexive relations in a knowledge graph effectively

6 min readJun 5, 2023

Research paper explained from:- https://arxiv.org/pdf/1905.10702.pdf

So what is a KG?

Practically, a KG usually consists of a set of facts. And a fact is a triple (head, relation, tail) where heads and tails are called entities.

KG completion is a huge task that helps us learn/predict new edges; for example, predicting any new customer that may buy a product, new drug discoveries, simple friend recommendations, and many more.

Typically distance based KG embeddings are popular due to their low number of parameters and hence scalability; one of the most popular distance-based methods is TransE, which simply learns a translation relational matrix between head and tail but fails to learn the symmetric pattern, for, e.g. a relation like A is a sibling of B, and hence B is a sibling of A. let us derive why

H = Head

T = Tail

R = Relaton

The Problem with TransE (H + R -T):- The TransE model learns a translation vector ‘r’ to represent relationships in a knowledge graph. It aims to find a vector that, when added to the embedding of entity A, produces a vector close to the embedding of entity B. This is mathematically represented as ea + r ≈ eb.

To measure the closeness between these vectors, the TransE model uses a distance function, typically the L2 distance. The model minimizes this distance to accurately learn the translation vector ‘r.

However, a problem arises when dealing with symmetric relations. In these cases, we expect the distance between embeddings of entities to be close to zero, indicating their similarity. Unfortunately, the TransE model assigns relatively large vectors to relations, resulting in large distances between embeddings.

During training, the model tries to minimize the L2 distance ||ea + r — eb||2 but often reduces it by making the norm (length) of the translation vector ‘r’ close to zero. This happens because when the norm of ‘r’ approaches zero, the contribution of the translation vector becomes negligible, and the entity embeddings dominate the distance.

As a consequence, the TransE model struggles to capture symmetric relations effectively. Instead of achieving a small value for the norm of ‘r,’ it tends to make it close to zero. This means that the model heavily relies on entity embeddings to determine the distance, and the explicit representation of the symmetric relation through ‘r’ becomes less significant.

The paper states that other variants are available, but then they fail to grasp composition and inversion patterns, and reflexive (an entity related to itself) patterns are typically complex.

More scoring functions

DistMult: — it is simply H*R*T and is ineffective in learning relations due to the symmetric nature of multiplication
ComplEx:- this address the problem by having a complex conjugate in the equation given by Re(Htranpose .diag(r).t¯).

A conjugate represents a reflection on the real axis, and hence it allows modeling for symmetric relations to give an intuitive, e.g. we can think of it as seeing our review in the mirror but just in reverse

But it fails to capture composition as modeling composition involves chaining and combining multiple steps of a simple conjugate addition which is just a sign change at the highest level of understanding is not able to learn such intricate patterns.

RESCAL:- given by Htranspose RT R is relation matrix, and we perform the outer product of H and T

SimplE:- given by ½(Hei, R, Tej) + ½(Hei, R.inv, Tej) is the average of two Due to design, it cannot map composition rules because it learns a separate mapping of head and tails, which are ineffective in learning chain and complex combination rules.

In short, Every Approach lacks something or the other!

So authors state that MDE is a multiple distance embedding approach that trains different types of distance functions jointly and hence may provide different geometric views proposing to learn every kind of relations.

So let’s State MDE and give proof of how it learns these patterns with each distance function.

S1 = ||H + R — T||p
S2 = ||T + R — H||p
S3 = ||H + T — R||p
S4 = ||H — R * T ||p

(In the S4 equation, h represents the embedding vector of the head entity, r denotes the embedding vector of the relation, t represents the embedding vector of the tail entity, ||*|| denotes a norm operation (such as the L2 norm), ◦ represents the Hadamard product (element-wise multiplication),)

Lemma 1 :- S1 allows modeling antisymmetry, inversion, and composition patterns, and S2 allows modeling symmetry patterns.

Translation-based models force the reflexive relations to become symmetric and transitive.

Proof that S4 doesn’t allow such restrictions

Since r1 is a reflexive relation, we can conclude that r1 is symmetric on Set, which means if r1 holds between ei and ej, it also holds between ej and ei. Therefore, the proof shows that restriction R1 does not apply to the S4 score function.

One final tweak in aggregating the functions
If S1 and S3 are added together, the minimization will lead to relation(r) vectors with zero norm value. To address this issue, they represent the same entities with independent variables in different distance functions.

So finally MDE is defined as

The authors also provide a limit-based rank function wherein limits (Psi and psi dash ) can be learned whilst training.

WHY ?:- in cases with margin-based ranking functions, due to the margin taken, it may not give a small score to pos samples and hence may not be able to distinguish between pos and neg samples; this can happen when r embeddings are assigned large values which would happen in TransE.

The goal of this loss function is that the score of pos samples is lesser than Y1, and a score of negative samples is greater than Y2, B1, and B2 helps assign importance to pos or neg samples.

We want small values for pos samples because:-

it guides the model or promotes the formation of tight clusters or neighborhoods representing specific relations or entity types, making the embeddings more discriminative.
When the model learns to assign small scores to positive triplets during training, it becomes more sensitive to subtle differences between entities and relations. This improves its ability to generalize and accurately predict new, unseen triplets.

If this wasn’t enough, they also proposed the scoring function as a neural net to learn the weights.

Directly taking this from a research paper :p (I am not lazy; it’s a good/simple enough explanation ) and Since Tanhshrink has a range of R, it allows setting large values for γ1 and γ2

Loss functions are modified as above to create non-linearity

Scalability

Benchmarks

MR:- lower is better for ranking.
MRR:- higher is better
Hits@10:- how many times was the true answer in the first top 10 values
All of the metrics are ranking-focused.