-Intuitively explained DE-BERTA

Recap Attention and Positional Embeddings

4 min readMay 22, 2023

Because of how the attention mechanism works it doesn’t have a way to discern where a token is in a sentence because each sentence is treated as a Bag of Words so we usually add positional embeddings (fixed or trainable)
This can be done in 2 ways:- add positional encoding or concatenate with word embeddings

Technically speaking worst case for addition is network learns zero on the embedding side and zero on the input side hence replicating concatenation as shown in figure

In normal attention we output Key and Querry from X (input sequence) intuitively KEY says what information this token is about and query says what information it requests from other tokens

So we add content and position together and the network is supposed to figure out how to use them. So A is about how much token I attend to token J.

Deberta’s Claim

this is not ideal because the positions are too much mixed with the signal of contents of words so we rather have them dis-entangled so that model can reason about contents in one line and positions in another line

So here is the proposition of the paper.Given a sequence of tokens I Attention Aij can be calculated as

Where Hi and Hj are Content vectors of tokens i and j
And Pi and Pj are positional embeddings

Hence each token now produces a content [dark blue] as well as positional embedding [light blue]

Deberta Proposed Decoupled Attention Mechanism

So now we have 4 matrices C-C C-P P-C and P-P
[P = position and C = content]

Content to content:- this is the classical attention
For eg, I am the word am and I requisition all of the information from nouns of the sentence because I am a verb

Content to position:- I am the word “am” and I want to know tokens around me so the word can attend to my surroundings
For eg, the network has figured out is not a question and hence it might not want information from tokens before “am” because it probably is “I”

Position to content:- this one is kind of weird so basically I am in a position that is two words after me what kind of information do I want from it since it is attending to content it can be dependent on what kind of word it is

Position to position:- since Deberta uses relative encoding with the context of length N and this matrix is all about the relativity of token position hence this matrix is not useful and not included in the calculations

relative Positional encoding with context length 2

We finally add all 3 of them to get attention matrix Aij so, in summary, it is about I am the word “am “ I am in position 2 I request a lot of information from other nouns and I also want information from things that are 1 or 2 words ahead of me and also since I am at pos 2 I am interested in the subject of the sentence

And the rest is history I mean like classic attention :P

So we have new positional vectors P multiplied by their hidden vectors and we feed them to each and every layer in Deberta rather than just the first layer

So hence it is a di-entangled positional encoding because at every layer we are explicitly adding the positional encoding in conjunction with a new attention mechanism

Some new problems with approach and fix?

A problem with positional encoding

In an MLM object:- given a sentence — A new store opened beside a new mall
Then using local context it will be impossible to distinguish between mall and store using relative encoding

Hence they feed in normal Bert style encoding at the end just before the softmax layer, which is what they criticized as a similar concept to CONVS being better than MLP

References

paper link:- https://arxiv.org/abs/2006.03654