-Intuitively explained DE-BERTA
Recap Attention and Positional Embeddings
Because of how the attention mechanism works it doesn’t have a way to discern where a token is in a sentence because each sentence is treated as a Bag of Words so we usually add positional embeddings (fixed or trainable)
This can be done in 2 ways:- add positional encoding or concatenate with word embeddings
Technically speaking worst case for addition is network learns zero on the embedding side and zero on the input side hence replicating concatenation as shown in figure
In normal attention we output Key and Querry from X (input sequence) intuitively KEY says what information this token is about and query says what information it requests from other tokens
So we add content and position together and the network is supposed to figure out how to use them. So A is about how much token I attend to token J.
Deberta’s Claim
this is not ideal because the positions are too much mixed with the signal of contents of words so we rather have them dis-entangled so that model can reason about contents in one line and positions in another line
So here is the proposition of the paper.Given a sequence of tokens I Attention Aij can be calculated as
Hence each token now produces a content [dark blue] as well as positional embedding [light blue]
Deberta Proposed Decoupled Attention Mechanism
So now we have 4 matrices C-C C-P P-C and P-P
[P = position and C = content]
Content to content:- this is the classical attention
For eg, I am the word am and I requisition all of the information from nouns of the sentence because I am a verb
Content to position:- I am the word “am” and I want to know tokens around me so the word can attend to my surroundings
For eg, the network has figured out is not a question and hence it might not want information from tokens before “am” because it probably is “I”
Position to content:- this one is kind of weird so basically I am in a position that is two words after me what kind of information do I want from it since it is attending to content it can be dependent on what kind of word it is
Position to position:- since Deberta uses relative encoding with the context of length N and this matrix is all about the relativity of token position hence this matrix is not useful and not included in the calculations
We finally add all 3 of them to get attention matrix Aij so, in summary, it is about I am the word “am “ I am in position 2 I request a lot of information from other nouns and I also want information from things that are 1 or 2 words ahead of me and also since I am at pos 2 I am interested in the subject of the sentence
And the rest is history I mean like classic attention :P
So we have new positional vectors P multiplied by their hidden vectors and we feed them to each and every layer in Deberta rather than just the first layer
So hence it is a di-entangled positional encoding because at every layer we are explicitly adding the positional encoding in conjunction with a new attention mechanism
Some new problems with approach and fix?
A problem with positional encoding
In an MLM object:- given a sentence — A new store opened beside a new mall
Then using local context it will be impossible to distinguish between mall and store using relative encoding
Hence they feed in normal Bert style encoding at the end just before the softmax layer, which is what they criticized as a similar concept to CONVS being better than MLP
References
paper link:- https://arxiv.org/abs/2006.03654