GPT Understands, too…. ( Standard from now on ;)? )

3 min readSep 13, 2021

By Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, Jie Tang.

Why? :- from the same research group — presented as a given: that autoregressive pretraining is no good for NLU. Well hold your paper! And keep reading ;)… The proposed, p-tuning, has the potential to become a conventional technique for few-shot learning and finetuning huge LMs for which standard finetuning is too costly or doesn’t work very well

Method Proposed:-

from the wordings of authors, p-tuning is:-“Given a pre-trained language model M, a sequence of discrete input tokens x1:n = {x0, x1, …, xn} will be mapped to input embeddings {e(x0), e(x1), …, e(xn)} by the pre-trained embedding layer e ∈ M.” where e is a learned continuous function through simple back-propagation using LSTM’S, or simply put manual/discrete prompts are replaced by differential prompts and they can be placed anywhere even in the place of context or even target, all the while keeping original model parameters frozen.

The idea of the prompt is to be able to organize itself, along with context, and target y into a Template T.

Now you may be asking what is a prompt? for eg given a sentence:-“The capital of India is [MASK] “, the “The capital of … is … .” is prompt, “India” is the context, and “[MASK]” is the target.

Challenges:-

1) Discreteness: the original word embedding e of M has already become highly discrete after pre-training. If h: is initialized with random distribution and then optimized with stochastic gradient descent (SGD), which has been proved to only change the parameters in a small neighborhood (AllenZhu et al., 2019), the optimizer would easily fall into local minima.

2) Association: another concern would be, intuitively, it is believed that the values of prompt embeddings should be dependent on each other rather than independent.

Solution ? small universal funtion approximator A.K.A neurel net with LSTM’S with a ReLU activated two-layer multilayer perceptron (MLP) to encourage discreteness

THE MATH 💩

Let V refers to the vocabulary of a language model M and [Pi] refers to the ith prompt token in a template T. for simplicity, given a template T = {[P0:i ], x, [Pi+1:m], y}, compared to traditional discrete prompts which satisfy [Pi ] ∈ V and map the T into

(1) P-tuning instead regards the [Pi] as pseudo tokens and map the template to

(2) where hi(0 ≤ i<m) are trainable parameters.

Loss Funtion used :-

The Results

The results are most Fascinating when comparing finetuning, p-tuning, and manual prompt. Especially for knowledge probing (evaluates how much real-world knowledge has language models gained from pre-training), where p-tuning performs vastly better than the techniques. On the SuperGLUE benchmark, while it doesn’t come close to the other SOTA (but that wouldn’t be an apple’s to apple’s comparison given the massive model size differences), p-tuning shows solid performance when compared to standard fine-tuning or manual prompting.