GPT Understands, too…. ( Standard from now on ;)? )

Method Proposed:-

from the wordings of authors, p-tuning is:-“Given a pre-trained language model M, a sequence of discrete input tokens x1:n = {x0, x1, …, xn} will be mapped to input embeddings {e(x0), e(x1), …, e(xn)} by the pre-trained embedding layer e ∈ M.” where e is a learned continuous function through simple back-propagation using LSTM’S, or simply put manual/discrete prompts are replaced by differential prompts and they can be placed anywhere even in the place of context or even target, all the while keeping original model parameters frozen.

Challenges:-

1) Discreteness: the original word embedding e of M has already become highly discrete after pre-training. If h: is initialized with random distribution and then optimized with stochastic gradient descent (SGD), which has been proved to only change the parameters in a small neighborhood (AllenZhu et al., 2019), the optimizer would easily fall into local minima.

THE MATH 💩

The Results

The results are most Fascinating when comparing finetuning, p-tuning, and manual prompt. Especially for knowledge probing (evaluates how much real-world knowledge has language models gained from pre-training), where p-tuning performs vastly better than the techniques. On the SuperGLUE benchmark, while it doesn’t come close to the other SOTA (but that wouldn’t be an apple’s to apple’s comparison given the massive model size differences), p-tuning shows solid performance when compared to standard fine-tuning or manual prompting.

Sources :-

1 https://arxiv.org/pdf/2103.10385.pdf

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store