RLHF- Reinforcement Learning From Human Feedback

Abhinav sharma
6 min readNov 2


In a concise form answer, it is used in the generative NLP field as a means to model human preference/ generate human preferential responses.

1. It’s a three-stage process as in the image above, wherein a foundational model is pretarined via autoregressive pre-training task,

and this is done on books data set common crawl or simply put any piece of text you can find.

Data Scale — — — whatever you can find

2. The next step is now, instead of training the model on RAW data, you train the model on proper gold standard data wherein the model is exposed to the types of questions it will answer in the real world or very close to it. We call this stage SUPERVISED FINE TUNING (SFT)

The above equation in 1 holds for SFT as well,

Data scale- 10,000 — — — 100,000 (prompt, response)

  • InstructGPT: ~14,500 pairs (13,000 from label + 1,500 from customers)
  • Alpaca: 52K ChatGPT instructions
  • Databricks’ Dolly-15k: ~15k pairs, created by Databricks employees
  • OpenAssistant: 161,000 messages in 10,000 conversations -> approximately 88,000 pairs
  • Dialogue-finetuned Gopher: ~5 billion tokens, which I estimate to be in the order of 10M messages. However, remember that these are filtered out using heuristics from the Internet, so they are not of the highest quality.

3. Easy stuff is over now we will dive deep into “RLHF”.

So, if you think about it, the way to answer a question or the writing style of a model is not something that can be modelled as a LOSS function.

So, how do we have this preference encoded in the model?

You may also ask why this answering style is essential, because we can’t have the model replying in a shroud manner or a negative, unhelpful tone. We would typically want something helpful and polite in the response, even if it’s wrong

However, since we pre-train the model on a large chunk of human text, it buries in itself these toxic capabilities.

An analogy I can put forward is:- Think of the pre-trained model as a Hyena And the SFT model is a wild Wolf, but we want a cute Labrador.


If we have a model that can rank/ give a quality score for the generated texts based on helpfulness, correctness, X, Y, .……. And so on, characteristics. Then, we can set up a Reinforcement learning framework for generating subjective texts and pose the problem of maximising this reward from the reward model over generated text.

Let’s talk about the REWARD MODEL FIRST

We want to sample a bunch of generated text from the model given a prompt and rank them from best to worst.

Now, openAI found this to be tricky, so they let the user choose between 2 responses, whichever is better and hence rank them relatively

(such that if x1 is better than x2 and x3 is better than x1, hence x3>x1>x2 || also number of samples we can make for it are x3>x1, x3>2, x1>x2)

These rankings are then normalised by the number of sentences sampled to get this score. We Train this as a regression problem (remember the objective of this exercise is to create a mechanism to ascertain a reward based on model action given a current optimal policy)

Which model to choose?

What we do is take the SFT model and remove the last layer with a regression layer to predict this subjective scaler score for that sentence

The expectation is nothing but the mean of a sample or the expected behaviour from the distribution

The below image should make it pretty clear what we mean by sampling sequences is Yw or Yl

Looking at the above equation for loss

Whatever reward is given to winning or losing sentence, their difference is squashed between 0 and 1

If the model gives a bigger score to the losing sentence, then the difference would be negative, which would then be close to 0, which would then give about values much greater than 0 and hence will be penalised

On the other hand, if the winning sentence is assigned more score, then the sigmoid of that will be closer to 1, and the log of which will be closer to 0

DATA SCALE — — — 100K — 1M examples

  • InstructGPT: 50,000 prompts. Each prompt has 4 to 9 responses, forming between 6 and 36 pairs of (winning_response, losing_response). This means between 300K and 1.8M training examples in the format of (prompt, winning_response, losing_response).
  • Constitutional AI, which is suspected to be the backbone of Claude (Anthropic): 318K comparisons — 135K generated by humans, and 183K generated by AI. Anthropic has an older version of their data open-sourced (hh-rlhf), which consists of roughly 170K comparisons.

Reinforcement part

The goal of a RL algorithm is to maximize the cumulative reward given a current state and action pairs and optimal policy

Policy and agent (interchangeable/ same) = Model

Action space = vocabulary

Reward = reward given by reward model R(theta)


Reading objective 1 — score given by reward model on promptX and y sequenceY minus kl divergence of the probability distribution of each step generated by RL model being tuned, upon SFT model output probability distribution

Kl divergence is there so that model doesn’t stray too far from SFT model

Reading objective 2 — is basically regularisation for RL model so that it doesn’t perform worse on the token completion task

Standard regularisation we have been seeing since Logistic Regression

Another diagram for clarification/ simplicity

DATA SCALE — — — 10,000–100,000

Now the simplified diagram by OpenAI, in the beginning should be very easy to understand

Things to Note

  • These models have unimaginable scale with 175 billion parameters for gpt 3 to put into perspective a full on model should around 4 bytes for model per parameter and 8 more for optimizer = 700 GB lol
  • So use maybe a 7B model with 8 bit that should put the model at 7GB VRAM, along with use adapter weights (any PEFT technique ) for fine tuning to get significant reduction in training memory which should give ample space and precision to be trained on 16 gb gpu
  • DPO is found much more stable and efficient to train
  • Train PPO Proximal Policy Optimization (PPO) is Easy With PyTorch | Full PPO Tutorial
  • Link for training a sample LLM with RLHF https://huggingface.co/blog/stackllama