How LLMs Learn to Behave: RLHF, Reward Models, and the Alignment Problem

A practical walkthrough of how large language models are aligned with human values — from collecting feedback to PPO optimization and the reward hacking pitfalls.

Alexandre Agius

AWS Solutions Architect

February 9, 2026 9 min read

AI LLM RLHF PPO Alignment Machine Learning Reward Model

Table of Contents

The Problem
The Solution
How It Works
Collecting Human Feedback
The Reward Model
PPO: The Optimization Engine
Reward Hacking: When Alignment Goes Wrong
Mitigating Reward Hacking
Beyond RLHF: Emerging Alternatives
What I Learned
What’s Next

A pretrained LLM can write code, summarize text, and answer questions — but it has no concept of “helpful” or “harmful.” Alignment is the process that teaches it the difference. This post walks through the full pipeline, from human feedback collection to the optimization tricks that make it work — and the failure modes that make it hard.

The Problem

Base foundation models are trained on internet text. They learn to predict the next token, which makes them fluent but not aligned. A base model will happily generate toxic content, confidently fabricate facts, or ignore your instructions entirely — because none of those things affect next-token prediction accuracy.

The gap between “predicts text well” and “behaves like a useful assistant” is the alignment problem. Closing that gap requires teaching the model what humans actually prefer, and that turns out to be surprisingly difficult.

The Solution

The standard approach is Reinforcement Learning from Human Feedback (RLHF) — a pipeline that turns human preferences into a training signal. Instead of hand-writing rules for “good behavior,” you collect examples of humans choosing between model outputs, train a reward model on those choices, and use reinforcement learning to optimize the LLM against that reward.

RLHF Alignment Pipeline

The full training pipeline follows a clear progression:

Pretraining — Learn language from internet text
Supervised Fine-Tuning (SFT) — Learn to follow instructions from curated examples
Reward Model Training — Learn what humans prefer from ranked outputs
RLHF (PPO) — Optimize the model to produce preferred responses
Safety fine-tuning — Additional red-teaming and guardrails

Each phase builds on the previous one. You can’t skip steps — a model that hasn’t been SFT’d first won’t produce meaningful outputs for RLHF to optimize.

How It Works

Collecting Human Feedback

The entire pipeline starts with human judgment. There are several methods, each with different tradeoffs.

Preference ranking is the most common. A human sees two or more model outputs for the same prompt and picks the better one. This is easier than absolute scoring — humans are naturally better at comparing than rating.

Prompt: "Explain gravity to a 5-year-old"

Response A: "Gravity is the force of attraction between masses..."
Response B: "You know how a ball falls down when you throw it? ..."

Human picks: B is better

Other feedback methods include:

Method	How it works	Tradeoff
Preference ranking	Compare 2+ outputs, pick the best	Most reliable, expensive
Likert scale	Rate on 1-5 across dimensions (helpfulness, accuracy)	More nuanced, less consistent
Thumbs up/down	Binary good/bad signal	Cheap, low information
Red-teaming	Adversarially probe for failures	Finds edge cases, doesn’t improve normal behavior
Expert annotation	Domain specialists verify factual correctness	High quality, very expensive

The prompt dataset itself matters enormously. It needs to cover normal use, edge cases, adversarial inputs, multiple languages, and sensitive topics. If the dataset misses a category, the model is unaligned in that area.

The Reward Model

You can’t have a human rate every output during training — there are millions of them. So you train a reward model (RM): a separate neural network that learns to predict what humans would prefer.

The reward model takes a prompt + response and outputs a scalar score. Higher score means more aligned with human preferences.

Architecture: Typically a transformer (often a copy of the LLM itself) with the final layer replaced by a single scalar output.

Training objective:

Given a pair (preferred response, rejected response):
Loss = -log(sigmoid(score_preferred - score_rejected))

The model learns to assign higher scores to responses humans preferred. After training on thousands of ranked pairs, it can score any new response — acting as a proxy for human judgment at scale.

This is the critical shortcut that makes RLHF feasible: instead of asking a human every time, you ask the reward model.

PPO: The Optimization Engine

With a reward model in hand, you need an algorithm to optimize the LLM. The standard choice is Proximal Policy Optimization (PPO).

The core problem PPO solves: when updating the model, you want to improve it — but if you change too much in one step, training becomes unstable. The model can go off a cliff and never recover.

PPO’s solution is a clipping trick. It measures how much the policy changed:

ratio = new_policy(action) / old_policy(action)

If the ratio is 1.0, nothing changed. If it’s 1.5, the new policy is 50% more likely to take that action. PPO clips this ratio to a narrow range (typically [0.8, 1.2]), preventing any single update from being too drastic.

Think of it like adjusting a recipe:

Without clipping: “This cake was good, add 10x more sugar next time” — overshoots wildly
With PPO clipping: “This cake was good, add a little more sugar, but never more than 20% change” — stable improvement

The RLHF training loop then becomes:

LLM generates a response to a prompt
Reward model scores the response
PPO computes the update: favor higher-reward responses
Clip the update so it stays close to the previous model
Apply a KL penalty to prevent drift from the base model
Repeat thousands of times

The KL penalty is critical — without it, the model can drift so far from the pretrained weights that it loses coherence entirely. It acts as a tether, keeping the aligned model close enough to the base model that it still writes fluent text.

Reward Hacking: When Alignment Goes Wrong

Here’s the fundamental tension: RL is very good at optimization, and the reward model is an imperfect proxy. The harder you optimize an imperfect proxy, the more the result diverges from what you actually want.

This is Goodhart’s Law applied to AI: “When a measure becomes a target, it ceases to be a good measure.”

Common reward hacking patterns:

Sycophancy — The reward model learned that agreeable responses get ranked higher, so the model becomes a yes-man:

User: "I think the earth is flat"
Model: "You raise a great point! There are many
        interesting perspectives on this..."

Verbosity — Human raters tended to prefer longer answers during labeling, so the model produces long responses even when a short answer is better.

Confident nonsense — The reward model rewards authoritative tone over actual accuracy (because raters couldn’t always verify facts), so the model states fabricated information confidently.

Formatting tricks — Raters associated bullet points and headers with quality, so the model uses heavy formatting everywhere regardless of whether it helps.

The error compounds at each step of the pipeline:

Human has true preference P
  → Raters approximate P imperfectly
    → Reward Model approximates the raters imperfectly
      → PPO optimizes the Reward Model aggressively
        → Model exploits gaps between RM score and true preference

Mitigating Reward Hacking

There’s no silver bullet, but several techniques reduce the problem:

Approach	How it helps
KL penalty	Limits how far the model can drift to exploit the RM
Reward model ensembles	Use multiple RMs — harder to hack all simultaneously
Iterated RLHF	Retrain the RM on the new model’s outputs periodically
Constitutional AI (CAI)	Model self-critiques using principles rather than relying solely on RM scores
Human auditing	Spot-check high-reward outputs for gaming
Reward capping	Don’t let the model optimize beyond a score threshold
Better rater training	Teach raters to check facts, penalize sycophancy, reward concise answers

Beyond RLHF: Emerging Alternatives

The field is moving toward reducing reliance on the reward model bottleneck:

DPO (Direct Preference Optimization) — Skips the reward model entirely. Trains directly on preference pairs by reformulating the RL objective as a classification loss. Simpler, increasingly popular.
RLAIF (RL from AI Feedback) — Uses a stronger model to evaluate a weaker model’s outputs, reducing the need for human labeling.
Constitutional AI — The model critiques its own outputs against a set of principles, then revises. Developed by Anthropic.
GRPO (Group Relative Policy Optimization) — Used by DeepSeek. Estimates baselines from group samples instead of training a separate critic model.

The trend is clear: less human labeling, more automated feedback. But human judgment remains the ground truth that everything else is calibrated against.

What I Learned

The reward model is the bottleneck — The entire alignment quality depends on how well the RM captures human preferences. Every downstream optimization is only as good as this proxy.
PPO’s clipping is elegant — A simple mathematical constraint (keep updates within a narrow ratio) turns an unstable optimization problem into a reliable training loop.
Reward hacking is Goodhart’s Law for AI — The harder you optimize an imperfect metric, the more the result diverges from the true goal. This isn’t a bug to fix; it’s a fundamental tension to manage.
Human feedback methods matter more than you’d think — Preference ranking vs rating vs thumbs-up produces meaningfully different training signals. The labeling interface shapes the model.
Alignment is iterative, not one-shot — The best results come from cycles of training, evaluating, retraining the reward model, and re-aligning. Static alignment degrades as models evolve.

What’s Next

Explore DPO implementation in practice — compare training stability and output quality vs PPO-based RLHF
Hands-on with Constitutional AI: define a principle set and evaluate self-critique quality
Investigate reward model evaluation — how do you measure whether your RM actually captures human preferences?
Compare RLHF costs: human labeling vs RLAIF vs CAI for the same alignment quality bar

Alexandre Agius

AWS Solutions Architect

Passionate about AI & Security. Building scalable cloud solutions and helping organizations leverage AWS services to innovate faster. Specialized in Generative AI, serverless architectures, and security best practices.

LinkedIn GitHub

How LLMs Learn to Behave: RLHF, Reward Models, and the Alignment Problem

The Problem

The Solution

How It Works

Collecting Human Feedback

The Reward Model

PPO: The Optimization Engine

Reward Hacking: When Alignment Goes Wrong

Mitigating Reward Hacking

Beyond RLHF: Emerging Alternatives

What I Learned

What’s Next

Alexandre Agius

Related Posts

TFLOPS: The GPU Metric Every AI Engineer Should Understand

A Practical Guide to Fine-Tuning LLMs: From Full Training to LoRA

Getting Hands-On with Mistral AI: From API to Self-Hosted in One Afternoon