DPO Finetuning

Direct Preference Optimization (DPO) is a practical alignment technique that trains a model from preference pairs (chosen vs rejected) without the complexity of RLHF.

Learning goals

Understand the “chosen vs rejected” objective intuitively
Prepare a clean preference dataset
Run a DPO training loop (typically via TRL) and evaluate results

The core idea

For each prompt, you collect two completions:

chosen: the preferred answer
rejected: a worse answer

DPO nudges the model to increase the probability of the chosen response relative to the rejected one.

Dataset format (typical)

json

{
  "prompt": "Explain RAG in 3 sentences.",
  "chosen": "RAG retrieves relevant documents...",
  "rejected": "RAG is when the model guesses..."
}

What makes a good preference dataset

Chosen/rejected must be comparable (same prompt, similar length)
Rejected should be plausible, not nonsense (otherwise you overfit)
Include your real constraints: tone, safety policy, citation rules

Training workflow (high-level)

Start from a solid base model or SFT checkpoint
Train with DPO on preference pairs
Evaluate on:
- style + policy adherence
- helpfulness
- regression tests (safety, formatting)

DPO can over-optimize style

If you only optimize “tone”, you can degrade factual accuracy. Keep eval sets that measure correctness/faithfulness.

Mini-lab (optional)

Build a preference dataset for your course assistant:

chosen answers follow a strict citation rule
rejected answers contain hallucinated citations
train DPO and measure faithfulness deltas with a small test set

Where this fits

DPO is an alignment layer on top of Finetuning Strategy and complements guardrails (Week 8).

Learning goals​

The core idea​

Dataset format (typical)​

What makes a good preference dataset​

Training workflow (high-level)​

Mini-lab (optional)​

Where this fits​