DPO Finetuning
Direct Preference Optimization (DPO) is a practical alignment technique that trains a model from preference pairs (chosen vs rejected) without the complexity of RLHF.
Learning goals
- Understand the “chosen vs rejected” objective intuitively
- Prepare a clean preference dataset
- Run a DPO training loop (typically via TRL) and evaluate results
The core idea
For each prompt, you collect two completions:
- chosen: the preferred answer
- rejected: a worse answer
DPO nudges the model to increase the probability of the chosen response relative to the rejected one.
Dataset format (typical)
json
{
"prompt": "Explain RAG in 3 sentences.",
"chosen": "RAG retrieves relevant documents...",
"rejected": "RAG is when the model guesses..."
}
What makes a good preference dataset
- Chosen/rejected must be comparable (same prompt, similar length)
- Rejected should be plausible, not nonsense (otherwise you overfit)
- Include your real constraints: tone, safety policy, citation rules
Training workflow (high-level)
- Start from a solid base model or SFT checkpoint
- Train with DPO on preference pairs
- Evaluate on:
- style + policy adherence
- helpfulness
- regression tests (safety, formatting)
DPO can over-optimize style
If you only optimize “tone”, you can degrade factual accuracy. Keep eval sets that measure correctness/faithfulness.
Mini-lab (optional)
Build a preference dataset for your course assistant:
- chosen answers follow a strict citation rule
- rejected answers contain hallucinated citations
- train DPO and measure faithfulness deltas with a small test set
Where this fits
DPO is an alignment layer on top of Finetuning Strategy and complements guardrails (Week 8).