Skip to main content

DPO Finetuning

Direct Preference Optimization (DPO) is a practical alignment technique that trains a model from preference pairs (chosen vs rejected) without the complexity of RLHF.

Learning goals

  • Understand the “chosen vs rejected” objective intuitively
  • Prepare a clean preference dataset
  • Run a DPO training loop (typically via TRL) and evaluate results

The core idea

For each prompt, you collect two completions:

  • chosen: the preferred answer
  • rejected: a worse answer

DPO nudges the model to increase the probability of the chosen response relative to the rejected one.

Dataset format (typical)

json
{
"prompt": "Explain RAG in 3 sentences.",
"chosen": "RAG retrieves relevant documents...",
"rejected": "RAG is when the model guesses..."
}

What makes a good preference dataset

  • Chosen/rejected must be comparable (same prompt, similar length)
  • Rejected should be plausible, not nonsense (otherwise you overfit)
  • Include your real constraints: tone, safety policy, citation rules

Training workflow (high-level)

  1. Start from a solid base model or SFT checkpoint
  2. Train with DPO on preference pairs
  3. Evaluate on:
    • style + policy adherence
    • helpfulness
    • regression tests (safety, formatting)
DPO can over-optimize style

If you only optimize “tone”, you can degrade factual accuracy. Keep eval sets that measure correctness/faithfulness.

Mini-lab (optional)

Build a preference dataset for your course assistant:

  • chosen answers follow a strict citation rule
  • rejected answers contain hallucinated citations
  • train DPO and measure faithfulness deltas with a small test set

Where this fits

DPO is an alignment layer on top of Finetuning Strategy and complements guardrails (Week 8).