Skip to main content

Finetuning Strategy

One of the most common questions in AI engineering is: "Should I finetune, use RAG, or just write better prompts?" The answer depends on your specific use case, data, budget, and accuracy requirements. This page provides a decision framework.

The Three Approaches

1. Prompt Engineering

Crafting better instructions, examples, and system prompts to get the desired output.

python
# Simple prompt engineering
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": """You are a medical coding assistant.
Map the given clinical description to the correct ICD-10 code.
Rules:
- Always use the most specific code available
- If unsure between two codes, choose the more specific one
- Format: CODE - DESCRIPTION"""
},
{"role": "user", "content": "Patient has Type 2 diabetes with diabetic chronic kidney disease"},
],
)

Best for: Quick iteration, general tasks, when the model already has the knowledge.

2. RAG (Retrieval-Augmented Generation)

Providing relevant context from an external knowledge base at inference time.

python
# RAG approach
relevant_docs = vector_store.search("Type 2 diabetes with CKD ICD-10", top_k=5)
context = "\n".join(doc.text for doc in relevant_docs)

response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer using only the provided reference material."},
{"role": "user", "content": f"Reference:\n{context}\n\nQuestion: Map 'Type 2 diabetes with CKD' to ICD-10"},
],
)

Best for: Knowledge-heavy tasks, frequently updating information, when you need citations.

3. Finetuning

Training the model's weights on your specific data to change its behavior, style, or domain knowledge.

python
# After finetuning, the model inherently knows the domain
response = client.chat.completions.create(
model="ft:gpt-4o:my-org:medical-coder:abc123",
messages=[
{"role": "user", "content": "Type 2 diabetes with diabetic chronic kidney disease"},
],
)

Best for: Style/tone changes, domain-specific reasoning, reducing latency, cost reduction at scale.

Decision Framework

Use this decision tree to choose the right approach:

code
Is the task about style/format/tone?
├── YES → Finetune (small dataset, fast results)
└── NO → Does the model already know the information?
├── YES → Prompt Engineering (cheapest, fastest)
└── NO → Is the information frequently updated?
├── YES → RAG (update knowledge base, no retraining)
└── NO → Is the reasoning domain-specific and complex?
├── YES → Finetune (embeds reasoning patterns)
└── NO → Start with RAG + better prompts

Comparison Matrix

CriterionPrompt EngineeringRAGFinetuning
Setup costNoneMedium (vector DB, pipeline)High (data prep, training)
Per-query costHigh (long prompts)Medium (context + query)Low (shorter prompts)
LatencyHigh (long prompts)Medium (retrieval + LLM)Low (efficient inference)
Knowledge freshnessStatic (cutoff)Real-time (update DB)Static (retrain to update)
Reasoning qualityGoodGood (with context)Best (internalized)
Style/tone controlModerateLimitedExcellent
Data requirementsNoneDocuments100-10K examples
Technical complexityLowMediumHigh

When Finetuning Wins

1. Consistent Output Format

If you need the model to always output in a specific structured format:

python
# Finetuning data for structured extraction
training_example = {
"messages": [
{"role": "system", "content": "Extract medical entities from clinical text."},
{"role": "user", "content": "Patient presents with acute appendicitis, scheduled for laparoscopic appendectomy."},
{"role": "assistant", "content": '{"conditions": ["acute appendicitis"], "procedures": ["laparoscopic appendectomy"], "body_parts": ["appendix"], "urgency": "acute"}'},
]
}

2. Domain-Specific Reasoning

When the model needs to learn reasoning patterns specific to your domain (medical diagnosis, legal analysis, code review).

3. Cost Reduction at Scale

Finetuning a smaller model to match a larger model's performance on your specific task:

python
# Before: Using GPT-4o for every request ($$$)
# After: Finetuned GPT-4o-mini ($, ~20x cheaper)
# If accuracy is comparable, savings are enormous at scale

# Cost comparison for 1M requests:
# GPT-4o: ~$5,000/month
# GPT-4o-mini: ~$250/month (finetuned to match quality)
# Training: ~$100 one-time
# Net savings: ~$4,650/month

4. Reduced Latency

Finetuned models need shorter prompts (no few-shot examples needed), reducing both token count and response time.

When NOT to Finetune

Finetuning Anti-Patterns
  • Don't finetune to add knowledge — use RAG instead. Finetuning is unreliable for factual recall.
  • Don't finetune with < 100 examples — you won't see meaningful improvement.
  • Don't finetune when prompts work — start simple, finetune only when you hit a ceiling.
  • Don't finetune for rapidly changing data — you'd need to retrain constantly.

The Hybrid Approach

The best production systems combine all three:

python
class HybridMedicalCoder:
def __init__(self):
self.model = "ft:gpt-4o-mini:med-coder:v2" # Finetuned for format/reasoning
self.knowledge_base = VectorStore("icd10_codes") # RAG for latest codes

async def code(self, clinical_text: str) -> dict:
# Step 1: RAG retrieves latest coding guidelines
guidelines = await self.knowledge_base.search(clinical_text, top_k=3)

# Step 2: Finetuned model applies domain reasoning with RAG context
response = await client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": "Code the clinical text using provided guidelines."},
{"role": "user", "content": f"Guidelines:\n{guidelines}\n\nClinical text: {clinical_text}"},
],
)
return parse_response(response)

Data Preparation Checklist

Before finetuning, ensure your training data is:

  • Sufficient: At least 100-500 high-quality examples (more = better)
  • Diverse: Covers the full range of expected inputs
  • Consistent: Same format, style, and quality throughout
  • Clean: No errors, contradictions, or low-quality examples
  • Split: 90% training, 10% validation
  • Evaluated: You have a clear metric to measure improvement