Finetuning Strategy
One of the most common questions in AI engineering is: "Should I finetune, use RAG, or just write better prompts?" The answer depends on your specific use case, data, budget, and accuracy requirements. This page provides a decision framework.
The Three Approaches
1. Prompt Engineering
Crafting better instructions, examples, and system prompts to get the desired output.
# Simple prompt engineering
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": """You are a medical coding assistant.
Map the given clinical description to the correct ICD-10 code.
Rules:
- Always use the most specific code available
- If unsure between two codes, choose the more specific one
- Format: CODE - DESCRIPTION"""
},
{"role": "user", "content": "Patient has Type 2 diabetes with diabetic chronic kidney disease"},
],
)
Best for: Quick iteration, general tasks, when the model already has the knowledge.
2. RAG (Retrieval-Augmented Generation)
Providing relevant context from an external knowledge base at inference time.
# RAG approach
relevant_docs = vector_store.search("Type 2 diabetes with CKD ICD-10", top_k=5)
context = "\n".join(doc.text for doc in relevant_docs)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Answer using only the provided reference material."},
{"role": "user", "content": f"Reference:\n{context}\n\nQuestion: Map 'Type 2 diabetes with CKD' to ICD-10"},
],
)
Best for: Knowledge-heavy tasks, frequently updating information, when you need citations.
3. Finetuning
Training the model's weights on your specific data to change its behavior, style, or domain knowledge.
# After finetuning, the model inherently knows the domain
response = client.chat.completions.create(
model="ft:gpt-4o:my-org:medical-coder:abc123",
messages=[
{"role": "user", "content": "Type 2 diabetes with diabetic chronic kidney disease"},
],
)
Best for: Style/tone changes, domain-specific reasoning, reducing latency, cost reduction at scale.
Decision Framework
Use this decision tree to choose the right approach:
Is the task about style/format/tone?
├── YES → Finetune (small dataset, fast results)
└── NO → Does the model already know the information?
├── YES → Prompt Engineering (cheapest, fastest)
└── NO → Is the information frequently updated?
├── YES → RAG (update knowledge base, no retraining)
└── NO → Is the reasoning domain-specific and complex?
├── YES → Finetune (embeds reasoning patterns)
└── NO → Start with RAG + better prompts
Comparison Matrix
| Criterion | Prompt Engineering | RAG | Finetuning |
|---|---|---|---|
| Setup cost | None | Medium (vector DB, pipeline) | High (data prep, training) |
| Per-query cost | High (long prompts) | Medium (context + query) | Low (shorter prompts) |
| Latency | High (long prompts) | Medium (retrieval + LLM) | Low (efficient inference) |
| Knowledge freshness | Static (cutoff) | Real-time (update DB) | Static (retrain to update) |
| Reasoning quality | Good | Good (with context) | Best (internalized) |
| Style/tone control | Moderate | Limited | Excellent |
| Data requirements | None | Documents | 100-10K examples |
| Technical complexity | Low | Medium | High |
When Finetuning Wins
1. Consistent Output Format
If you need the model to always output in a specific structured format:
# Finetuning data for structured extraction
training_example = {
"messages": [
{"role": "system", "content": "Extract medical entities from clinical text."},
{"role": "user", "content": "Patient presents with acute appendicitis, scheduled for laparoscopic appendectomy."},
{"role": "assistant", "content": '{"conditions": ["acute appendicitis"], "procedures": ["laparoscopic appendectomy"], "body_parts": ["appendix"], "urgency": "acute"}'},
]
}
2. Domain-Specific Reasoning
When the model needs to learn reasoning patterns specific to your domain (medical diagnosis, legal analysis, code review).
3. Cost Reduction at Scale
Finetuning a smaller model to match a larger model's performance on your specific task:
# Before: Using GPT-4o for every request ($$$)
# After: Finetuned GPT-4o-mini ($, ~20x cheaper)
# If accuracy is comparable, savings are enormous at scale
# Cost comparison for 1M requests:
# GPT-4o: ~$5,000/month
# GPT-4o-mini: ~$250/month (finetuned to match quality)
# Training: ~$100 one-time
# Net savings: ~$4,650/month
4. Reduced Latency
Finetuned models need shorter prompts (no few-shot examples needed), reducing both token count and response time.
When NOT to Finetune
- Don't finetune to add knowledge — use RAG instead. Finetuning is unreliable for factual recall.
- Don't finetune with < 100 examples — you won't see meaningful improvement.
- Don't finetune when prompts work — start simple, finetune only when you hit a ceiling.
- Don't finetune for rapidly changing data — you'd need to retrain constantly.
The Hybrid Approach
The best production systems combine all three:
class HybridMedicalCoder:
def __init__(self):
self.model = "ft:gpt-4o-mini:med-coder:v2" # Finetuned for format/reasoning
self.knowledge_base = VectorStore("icd10_codes") # RAG for latest codes
async def code(self, clinical_text: str) -> dict:
# Step 1: RAG retrieves latest coding guidelines
guidelines = await self.knowledge_base.search(clinical_text, top_k=3)
# Step 2: Finetuned model applies domain reasoning with RAG context
response = await client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": "Code the clinical text using provided guidelines."},
{"role": "user", "content": f"Guidelines:\n{guidelines}\n\nClinical text: {clinical_text}"},
],
)
return parse_response(response)
Data Preparation Checklist
Before finetuning, ensure your training data is:
- Sufficient: At least 100-500 high-quality examples (more = better)
- Diverse: Covers the full range of expected inputs
- Consistent: Same format, style, and quality throughout
- Clean: No errors, contradictions, or low-quality examples
- Split: 90% training, 10% validation
- Evaluated: You have a clear metric to measure improvement