Finetuning Strategy

One of the most common questions in AI engineering is: "Should I finetune, use RAG, or just write better prompts?" The answer depends on your specific use case, data, budget, and accuracy requirements. This page provides a decision framework.

The Three Approaches

1. Prompt Engineering

Crafting better instructions, examples, and system prompts to get the desired output.

python

# Simple prompt engineering
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": """You are a medical coding assistant.
Map the given clinical description to the correct ICD-10 code.
Rules:
- Always use the most specific code available
- If unsure between two codes, choose the more specific one
- Format: CODE - DESCRIPTION"""
        },
        {"role": "user", "content": "Patient has Type 2 diabetes with diabetic chronic kidney disease"},
    ],
)

Best for: Quick iteration, general tasks, when the model already has the knowledge.

2. RAG (Retrieval-Augmented Generation)

Providing relevant context from an external knowledge base at inference time.

python

# RAG approach
relevant_docs = vector_store.search("Type 2 diabetes with CKD ICD-10", top_k=5)
context = "\n".join(doc.text for doc in relevant_docs)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Answer using only the provided reference material."},
        {"role": "user", "content": f"Reference:\n{context}\n\nQuestion: Map 'Type 2 diabetes with CKD' to ICD-10"},
    ],
)

Best for: Knowledge-heavy tasks, frequently updating information, when you need citations.

3. Finetuning

Training the model's weights on your specific data to change its behavior, style, or domain knowledge.

python

# After finetuning, the model inherently knows the domain
response = client.chat.completions.create(
    model="ft:gpt-4o:my-org:medical-coder:abc123",
    messages=[
        {"role": "user", "content": "Type 2 diabetes with diabetic chronic kidney disease"},
    ],
)

Best for: Style/tone changes, domain-specific reasoning, reducing latency, cost reduction at scale.

Decision Framework

Use this decision tree to choose the right approach:

code

Is the task about style/format/tone?
├── YES → Finetune (small dataset, fast results)
└── NO → Does the model already know the information?
    ├── YES → Prompt Engineering (cheapest, fastest)
    └── NO → Is the information frequently updated?
        ├── YES → RAG (update knowledge base, no retraining)
        └── NO → Is the reasoning domain-specific and complex?
            ├── YES → Finetune (embeds reasoning patterns)
            └── NO → Start with RAG + better prompts

Comparison Matrix

Criterion	Prompt Engineering	RAG	Finetuning
Setup cost	None	Medium (vector DB, pipeline)	High (data prep, training)
Per-query cost	High (long prompts)	Medium (context + query)	Low (shorter prompts)
Latency	High (long prompts)	Medium (retrieval + LLM)	Low (efficient inference)
Knowledge freshness	Static (cutoff)	Real-time (update DB)	Static (retrain to update)
Reasoning quality	Good	Good (with context)	Best (internalized)
Style/tone control	Moderate	Limited	Excellent
Data requirements	None	Documents	100-10K examples
Technical complexity	Low	Medium	High

When Finetuning Wins

1. Consistent Output Format

If you need the model to always output in a specific structured format:

python

# Finetuning data for structured extraction
training_example = {
    "messages": [
        {"role": "system", "content": "Extract medical entities from clinical text."},
        {"role": "user", "content": "Patient presents with acute appendicitis, scheduled for laparoscopic appendectomy."},
        {"role": "assistant", "content": '{"conditions": ["acute appendicitis"], "procedures": ["laparoscopic appendectomy"], "body_parts": ["appendix"], "urgency": "acute"}'},
    ]
}

2. Domain-Specific Reasoning

When the model needs to learn reasoning patterns specific to your domain (medical diagnosis, legal analysis, code review).

3. Cost Reduction at Scale

Finetuning a smaller model to match a larger model's performance on your specific task:

python

# Before: Using GPT-4o for every request ($$$)
# After: Finetuned GPT-4o-mini ($, ~20x cheaper)
# If accuracy is comparable, savings are enormous at scale

# Cost comparison for 1M requests:
# GPT-4o:      ~$5,000/month
# GPT-4o-mini: ~$250/month (finetuned to match quality)
# Training:    ~$100 one-time
# Net savings: ~$4,650/month

4. Reduced Latency

Finetuned models need shorter prompts (no few-shot examples needed), reducing both token count and response time.

When NOT to Finetune

Finetuning Anti-Patterns

Don't finetune to add knowledge — use RAG instead. Finetuning is unreliable for factual recall.
Don't finetune with < 100 examples — you won't see meaningful improvement.
Don't finetune when prompts work — start simple, finetune only when you hit a ceiling.
Don't finetune for rapidly changing data — you'd need to retrain constantly.

The Hybrid Approach

The best production systems combine all three:

python

class HybridMedicalCoder:
    def __init__(self):
        self.model = "ft:gpt-4o-mini:med-coder:v2"  # Finetuned for format/reasoning
        self.knowledge_base = VectorStore("icd10_codes")  # RAG for latest codes

    async def code(self, clinical_text: str) -> dict:
        # Step 1: RAG retrieves latest coding guidelines
        guidelines = await self.knowledge_base.search(clinical_text, top_k=3)

        # Step 2: Finetuned model applies domain reasoning with RAG context
        response = await client.chat.completions.create(
            model=self.model,
            messages=[
                {"role": "system", "content": "Code the clinical text using provided guidelines."},
                {"role": "user", "content": f"Guidelines:\n{guidelines}\n\nClinical text: {clinical_text}"},
            ],
        )
        return parse_response(response)

Data Preparation Checklist

Before finetuning, ensure your training data is:

Sufficient: At least 100-500 high-quality examples (more = better)
Diverse: Covers the full range of expected inputs
Consistent: Same format, style, and quality throughout
Clean: No errors, contradictions, or low-quality examples
Split: 90% training, 10% validation
Evaluated: You have a clear metric to measure improvement

The Three Approaches​

1. Prompt Engineering​

2. RAG (Retrieval-Augmented Generation)​

3. Finetuning​

Decision Framework​

Comparison Matrix​

When Finetuning Wins​

1. Consistent Output Format​

2. Domain-Specific Reasoning​

3. Cost Reduction at Scale​

4. Reduced Latency​

When NOT to Finetune​

The Hybrid Approach​

Data Preparation Checklist​

The Three Approaches

1. Prompt Engineering

2. RAG (Retrieval-Augmented Generation)

3. Finetuning

Decision Framework

Comparison Matrix

When Finetuning Wins

1. Consistent Output Format

2. Domain-Specific Reasoning

3. Cost Reduction at Scale

4. Reduced Latency

When NOT to Finetune

The Hybrid Approach

Data Preparation Checklist