Gemma Finetuning with QLoRA
Unsloth makes finetuning LLMs 2-5x faster while using 70% less memory. Combined with QLoRA (Quantized Low-Rank Adaptation), you can finetune models like Gemma 2 on a single consumer GPU.
What is QLoRA?
QLoRA freezes the pretrained model weights in 4-bit precision and trains only small adapter layers (Low-Rank Adaptations). This dramatically reduces memory while maintaining quality.
code
Full Finetuning: Train ALL 7B parameters → ~28GB VRAM (7B model, fp32)
LoRA: Train ~1% of parameters → ~16GB VRAM
QLoRA: Train ~1% in 4-bit base → ~6GB VRAM ← We use this
Setup with Unsloth
bash
# Install Unsloth (Colab/Kaggle ready)
pip install unsloth
# Or for maximum speed (requires Ampere GPU: A100, RTX 30xx/40xx)
pip install unsloth[colab-new]
pip install --no-deps trl peft accelerate bitsandbytes
Full Finetuning Script
python
from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset
# ── 1. Load Model with 4-bit Quantization ──────────────────────────
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/gemma-2-9b-bnb-4bit",
max_seq_length=2048,
dtype=None, # Auto-detect
load_in_4bit=True,
)
# ── 2. Add LoRA Adapters ───────────────────────────────────────────
model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank — higher = more capacity, more memory
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha=16, # Scaling factor (typically = rank)
lora_dropout=0, # 0 is optimized for speed
bias="none", # "none" is optimized for speed
use_gradient_checkpointing="unsloth", # Long context support
random_state=42,
)
# ── 3. Prepare Training Data ───────────────────────────────────────
# Load a dataset (example: medical QA)
dataset = load_dataset("medalpaca/medical_meadow_medqa", split="train")
# Format into chat messages
def format_example(example):
return {
"text": f"### Question:\n{example['input']}\n\n### Answer:\n{example['output']}"
}
dataset = dataset.map(format_example)
# Split into train/validation
split = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = split["train"]
eval_dataset = split["test"]
print(f"Training examples: {len(train_dataset)}")
print(f"Validation examples: {len(eval_dataset)}")
# ── 4. Configure Training ──────────────────────────────────────────
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
dataset_text_field="text",
max_seq_length=2048,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4, # Effective batch size = 2 * 4 = 8
warmup_steps=50,
num_train_epochs=3,
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=10,
eval_strategy="steps",
eval_steps=100,
save_strategy="steps",
save_steps=100,
save_total_limit=3,
output_dir="./gemma-medical-qlora",
optim="adamw_8bit",
seed=42,
report_to="none", # Set to "wandb" for experiment tracking
),
)
# ── 5. Train! ──────────────────────────────────────────────────────
trainer.train()
# ── 6. Save the Adapter ────────────────────────────────────────────
model.save_pretrained("gemma-medical-qlora")
tokenizer.save_pretrained("gemma-medical-qlora")
print("Training complete! Adapter saved to ./gemma-medical-qlora/")
Inference with the Finetuned Model
python
from unsloth import FastLanguageModel
# Load the finetuned model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="./gemma-medical-qlora",
max_seq_length=2048,
dtype=None,
load_in_4bit=True,
)
# Enable inference mode (faster)
FastLanguageModel.for_inference(model)
# Generate a response
prompt = """### Question:
What are the common side effects of metformin?
### Answer:
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Exporting for Deployment
Option 1: Merge and Export to GGUF (for Ollama/local use)
python
# Merge LoRA weights into the base model
model.save_pretrained_merged("gemma-medical-merged", tokenizer)
# Export to GGUF format for llama.cpp/Ollama
model.save_pretrained_gguf(
"gemma-medical-gguf",
tokenizer,
quantization_method="q4_k_m", # Good balance of quality/size
)
Option 2: Push to HuggingFace Hub
python
# Push adapter to HuggingFace
model.push_to_hub("your-username/gemma-medical-qlora")
tokenizer.push_to_hub("your-username/gemma-medical-qlora")
# Push merged model
model.push_to_hub_merged("your-username/gemma-medical-merged", tokenizer)
Option 3: Deploy as vLLM Server
bash
# Serve the merged model with vLLM
python -m vllm.entrypoints.openai.api_server \
--model ./gemma-medical-merged \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1
LoRA Hyperparameter Guide
| Parameter | Recommended | Effect of Increasing |
|---|---|---|
r (rank) | 8-32 | More capacity, more memory, risk of overfitting |
lora_alpha | Same as r | Stronger updates, may be unstable if too high |
lora_dropout | 0 | More regularization, slower training |
learning_rate | 1e-4 to 3e-4 | Faster learning, risk of instability |
target_modules | All linear layers | More parameters trained, better results |
batch_size | 2-8 | More stable gradients, more memory |
Rank Selection
- r=8: For simple tasks (format changes, style transfer)
- r=16: Default, works well for most tasks
- r=32-64: Complex tasks requiring deep domain knowledge
- Start with r=16, increase only if validation loss plateaus
Common Pitfalls
- Overfitting: Use eval_steps to monitor. If train loss drops but eval loss rises, reduce epochs or add dropout.
- Catastrophic forgetting: LoRA helps prevent this, but if the model forgets general abilities, reduce the learning rate.
- Bad data quality: One bad example can poison the model. Always manually review a sample of training data.
- Too few examples: You need at least 100-200 high-quality examples to see meaningful improvement.