Skip to main content

Gemma Finetuning with QLoRA

Unsloth makes finetuning LLMs 2-5x faster while using 70% less memory. Combined with QLoRA (Quantized Low-Rank Adaptation), you can finetune models like Gemma 2 on a single consumer GPU.

What is QLoRA?

QLoRA freezes the pretrained model weights in 4-bit precision and trains only small adapter layers (Low-Rank Adaptations). This dramatically reduces memory while maintaining quality.

code
Full Finetuning: Train ALL 7B parameters → ~28GB VRAM (7B model, fp32)
LoRA: Train ~1% of parameters → ~16GB VRAM
QLoRA: Train ~1% in 4-bit base → ~6GB VRAM ← We use this

Setup with Unsloth

bash
# Install Unsloth (Colab/Kaggle ready)
pip install unsloth

# Or for maximum speed (requires Ampere GPU: A100, RTX 30xx/40xx)
pip install unsloth[colab-new]
pip install --no-deps trl peft accelerate bitsandbytes

Full Finetuning Script

python
from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# ── 1. Load Model with 4-bit Quantization ──────────────────────────

model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/gemma-2-9b-bnb-4bit",
max_seq_length=2048,
dtype=None, # Auto-detect
load_in_4bit=True,
)

# ── 2. Add LoRA Adapters ───────────────────────────────────────────

model = FastLanguageModel.get_peft_model(
model,
r=16, # LoRA rank — higher = more capacity, more memory
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_alpha=16, # Scaling factor (typically = rank)
lora_dropout=0, # 0 is optimized for speed
bias="none", # "none" is optimized for speed
use_gradient_checkpointing="unsloth", # Long context support
random_state=42,
)

# ── 3. Prepare Training Data ───────────────────────────────────────

# Load a dataset (example: medical QA)
dataset = load_dataset("medalpaca/medical_meadow_medqa", split="train")

# Format into chat messages
def format_example(example):
return {
"text": f"### Question:\n{example['input']}\n\n### Answer:\n{example['output']}"
}

dataset = dataset.map(format_example)

# Split into train/validation
split = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = split["train"]
eval_dataset = split["test"]

print(f"Training examples: {len(train_dataset)}")
print(f"Validation examples: {len(eval_dataset)}")

# ── 4. Configure Training ──────────────────────────────────────────

trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
dataset_text_field="text",
max_seq_length=2048,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4, # Effective batch size = 2 * 4 = 8
warmup_steps=50,
num_train_epochs=3,
learning_rate=2e-4,
fp16=not torch.cuda.is_bf16_supported(),
bf16=torch.cuda.is_bf16_supported(),
logging_steps=10,
eval_strategy="steps",
eval_steps=100,
save_strategy="steps",
save_steps=100,
save_total_limit=3,
output_dir="./gemma-medical-qlora",
optim="adamw_8bit",
seed=42,
report_to="none", # Set to "wandb" for experiment tracking
),
)

# ── 5. Train! ──────────────────────────────────────────────────────

trainer.train()

# ── 6. Save the Adapter ────────────────────────────────────────────

model.save_pretrained("gemma-medical-qlora")
tokenizer.save_pretrained("gemma-medical-qlora")

print("Training complete! Adapter saved to ./gemma-medical-qlora/")

Inference with the Finetuned Model

python
from unsloth import FastLanguageModel

# Load the finetuned model
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="./gemma-medical-qlora",
max_seq_length=2048,
dtype=None,
load_in_4bit=True,
)

# Enable inference mode (faster)
FastLanguageModel.for_inference(model)

# Generate a response
prompt = """### Question:
What are the common side effects of metformin?

### Answer:
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.7,
top_p=0.9,
do_sample=True,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Exporting for Deployment

Option 1: Merge and Export to GGUF (for Ollama/local use)

python
# Merge LoRA weights into the base model
model.save_pretrained_merged("gemma-medical-merged", tokenizer)

# Export to GGUF format for llama.cpp/Ollama
model.save_pretrained_gguf(
"gemma-medical-gguf",
tokenizer,
quantization_method="q4_k_m", # Good balance of quality/size
)

Option 2: Push to HuggingFace Hub

python
# Push adapter to HuggingFace
model.push_to_hub("your-username/gemma-medical-qlora")
tokenizer.push_to_hub("your-username/gemma-medical-qlora")

# Push merged model
model.push_to_hub_merged("your-username/gemma-medical-merged", tokenizer)

Option 3: Deploy as vLLM Server

bash
# Serve the merged model with vLLM
python -m vllm.entrypoints.openai.api_server \
--model ./gemma-medical-merged \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1

LoRA Hyperparameter Guide

ParameterRecommendedEffect of Increasing
r (rank)8-32More capacity, more memory, risk of overfitting
lora_alphaSame as rStronger updates, may be unstable if too high
lora_dropout0More regularization, slower training
learning_rate1e-4 to 3e-4Faster learning, risk of instability
target_modulesAll linear layersMore parameters trained, better results
batch_size2-8More stable gradients, more memory
Rank Selection
  • r=8: For simple tasks (format changes, style transfer)
  • r=16: Default, works well for most tasks
  • r=32-64: Complex tasks requiring deep domain knowledge
  • Start with r=16, increase only if validation loss plateaus
Common Pitfalls
  • Overfitting: Use eval_steps to monitor. If train loss drops but eval loss rises, reduce epochs or add dropout.
  • Catastrophic forgetting: LoRA helps prevent this, but if the model forgets general abilities, reduce the learning rate.
  • Bad data quality: One bad example can poison the model. Always manually review a sample of training data.
  • Too few examples: You need at least 100-200 high-quality examples to see meaningful improvement.