Gemma Finetuning with QLoRA

Unsloth makes finetuning LLMs 2-5x faster while using 70% less memory. Combined with QLoRA (Quantized Low-Rank Adaptation), you can finetune models like Gemma 2 on a single consumer GPU.

What is QLoRA?

QLoRA freezes the pretrained model weights in 4-bit precision and trains only small adapter layers (Low-Rank Adaptations). This dramatically reduces memory while maintaining quality.

code

Full Finetuning:  Train ALL 7B parameters  →  ~28GB VRAM (7B model, fp32)
LoRA:             Train ~1% of parameters   →  ~16GB VRAM
QLoRA:            Train ~1% in 4-bit base   →  ~6GB VRAM  ← We use this

Setup with Unsloth

bash

# Install Unsloth (Colab/Kaggle ready)
pip install unsloth

# Or for maximum speed (requires Ampere GPU: A100, RTX 30xx/40xx)
pip install unsloth[colab-new]
pip install --no-deps trl peft accelerate bitsandbytes

Full Finetuning Script

python

from unsloth import FastLanguageModel
import torch
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# ── 1. Load Model with 4-bit Quantization ──────────────────────────

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/gemma-2-9b-bnb-4bit",
    max_seq_length=2048,
    dtype=None,  # Auto-detect
    load_in_4bit=True,
)

# ── 2. Add LoRA Adapters ───────────────────────────────────────────

model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # LoRA rank — higher = more capacity, more memory
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_alpha=16,  # Scaling factor (typically = rank)
    lora_dropout=0,  # 0 is optimized for speed
    bias="none",     # "none" is optimized for speed
    use_gradient_checkpointing="unsloth",  # Long context support
    random_state=42,
)

# ── 3. Prepare Training Data ───────────────────────────────────────

# Load a dataset (example: medical QA)
dataset = load_dataset("medalpaca/medical_meadow_medqa", split="train")

# Format into chat messages
def format_example(example):
    return {
        "text": f"### Question:\n{example['input']}\n\n### Answer:\n{example['output']}"
    }

dataset = dataset.map(format_example)

# Split into train/validation
split = dataset.train_test_split(test_size=0.1, seed=42)
train_dataset = split["train"]
eval_dataset = split["test"]

print(f"Training examples: {len(train_dataset)}")
print(f"Validation examples: {len(eval_dataset)}")

# ── 4. Configure Training ──────────────────────────────────────────

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,  # Effective batch size = 2 * 4 = 8
        warmup_steps=50,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=not torch.cuda.is_bf16_supported(),
        bf16=torch.cuda.is_bf16_supported(),
        logging_steps=10,
        eval_strategy="steps",
        eval_steps=100,
        save_strategy="steps",
        save_steps=100,
        save_total_limit=3,
        output_dir="./gemma-medical-qlora",
        optim="adamw_8bit",
        seed=42,
        report_to="none",  # Set to "wandb" for experiment tracking
    ),
)

# ── 5. Train! ──────────────────────────────────────────────────────

trainer.train()

# ── 6. Save the Adapter ────────────────────────────────────────────

model.save_pretrained("gemma-medical-qlora")
tokenizer.save_pretrained("gemma-medical-qlora")

print("Training complete! Adapter saved to ./gemma-medical-qlora/")

Inference with the Finetuned Model

python

from unsloth import FastLanguageModel

# Load the finetuned model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./gemma-medical-qlora",
    max_seq_length=2048,
    dtype=None,
    load_in_4bit=True,
)

# Enable inference mode (faster)
FastLanguageModel.for_inference(model)

# Generate a response
prompt = """### Question:
What are the common side effects of metformin?

### Answer:
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Exporting for Deployment

Option 1: Merge and Export to GGUF (for Ollama/local use)

python

# Merge LoRA weights into the base model
model.save_pretrained_merged("gemma-medical-merged", tokenizer)

# Export to GGUF format for llama.cpp/Ollama
model.save_pretrained_gguf(
    "gemma-medical-gguf",
    tokenizer,
    quantization_method="q4_k_m",  # Good balance of quality/size
)

Option 2: Push to HuggingFace Hub

python

# Push adapter to HuggingFace
model.push_to_hub("your-username/gemma-medical-qlora")
tokenizer.push_to_hub("your-username/gemma-medical-qlora")

# Push merged model
model.push_to_hub_merged("your-username/gemma-medical-merged", tokenizer)

Option 3: Deploy as vLLM Server

bash

# Serve the merged model with vLLM
python -m vllm.entrypoints.openai.api_server \
  --model ./gemma-medical-merged \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1

LoRA Hyperparameter Guide

Parameter	Recommended	Effect of Increasing
`r` (rank)	8-32	More capacity, more memory, risk of overfitting
`lora_alpha`	Same as `r`	Stronger updates, may be unstable if too high
`lora_dropout`	0	More regularization, slower training
`learning_rate`	1e-4 to 3e-4	Faster learning, risk of instability
`target_modules`	All linear layers	More parameters trained, better results
`batch_size`	2-8	More stable gradients, more memory

Rank Selection

r=8: For simple tasks (format changes, style transfer)
r=16: Default, works well for most tasks
r=32-64: Complex tasks requiring deep domain knowledge
Start with r=16, increase only if validation loss plateaus

Common Pitfalls

Overfitting: Use eval_steps to monitor. If train loss drops but eval loss rises, reduce epochs or add dropout.
Catastrophic forgetting: LoRA helps prevent this, but if the model forgets general abilities, reduce the learning rate.
Bad data quality: One bad example can poison the model. Always manually review a sample of training data.
Too few examples: You need at least 100-200 high-quality examples to see meaningful improvement.

What is QLoRA?​

Setup with Unsloth​

Full Finetuning Script​

Inference with the Finetuned Model​

Exporting for Deployment​

Option 1: Merge and Export to GGUF (for Ollama/local use)​

Option 2: Push to HuggingFace Hub​

Option 3: Deploy as vLLM Server​

LoRA Hyperparameter Guide​

What is QLoRA?

Setup with Unsloth

Full Finetuning Script

Inference with the Finetuned Model

Exporting for Deployment

Option 1: Merge and Export to GGUF (for Ollama/local use)

Option 2: Push to HuggingFace Hub

Option 3: Deploy as vLLM Server

LoRA Hyperparameter Guide