Table of Contents
Quick Answer
For most use cases, RAG beats fine-tuning. But when you need style/format/domain-language matching, fine-tune an open model (Llama 3.1 8B, Mistral, Qwen 2.5) with LoRA using Unsloth, Together.ai, or Modal. Budget: $5-50 for a single run.
- Dataset size: 500-10,000 examples minimum
- Cost per run: $5-50 (LoRA) or $200+ (full)
- Time: 2-12 hours
What You'll Need
- 500+ high-quality input/output pairs (JSONL)
- GPU access (Colab free, Modal, RunPod, or Together)
- Python + PyTorch basics (AI assists)
- Evaluation set (100+ held-out examples)
Steps
- Decide: RAG or fine-tune? If knowledge changes often → RAG. If style/format/tone matters → fine-tune. If both → hybrid.
- Build dataset. Format as JSONL with {"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}. Quality > quantity. 500 great > 5000 okay.
- Pick base model. Llama 3.1 8B for general use, Qwen 2.5 7B for multilingual, Phi-3 for tiny/edge. Ask AI: "Which open model is best for my task: [describe]?"
- Fine-tune with Unsloth (easiest & fastest). Notebook template handles LoRA config. Set rank 16-32, alpha 16-32, learning rate 2e-4, epochs 1-3.
- Run training. On Colab free T4: ~2-4 hours for 1K examples, Llama 3.1 8B. On Modal A100: 30 min, costs ~$2.
- Evaluate. Hold-out set. Compare fine-tuned vs base on rubric: correctness, format match, style. If fine-tuned loses on 3+ categories, dataset issue.
- Deploy. Merge LoRA adapter into base, convert to GGUF with llama.cpp, serve via vLLM or Ollama on a VPS.
- Iterate. Log production failures, add them to training set, re-tune monthly.
Common Mistakes
- Tiny dataset: <200 examples won't budge the model. Overfit instead.
- Mixed formats: Consistent JSONL structure across all examples.
- No eval set: You can't claim improvement without measuring.
- Tuning for knowledge: Models forget. Use RAG for facts.
- Over-tuning: >3 epochs on small data = catastrophic forgetting.
Top Tools
Tool
Best For
Price
Unsloth
Fast LoRA tuning
Free
Together.ai
Hosted fine-tuning
$0.80/M tokens
Modal
Serverless GPU
Pay per sec
Ollama
Local inference
Free
vLLM
Fast serving
Free
FAQs
Q: How many examples do I need?
500 minimum for visible effect; 2-5K for solid results; 10K+ for hard domains.
Q: LoRA vs full fine-tuning?
LoRA for 95% of use cases. Full for frontier research or when LoRA caps out.
Q: Will my data leak?
Use local (Ollama, vLLM) or self-hosted inference. Avoid hosted if data is sensitive.
Q: Can I fine-tune closed models like GPT-4?
OpenAI offers it but BANNED — use open models per our AI policy.
Q: How much VRAM needed?
QLoRA on 8B model: 16GB. LoRA on 8B: 24GB. Full 8B: 60GB+.
Q: Can I fine-tune image models?
Yes — Stable Diffusion LoRAs follow similar process with different tooling.
Conclusion
Fine-tuning is powerful but over-used. Always try RAG first. When you do tune, invest 80% of effort in dataset quality — model choice is secondary. Small, clean datasets beat sloppy big ones every time.