Table of Contents
Quick Answer
For most use cases, RAG beats fine-tuning. But when you need style/format/domain-language matching, fine-tune an open model (Llama 3.1 8B, Mistral, Qwen 2.5) with LoRA using Unsloth, Together.ai, or Modal. Budget: $5-50 for a single run.
- Dataset size: 500-10,000 examples minimum
- Cost per run: $5-50 (LoRA) or $200+ (full)
- Time: 2-12 hours
What You'll Need
- 500+ high-quality input/output pairs (JSONL)
- GPU access (Colab free, Modal, RunPod, or Together)
- Python + PyTorch basics (AI assists)
- Evaluation set (100+ held-out examples)
Steps
- Decide: RAG or fine-tune? If knowledge changes often → RAG. If style/format/tone matters → fine-tune. If both → hybrid.
- Build dataset. Format as JSONL with
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}. Quality > quantity. 500 great > 5000 okay. - Pick base model. Llama 3.1 8B for general use, Qwen 2.5 7B for multilingual, Phi-3 for tiny/edge. Ask AI: "Which open model is best for my task: [describe]?"
- Fine-tune with Unsloth (easiest & fastest). Notebook template handles LoRA config. Set rank 16-32, alpha 16-32, learning rate 2e-4, epochs 1-3.
- Run training. On Colab free T4: ~2-4 hours for 1K examples, Llama 3.1 8B. On Modal A100: 30 min, costs ~$2.
- Evaluate. Hold-out set. Compare fine-tuned vs base on rubric: correctness, format match, style. If fine-tuned loses on 3+ categories, dataset issue.
- Deploy. Merge LoRA adapter into base, convert to GGUF with llama.cpp, serve via vLLM or Ollama on a VPS.
- Iterate. Log production failures, add them to training set, re-tune monthly.
Common Mistakes
- Tiny dataset: <200 examples won't budge the model. Overfit instead.
- Mixed formats: Consistent JSONL structure across all examples.
- No eval set: You can't claim improvement without measuring.
- Tuning for knowledge: Models forget. Use RAG for facts.
- Over-tuning: >3 epochs on small data = catastrophic forgetting.
Top Tools
| Tool | Best For | Price |
|---|---|---|
| Unsloth | Fast LoRA tuning | Free |
| Together.ai | Hosted fine-tuning | $0.80/M tokens |
| Modal | Serverless GPU | Pay per sec |
| Ollama | Local inference | Free |
| vLLM | Fast serving | Free |
Conclusion
Fine-tuning is powerful but over-used. Always try RAG first. When you do tune, invest 80% of effort in dataset quality — model choice is secondary. Small, clean datasets beat sloppy big ones every time.
