Table of Contents
Quick Answer
For CPU or small GPUs: use Ollama with quantized GGUF models (Llama 3.1 8B runs on 8GB RAM). For production serving: vLLM on a dedicated GPU (RTX 4090, A100, or rented). Expose via OpenAI-compatible API behind Caddy/Traefik with HTTPS.
- Setup time: 1-3 hours
- Cost: $5/mo CPU VPS to $500/mo A100
- Throughput: 10-200 tokens/sec depending on setup
What You'll Need
- VPS with 16GB+ RAM (CPU) or GPU VPS (RunPod, Vast.ai, Hetzner GPU)
- Docker installed
- Domain + Caddy/Traefik for HTTPS
- Ollama or vLLM
Steps
- Choose model size by hardware.
- CPU (16GB RAM): Llama 3.1 8B Q4_K_M, Qwen 2.5 7B
- Single GPU 24GB (RTX 3090/4090): Llama 3.1 70B Q4, Qwen 2.5 32B
- A100 80GB: Llama 3.1 70B full, Mixtral 8x22B Q4
- Install Ollama (simplest). curl -fsSL https://ollama.com/install.sh | sh. Pull model: ollama pull llama3.1:8b. Ollama exposes OpenAI-compatible API on :11434.
- Or install vLLM (production). docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model meta-llama/Llama-3.1-8B-Instruct. Far higher throughput than Ollama.
- Put HTTPS in front. Caddy one-liner: your-domain.com { reverse_proxy localhost:8000 }. Auto Let's Encrypt certs.
- Add auth. Ollama/vLLM don't ship auth. Use a simple Node.js or Caddy basic-auth proxy — reject requests without Authorization: Bearer <key>.
- Test with OpenAI SDK. new OpenAI({ baseURL: 'https://your-domain.com/v1', apiKey: 'your-key' }). Works because both expose OpenAI format.
- Monitor. Track GPU utilization (nvidia-smi), tokens/sec, queue depth. Prometheus + Grafana on the VPS.
- Scale. Multi-GPU with vLLM tensor parallelism. Or run multiple single-GPU nodes behind a load balancer.
Common Mistakes
- Running 70B on 16GB: OOM crashes. Check quantization fits VRAM/RAM.
- No rate limiting: One runaway client saturates the box. Add per-key limits.
- Ignoring context window: Each 1K ctx = ~500MB KV cache. Don't over-allocate.
- No HTTPS: Browsers & most OpenAI SDKs refuse HTTP. Always use Caddy/Traefik.
- Skipping quantization: Full-precision 8B needs 16GB VRAM. Q4_K_M needs 5GB with minor quality loss.
Top Tools
Tool
Best For
Price
Ollama
Easiest setup
Free
vLLM
High throughput
Free
Llama.cpp
CPU / edge
Free
Caddy
HTTPS proxy
Free
Hetzner GPU
Cheap GPU VPS
$70-500/mo
FAQs
Q: Ollama vs vLLM?
Ollama: simple, slow. vLLM: complex, 10x faster at scale.
Q: Which GPU for production?
RTX 4090 (24GB) for indie. A100 (80GB) for scale. H100 for frontier.
Q: Can I run on CPU only?
Yes — 8B quantized model on 16GB RAM. ~5-10 tok/sec. Fine for batch jobs.
Q: Is this cheaper than OpenAI API?
Above ~5M tokens/mo, yes. Below that, APIs are cheaper.
Q: Data privacy vs hosted API?
100% on-prem. Data never leaves your VPS.
Q: Can I serve multiple models?
Yes — vLLM supports multi-model serving with routing.
Conclusion
Self-hosting LLMs in 2026 is easier than ever. Start with Ollama on a cheap GPU VPS, graduate to vLLM when throughput matters. Full control, full privacy, predictable cost.