How to Deploy an LLM on a VPS with AI in 2026 (Step-by-Step Guide)

Table of Contents

Updated August 23, 2025

Quick Answer

For CPU or small GPUs: use Ollama with quantized GGUF models (Llama 3.1 8B runs on 8GB RAM). For production serving: vLLM on a dedicated GPU (RTX 4090, A100, or rented). Expose via OpenAI-compatible API behind Caddy/Traefik with HTTPS.

Setup time: 1-3 hours
Cost: $5/mo CPU VPS to $500/mo A100
Throughput: 10-200 tokens/sec depending on setup

What You'll Need

VPS with 16GB+ RAM (CPU) or GPU VPS (RunPod, Vast.ai, Hetzner GPU)
Docker installed
Domain + Caddy/Traefik for HTTPS
Ollama or vLLM

Steps

Choose model size by hardware.

CPU (16GB RAM): Llama 3.1 8B Q4_K_M, Qwen 2.5 7B
Single GPU 24GB (RTX 3090/4090): Llama 3.1 70B Q4, Qwen 2.5 32B
A100 80GB: Llama 3.1 70B full, Mixtral 8x22B Q4

Install Ollama (simplest). curl -fsSL https://ollama.com/install.sh | sh. Pull model: ollama pull llama3.1:8b. Ollama exposes OpenAI-compatible API on :11434.
Or install vLLM (production). docker run --gpus all -p 8000:8000 vllm/vllm-openai:latest --model meta-llama/Llama-3.1-8B-Instruct. Far higher throughput than Ollama.
Put HTTPS in front. Caddy one-liner: your-domain.com { reverse_proxy localhost:8000 }. Auto Let's Encrypt certs.
Add auth. Ollama/vLLM don't ship auth. Use a simple Node.js or Caddy basic-auth proxy — reject requests without Authorization: Bearer <key>.
Test with OpenAI SDK. new OpenAI({ baseURL: 'https://your-domain.com/v1', apiKey: 'your-key' }). Works because both expose OpenAI format.
Monitor. Track GPU utilization (nvidia-smi), tokens/sec, queue depth. Prometheus + Grafana on the VPS.
Scale. Multi-GPU with vLLM tensor parallelism. Or run multiple single-GPU nodes behind a load balancer.

Common Mistakes

Running 70B on 16GB: OOM crashes. Check quantization fits VRAM/RAM.
No rate limiting: One runaway client saturates the box. Add per-key limits.
Ignoring context window: Each 1K ctx = ~500MB KV cache. Don't over-allocate.
No HTTPS: Browsers & most OpenAI SDKs refuse HTTP. Always use Caddy/Traefik.
Skipping quantization: Full-precision 8B needs 16GB VRAM. Q4_K_M needs 5GB with minor quality loss.

Top Tools

Tool

Best For

Price

Ollama

Easiest setup

Free

vLLM

High throughput

Free

Llama.cpp

CPU / edge

Free

Caddy

HTTPS proxy

Free

Hetzner GPU

Cheap GPU VPS

$70-500/mo

FAQs

Q: Ollama vs vLLM?

Ollama: simple, slow. vLLM: complex, 10x faster at scale.

Q: Which GPU for production?

RTX 4090 (24GB) for indie. A100 (80GB) for scale. H100 for frontier.

Q: Can I run on CPU only?

Yes — 8B quantized model on 16GB RAM. ~5-10 tok/sec. Fine for batch jobs.

Q: Is this cheaper than OpenAI API?

Above ~5M tokens/mo, yes. Below that, APIs are cheaper.

Q: Data privacy vs hosted API?

100% on-prem. Data never leaves your VPS.

Q: Can I serve multiple models?

Yes — vLLM supports multi-model serving with routing.

Conclusion

Self-hosting LLMs in 2026 is easier than ever. Start with Ollama on a cheap GPU VPS, graduate to vLLM when throughput matters. Full control, full privacy, predictable cost.