Table of Contents
Quick Answer
- Training: feeding data to update model weights (happens once, costs millions)
- Inference: running the trained model on new inputs (happens billions of times, costs pennies)
Both use GPUs but in very different patterns.
What Do These Terms Mean?
During training, gradient updates flow backward through the network, adjusting billions of parameters. During inference, a single forward pass converts input tokens to output tokens — no learning happens (Stanford HAI AI Index, 2024; NVIDIA developer docs).
How Each Works
Training
- Feed a batch of data (e.g., 1M tokens)
- Compute the loss between prediction and ground truth
- Backpropagate gradients
- Update weights with an optimizer (AdamW, Shampoo)
- Repeat billions of times
GPT-4-class training: ~25,000 GPUs for months, $100M+.
Inference
- Load pre-trained weights into GPU memory
- Receive user input tokens
- Forward pass through all layers
- Sample next token
- Repeat until stop token
Inference for one chat response: <1 second, $0.001-0.10.
Examples
- Training: Meta trains Llama 4 on 15T tokens over 3 months
- Inference: ChatGPT serves 300M weekly users — trillions of inferences
- Fine-tune training: a small update of 10K examples on your support data
- Edge inference: phone model summarizes a webpage offline
- Batch inference: overnight job classifies 10M documents
Training vs Inference Costs
Aspect
Training
Inference
Frequency
Once (or periodic)
Every user request
Cost scale
Millions of dollars
Cents per call
Hardware
H100 / B200 clusters
Anything from phones to H100s
Duration
Weeks to months
Milliseconds to seconds
Memory pattern
Store gradients + weights + optimizer states
Weights + KV cache only
At scale, total inference cost eventually exceeds training cost — ChatGPT spends more on inference than it did on training.
When Each Matters
- Builders of foundation models: training dominates
- App developers using APIs: only inference matters
- Enterprises fine-tuning: small training cost + ongoing inference
- Researchers: both
FAQs
Is inference the same as serving? Yes — "serving" is the production engineering around inference.
Can I train on a laptop? LoRA fine-tunes of small models: yes. Training GPT-scale: no.
Why is inference slow? Because generating each token requires a full forward pass. Speculative decoding helps.
Does RAG affect inference cost? Adds embedding lookup (cheap) and more input tokens (moderate cost).
Is quantization training or inference? Usually post-training optimization applied before inference.
What is continuous training? Periodic retraining as new data arrives.
Are training and inference separate teams? In big labs, yes — "pre-training," "post-training," and "serving" are distinct.
Conclusion
Training builds the brain; inference uses it. App builders rarely train — they focus on prompts, retrieval, and evaluation. More on Misar Blog↗.