Table of Contents
Quick Answer
- Parameter: any learnable number in the model (weights + biases)
- Weight: the multiplicative coefficient in a layer (the most common parameter)
"70B parameters" counts every learnable value; weights dominate that count.
What Do These Terms Mean?
A neural network is a giant function with millions-to-trillions of adjustable numbers. Each one is a parameter. Most parameters are weights — multipliers on inputs. A smaller set are biases — additive shifts. Both are learned during training (Google AI Glossary; Stanford CS231n).
How They Differ in Math
For a single neuron:
output = activation(w1x1 + w2x2 + ... + wn*xn + b)
- w1 ... wn are weights
- b is a bias
- All are parameters
In a 175B-parameter model, ~98% are weights, ~1-2% are biases, and a tiny fraction are layernorm scales and other learned scalars.
Examples
- Llama 3 70B: 70 billion parameters (overwhelmingly weights)
- GPT-3 175B: 175 billion parameters
- Tiny model: a single-layer perceptron with 10 weights + 1 bias = 11 parameters
- Embedding layer: one weight vector per token — 50,000 vocab * 4096 dim = 200M parameters
- Attention head: query, key, value, output matrices — millions of weights per head
Weights vs Parameters vs Hyperparameters
Term
Learned?
Examples
Weight
Yes
Connection strengths
Bias
Yes
Per-neuron offsets
Parameter
Yes
Weights + biases + other learned scalars
Hyperparameter
No (set before training)
Learning rate, batch size, number of layers
The big distinction: parameters change during training; hyperparameters do not.
When the Distinction Matters
- Model size marketing: "7B parameters" is the industry convention
- Memory math: a 7B model in fp16 = 7B * 2 bytes = 14 GB
- Fine-tuning: updating 100% of parameters = full FT; updating <1% = LoRA
- Safety: some research distinguishes weight-based vs activation-based interventions
FAQs
Is "7B parameters" the same as "7B weights"? Close enough for marketing. Technically includes a small number of non-weight parameters.
Are activations parameters? No — activations are computed at runtime, not stored or learned.
Are embeddings weights? Yes — the embedding table is a big weight matrix.
Do biases matter? A little — some modern transformers drop biases to simplify without losing much accuracy.
What is parameter efficiency? Techniques like LoRA update <1% of parameters and match full fine-tuning quality for many tasks.
How do I count parameters? sum(p.numel() for p in model.parameters()) in PyTorch.
Does more parameters mean smarter? Roughly, but diminishing returns. 70B tuned model > 175B untuned model.
Conclusion
Weights are the dominant type of parameter; in most sentences the two words are interchangeable. Distinguish parameters from hyperparameters to avoid confusion. More on Misar Blog↗.