Weights vs Parameters in AI: What's the Difference in 2026?

Table of Contents

Updated June 17, 2025

Quick Answer

Parameter: any learnable number in the model (weights + biases)
Weight: the multiplicative coefficient in a layer (the most common parameter)

"70B parameters" counts every learnable value; weights dominate that count.

What Do These Terms Mean?

A neural network is a giant function with millions-to-trillions of adjustable numbers. Each one is a parameter. Most parameters are weights — multipliers on inputs. A smaller set are biases — additive shifts. Both are learned during training (Google AI Glossary; Stanford CS231n).

How They Differ in Math

For a single neuron:

output = activation(w1x1 + w2x2 + ... + wn*xn + b)

w1 ... wn are weights
b is a bias
All are parameters

In a 175B-parameter model, ~98% are weights, ~1-2% are biases, and a tiny fraction are layernorm scales and other learned scalars.

Examples

Llama 3 70B: 70 billion parameters (overwhelmingly weights)
GPT-3 175B: 175 billion parameters
Tiny model: a single-layer perceptron with 10 weights + 1 bias = 11 parameters
Embedding layer: one weight vector per token — 50,000 vocab * 4096 dim = 200M parameters
Attention head: query, key, value, output matrices — millions of weights per head

Weights vs Parameters vs Hyperparameters

Term

Learned?

Examples

Weight

Yes

Connection strengths

Bias

Yes

Per-neuron offsets

Parameter

Yes

Weights + biases + other learned scalars

Hyperparameter

No (set before training)

Learning rate, batch size, number of layers

The big distinction: parameters change during training; hyperparameters do not.

When the Distinction Matters

Model size marketing: "7B parameters" is the industry convention
Memory math: a 7B model in fp16 = 7B * 2 bytes = 14 GB
Fine-tuning: updating 100% of parameters = full FT; updating <1% = LoRA
Safety: some research distinguishes weight-based vs activation-based interventions

FAQs

Is "7B parameters" the same as "7B weights"? Close enough for marketing. Technically includes a small number of non-weight parameters.

Are activations parameters? No — activations are computed at runtime, not stored or learned.

Are embeddings weights? Yes — the embedding table is a big weight matrix.

Do biases matter? A little — some modern transformers drop biases to simplify without losing much accuracy.

What is parameter efficiency? Techniques like LoRA update <1% of parameters and match full fine-tuning quality for many tasks.

How do I count parameters? sum(p.numel() for p in model.parameters()) in PyTorch.

Does more parameters mean smarter? Roughly, but diminishing returns. 70B tuned model > 175B untuned model.

Conclusion

Weights are the dominant type of parameter; in most sentences the two words are interchangeable. Distinguish parameters from hyperparameters to avoid confusion. More on Misar Blog↗.