MoE

A model architecture where each forward pass activates only a fraction of total parameters via a learned router. Mixtral, DeepSeek-V3, and Llama 4 use it. Bigger total parameter count, similar compute per token.

MoE models replace some dense layers with N expert sub-networks plus a router that picks the top-k experts for each token. A 70B-parameter dense model uses all 70B every forward pass; a 70B MoE with top-2-of-8 routing uses ~17B per token, even though the total parameter count is much higher.

The practical implication: MoE models give you the knowledge breadth of a much larger model at the inference cost of a smaller one. The cost is engineering complexity — load balancing across experts, training stability, and a memory footprint that still requires holding all experts in VRAM.

MoE

See also

Context window

Agentic