Mixture of Experts
MoE
A model architecture where each forward pass activates only a fraction of total parameters via a learned router. Mixtral, DeepSeek-V3, and Llama 4 use it. Bigger total parameter count, similar compute per token.
MoE models replace some dense layers with N expert sub-networks plus a router that picks the top-k experts for each token. A 70B-parameter dense model uses all 70B every forward pass; a 70B MoE with top-2-of-8 routing uses ~17B per token, even though the total parameter count is much higher.
The practical implication: MoE models give you the knowledge breadth of a much larger model at the inference cost of a smaller one. The cost is engineering complexity — load balancing across experts, training stability, and a memory footprint that still requires holding all experts in VRAM.
See also
-
Context window
Context windowThe maximum number of tokens an LLM can consider per forward pass. 2026 frontier: 1M+ for some models (Claude 4.7, Gemini 2.5 Pro). Bigger window ≠ better answer — recall degrades inside long contexts.
-
Agentic
Agentic systemsLLM-driven loops that plan, take actions in the world (call tools, edit files, hit APIs), observe results, and iterate — rather than just answering a single prompt. The dominant 2026 paradigm for AI engineering.