Evaluation
Eval
Systematic measurement of LLM/agent quality — accuracy, hallucination rate, latency, cost. The discipline you wish you'd started 6 months earlier. Without it, you're shipping vibes.
Evals are the test suite of LLM engineering. The naive version: a list of inputs + expected outputs, scored by string match. The grown-up version: held-out datasets, LLM-as-judge for open-ended quality, regression tracking across model versions, drift detection for production traffic.
Categories worth tracking: task-specific (did the agent close the ticket correctly?), safety (does it refuse out-of-scope requests?), cost/latency (do new prompts blow the budget?), and drift (is production traffic shifting away from what the eval set covers?).
The eval set is also the canary for upgrading models. New model lands → run evals → compare. Without that you're guessing whether the upgrade is a regression.
See also
-
RAG
Retrieval-Augmented GenerationLooking up relevant context from an external store (vector DB, docs, your own corpus) and stuffing it into the LLM prompt before answering. Reduces hallucination, costs less than fine-tuning, but adds a retrieval failure mode of its own.
-
Agentic
Agentic systemsLLM-driven loops that plan, take actions in the world (call tools, edit files, hit APIs), observe results, and iterate — rather than just answering a single prompt. The dominant 2026 paradigm for AI engineering.
-
Context window
Context windowThe maximum number of tokens an LLM can consider per forward pass. 2026 frontier: 1M+ for some models (Claude 4.7, Gemini 2.5 Pro). Bigger window ≠ better answer — recall degrades inside long contexts.