AI Week Radar

Evaluation

Eval

Systematic measurement of LLM/agent quality — accuracy, hallucination rate, latency, cost. The discipline you wish you'd started 6 months earlier. Without it, you're shipping vibes.

Evals are the test suite of LLM engineering. The naive version: a list of inputs + expected outputs, scored by string match. The grown-up version: held-out datasets, LLM-as-judge for open-ended quality, regression tracking across model versions, drift detection for production traffic.

Categories worth tracking: task-specific (did the agent close the ticket correctly?), safety (does it refuse out-of-scope requests?), cost/latency (do new prompts blow the budget?), and drift (is production traffic shifting away from what the eval set covers?).

The eval set is also the canary for upgrading models. New model lands → run evals → compare. Without that you're guessing whether the upgrade is a regression.

See also