AI Week Radar

Retrieval-Augmented Generation

RAG

Looking up relevant context from an external store (vector DB, docs, your own corpus) and stuffing it into the LLM prompt before answering. Reduces hallucination, costs less than fine-tuning, but adds a retrieval failure mode of its own.

RAG is the dominant production pattern for grounding LLMs in domain-specific knowledge. Instead of fine-tuning a model on your data — expensive, slow, and freezes the model on a snapshot — you keep the data in a retrieval layer (typically vector embeddings over chunks of text) and at query time fetch the top-k relevant chunks to drop into the prompt.

The wins: cheaper than fine-tuning, updates instantly when your corpus changes, and the model can cite which chunks it used. The losses: retrieval quality becomes its own engineering problem (chunking strategy, embedding choice, reranking, eval), and the context window puts a hard ceiling on how much you can stuff in.

Mature production RAG stacks include a query rewriter, a reranker, citation tracking, and an eval loop — not just the dense vector lookup.

See also