Retrieval-Augmented Generation
RAG
Looking up relevant context from an external store (vector DB, docs, your own corpus) and stuffing it into the LLM prompt before answering. Reduces hallucination, costs less than fine-tuning, but adds a retrieval failure mode of its own.
RAG is the dominant production pattern for grounding LLMs in domain-specific knowledge. Instead of fine-tuning a model on your data — expensive, slow, and freezes the model on a snapshot — you keep the data in a retrieval layer (typically vector embeddings over chunks of text) and at query time fetch the top-k relevant chunks to drop into the prompt.
The wins: cheaper than fine-tuning, updates instantly when your corpus changes, and the model can cite which chunks it used. The losses: retrieval quality becomes its own engineering problem (chunking strategy, embedding choice, reranking, eval), and the context window puts a hard ceiling on how much you can stuff in.
Mature production RAG stacks include a query rewriter, a reranker, citation tracking, and an eval loop — not just the dense vector lookup.
See also
-
Embedding
Vector embeddingA dense numeric vector that represents text (or images, audio…) in a learned semantic space. Cosine-similar vectors mean semantically similar content. The thing underneath every RAG pipeline.
-
Context window
Context windowThe maximum number of tokens an LLM can consider per forward pass. 2026 frontier: 1M+ for some models (Claude 4.7, Gemini 2.5 Pro). Bigger window ≠ better answer — recall degrades inside long contexts.
-
Eval
EvaluationSystematic measurement of LLM/agent quality — accuracy, hallucination rate, latency, cost. The discipline you wish you'd started 6 months earlier. Without it, you're shipping vibes.