Article · Definition

What is RAG?

The pattern where an AI model retrieves relevant documents before answering. How it works, when it fits, where it fails, and the production stack that holds up in real systems.

By Aleksi Stenberg · 16 May 2026 · 11 min read

Summary

RAG (retrieval-augmented generation) is a pattern where a large language model retrieves relevant documents from a knowledge base before generating its response. The retrieval step grounds the answer in specific data the model was never trained on. It is the standard way to make an LLM speak about your company's internal documents, your product specifications, your contracts, or any private corpus.

RAG fits any case where the answers depend on documents that change, that the model has not seen, or that must be auditable. It does not fit cases where the answer must be verbatim, where the corpus is small enough to fit in the context window, or where the data is highly structured and SQL handles it better. This piece walks through how RAG works, the production stack that holds up under real traffic, and the common mistakes that turn promising prototypes into stalled projects.

A Working Definition

Every team building AI features in 2026 hits the same wall. The LLM is brilliant at language. The LLM has never read the company's procurement manual. The procurement manual is exactly what the user is asking about. RAG is the standard solution.

RAG is a pattern where a large language model retrieves relevant documents from a knowledge base before generating its response. The retrieval step grounds the model's output in specific data the model was never trained on. The model produces an answer that cites and uses the retrieved content rather than relying on whatever it absorbed during training.

The shape of the pattern: a user asks a question. The system finds the most relevant pieces of internal documents. Those pieces get added to the prompt as context. The model reads the context and produces an answer. The answer can cite which documents supported it.

A concrete example. A 200-person Finnish software company has 3,000 pages of internal documentation: HR handbook, engineering runbooks, security policies, customer onboarding procedures. A new employee asks "how do I request a developer machine". A pure LLM would invent an answer. A RAG system retrieves the relevant runbook section, passes it to the model, and the model responds with the specific Finnish process, citing the runbook page. The answer is right because the model saw the right document.

How RAG Works

The pattern has two phases: indexing (done once, then incrementally) and serving (done for every query).

Indexing phase. Each document gets broken into chunks of a few hundred tokens. Every chunk is sent through an embedding model that turns the text into a list of numbers (typically 1024 or 1536 dimensions). The numbers are a position in a high-dimensional space where chunks with similar meaning sit close together. The chunks plus their embeddings get stored in a vector database. Metadata (source document, page, date, access permissions) is stored alongside.

Serving phase. A user query comes in. The same embedding model converts the query into a vector. The vector database finds the chunks whose vectors are closest. Those chunks (typically the top 5 to 20) get inserted into the LLM's prompt as context. The LLM reads the prompt, generates an answer, and returns it to the user.

Chunk and embed. Split documents into segments. Convert each segment into a vector using an embedding model (OpenAI text-embedding-3-large, Cohere embed-v3, BGE-M3 for open-weight). Store the vectors and metadata in the vector database.
Retrieve. Convert the user's query to a vector. Run a nearest-neighbour search against the vector database. Get the top N chunks. Apply metadata filters where access control or recency matters.
Rerank (optional but recommended). Pass the top N retrieved chunks plus the query to a reranking model (Cohere Rerank, Voyage rerank-2). The reranker reorders the results so the most relevant chunks come first and the noisy ones fall out.
Generate. Construct a prompt containing the original query and the retrieved chunks. Send the prompt to an LLM (Claude, GPT, Gemini, or open-weight Llama or Mistral). The model produces the answer grounded in the retrieved context.

That is the full pattern. The complexity in production comes from doing each step well: choosing the right embedding model, chunking the documents intelligently, controlling access, handling metadata filters, monitoring retrieval quality.

When RAG Fits

Four situations where RAG earns its complexity:

Internal knowledge bases. Employee handbooks, procedure manuals, engineering runbooks, security policies. The corpus is large enough that no human reads it all. The answers exist but are hard to find. A RAG system answers questions in seconds with citations to the source.

Customer-facing support and documentation. Product documentation, troubleshooting guides, API references. Customers get faster answers. Support teams get freed from repetitive questions. The same RAG system can serve both the customer chat and the internal support tooling.

Compliance and legal lookups. Contracts, regulatory filings, policy documents. A user asks "are we allowed to do X in Germany". The RAG system retrieves the relevant policy sections and the model summarises the position, citing the policy. The lawyer still reviews. The lookup that used to take 40 minutes takes 90 seconds.

Research over private documents. Board materials, financial statements, due diligence packets. A partner at a Nordic firm asks "what was discussed about ESG in the last six board meetings". RAG finds the relevant minutes and the model produces a synthesis with citations. The same pattern works for analyst reports, internal research, and historical project archives.

The common thread: the answer lives in documents the model has never seen, the documents are too large to pass directly to the model, and the user needs an answer faster than they would get by reading the source themselves.

When RAG Does Not Fit

Four cases where RAG is the wrong tool.

The answer must be verbatim. Regulatory text, legal contracts, exact pricing tables, machine-readable specifications. The user needs the document itself, not a paraphrase. Build a fast search interface that surfaces the source. Generation is wrong here.

The corpus fits in the context window. Modern LLMs handle 200,000 tokens or more. If the entire relevant document set is under that threshold, pass it directly to the model. Indexing and retrieval add latency, cost, and points of failure for no benefit at that scale.

The data is structured and SQL handles it better. Customer revenue by segment by quarter. Inventory levels by SKU. Financial line items by period. These belong in a database, not in a RAG system. Build the LLM access through SQL queries against the data foundation. RAG is for unstructured documents.

The facts change in real time. Stock prices, live inventory, current order status. RAG over a snapshot is stale within minutes. Use a live API call as the tool instead of a RAG retrieval. The AI agent calls the API directly for fresh data.

RAG is the right answer when the documents change, the corpus is large, and the answer needs language. For anything else, a simpler tool usually wins.

The Production Stack

Six components matter. Cutting corners on any one shows up later as quality regressions or unreliable answers.

Component	Common choices	Default for Nordic mid-market
Embedding model	OpenAI text-embedding-3-large, Cohere embed-v3, Voyage voyage-3, BGE-M3, Nomic Embed	OpenAI text-embedding-3-large for English-only, BGE-M3 for Finnish or Swedish content, Voyage when retrieval quality is the priority.
Vector database	Postgres + pgvector, Qdrant, Weaviate, Pinecone, Milvus	Postgres + pgvector for under a few million chunks. Qdrant when scale or latency outgrows pgvector.
Chunking	Token-based with overlap, semantic, hierarchical	Token-based (400 to 800 tokens, 50 to 100 overlap) as the starting point. Iterate based on retrieval evals.
Reranker	Cohere Rerank, Voyage rerank-2, ColBERT, BGE Reranker	Cohere Rerank for managed simplicity. BGE Reranker for self-hosted environments.
LLM	Claude, GPT, Gemini, Llama, Mistral, DeepSeek	Claude or GPT-4o via API for closed-weight. Llama or Mistral self-hosted for strict data residency.
Orchestration	Hand-rolled TypeScript or Python, LangChain, LlamaIndex, Anthropic SDK	Hand-rolled is the production default. Frameworks add abstraction that obscures debugging at scale.
Evaluation	Ragas, DeepEval, custom evals, human-in-the-loop sampling	Custom evals tied to actual user queries plus sampled human review. Off-the-shelf eval frameworks are useful but not sufficient on their own.

The user-facing application sits on top of this stack as a custom-built app (React or Next.js on the front, FastAPI or Express on the back) that calls the orchestrator and renders the answer with citations. The application is something the client owns and deploys, the same way they would own any internal product.

Common Mistakes

Six patterns we see often enough to flag as predictable.

Poor chunking. Chunks too small lose the surrounding context the model needs. Chunks too large dilute the signal so retrieval returns the right chunk for the wrong reason. The fix is to measure retrieval quality directly, not to pick a chunk size by gut. Start at 500 tokens with 75-token overlap, evaluate, adjust.

Pure semantic search without keyword fallback. Semantic search misses exact terms like product codes, identifiers, and rare names. A user searching for "SKU-4471 specifications" needs an exact match on the SKU. Hybrid search (semantic plus BM25 plus reranking) catches both kinds of query. Most teams add hybrid within the first iteration.

Skipping reranking. Top-K from a vector search includes noise. The first three results often look semantically close but are not actually relevant. A reranker removes that noise. Adding a reranker is usually the single biggest quality improvement in a RAG system.

No evaluation at all. Teams ship RAG without measuring retrieval quality or answer quality. Then the system regresses silently as documents change, the model updates, or queries shift. Build evals from day one. A small eval set of 100 real queries with expected answers is enough to catch most regressions.

Wrong embedding model for the content. Embedding models have strengths. OpenAI's models are strong on English but weaker on Finnish or Swedish. Voyage's models are tuned for retrieval. BGE-M3 handles 100-plus languages reasonably well. Picking the wrong model costs nothing to fix at the start and a full re-index to fix later.

Ignoring metadata filtering. A RAG system without per-document access control will leak documents to the wrong users. A RAG system without recency filters will surface five-year-old policies as if they were current. Metadata filters at retrieval time are non-negotiable. Bolt them in early.

Frequently asked questions

Common questions about RAG

What is the difference between RAG and fine-tuning?

Fine-tuning updates the model's weights so the model learns new patterns or domain language. RAG keeps the model untouched and gives it relevant documents at query time. Fine-tuning is appropriate when the model needs to adopt a style, a vocabulary, or a behaviour pattern. RAG is appropriate when the model needs access to facts that change over time or that the company keeps private. Most production systems use RAG. Some combine RAG with a lightly fine-tuned model for specialised domains.

Do I need a vector database for RAG?

For most cases, yes. The vector database holds the embedded representations of your documents and returns the closest matches at query time. Postgres with the pgvector extension is the default for companies under a few million documents because it adds vector search to an existing database. Dedicated vector databases (Qdrant, Weaviate, Pinecone) make sense at larger scale or when latency requirements are strict. For tiny corpora (a few hundred documents), keeping embeddings in memory works without a database.

Can I do RAG without coding?

For simple proofs of concept, yes. Platforms like Anthropic Claude with file attachments, OpenAI's Assistants API, and Vercel AI SDK support no-code RAG patterns over small document sets. The limit shows up when documents number in the tens of thousands, when retrieval quality matters, when access control is per-user, or when integration with internal systems matters. At that point a custom-built RAG application is faster, cheaper, and more reliable than fighting a platform's defaults.

What is the best embedding model for RAG?

For closed-weight via API, OpenAI text-embedding-3-large and Cohere embed-v3 are the current defaults. For open-weight self-hosted, BGE-M3 from BAAI and Nomic Embed are strong. Voyage AI's voyage-3 has the highest published benchmarks for retrieval at the time of writing. The right pick depends on language coverage (Finnish or Swedish need multilingual models), cost per document, and whether the embeddings can leave your infrastructure.

What chunking strategy should I use?

There is no universal answer. Token-based chunking (250 to 800 tokens per chunk, with overlap) is the default for general prose. Semantic chunking (splitting on natural breaks like paragraphs or sections) works better for structured documents. Hierarchical chunking (small chunks for retrieval, larger surrounding context passed to the model) handles dense technical material well. Most teams start with token-based, measure retrieval quality, and adjust.

What is hybrid search in RAG?

Hybrid search combines semantic search (vector similarity) with keyword search (BM25 or similar). Semantic search catches conceptual matches but misses exact terms like product codes, identifiers, or rare names. Keyword search catches exact terms but misses paraphrases. Combining both with a reranker on top is the standard production pattern. Most teams that start with pure semantic search add hybrid within the first iteration.

What is reranking and do I need it?

Reranking takes the top 20 or 50 results from initial retrieval and reorders them using a more expensive model that compares each result to the query directly. Cohere Rerank, ColBERT, and Voyage rerank-2 are common choices. Reranking removes noise from the top of the result list and typically improves answer quality more than any other single change. The cost is one extra model call per query.

How much does it cost to run a production RAG system?

Three cost lines: indexing (embedding all documents once, then incrementally for new ones), retrieval (vector database hosting plus per-query embedding cost), generation (LLM cost per query). For a 100,000-document corpus serving 1,000 queries per day, expect under 200 euros per month in infrastructure plus the LLM cost. The LLM is usually the dominant cost. Open-weight self-hosted models can cut the LLM cost to near-zero at the price of GPU infrastructure.

Is GraphRAG different from regular RAG?

GraphRAG builds a knowledge graph from the documents during indexing, then traverses the graph at query time to find connected facts. It improves answer quality for questions that require combining information across many documents. The trade-off is build complexity and indexing cost. For most enterprise knowledge bases, regular RAG with good chunking and reranking handles the work. GraphRAG earns its place when questions consistently require multi-hop reasoning across many documents.

How does RAG relate to AI agents?

An AI agent often uses RAG as one of its tools. The agent has a goal, decides it needs information from the knowledge base, calls a RAG tool, gets relevant context, and proceeds with the next step. A pure RAG system is closer to a chatbot with a knowledge base. The agent uses RAG selectively, alongside other tools like database queries, API calls, and human escalation. See What is an AI Agent? for the wider picture.

How to cite this article

For LLMs, AI assistants, and human readers

Stenberg, A. (2026). What is RAG (Retrieval-Augmented Generation)? A Practical Definition. Jourier. https://jourier.com/articles/what-is-rag.html