Frequently asked questions
Common questions about RAG
What is the difference between RAG and fine-tuning?
Fine-tuning updates the model's weights so the model learns new patterns or domain language. RAG keeps the model untouched and gives it relevant documents at query time. Fine-tuning is appropriate when the model needs to adopt a style, a vocabulary, or a behaviour pattern. RAG is appropriate when the model needs access to facts that change over time or that the company keeps private. Most production systems use RAG. Some combine RAG with a lightly fine-tuned model for specialised domains.
Do I need a vector database for RAG?
For most cases, yes. The vector database holds the embedded representations of your documents and returns the closest matches at query time. Postgres with the pgvector extension is the default for companies under a few million documents because it adds vector search to an existing database. Dedicated vector databases (Qdrant, Weaviate, Pinecone) make sense at larger scale or when latency requirements are strict. For tiny corpora (a few hundred documents), keeping embeddings in memory works without a database.
Can I do RAG without coding?
For simple proofs of concept, yes. Platforms like Anthropic Claude with file attachments, OpenAI's Assistants API, and Vercel AI SDK support no-code RAG patterns over small document sets. The limit shows up when documents number in the tens of thousands, when retrieval quality matters, when access control is per-user, or when integration with internal systems matters. At that point a custom-built RAG application is faster, cheaper, and more reliable than fighting a platform's defaults.
What is the best embedding model for RAG?
For closed-weight via API, OpenAI text-embedding-3-large and Cohere embed-v3 are the current defaults. For open-weight self-hosted, BGE-M3 from BAAI and Nomic Embed are strong. Voyage AI's voyage-3 has the highest published benchmarks for retrieval at the time of writing. The right pick depends on language coverage (Finnish or Swedish need multilingual models), cost per document, and whether the embeddings can leave your infrastructure.
What chunking strategy should I use?
There is no universal answer. Token-based chunking (250 to 800 tokens per chunk, with overlap) is the default for general prose. Semantic chunking (splitting on natural breaks like paragraphs or sections) works better for structured documents. Hierarchical chunking (small chunks for retrieval, larger surrounding context passed to the model) handles dense technical material well. Most teams start with token-based, measure retrieval quality, and adjust.
What is hybrid search in RAG?
Hybrid search combines semantic search (vector similarity) with keyword search (BM25 or similar). Semantic search catches conceptual matches but misses exact terms like product codes, identifiers, or rare names. Keyword search catches exact terms but misses paraphrases. Combining both with a reranker on top is the standard production pattern. Most teams that start with pure semantic search add hybrid within the first iteration.
What is reranking and do I need it?
Reranking takes the top 20 or 50 results from initial retrieval and reorders them using a more expensive model that compares each result to the query directly. Cohere Rerank, ColBERT, and Voyage rerank-2 are common choices. Reranking removes noise from the top of the result list and typically improves answer quality more than any other single change. The cost is one extra model call per query.
How much does it cost to run a production RAG system?
Three cost lines: indexing (embedding all documents once, then incrementally for new ones), retrieval (vector database hosting plus per-query embedding cost), generation (LLM cost per query). For a 100,000-document corpus serving 1,000 queries per day, expect under 200 euros per month in infrastructure plus the LLM cost. The LLM is usually the dominant cost. Open-weight self-hosted models can cut the LLM cost to near-zero at the price of GPU infrastructure.
Is GraphRAG different from regular RAG?
GraphRAG builds a knowledge graph from the documents during indexing, then traverses the graph at query time to find connected facts. It improves answer quality for questions that require combining information across many documents. The trade-off is build complexity and indexing cost. For most enterprise knowledge bases, regular RAG with good chunking and reranking handles the work. GraphRAG earns its place when questions consistently require multi-hop reasoning across many documents.
How does RAG relate to AI agents?
An AI agent often uses RAG as one of its tools. The agent has a goal, decides it needs information from the knowledge base, calls a RAG tool, gets relevant context, and proceeds with the next step. A pure RAG system is closer to a chatbot with a knowledge base. The agent uses RAG selectively, alongside other tools like database queries, API calls, and human escalation. See What is an AI Agent? for the wider picture.