RAG Architecture Guide: Building Retrieval-Augmented Generation Systems
Retrieval-Augmented Generation (RAG) connects LLMs to external knowledge bases, enabling accurate, up-to-date answers without constant retraining. This guide covers every layer of a production RAG system from ingestion through generation.
Why RAG instead of fine-tuning?
| Approach | Best for | Limitation |
|---|---|---|
| RAG | Dynamic, frequently updated data | Retrieval quality is a hard ceiling |
| Fine-tuning | Style, tone, domain vocabulary | Cannot inject new facts post-training |
| Prompt stuffing | Small static context | Context window exhaustion, cost |
Core RAG pipeline
User query → Query embedding → Vector similarity search → Top-K chunk retrieval → Context assembly → LLM generation → Response
Document ingestion
Before querying, you must index your knowledge base:
- Load: PDF, HTML, Markdown, database rows, API responses.
- Chunk: split into semantically coherent segments.
- Embed: convert each chunk to a dense vector.
- Store: persist vectors and metadata in a vector store.
Chunking strategies
Chunk size is one of the highest-impact decisions in a RAG system:
| Strategy | Chunk size | Use case |
|---|---|---|
| Fixed-size | 256–512 tokens | Simple baseline, fast to implement |
| Sentence | 1–5 sentences | QA over structured prose |
| Paragraph | ~200 tokens | Documentation, blog-like content |
| Semantic | Variable | Complex multi-topic documents |
| Hierarchical | Nested | Legal, medical, long-form reports |
Use overlap (10–15% of chunk size) to avoid cutting context at boundaries.
Embedding models
// OpenAI text-embedding-3-small — cost-efficient, 1536 dims
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: chunkText,
});
const vector = response.data[0].embedding;For local or open-source deployments, consider nomic-embed-text or BAAI/bge-large-en-v1.5 via Ollama or HuggingFace.
Vector store options
| Store | Hosting | Best for |
|---|---|---|
| Pinecone | Managed cloud | Production scale, low ops overhead |
| Weaviate | Self-hosted / cloud | Multi-tenancy, hybrid search |
| Chroma | In-process / self-hosted | Prototyping, local development |
| pgvector | PostgreSQL extension | Teams already on Postgres |
| Qdrant | Self-hosted / cloud | High-performance Rust core |
Retrieval patterns
Naive top-K retrieval
const results = await vectorStore.similaritySearch(queryEmbedding, topK: 5);
Hybrid search (BM25 + vector)
Combine dense vector search with sparse keyword matching for better recall on exact terms:
const denseHits = await vectorStore.similaritySearch(queryEmbedding, 10); const sparseHits = await bm25Index.search(queryText, 10); const reranked = rrf(denseHits, sparseHits); // Reciprocal Rank Fusion
Contextual compression
Rather than sending full chunks to the LLM, extract only the relevant sentences from each chunk before assembly.
HyDE (Hypothetical Document Embeddings)
Generate a hypothetical answer first, embed it, then search — useful for questions where the question and answer vocabulary differ significantly.
Context assembly
const systemPrompt = `You are a helpful assistant. Use only the context below.
If the answer is not in the context, say "I don't know."
Context:
${retrievedChunks.map((c, i) => `[${i + 1}] ${c.text}`).join('\n\n')}`;Metadata filtering
Add metadata to chunks at index time to enable pre-filtering before vector search:
await vectorStore.upsert([{
id: 'doc-42-chunk-3',
values: embedding,
metadata: {
source: 'docs/api-reference.md',
category: 'api',
updated: '2026-01-15',
language: 'en',
}
}]);
// Later, filter at query time
const results = await vectorStore.query({
vector: queryEmbedding,
topK: 5,
filter: { category: 'api', language: 'en' },
});Evaluation checklist
- Context recall: does retrieval include the ground-truth chunk?
- Context precision: how much retrieved context is actually relevant?
- Answer faithfulness: does the LLM stick to retrieved context?
- Answer relevance: does the final answer address the question?
Production checklist
- Version your embedding model — changing it invalidates the entire index.
- Cache embeddings for frequently reused documents.
- Monitor retrieval latency separately from generation latency.
- Log query + retrieved chunks for debugging and retraining.
- Set a fallback path when retrieval returns low-confidence results.
Takeaway
RAG systems fail at the retrieval layer more often than at the generation layer. Invest in chunk quality, embedding model selection, and hybrid search before optimizing prompts.