RAG Architecture Guide: Building Retrieval-Augmented Generation Systems

Retrieval-Augmented Generation (RAG) connects LLMs to external knowledge bases, enabling accurate, up-to-date answers without constant retraining. This guide covers every layer of a production RAG system from ingestion through generation.

Why RAG instead of fine-tuning?

Approach	Best for	Limitation
RAG	Dynamic, frequently updated data	Retrieval quality is a hard ceiling
Fine-tuning	Style, tone, domain vocabulary	Cannot inject new facts post-training
Prompt stuffing	Small static context	Context window exhaustion, cost

Core RAG pipeline

User query
  → Query embedding
  → Vector similarity search
  → Top-K chunk retrieval
  → Context assembly
  → LLM generation
  → Response

Document ingestion

Before querying, you must index your knowledge base:

Load: PDF, HTML, Markdown, database rows, API responses.
Chunk: split into semantically coherent segments.
Embed: convert each chunk to a dense vector.
Store: persist vectors and metadata in a vector store.

Chunking strategies

Chunk size is one of the highest-impact decisions in a RAG system:

Strategy	Chunk size	Use case
Fixed-size	256–512 tokens	Simple baseline, fast to implement
Sentence	1–5 sentences	QA over structured prose
Paragraph	~200 tokens	Documentation, blog-like content
Semantic	Variable	Complex multi-topic documents
Hierarchical	Nested	Legal, medical, long-form reports

Use overlap (10–15% of chunk size) to avoid cutting context at boundaries.

Embedding models

// OpenAI text-embedding-3-small — cost-efficient, 1536 dims
const response = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: chunkText,
});
const vector = response.data[0].embedding;

For local or open-source deployments, consider nomic-embed-text or BAAI/bge-large-en-v1.5 via Ollama or HuggingFace.

Vector store options

Store	Hosting	Best for
Pinecone	Managed cloud	Production scale, low ops overhead
Weaviate	Self-hosted / cloud	Multi-tenancy, hybrid search
Chroma	In-process / self-hosted	Prototyping, local development
pgvector	PostgreSQL extension	Teams already on Postgres
Qdrant	Self-hosted / cloud	High-performance Rust core

Retrieval patterns

Naive top-K retrieval

const results = await vectorStore.similaritySearch(queryEmbedding, topK: 5);

Hybrid search (BM25 + vector)

Combine dense vector search with sparse keyword matching for better recall on exact terms:

const denseHits  = await vectorStore.similaritySearch(queryEmbedding, 10);
const sparseHits = await bm25Index.search(queryText, 10);
const reranked   = rrf(denseHits, sparseHits); // Reciprocal Rank Fusion

Contextual compression

Rather than sending full chunks to the LLM, extract only the relevant sentences from each chunk before assembly.

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer first, embed it, then search — useful for questions where the question and answer vocabulary differ significantly.

Context assembly

const systemPrompt = `You are a helpful assistant. Use only the context below.
If the answer is not in the context, say "I don't know."

Context:
${retrievedChunks.map((c, i) => `[${i + 1}] ${c.text}`).join('\n\n')}`;

Metadata filtering

Add metadata to chunks at index time to enable pre-filtering before vector search:

await vectorStore.upsert([{
  id: 'doc-42-chunk-3',
  values: embedding,
  metadata: {
    source: 'docs/api-reference.md',
    category: 'api',
    updated: '2026-01-15',
    language: 'en',
  }
}]);

// Later, filter at query time
const results = await vectorStore.query({
  vector: queryEmbedding,
  topK: 5,
  filter: { category: 'api', language: 'en' },
});

Evaluation checklist

Context recall: does retrieval include the ground-truth chunk?
Context precision: how much retrieved context is actually relevant?
Answer faithfulness: does the LLM stick to retrieved context?
Answer relevance: does the final answer address the question?

Production checklist

Version your embedding model — changing it invalidates the entire index.
Cache embeddings for frequently reused documents.
Monitor retrieval latency separately from generation latency.
Log query + retrieved chunks for debugging and retraining.
Set a fallback path when retrieval returns low-confidence results.

Takeaway

RAG systems fail at the retrieval layer more often than at the generation layer. Invest in chunk quality, embedding model selection, and hybrid search before optimizing prompts.