Free developer tools and practical guides for SQL, data workflows, and debugging.
AAskDBSQL & Data Toolkit

RAG Architecture Guide: Building Retrieval-Augmented Generation Systems

·14 min read

Retrieval-Augmented Generation (RAG) connects LLMs to external knowledge bases, enabling accurate, up-to-date answers without constant retraining. This guide covers every layer of a production RAG system from ingestion through generation.

Why RAG instead of fine-tuning?

ApproachBest forLimitation
RAGDynamic, frequently updated dataRetrieval quality is a hard ceiling
Fine-tuningStyle, tone, domain vocabularyCannot inject new facts post-training
Prompt stuffingSmall static contextContext window exhaustion, cost

Core RAG pipeline

User query
  → Query embedding
  → Vector similarity search
  → Top-K chunk retrieval
  → Context assembly
  → LLM generation
  → Response

Document ingestion

Before querying, you must index your knowledge base:

  1. Load: PDF, HTML, Markdown, database rows, API responses.
  2. Chunk: split into semantically coherent segments.
  3. Embed: convert each chunk to a dense vector.
  4. Store: persist vectors and metadata in a vector store.

Chunking strategies

Chunk size is one of the highest-impact decisions in a RAG system:

StrategyChunk sizeUse case
Fixed-size256–512 tokensSimple baseline, fast to implement
Sentence1–5 sentencesQA over structured prose
Paragraph~200 tokensDocumentation, blog-like content
SemanticVariableComplex multi-topic documents
HierarchicalNestedLegal, medical, long-form reports

Use overlap (10–15% of chunk size) to avoid cutting context at boundaries.

Embedding models

// OpenAI text-embedding-3-small — cost-efficient, 1536 dims
const response = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: chunkText,
});
const vector = response.data[0].embedding;

For local or open-source deployments, consider nomic-embed-text or BAAI/bge-large-en-v1.5 via Ollama or HuggingFace.

Vector store options

StoreHostingBest for
PineconeManaged cloudProduction scale, low ops overhead
WeaviateSelf-hosted / cloudMulti-tenancy, hybrid search
ChromaIn-process / self-hostedPrototyping, local development
pgvectorPostgreSQL extensionTeams already on Postgres
QdrantSelf-hosted / cloudHigh-performance Rust core

Retrieval patterns

Naive top-K retrieval

const results = await vectorStore.similaritySearch(queryEmbedding, topK: 5);

Hybrid search (BM25 + vector)

Combine dense vector search with sparse keyword matching for better recall on exact terms:

const denseHits  = await vectorStore.similaritySearch(queryEmbedding, 10);
const sparseHits = await bm25Index.search(queryText, 10);
const reranked   = rrf(denseHits, sparseHits); // Reciprocal Rank Fusion

Contextual compression

Rather than sending full chunks to the LLM, extract only the relevant sentences from each chunk before assembly.

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical answer first, embed it, then search — useful for questions where the question and answer vocabulary differ significantly.

Context assembly

const systemPrompt = `You are a helpful assistant. Use only the context below.
If the answer is not in the context, say "I don't know."

Context:
${retrievedChunks.map((c, i) => `[${i + 1}] ${c.text}`).join('\n\n')}`;

Metadata filtering

Add metadata to chunks at index time to enable pre-filtering before vector search:

await vectorStore.upsert([{
  id: 'doc-42-chunk-3',
  values: embedding,
  metadata: {
    source: 'docs/api-reference.md',
    category: 'api',
    updated: '2026-01-15',
    language: 'en',
  }
}]);

// Later, filter at query time
const results = await vectorStore.query({
  vector: queryEmbedding,
  topK: 5,
  filter: { category: 'api', language: 'en' },
});

Evaluation checklist

  • Context recall: does retrieval include the ground-truth chunk?
  • Context precision: how much retrieved context is actually relevant?
  • Answer faithfulness: does the LLM stick to retrieved context?
  • Answer relevance: does the final answer address the question?

Production checklist

  • Version your embedding model — changing it invalidates the entire index.
  • Cache embeddings for frequently reused documents.
  • Monitor retrieval latency separately from generation latency.
  • Log query + retrieved chunks for debugging and retraining.
  • Set a fallback path when retrieval returns low-confidence results.

Takeaway

RAG systems fail at the retrieval layer more often than at the generation layer. Invest in chunk quality, embedding model selection, and hybrid search before optimizing prompts.