← All Cheatsheets

RAG

Retrieval-Augmented Generation · Pipeline · Chunking · Embeddings · Vector DBs · Evaluation

🧩

What is RAG?

Definition

RAG grounds LLM responses in external knowledge by retrieving relevant documents at query time and injecting them into the prompt context.

Why RAG?

Reduces hallucinationsGrounded in real docs

No retraining neededUpdate KB, not model

Cites sourcesTraceable answers

Private dataDocs never leave your infra

vs Fine-tuning

RAG = dynamic knowledge retrieval. Fine-tuning = baked-in knowledge. Use RAG for frequently changing data; fine-tune for style/behavior.

🔄

RAG Pipeline

Indexing (offline)

Load documents

PDFs, web pages, databases, APIs.

Chunk

Split into smaller pieces (500–1000 tokens).

Embed

Convert chunks to dense vectors via embedding model.

Store

Save vectors + metadata to vector database.

Retrieval + Generation (online)

User query

Incoming question from user.

Embed query

Same embedding model as indexing.

Vector search

Find top-k most similar chunks (cosine similarity).

Rerank (optional)

Cross-encoder reranks for precision.

Augment prompt

Inject retrieved chunks as context.

Generate

LLM answers grounded in retrieved facts.

✂️

Chunking Strategies

Fixed-size

Split every N tokens. Simple but may cut mid-sentence.

Sentence

Split on sentence boundaries. Better coherence.

Recursive

Try paragraph → sentence → word. LangChain default.

Semantic

Split on topic shifts using embeddings. Best quality, slowest.

Document-aware

Respect markdown headers, HTML tags, code blocks.

Typical chunk size256–1024 tokens

Overlap10–20% of chunk size

🔢

Embedding Models

Model	Dims	Notes
text-embedding-3-small	1536	OpenAI. Fast, cheap, good quality.
text-embedding-3-large	3072	OpenAI. Best quality, higher cost.
Amazon Titan Embed v2	1024	AWS-native. Good for Bedrock RAG.
Cohere Embed v3	1024	Multilingual, search-optimized.
BGE-M3	1024	Open source. Multi-lingual, strong.
all-MiniLM-L6-v2	384	Tiny, fast. Good for local/edge.

🗄️

Vector Databases

DB	Type	Best for
Pinecone	Managed	Production, serverless, easy setup
Weaviate	OSS/Cloud	Hybrid search, GraphQL API
Qdrant	OSS/Cloud	High perf, Rust-based, filtering
Chroma	OSS	Local dev, prototyping
pgvector	Postgres ext	Existing Postgres, simple setup
OpenSearch	AWS managed	AWS ecosystem, hybrid search
FAISS	Library	In-memory, research, no infra

🔍

Retrieval Techniques

Dense retrieval

Semantic similarity via embeddings. Finds conceptually related chunks.

Sparse (BM25)

Keyword matching. Fast, good for exact terms and proper nouns.

Hybrid search

Combine dense + sparse. Best of both worlds. Use RRF to merge.

MMR

Maximal Marginal Relevance — diverse results, reduces redundancy.

HyDE

Hypothetical Document Embeddings — generate a fake answer, embed it, search.

Multi-query

Generate N query variants, retrieve for each, deduplicate.

📝

Prompt Template

System:

You are a helpful assistant. Answer questions using ONLY the provided context. If the answer is not in the context, say "I don't know."

Context:

<doc1> [chunk 1 text] </doc1>

<doc2> [chunk 2 text] </doc2>

Question:

[user question]

Answer:

Cite sourcesLimit context tokensInstruct "say I don't know"Use XML tags

📊

RAG Evaluation Metrics

Faithfulness

Is the answer grounded in the retrieved context? (No hallucination)

Answer Relevance

Does the answer actually address the question?

Context Precision

Are the retrieved chunks relevant to the question?

Context Recall

Did retrieval find all necessary information?

RAGAS score

Combined metric (faithfulness × answer relevance × context precision).

MRR / NDCG

Ranking quality of retrieved documents.

🚀

Advanced RAG Patterns

→

Parent-Child chunking: Index small chunks, retrieve parent for context. Better precision + context.

→

Contextual retrieval: Prepend chunk-level context summary before embedding (Anthropic technique).

→

Agentic RAG: LLM decides when and what to retrieve. Multi-hop reasoning.

→

GraphRAG: Build knowledge graph from docs. Better for multi-hop questions.

→

Self-RAG: Model decides whether to retrieve, then critiques its own output.

→

CRAG: Corrective RAG — evaluates retrieved docs, falls back to web search if poor.

✅

Best Practices

→

Chunk with overlap: 10–20% overlap prevents losing context at chunk boundaries.

→

Store metadata: Source URL, date, section — enables filtering and citation.

→

Use hybrid search: Dense + sparse catches both semantic and keyword matches.

→

Rerank retrieved results: Cross-encoder reranking significantly improves precision.

→

Limit context window: More chunks ≠ better. Top 3–5 high-quality chunks beats 20 mediocre ones.

→

Evaluate continuously: Use RAGAS or LLM-as-judge to track retrieval and generation quality.

→

Handle "I don't know": Explicitly instruct the model to admit when context is insufficient.

→

Cache embeddings: Re-embedding unchanged docs is wasteful. Cache by content hash.