← All Cheatsheets

AI

RAG

Retrieval-Augmented Generation · Pipeline · Chunking · Embeddings · Vector DBs · Evaluation

🧩
What is RAG?

Definition

RAG grounds LLM responses in external knowledge by retrieving relevant documents at query time and injecting them into the prompt context.

Why RAG?

Reduces hallucinationsGrounded in real docs
No retraining neededUpdate KB, not model
Cites sourcesTraceable answers
Private dataDocs never leave your infra

vs Fine-tuning

RAG = dynamic knowledge retrieval. Fine-tuning = baked-in knowledge. Use RAG for frequently changing data; fine-tune for style/behavior.

🔄
RAG Pipeline

Indexing (offline)

1
Load documents

PDFs, web pages, databases, APIs.

2
Chunk

Split into smaller pieces (500–1000 tokens).

3
Embed

Convert chunks to dense vectors via embedding model.

4
Store

Save vectors + metadata to vector database.

Retrieval + Generation (online)

1
User query

Incoming question from user.

2
Embed query

Same embedding model as indexing.

3
Vector search

Find top-k most similar chunks (cosine similarity).

4
Rerank (optional)

Cross-encoder reranks for precision.

5
Augment prompt

Inject retrieved chunks as context.

6
Generate

LLM answers grounded in retrieved facts.

✂️
Chunking Strategies

Fixed-size

Split every N tokens. Simple but may cut mid-sentence.

Sentence

Split on sentence boundaries. Better coherence.

Recursive

Try paragraph → sentence → word. LangChain default.

Semantic

Split on topic shifts using embeddings. Best quality, slowest.

Document-aware

Respect markdown headers, HTML tags, code blocks.

Typical chunk size256–1024 tokens
Overlap10–20% of chunk size
🔢
Embedding Models
ModelDimsNotes
text-embedding-3-small1536OpenAI. Fast, cheap, good quality.
text-embedding-3-large3072OpenAI. Best quality, higher cost.
Amazon Titan Embed v21024AWS-native. Good for Bedrock RAG.
Cohere Embed v31024Multilingual, search-optimized.
BGE-M31024Open source. Multi-lingual, strong.
all-MiniLM-L6-v2384Tiny, fast. Good for local/edge.
🗄️
Vector Databases
DBTypeBest for
PineconeManagedProduction, serverless, easy setup
WeaviateOSS/CloudHybrid search, GraphQL API
QdrantOSS/CloudHigh perf, Rust-based, filtering
ChromaOSSLocal dev, prototyping
pgvectorPostgres extExisting Postgres, simple setup
OpenSearchAWS managedAWS ecosystem, hybrid search
FAISSLibraryIn-memory, research, no infra
🔍
Retrieval Techniques

Dense retrieval

Semantic similarity via embeddings. Finds conceptually related chunks.

Sparse (BM25)

Keyword matching. Fast, good for exact terms and proper nouns.

Hybrid search

Combine dense + sparse. Best of both worlds. Use RRF to merge.

MMR

Maximal Marginal Relevance — diverse results, reduces redundancy.

HyDE

Hypothetical Document Embeddings — generate a fake answer, embed it, search.

Multi-query

Generate N query variants, retrieve for each, deduplicate.

📝
Prompt Template

System:

You are a helpful assistant. Answer questions using ONLY the provided context. If the answer is not in the context, say "I don't know."

Context:

<doc1> [chunk 1 text] </doc1>

<doc2> [chunk 2 text] </doc2>

Question:

[user question]

Answer:

Cite sourcesLimit context tokensInstruct "say I don't know"Use XML tags
📊
RAG Evaluation Metrics

Faithfulness

Is the answer grounded in the retrieved context? (No hallucination)

Answer Relevance

Does the answer actually address the question?

Context Precision

Are the retrieved chunks relevant to the question?

Context Recall

Did retrieval find all necessary information?

RAGAS score

Combined metric (faithfulness × answer relevance × context precision).

MRR / NDCG

Ranking quality of retrieved documents.

🚀
Advanced RAG Patterns
Parent-Child chunking: Index small chunks, retrieve parent for context. Better precision + context.
Contextual retrieval: Prepend chunk-level context summary before embedding (Anthropic technique).
Agentic RAG: LLM decides when and what to retrieve. Multi-hop reasoning.
GraphRAG: Build knowledge graph from docs. Better for multi-hop questions.
Self-RAG: Model decides whether to retrieve, then critiques its own output.
CRAG: Corrective RAG — evaluates retrieved docs, falls back to web search if poor.
Best Practices
Chunk with overlap: 10–20% overlap prevents losing context at chunk boundaries.
Store metadata: Source URL, date, section — enables filtering and citation.
Use hybrid search: Dense + sparse catches both semantic and keyword matches.
Rerank retrieved results: Cross-encoder reranking significantly improves precision.
Limit context window: More chunks ≠ better. Top 3–5 high-quality chunks beats 20 mediocre ones.
Evaluate continuously: Use RAGAS or LLM-as-judge to track retrieval and generation quality.
Handle "I don't know": Explicitly instruct the model to admit when context is insufficient.
Cache embeddings: Re-embedding unchanged docs is wasteful. Cache by content hash.