← All Cheatsheets
AI
Retrieval-Augmented Generation · Pipeline · Chunking · Embeddings · Vector DBs · Evaluation
Definition
RAG grounds LLM responses in external knowledge by retrieving relevant documents at query time and injecting them into the prompt context.
Why RAG?
vs Fine-tuning
RAG = dynamic knowledge retrieval. Fine-tuning = baked-in knowledge. Use RAG for frequently changing data; fine-tune for style/behavior.
Indexing (offline)
PDFs, web pages, databases, APIs.
Split into smaller pieces (500–1000 tokens).
Convert chunks to dense vectors via embedding model.
Save vectors + metadata to vector database.
Retrieval + Generation (online)
Incoming question from user.
Same embedding model as indexing.
Find top-k most similar chunks (cosine similarity).
Cross-encoder reranks for precision.
Inject retrieved chunks as context.
LLM answers grounded in retrieved facts.
Fixed-size
Split every N tokens. Simple but may cut mid-sentence.
Sentence
Split on sentence boundaries. Better coherence.
Recursive
Try paragraph → sentence → word. LangChain default.
Semantic
Split on topic shifts using embeddings. Best quality, slowest.
Document-aware
Respect markdown headers, HTML tags, code blocks.
| Model | Dims | Notes |
|---|---|---|
| text-embedding-3-small | 1536 | OpenAI. Fast, cheap, good quality. |
| text-embedding-3-large | 3072 | OpenAI. Best quality, higher cost. |
| Amazon Titan Embed v2 | 1024 | AWS-native. Good for Bedrock RAG. |
| Cohere Embed v3 | 1024 | Multilingual, search-optimized. |
| BGE-M3 | 1024 | Open source. Multi-lingual, strong. |
| all-MiniLM-L6-v2 | 384 | Tiny, fast. Good for local/edge. |
| DB | Type | Best for |
|---|---|---|
| Pinecone | Managed | Production, serverless, easy setup |
| Weaviate | OSS/Cloud | Hybrid search, GraphQL API |
| Qdrant | OSS/Cloud | High perf, Rust-based, filtering |
| Chroma | OSS | Local dev, prototyping |
| pgvector | Postgres ext | Existing Postgres, simple setup |
| OpenSearch | AWS managed | AWS ecosystem, hybrid search |
| FAISS | Library | In-memory, research, no infra |
Dense retrieval
Semantic similarity via embeddings. Finds conceptually related chunks.
Sparse (BM25)
Keyword matching. Fast, good for exact terms and proper nouns.
Hybrid search
Combine dense + sparse. Best of both worlds. Use RRF to merge.
MMR
Maximal Marginal Relevance — diverse results, reduces redundancy.
HyDE
Hypothetical Document Embeddings — generate a fake answer, embed it, search.
Multi-query
Generate N query variants, retrieve for each, deduplicate.
System:
You are a helpful assistant. Answer questions using ONLY the provided context. If the answer is not in the context, say "I don't know."
Context:
<doc1> [chunk 1 text] </doc1>
<doc2> [chunk 2 text] </doc2>
Question:
[user question]
Answer:
Faithfulness
Is the answer grounded in the retrieved context? (No hallucination)
Answer Relevance
Does the answer actually address the question?
Context Precision
Are the retrieved chunks relevant to the question?
Context Recall
Did retrieval find all necessary information?
RAGAS score
Combined metric (faithfulness × answer relevance × context precision).
MRR / NDCG
Ranking quality of retrieved documents.