← All Cheatsheets
AI / ML
Intro to Large Language Models
Architecture · Parameters · Training · Prompting · Evaluation · Limitations
LLM
Large Language Model — a deep neural network trained on massive text corpora to predict and generate natural language token by token.
Token
Smallest unit of text. ~0.75 words on average. "ChatGPT" = 2 tokens.
Parameters
Learned weights. GPT-4 ≈ 1.8T. Llama 3 ≈ 70B–405B. More ≠ always better.
Context Window
Max tokens the model sees at once. GPT-4o 128K · Claude 3.5 200K · Gemini 1.5 1M.
Core Pipeline
Each Block
Controls randomness. Low = factual. High = creative. Default ~0.7–1.0
Sample from top-p probability mass. 0.9 = only highest prob tokens.
Consider only top-k tokens. k=1 = greedy decoding. Common: 40–50.
Max output length. Affects cost directly.
Penalizes repeated tokens. 1.0 = disabled.
Next-token prediction on internet-scale text. Terabytes of data. Weeks on thousands of GPUs.
Fine-tune on curated prompt-response pairs. Teaches the model to follow instructions.
Human feedback ranks responses. PPO or Direct Preference Optimization aligns to human preferences.
Constitutional AI, adversarial probing, refusal training before deployment.
Attention(Q,K,V) = softmax(Q·Kᵀ / √dₖ) · V
Q = query K = key V = value dₖ = key dim
Basic → Advanced
Zero-Shot
Ask directly, no examples. Works for well-understood tasks. "Translate to French: …"
Few-Shot
Provide 2–5 input/output examples before your query. Dramatically improves accuracy on novel tasks.
Chain-of-Thought
Add "Let's think step by step". Unlocks multi-step problem solving.
Self-Consistency
Sample N CoT paths, majority-vote the final answer. Improves reasoning reliability.
Advanced Patterns
ReAct
Interleave Reasoning + Action + Observation loops. Foundation for agents + tool use.
Tree-of-Thought
Explore multiple reasoning branches, backtrack on dead ends. Better than linear CoT for planning.
RAG
Retrieval-Augmented Generation — inject retrieved docs into context. Reduces hallucinations on facts.
System Prompt
Persistent instructions before conversation. Sets persona, format, constraints, and guardrails.
| Method | Params | Notes |
|---|---|---|
| Full FT | All | Highest quality, highest cost. Catastrophic forgetting risk. |
| LoRA | ~0.1% | Low-rank adapter matrices. Fast, cheap. Most popular. |
| QLoRA | ~0.1% | LoRA + 4-bit quantization. Fits 65B on 2× A100. |
| Prefix FT | Prefix | Add trainable prefix tokens. Good for task switching. |
| Adapters | ~3–4% | Bottleneck layers per task. Modular, swappable. |
| Family | Org | Context | Strengths | Access |
|---|---|---|---|---|
| GPT-4o | OpenAI | 128K | Multimodal, coding, reasoning, SOTA benchmarks | API |
| Claude 3.5 | Anthropic | 200K | Long context, safety, nuanced writing, coding | API |
| Gemini 1.5 | 1M | Ultra-long context, multimodal, Google ecosystem | API | |
| Llama 3 | Meta | 128K | Open weights, strong fine-tuning base, on-prem | Open |
| Mistral/Mixtral | Mistral AI | 32K | MoE efficiency, open weights, fast inference | Open |
| Nova Premier | Amazon | 300K | AWS-native, inference profiles, enterprise ready | API |
Perplexity (PPL)
How "surprised" the model is by test data. Lower = better. Not correlated with task quality.
BLEU / ROUGE
N-gram overlap vs. reference. BLEU for translation, ROUGE for summarization. Weak proxy for quality.
MMLU
57-subject multiple-choice benchmark. Tests breadth of world knowledge. Common SOTA target.
HumanEval / SWE-bench
Code generation pass rate. SWE-bench = real GitHub issues. Better proxy for dev use-cases.
LLM-as-Judge
Use a strong model to score outputs. MT-Bench, Alpaca-Eval. Fast but biased toward same model family.
Mitigations