← All Cheatsheets

AI / ML

Intro to Large Language Models

Architecture · Parameters · Training · Prompting · Evaluation · Limitations

🧠
What is an LLM?

LLM

Large Language Model — a deep neural network trained on massive text corpora to predict and generate natural language token by token.

Token

Smallest unit of text. ~0.75 words on average. "ChatGPT" = 2 tokens.

Parameters

Learned weights. GPT-4 ≈ 1.8T. Llama 3 ≈ 70B–405B. More ≠ always better.

Context Window

Max tokens the model sees at once. GPT-4o 128K · Claude 3.5 200K · Gemini 1.5 1M.

⚙️
Transformer Architecture

Core Pipeline

InputTokenizerEmbeddingsPos. Enc.N × BlocksLogitsToken Out

Each Block

Self-Attention (MHA)Q · Kᵀ / √d → V
Feed-Forward NetworkLinear → GeLU → Linear
Layer NormalizationPre-norm or Post-norm
Residual Connectionx + Sublayer(x)
🎛️
Inference Parameters
Temperature0.0 – 2.0

Controls randomness. Low = factual. High = creative. Default ~0.7–1.0

Top-p (Nucleus)0.0 – 1.0

Sample from top-p probability mass. 0.9 = only highest prob tokens.

Top-k1 – 100

Consider only top-k tokens. k=1 = greedy decoding. Common: 40–50.

Max Tokens1 – model limit

Max output length. Affects cost directly.

Repetition Penalty1.0 – 1.5

Penalizes repeated tokens. 1.0 = disabled.

🔄
Training Pipeline
1
Pre-trainingSelf-supervised

Next-token prediction on internet-scale text. Terabytes of data. Weeks on thousands of GPUs.

2
Supervised Fine-Tuning (SFT)Instruction

Fine-tune on curated prompt-response pairs. Teaches the model to follow instructions.

3
RLHF / DPOAlignment

Human feedback ranks responses. PPO or Direct Preference Optimization aligns to human preferences.

4
Red-teaming & SafetySafety

Constitutional AI, adversarial probing, refusal training before deployment.

👁️
Attention Mechanism
Core Idea: Every token attends to every other token and weighs its relevance. "The bank by the river" → bank attends strongly to river.

Attention(Q,K,V) = softmax(Q·Kᵀ / √dₖ) · V
Q = query K = key V = value dₖ = key dim

Multi-Head (MHA)h parallel heads
Grouped Query (GQA)Fewer KV heads → ↓ memory
Flash AttentionIO-aware kernel → 3-8× faster
KV CacheReuse past keys/values
✍️
Prompting Techniques

Basic → Advanced

Zero-Shot

Ask directly, no examples. Works for well-understood tasks. "Translate to French: …"

Few-Shot

Provide 2–5 input/output examples before your query. Dramatically improves accuracy on novel tasks.

Chain-of-Thought

Add "Let's think step by step". Unlocks multi-step problem solving.

Self-Consistency

Sample N CoT paths, majority-vote the final answer. Improves reasoning reliability.

Advanced Patterns

ReAct

Interleave Reasoning + Action + Observation loops. Foundation for agents + tool use.

Tree-of-Thought

Explore multiple reasoning branches, backtrack on dead ends. Better than linear CoT for planning.

RAG

Retrieval-Augmented Generation — inject retrieved docs into context. Reduces hallucinations on facts.

System Prompt

Persistent instructions before conversation. Sets persona, format, constraints, and guardrails.

🔧
Fine-Tuning Methods
MethodParamsNotes
Full FTAllHighest quality, highest cost. Catastrophic forgetting risk.
LoRA~0.1%Low-rank adapter matrices. Fast, cheap. Most popular.
QLoRA~0.1%LoRA + 4-bit quantization. Fits 65B on 2× A100.
Prefix FTPrefixAdd trainable prefix tokens. Good for task switching.
Adapters~3–4%Bottleneck layers per task. Modular, swappable.
📊
Notable Model Families (2024–25)
FamilyOrgContextStrengthsAccess
GPT-4oOpenAI128KMultimodal, coding, reasoning, SOTA benchmarksAPI
Claude 3.5Anthropic200KLong context, safety, nuanced writing, codingAPI
Gemini 1.5Google1MUltra-long context, multimodal, Google ecosystemAPI
Llama 3Meta128KOpen weights, strong fine-tuning base, on-premOpen
Mistral/MixtralMistral AI32KMoE efficiency, open weights, fast inferenceOpen
Nova PremierAmazon300KAWS-native, inference profiles, enterprise readyAPI
📏
Evaluation Metrics

Perplexity (PPL)

How "surprised" the model is by test data. Lower = better. Not correlated with task quality.

BLEU / ROUGE

N-gram overlap vs. reference. BLEU for translation, ROUGE for summarization. Weak proxy for quality.

MMLU

57-subject multiple-choice benchmark. Tests breadth of world knowledge. Common SOTA target.

HumanEval / SWE-bench

Code generation pass rate. SWE-bench = real GitHub issues. Better proxy for dev use-cases.

LLM-as-Judge

Use a strong model to score outputs. MT-Bench, Alpaca-Eval. Fast but biased toward same model family.

⚠️
Hallucinations & Limitations
Hallucination: Model generates plausible-sounding but factually incorrect content with high confidence.
Knowledge cutoffTraining data frozen in time
Intrinsic hallucinationContradicts source/facts
Extrinsic hallucinationUnverifiable fabrication
SycophancyAgrees with user pressure
Lost in the middleMisses middle-of-context info

Mitigations

RAGGroundingCitation promptsTemp = 0Self-consistencyHuman review