← All Cheatsheets

AI / ML

Intro to Large Language Models

Architecture · Parameters · Training · Prompting · Evaluation · Limitations

🧠

What is an LLM?

LLM

Large Language Model — a deep neural network trained on massive text corpora to predict and generate natural language token by token.

Token

Smallest unit of text. ~0.75 words on average. "ChatGPT" = 2 tokens.

Parameters

Learned weights. GPT-4 ≈ 1.8T. Llama 3 ≈ 70B–405B. More ≠ always better.

Context Window

Max tokens the model sees at once. GPT-4o 128K · Claude 3.5 200K · Gemini 1.5 1M.

⚙️

Transformer Architecture

Core Pipeline

Input→Tokenizer→Embeddings→Pos. Enc.→N × Blocks→Logits→Token Out

Each Block

Self-Attention (MHA)Q · Kᵀ / √d → V

Feed-Forward NetworkLinear → GeLU → Linear

Layer NormalizationPre-norm or Post-norm

Residual Connectionx + Sublayer(x)

🎛️

Inference Parameters

Temperature0.0 – 2.0

Controls randomness. Low = factual. High = creative. Default ~0.7–1.0

Top-p (Nucleus)0.0 – 1.0

Sample from top-p probability mass. 0.9 = only highest prob tokens.

Top-k1 – 100

Consider only top-k tokens. k=1 = greedy decoding. Common: 40–50.

Max Tokens1 – model limit

Max output length. Affects cost directly.

Repetition Penalty1.0 – 1.5

Penalizes repeated tokens. 1.0 = disabled.

🔄

Training Pipeline

Pre-trainingSelf-supervised

Next-token prediction on internet-scale text. Terabytes of data. Weeks on thousands of GPUs.

Supervised Fine-Tuning (SFT)Instruction

Fine-tune on curated prompt-response pairs. Teaches the model to follow instructions.

RLHF / DPOAlignment

Human feedback ranks responses. PPO or Direct Preference Optimization aligns to human preferences.

Red-teaming & SafetySafety

Constitutional AI, adversarial probing, refusal training before deployment.

👁️

Attention Mechanism

Core Idea: Every token attends to every other token and weighs its relevance. "The bank by the river" → bank attends strongly to river.

Attention(Q,K,V) = softmax(Q·Kᵀ / √dₖ) · V
Q = query K = key V = value dₖ = key dim

Multi-Head (MHA)h parallel heads

Grouped Query (GQA)Fewer KV heads → ↓ memory

Flash AttentionIO-aware kernel → 3-8× faster

KV CacheReuse past keys/values

✍️

Prompting Techniques

Basic → Advanced

Zero-Shot

Ask directly, no examples. Works for well-understood tasks. "Translate to French: …"

Few-Shot

Provide 2–5 input/output examples before your query. Dramatically improves accuracy on novel tasks.

Chain-of-Thought

Add "Let's think step by step". Unlocks multi-step problem solving.

Self-Consistency

Sample N CoT paths, majority-vote the final answer. Improves reasoning reliability.

Advanced Patterns

ReAct

Interleave Reasoning + Action + Observation loops. Foundation for agents + tool use.

Tree-of-Thought

Explore multiple reasoning branches, backtrack on dead ends. Better than linear CoT for planning.

RAG

Retrieval-Augmented Generation — inject retrieved docs into context. Reduces hallucinations on facts.

System Prompt

Persistent instructions before conversation. Sets persona, format, constraints, and guardrails.

🔧

Fine-Tuning Methods

Method	Params	Notes
Full FT	All	Highest quality, highest cost. Catastrophic forgetting risk.
LoRA	~0.1%	Low-rank adapter matrices. Fast, cheap. Most popular.
QLoRA	~0.1%	LoRA + 4-bit quantization. Fits 65B on 2× A100.
Prefix FT	Prefix	Add trainable prefix tokens. Good for task switching.
Adapters	~3–4%	Bottleneck layers per task. Modular, swappable.

📊

Notable Model Families (2024–25)

Family	Org	Context	Strengths	Access
GPT-4o	OpenAI	128K	Multimodal, coding, reasoning, SOTA benchmarks	API
Claude 3.5	Anthropic	200K	Long context, safety, nuanced writing, coding	API
Gemini 1.5	Google	1M	Ultra-long context, multimodal, Google ecosystem	API
Llama 3	Meta	128K	Open weights, strong fine-tuning base, on-prem	Open
Mistral/Mixtral	Mistral AI	32K	MoE efficiency, open weights, fast inference	Open
Nova Premier	Amazon	300K	AWS-native, inference profiles, enterprise ready	API

📏

Evaluation Metrics

Perplexity (PPL)

How "surprised" the model is by test data. Lower = better. Not correlated with task quality.

BLEU / ROUGE

N-gram overlap vs. reference. BLEU for translation, ROUGE for summarization. Weak proxy for quality.

MMLU

57-subject multiple-choice benchmark. Tests breadth of world knowledge. Common SOTA target.

HumanEval / SWE-bench

Code generation pass rate. SWE-bench = real GitHub issues. Better proxy for dev use-cases.

LLM-as-Judge

Use a strong model to score outputs. MT-Bench, Alpaca-Eval. Fast but biased toward same model family.

⚠️

Hallucinations & Limitations

Hallucination: Model generates plausible-sounding but factually incorrect content with high confidence.

Knowledge cutoffTraining data frozen in time

Intrinsic hallucinationContradicts source/facts

Extrinsic hallucinationUnverifiable fabrication

SycophancyAgrees with user pressure

Lost in the middleMisses middle-of-context info

Mitigations

RAGGroundingCitation promptsTemp = 0Self-consistencyHuman review