Area 1

Foundational concepts

The mental models every engineer needs. Get these right and the rest โ€” agents, MCP, cost โ€” all click into place.

Tokens & tokenization

Models read tokens, not words.

Tokens are sub-word chunks produced by a tokenizer, each mapped to an integer ID. They are the unit of both pricing and context limits โ€” and many odd model behaviors with spelling or math trace straight back to tokenization.

๐Ÿ“

The rule of thumb

~1 token โ‰ˆ 4 English characters, or about ยพ of a word.

๐ŸŒ

Non-English costs more

Other languages often use 1.5โ€“2ร— the tokens for the same meaning โ€” and you pay per token.

โš ๏ธ

Tokenizers change bills

Opus 4.7's tokenizer reportedly produces up to 35% more tokens for the same text โ€” same rate card, higher bill.

Context windows

Bigger windows, but not bigger isn't free.

The context window is the maximum tokens (input + output) a model can consider at once. 1M-token windows are common in 2026 โ€” but two effects mean you must curate context, not dump everything in.

Context window sizes (2026)

Approximate maximum tokens, per source material
Claude Opus/Sonnet 4.6+
1M
Gemini 3.x
1M
GPT-5.2
~400K
EFFECT #1

Lost in the middle

Models recall the beginning and end of context better than the middle (Liu et al., 2023; TACL 2024). Put key constraints where they'll be seen.

EFFECT #2

Context rot

Output quality degrades as input grows โ€” even before the window is full (Chroma Research, 2025, across 18 models).

Sampling parameters

Temperature dials randomness.

Temperature controls how varied the output is; top-p (nucleus sampling) limits choices to the smallest set of tokens whose probabilities sum to p. The practical rule is simple.

๐ŸŽฏ

Low temperature

Deterministic and focused. Use it for code, structured output, and anything that must be correct and repeatable.

๐ŸŽจ

High temperature

Creative and varied. Use it for brainstorming, naming, and idea generation where you want range.

The 2026 landscape

A specialization market.

No single model wins everything. The 2026 best practice is model routing โ€” match the cheapest model that clears your quality bar to each task.

FamilyTiersStrength
Anthropic ClaudeOpus ยท Sonnet ยท HaikuAgentic coding
OpenAI GPT-5.xInstant ยท Thinking ยท ProAbstract reasoning
Google Gemini 3.xPro ยท Flash ยท Deep ThinkLong context
Open-weightLlama ยท Qwen ยท DeepSeek ยท MistralCost & privacy
A 70/20/10 Haiku / Sonnet / Opus split can cut API costs by more than half versus all-Sonnet.
Routing the cheapest sufficient model to each task is the dominant cost+quality pattern.
Reasoning / thinking models

They think before they answer.

Reasoning models spend extra thinking tokens on an internal scratchpad before responding (OpenAI o-series, Claude extended thinking, Gemini Deep Think). Use them for hard, multi-step problems โ€” but know the trade-offs.

๐Ÿงฎ

Best for hard problems

Complex debugging, architecture, and math โ€” anything multi-step benefits most.

โฑ๏ธ

Latency & cost

3โ€“15s before the first visible token, and thinking tokens bill at output rates. A task can cost ~9ร— the bare answer.

๐Ÿšซ

Skip the step-by-step prompt

Telling a reasoning model to think step by step is redundant โ€” it already thinks internally.

Hallucinations & reliability

Sometimes it sounds right but is wrong.

A hallucination is when the model gives an answer that sounds confident but is actually false or made up. The fix is to give it real sources to work from and ask it to show where each answer came from. One support team cut wrong answers from 19% down to about 2%, then under 1%.

Wrong-answer rate after adding sources

Real support team, from the source material
No sources
19%
With sources
~2%
+ source check
<1%

Don't ship an answer as fact in production unless it's backed by a real source.

Knowledge cutoffs

Why models need tools.

A model only has what was in its training data up to its cutoff date โ€” it has no awareness of anything after. That single fact is the core justification for everything that follows on this site.

Tool use
Let the model call functions to do things now.
Web search
Fetch current information at query time.
RAG
Ground answers in your own documents.
MCP
Connect to current, authoritative external data.