The mental models every engineer needs. Get these right and the rest โ agents, MCP, cost โ all click into place.
Tokens are sub-word chunks produced by a tokenizer, each mapped to an integer ID. They are the unit of both pricing and context limits โ and many odd model behaviors with spelling or math trace straight back to tokenization.
~1 token โ 4 English characters, or about ยพ of a word.
Other languages often use 1.5โ2ร the tokens for the same meaning โ and you pay per token.
Opus 4.7's tokenizer reportedly produces up to 35% more tokens for the same text โ same rate card, higher bill.
The context window is the maximum tokens (input + output) a model can consider at once. 1M-token windows are common in 2026 โ but two effects mean you must curate context, not dump everything in.
Models recall the beginning and end of context better than the middle (Liu et al., 2023; TACL 2024). Put key constraints where they'll be seen.
Output quality degrades as input grows โ even before the window is full (Chroma Research, 2025, across 18 models).
Temperature controls how varied the output is; top-p (nucleus sampling) limits choices to the smallest set of tokens whose probabilities sum to p. The practical rule is simple.
Deterministic and focused. Use it for code, structured output, and anything that must be correct and repeatable.
Creative and varied. Use it for brainstorming, naming, and idea generation where you want range.
No single model wins everything. The 2026 best practice is model routing โ match the cheapest model that clears your quality bar to each task.
| Family | Tiers | Strength |
|---|---|---|
| Anthropic Claude | Opus ยท Sonnet ยท Haiku | Agentic coding |
| OpenAI GPT-5.x | Instant ยท Thinking ยท Pro | Abstract reasoning |
| Google Gemini 3.x | Pro ยท Flash ยท Deep Think | Long context |
| Open-weight | Llama ยท Qwen ยท DeepSeek ยท Mistral | Cost & privacy |
Reasoning models spend extra thinking tokens on an internal scratchpad before responding (OpenAI o-series, Claude extended thinking, Gemini Deep Think). Use them for hard, multi-step problems โ but know the trade-offs.
Complex debugging, architecture, and math โ anything multi-step benefits most.
3โ15s before the first visible token, and thinking tokens bill at output rates. A task can cost ~9ร the bare answer.
Telling a reasoning model to think step by step is redundant โ it already thinks internally.
A hallucination is when the model gives an answer that sounds confident but is actually false or made up. The fix is to give it real sources to work from and ask it to show where each answer came from. One support team cut wrong answers from 19% down to about 2%, then under 1%.
Don't ship an answer as fact in production unless it's backed by a real source.
A model only has what was in its training data up to its cutoff date โ it has no awareness of anything after. That single fact is the core justification for everything that follows on this site.