Always reaching for the biggest model can cost several times more than necessary. A few structural habits change everything.
Tokens sent in (prompt) and tokens generated out (completion) are billed apart, and output is typically priced several times higher than input. For many apps, output length is the bigger cost lever.
| Model (per million tokens) | Input | Output |
|---|---|---|
| Claude Haiku 4.5 fast / cheap | ~$1 | ~$5 |
| Claude Sonnet 4.6 balanced | ~$3 | ~$15 |
| Claude Opus tier most capable | ~$5 | ~$25 |
Illustrative rates from the source material; verify against current provider pricing.
Reusing a stable prefix — system prompt, tool definitions, large documents — lets the provider cache it. On Anthropic, cache reads cost about 10% of the base input rate. Structure prompts with static content first, variable content last.
Put your system prompt, tool definitions and big documents at the start of every request.
Keep the per-request bits at the end so the cached prefix stays identical and hits the cache.
Four levers compound. Caching + batch alone can cut costs 90%+ — e.g. Sonnet 4.6 effectively dropping from $3/$15 toward $1.50/$7.50 with batch, far lower with caching.
Use the cheapest model that clears your quality bar for each task.
~50% discount for async, non-urgent jobs.
Reuse stable prefixes for up to ~90% off input.
Limit output and thinking tokens — the biggest lever for many apps.