Haiku, Sonnet, or Opus? Model Tiering Your Claude Code Workflows for 10x Cost Savings

The right way to choose between Haiku, Sonnet, and Opus is to match the model to the cognitive load of the task: Haiku for mechanical work (formatting, dedupe, lint), Sonnet for specialist craft (copy, briefs, structured analysis), and Opus for strategy (judgment calls, orchestration, deep research). Run everything on Opus and you pay roughly 10x more than necessary. Run everything on Haiku and your strategy work collapses. Correct tiering is the single highest-leverage cost decision in any multi-agent Claude Code setup.

What does "model tiering" actually mean?

Model tiering is assigning each task in a workflow to the cheapest model that reliably succeeds at that task. The key word is reliably — not "the cheapest model that runs" but "the cheapest model that produces a correct result." Get that distinction wrong and you either overpay (Opus formatting JSON) or ship garbage (Haiku orchestrating a 6-step pipeline).

The framework has three buckets, and almost every task in a real Claude Code workflow lands cleanly in one of them:

Mechanical — transformation, not reasoning. Deduplication, field normalization, JSON-LD formatting, list cleanup, lint passes, alt-text generation in bulk. Clear procedure, little judgment. Haiku.
Specialist craft — domain skill within a defined frame. Writing PDP copy, building research briefs, drafting email sequences, scoring content against a rubric, producing structured analysis. The agent knows what to produce; it just has to produce it well. Sonnet.
Strategy and judgment — holding the whole picture, making trade-offs, planning across multiple steps. Orchestration, pricing architecture, competitive positioning, interpreting ambiguous results. Opus.

The assignment is per-task (or per-agent in a multi-agent command), not per-project. A single /seo quick-wins run might touch Haiku to normalize URL slugs, Sonnet to draft title-tag rewrites, and Opus to rank prioritization decisions. That mix is intentional and designed.

How does Claude pricing make tiering worthwhile?

Anthropic's May 2026 Opus 4.8 release repriced the stack: Opus 4.8 now sits at $5/$25 per million tokens (input/output), with a fast mode running roughly 3x cheaper. Sonnet and Haiku remain substantially cheaper than Opus for standard inference. The exact published rates shift, so reason in relative terms: Haiku is roughly 1x, Sonnet roughly 5x, Opus roughly 15x per equivalent token of work. (Check current Anthropic pricing for absolute dollars — these ratios are illustrative but realistic.)

The gap between tiers is what creates the opportunity. A formatting task that takes 2,000 tokens on Haiku costs ~2,000 relative units. The same task on Opus costs ~30,000. At scale — say, a marketing team running 500 list-cleanup tasks a month — that's the difference between a rounding error and a meaningful line item.

What does the cost math look like for a real workflow?

Here is a worked example: 1,000 tasks in a month, distributed the way actual ClaudeKit commands fan out. The split is empirically grounded in how our 101 commands distribute work across the five kits.

Task type	Count	Model	Relative cost per task	Subtotal
Mechanical (cleanup, formatting, lint)	500	Haiku	1x	500
Specialist craft (copy, briefs, analysis)	400	Sonnet	5x	2,000
Strategy / judgment (orchestration, planning)	100	Opus	15x	1,500
Tiered total	1,000	Mixed		4,000

Now the naive approach — run everything on Opus "to be safe":

Task type	Count	Model	Relative cost per task	Subtotal
All tasks	1,000	Opus	15x	15,000

All-Opus:   1,000 × 15  = 15,000 units
Tiered:      500×1 + 400×5 + 100×15 = 4,000 units
Savings:    ~73% fewer units  (~3.7x–10x factor depending on your task mix)

The exact multiple scales with how mechanical your workload is. A pipeline that is 80% formatting and list cleanup saves far more than a strategy-heavy research workflow. The point is structural: every real multi-agent workflow has a large mechanical and specialist component, and paying Opus rates for that component is waste with no correctness benefit.

How do ClaudeKit commands encode model tiers?

Every ClaudeKit skill file and agent definition carries an explicit model assignment. When a command ships, its tier is designed, not accidental. We measured 82,197 tokens across all five kits (EngineerKit: 20,413 tokens / 25 commands; MarketingKit: 16,714 / 20; SEOKit: 16,004 / 19; EcomKit: 16,464 / 20; VideoKit: 12,602 / 17). Token count is only half the cost picture. The other half is the model that processes those tokens.

Here is how the tier pattern plays out across the five kits:

EngineerKit (/eng): Mechanical agents handle formatting and lint on Haiku. Specialist agents like the TDD guide and code-review pass run on Sonnet. The root-cause-first /eng debug flow uses Sonnet for the investigation and promotes to Opus for architectural judgment when the root cause has cross-system implications. The daily-8 commands (catchup, plan, tdd, debug, verify, review, commit, handoff) are mostly Sonnet with Haiku for pure formatting work.

MarketingKit (/mkt): /mkt humanize (strips 14 AI tells from a draft) is Sonnet — it is specialist craft, not strategy. /mkt voice (builds a voice profile from your real posts) is Sonnet for the extraction pass. /mkt repurpose (one piece into 5 formats) fans out Sonnet agents per format. Strategy calls — deciding what the voice profile means for a positioning decision — escalate to Opus.

SEOKit (/seo): /seo quick-wins (positions 8-20 + low-CTR analysis) is largely Sonnet for the interpretation pass and Haiku for URL normalization. /seo citations (N-run AI-citation measurement with confidence intervals) uses Opus for the measurement design and statistical interpretation, Haiku for the raw fetch passes.

EcomKit (/ecom): /ecom no-sales (store triage vs AOV-band benchmarks) is Sonnet for the diagnostic analysis. /ecom margin and /ecom ads use Haiku for data normalization, Sonnet for recommendation drafting, and Opus for the final strategy calls where a wrong answer has direct financial cost.

VideoKit (/video): /video clone (recreate a reference video's style in Remotion and verify the match) uses Sonnet for style extraction and Haiku for asset normalization. /video data analysis runs Sonnet. The verification step is Sonnet not Opus — it is matching against a defined spec, not open-ended judgment.

The v2 architecture uses read-only specialist agents (reviewer, auditor, researcher) rather than blocking quality-gate orchestrators. Commands end with EVIDENCE — a report, a diff, a verified file — not a reviewer agent deciding whether to let the output through. That design removes a class of unnecessary Opus calls that existed in the v1 architecture.

What are the real failure cases to watch for?

Tiering is not bulletproof. The three failure modes are distinct and each has a different fix.

Failure 1: Haiku doing strategy. This is the classic mistake and it produces wrong outputs, not slow ones. Haiku on an orchestration task mis-routes multi-stage workflows because it cannot hold enough context to sequence steps with dependencies. Haiku on a judgment task produces plausible-sounding but unreliable decisions. If you see a command producing inconsistent results at scale, check whether a strategy-tier task is assigned to Haiku.

Failure 2: Sonnet doing deep judgment. Sonnet produces coherent-reading strategy that can miss non-obvious trade-offs. In domains where being plausibly wrong is expensive — pricing guardrails, competitive positioning with real downside risk, interpreting ambiguous A/B test results — Sonnet's output reads fine but may skip a critical constraint. Reserve Opus for tasks where the cost of a subtle error is high, not just tasks that feel "important."

Failure 3: Opus doing mechanical work. No correctness problem — Opus formats JSON perfectly. Pure financial waste at roughly 15x Haiku cost for identical output. This is the most common real-world mistake. People use the best available model "to be safe" and pay the all-Opus column in the table above. There is no safety gain; there is only a larger bill.

The rule that covers all three: assign the lowest tier that reliably succeeds at the task, and no lower. Mechanical and transformation tasks: Haiku. Specialist creation within a defined frame: Sonnet. Open-ended judgment where being subtly wrong is expensive: Opus.

How does tiering interact with token budget management?

Token count and model tier are two independent cost levers, and both matter. We measured every ClaudeKit command using a tiktoken-compatible counter (~4 chars/token). The measured token costs give you one dimension of the cost picture. The model tier gives you the other. A 1,300-token specialist skill on Sonnet and a 1,300-token judgment pass on Opus have the same token count and very different prices.

The two levers compound. If a context spiral bloats a command from 3,000 to 12,000 tokens and that command runs on Opus, you are paying 15x price on 4x the tokens — the combination is ~60x the baseline cost. Getting both levers right (trim context AND tier correctly) is where real cost optimization lives. The context token cost post covers the first lever; this post covers the second.

Claude Code itself offers --model flag per invocation and supports model specification in CLAUDE.md and skill files. ClaudeKit encodes the tier in the skill manifest so you do not have to make the decision per-run. Install any kit with ck install <kit> (global to ~/.claude by default, or --local for project-scope) and the tier assignments ship with the commands. ck tokens <kit> recounts the token ledger; the model assignments are visible in the manifest files under ~/.claude/skills/.

Is Opus 4.8's fast mode worth using for strategy tasks?

Opus 4.8 (shipped May 28, 2026) introduced a fast mode running roughly 3x cheaper than standard Opus. For strategy tasks where latency matters more than maximum reasoning depth — quick deal-desk calls, light orchestration passes, ranking decisions that do not need extended thinking — fast mode closes part of the Opus-Sonnet price gap while keeping Opus-class judgment.

Our current take: use standard Opus for tasks where the output quality directly gates a downstream deliverable (competitive positioning, pricing logic, research design) and fast mode for Opus-tier tasks that are time-sensitive and lower stakes. Do not treat fast mode as a replacement for correct tiering — it is still an order of magnitude more expensive than Haiku for mechanical work.

Dynamic Workflows (Opus 4.8 research preview) can fan out to up to 1,000 subagents. At that scale, tiering is not optional — a 1,000-subagent workflow running on wrong tiers would produce costs an order of magnitude beyond what the task requires. The same three-tier framework applies; the stakes are just higher.

Which ClaudeKit is most relevant for model tiering decisions?

If you are running engineering workflows, EngineerKit has 25 commands and 4 agents with explicit tier assignments across the daily-8 commands. If you are running content and SEO workflows, SEOKit and MarketingKit encode Sonnet-tier craft and Haiku-tier mechanical work across 39 combined commands. The full token ledger and model tier for every command ships in the manifest on install.

FAQ

Which Claude model should I use for routine Claude Code tasks?

Mechanical tasks — formatting, deduplication, field normalization, lint passes — belong on Haiku. Specialist creation tasks — writing copy, building briefs, structured analysis — belong on Sonnet. Judgment-heavy tasks where being subtly wrong is expensive — orchestration, pricing decisions, strategy design — belong on Opus. The orchestrator and any judgment gate are almost always Opus; the bulk of content-producing work is Sonnet; cleanup and transformation are Haiku.

How much money does correct model tiering actually save?

For a representative 1,000-task month split 500 mechanical / 400 specialist / 100 strategy, tiering costs roughly 4,000 relative units versus ~15,000 if you ran everything on Opus — about 73% fewer, or a 3.7x to 10x factor depending on your task mix. The exact multiple depends on how mechanical your workload is. A pipeline that is 80% formatting and list cleanup saves much more than a strategy-heavy workflow. Check current Anthropic pricing for absolute dollar figures; the relative ratios drive the framework.

What breaks if I run everything on Haiku to save money?

Mechanical and many specialist tasks work fine, but strategy tasks produce wrong outputs, not just slow ones. Haiku on an orchestration task mis-routes multi-stage workflows because it cannot hold enough context to sequence steps with dependencies. Haiku on a pricing or judgment task produces plausible-sounding decisions that miss non-obvious constraints. You save tokens and ship broken output. The rule is the lowest tier that reliably succeeds, not the cheapest that runs.

Does Opus 4.8 fast mode change the tiering framework?

Not structurally. Fast mode makes Opus roughly 3x cheaper, which closes part of the Opus-Sonnet gap for time-sensitive judgment tasks. But it is still substantially more expensive than Haiku for mechanical work, so the three-tier framework holds. Use fast mode for Opus-tier tasks where latency matters and the stakes are lower; use standard Opus where the output quality directly gates a downstream deliverable.

Where can I see which model each ClaudeKit command uses?

The model tier is encoded in each skill manifest and ships on install. Run ck install <kit> and the manifest files appear under ~/.claude/skills/ with explicit model fields per agent. ck tokens <kit> prints the token ledger; the model assignments are visible alongside it. The getting-started guide walks through the install and manifest inspection flow.

How does model tiering interact with context window costs?

They are two independent cost levers that compound. Token count determines how much context you are paying to process; model tier determines what you pay per token. A 10,000-token context on Opus costs ~150x more than a 1,000-token context on Haiku. Getting both right — trimming unnecessary context AND assigning the correct model tier — is where real optimization lives. The measured token cost post covers context trimming; this post covers tier assignment.