Economics · June 2, 2026 · 8 min read
Token Costs Are Becoming an Operating Issue
The important question is no longer whether a model can complete a task. It is whether the system can complete recurring work without sending every step through the most expensive route.
Recurring work changes the cost conversation
A single model call is easy to reason about. An agentic workflow is not one call. It may include system instructions, conversation history, retrieved files, tool descriptions, intermediate results, retries, verification passes, and delegated subtasks. Each step can add input or output tokens. The operational cost is the shape of the workflow multiplied by how often it runs.
Provider pricing pages already expose the dimensions teams need to watch: input tokens, cached input tokens, output tokens, model tier, and processing mode. OpenAI, Anthropic, and Google each document mechanisms for reusing repeated context at lower cost when requests meet their caching conditions.1
Where token usage accumulates
Most teams do not have one obvious source of waste. Usage accumulates across ordinary decisions that looked harmless when the workflow was smaller.
- Instructions: the system prompt, tool descriptions, policy text, and output schemas that accompany a request.
- History: prior messages and intermediate responses carried forward to preserve continuity.
- Evidence: files, logs, search results, database rows, screenshots, and tool output included for interpretation.
- Iteration: planning passes, tool calls, retries, verification turns, and follow-up questions.
- Coordination: context handed between a supervisor and specialized agents.
None of these categories is inherently wasteful. The problem is accidental repetition. A workflow becomes expensive when it cannot distinguish durable knowledge from temporary evidence, premium reasoning from routine work, or model interpretation from deterministic execution.
Premium models earn their place
Complex planning, difficult debugging, security-sensitive review, and final synthesis can justify premium reasoning. Routine evidence compression, classification, simple extraction, and repetitive helper work often deserve a different route. The goal is not to make every task cheap. The goal is to reserve expensive reasoning for the parts of a workflow where it changes the result.
This is why model choice belongs in the architecture. A system that has only one route will naturally send every prompt, tool result, and retry through that route. A system with deliberate routing can use premium, lighter, cheaper, local, or deterministic options according to the work.
Caching helps, but it is not the whole architecture
OpenAI documents automatic prompt caching for qualifying repeated prefixes. Anthropic documents cache breakpoints for reusable prompt prefixes. Google documents implicit and explicit context caching, including explicit caching for substantial context referenced by shorter follow-up requests.2
Those capabilities matter. They reward stable prefixes and reusable context. They do not remove the need to choose routes carefully. A cached premium-model call may still be the wrong call for a routine task, and a local deterministic operation may avoid a model call entirely.
A practical operating model
A sustainable agent system separates the workflow into five questions:
- Can a deterministic tool answer this? Search indexes, parsers, policy checks, database queries, and local scripts should run mechanically when interpretation is unnecessary.
- Does the work need a model? If interpretation is useful, identify the smallest context package that supports a reliable answer.
- Which route fits the task? Use premium reasoning for ambiguity and depth, lighter routes for suitable helper work, and local routes when privacy or offline operation matters.
- What should persist? Retain useful knowledge and stable instructions without resending temporary debris forever.
- Can the result be inspected? Capture model choice, context size, tools, token counts, retries, errors, and previews so the route can improve.
Worked example: the hidden multiplier
Consider a recurring repository-analysis workflow. A supervisor sends 30,000 input tokens to a premium model, receives a plan, calls three tools, then sends a combined 45,000-token evidence package back for synthesis. A verification pass adds another 35,000 input tokens. The visible user action looked like one task. The workflow produced 110,000 input tokens before any delegated helper work or retry.
Now repeat that pattern 1,000 times per month. The operating issue is not the price of one prompt. It is whether all 110 million input tokens needed the same route, whether stable prefixes could be cached, whether deterministic tooling could narrow the evidence, and whether a lighter model could compress routine material before premium synthesis.
Questions teams should ask
- Which tasks need premium reasoning, and which only need a reliable helper route?
- How much repeated context is sent on every turn?
- Can stable instructions, tools, or documents benefit from provider caching?
- Can suitable work move to a lighter cloud model or a local model?
- Can deterministic tools answer part of the question before a model receives it?
- Can the system show token counts and context sent for each request?
Routing is an operating discipline
The teams that treat token usage as an operating issue will be better prepared for sustained agentic work. They will know what receives premium reasoning, what is delegated, what is cached, what is local, and what can be inspected afterward.
Continue with How Kaptain Reduces Unnecessary Token Spend or read the Agent Ecosystem OS whitepaper.
Sources
- OpenAI API Pricing; Anthropic Pricing; Gemini Developer API Pricing. Accessed June 2, 2026.
- OpenAI Prompt Caching; Anthropic Prompt Caching; Gemini Context Caching. Accessed June 2, 2026.