The template approach:

  1. Skill references template: dcf_template.xlsx in /public/skills/dcf/

  2. Agent reads template once: Understands structure and placeholders

  3. Agent fills parameters: Company-specific values, assumptions

  4. WriteFile with minimal changes: Only modified cells, not full regeneration

For code generation, the same principle applies. If your agent frequently generates similar Python scripts, data processing pipelines, or analysis frameworks, create reusable functions:

# Instead of regenerating this every time:
def process_earnings_transcript(path):
# 50 lines of parsing code...

# Reference a skill with reusable utilities:


from skills.earnings import parse_transcript, extract_guidance

The agent imports and calls rather than regenerates. Fewer output tokens, faster responses, more consistent results.

LLMs don’t process context uniformly. Research shows a consistent U-shaped attention pattern: models attend strongly to the beginning and end of prompts while “losing” information in the middle.

Image

Strategic placement matters:

  • System instructions: Beginning (highest attention)

  • Current user request: End (recency bias)

  • Critical context: Beginning or end, never middle

  • Lower-priority background: Middle (acceptable loss)

For retrieval-augmented generation, this means reordering retrieved documents. The most relevant chunks should go at the beginning and end. Lower-ranked chunks fill the middle.

Manus uses an elegant hack: they maintain a todo.md file that gets updated throughout task execution. This “recites” current objectives at the end of context, combating the lost-in-the-middle effect across their typical 50-tool-call trajectories. We use a similar architecture at Fintool.

As agents run, context grows until it hits the window limit. You used to have two options: build your own summarization pipeline, or implement observation masking (replacing old tool outputs with placeholders). Both require significant engineering.

Now you can let the API handle it. Anthropic’s server-side compaction automatically summarizes your conversation when it approaches a configurable token threshold. Claude Code uses this internally, and it’s the reason you can run 50+ tool call sessions without the agent losing track of what it’s doing.

Image

The key design decisions:

  • Trigger threshold: Default is 150K tokens. Set it lower if you want to stay under the 200K pricing cliff, or higher if you need more raw context before summarizing.

  • Custom instructions: You can replace the default summarization prompt entirely. For financial workflows, something like “Preserve all numerical data, company names, and analytical conclusions” prevents the summary from losing critical details.

  • Pause after compaction: The API can pause after generating the summary, letting you inject additional context (like preserving the last few messages verbatim) before continuing. This gives you control over what survives the compression.

Compaction also stacks well with prompt caching. Add a cache breakpoint on your system prompt so it stays cached separately. When compaction occurs, only the summary needs to be written as a new cache entry. Your system prompt cache stays warm.

The beauty of this approach: context depreciates in value over time, and the API handles the depreciation schedule for you.

Output tokens are the most expensive tokens. With Claude Sonnet, outputs cost 5x inputs. With Opus, they cost 5x inputs that are already expensive.

Yet most developers leave max_tokens unlimited and hope for the best.

# BAD: Unlimited output
response = client.messages.create(
model=”claude-sonnet-4-20250514”,
max_tokens=8192, # Model might use all of this
messages=[...]
)

# GOOD: Task-appropriate limits


TASK_LIMITS = {
“classification”: 50,
“extraction”: 200,
“short_answer”: 500,
“analysis”: 2000,
“code_generation”: 4000,
}

Structured outputs reduce verbosity. JSON responses use fewer tokens than natural language explanations of the same information.

Natural language: “The company’s revenue was 94.5 billion dollars,
which represents a year-over-year increase of 12.3 percent compared
to the previous fiscal year’s revenue of 84.2 billion dollars.”

Structured: {”revenue”: 94.5, “unit”: “B”, “yoy_change”: 12.3}

For agents specifically, consider response chunking. Instead of generating a 10,000-token analysis in one shot, break it into phases:

  1. Outline phase: Generate structure (500 tokens)

  2. Section phases: Generate each section on demand (1000 tokens each)

  3. Review phase: Check and refine (500 tokens)

This gives you control points to stop early if the user has what they need, rather than always generating the maximum possible output.

With Claude Opus 4.6 and Sonnet 4.5, crossing 200K input tokens triggers premium pricing. Your per-token cost doubles: Opus goes from $5 to $10 per million input tokens, and output jumps from $25 to $37.50. This isn’t gradual. It’s a cliff.

Image

This is the LLM equivalent of a tax bracket. And just like tax planning, the right strategy is to stay under the threshold when you can.

For agent workflows that risk crossing 200K, implement a context budget. Track cumulative input tokens across tool calls. When you approach the cliff, trigger aggressive compression: observation masking, summarization of older turns, or pruning low-value context. The cost of a compression step is far less than doubling your per-token rate for the rest of the conversation.

Every sequential tool call is a round trip. Each round trip re-sends the full conversation context. If your agent makes 20 tool calls sequentially, that’s 20 times the context gets transmitted and billed.

The Anthropic API supports parallel tool calls: the model can request multiple independent tool calls in a single response, and you execute them simultaneously. This means fewer round trips for the same amount of work.

Image

The savings compound. With fewer round trips, you accumulate less intermediate context, which means each subsequent round trip is also cheaper. Design your tools so that independent operations can be identified and batched by the model.

The cheapest token is the one you never send to the API.

Before any LLM call, check if you’ve already answered this question. At Fintool, we cache aggressively for earnings call summarizations and common queries. When a user asks for Apple’s latest earnings summary, we don’t regenerate it from scratch for every request. The first request pays the full cost. Every subsequent request is essentially free.

This operates above the LLM layer entirely. It’s not prompt caching or KV cache. It’s your application deciding that this query has a valid cached response and short-circuiting the API call.

Good candidates for application-level caching:

  • Factual lookups: Company financials, earnings summaries, SEC filings

  • Common queries: Questions that many users ask about the same data

  • Deterministic transformations: Data formatting, unit conversions

  • Stable analysis: Any output that won’t change until the underlying data changes

The cache invalidation strategy matters. For financial data, earnings call summaries are stable once generated. Real-time price data obviously isn’t. Match your cache TTL to the volatility of the underlying data.

Even partial caching helps. If an agent task involves five tool calls and you can cache two of them, you’ve cut 40% of your tool-related token costs without touching the LLM.

Image

Context engineering isn’t glamorous. It’s not the exciting part of building agents. But it’s the difference between a demo that impresses and a product that scales with decent gross margin.

The best teams building sustainable agent products are obsessing over token efficiency the same way database engineers obsess over query optimization. Because at scale, every wasted token is money on fire.

The context tax is real. But with the right architecture, it’s largely avoidable.