MCP server that reduces Claude Code context consumption by 98%

2026-02-2810:01570107mksg.lu

MCP server that reduces Claude Code context consumption by 98%. 315 KB becomes 5.4 KB.

Show article

Every MCP tool call in Claude Code dumps raw data into your 200K context window. A Playwright snapshot costs 56 KB. Twenty GitHub issues cost 59 KB. One access log — 45 KB. After 30 minutes, 40% of your context is gone.

Context Mode is an MCP server that sits between Claude Code and these outputs. 315 KB becomes 5.4 KB. 98% reduction.

The Problem

MCP became the standard way for AI agents to use external tools. But there's a tension at its core: every tool interaction fills the context window from both sides — definitions on the way in, raw output on the way out.

With 81+ tools active, 143K tokens (72%) get consumed before your first message. Then the tools start returning data. A single Playwright snapshot burns 56 KB. A gh issue list dumps 59 KB. Run a test suite, read a log file, fetch documentation — each response eats into what remains.

Cloudflare showed that tool definitions can be compressed by 99.9% with Code Mode. We asked: what about the other direction?

How the Sandbox Works

Each execute call spawns an isolated subprocess with its own process boundary. Scripts can't access each other's memory or state. The subprocess runs your code, captures stdout, and only that stdout enters the conversation context. The raw data — log files, API responses, snapshots — never leaves the sandbox.

Ten language runtimes are available: JavaScript, TypeScript, Python, Shell, Ruby, Go, Rust, PHP, Perl, R. Bun is auto-detected for 3-5x faster JS/TS execution.

Authenticated CLIs (gh, aws, gcloud, kubectl, docker) work through credential passthrough — the subprocess inherits environment variables and config paths without exposing them to the conversation.

How the Knowledge Base Works

The index tool chunks markdown content by headings while keeping code blocks intact, then stores them in a SQLite FTS5 (Full-Text Search 5) virtual table. Search uses BM25 ranking — a probabilistic relevance algorithm that scores documents based on term frequency, inverse document frequency, and document length normalization. Porter stemming is applied at index time so "running", "runs", and "ran" match the same stem.

When you call search, it returns exact code blocks with their heading hierarchy — not summaries, not approximations, the actual indexed content. fetch_and_index extends this to URLs: fetch, convert HTML to markdown, chunk, index. The raw page never enters context.

The Numbers

Validated across 11 real-world scenarios — test triage, TypeScript error diagnosis, git diff review, dependency audit, API response processing, CSV analytics. All under 1 KB output each.

Playwright snapshot: 56 KB → 299 B
GitHub issues (20): 59 KB → 1.1 KB
Access log (500 requests): 45 KB → 155 B
Analytics CSV (500 rows): 85 KB → 222 B
Git log (153 commits): 11.6 KB → 107 B
Repo research (subagent): 986 KB → 62 KB (5 calls vs 37)

Over a full session: 315 KB of raw output becomes 5.4 KB. Session time before slowdown goes from ~30 minutes to ~3 hours. Context remaining after 45 minutes: 99% instead of 60%.

Install

Two ways. Plugin Marketplace gives you auto-routing hooks and slash commands:

/plugin marketplace add mksglu/claude-context-mode
/plugin install context-mode@claude-context-mode

Or MCP-only if you just want the tools:

claude mcp add context-mode -- npx -y context-mode

Restart Claude Code. Done.

What Actually Changes

You don't change how you work. Context Mode includes a PreToolUse hook that automatically routes tool outputs through the sandbox. Subagents learn to use batch_execute as their primary tool. Bash subagents get upgraded to general-purpose so they can access MCP tools.

The practical difference: your context window stops filling up. Sessions that used to hit the wall at 30 minutes now run for 3 hours. The same 200K tokens, used more carefully.

Why We Built This

I run the MCP Directory & Hub. 100K+ daily requests. See every MCP server that ships. The pattern was clear: everyone builds tools that dump raw data into context. Nobody was solving the output side.

Cloudflare's Code Mode blog post crystallized it. They compressed tool definitions. We compress tool outputs. Same principle, other direction.

Built it for my own Claude Code sessions first. Noticed I could work 6x longer before context degradation. Open-sourced it.

Open source. MIT. github.com/mksglu/claude-context-mode

Mert Köseoğlu, Senior Software Engineer, AI consultant. x.com/mksglu · linkedin.com/in/mksglu · mksg.lu

Read the original article

Comments

By blakec 2026-03-014:433 reply

The FTS5 index approach here is right, but I'd push further: pure BM25 underperforms on tool outputs because they're a mix of structured data (JSON, tables, config) and natural language (comments, error messages, docstrings). Keyword matching falls apart on the structured half.

I built a hybrid retriever for a similar problem, compressing a 15,800-file Obsidian vault into a searchable index for Claude Code. Stack is Model2Vec (potion-base-8M, 256-dimensional embeddings) + sqlite-vec for vector search + FTS5 for BM25, combined via Reciprocal Rank Fusion. The database is 49,746 chunks in 83MB. RRF is the important piece: it merges ranked lists from both retrieval methods without needing score calibration, so you get BM25's exact-match precision on identifiers and function names plus vector search's semantic matching on descriptions and error context.

The incremental indexing matters too. If you're indexing tool outputs per-session, the corpus grows fast. My indexer has a --incremental flag that hashes content and only re-embeds changed chunks. Full reindex of 15,800 files takes ~4 minutes; incremental on a typical day's changes is under 10 seconds.

On the caching question raised upthread: this approach actually helps prompt caching because the compressed output is deterministic for the same query. The raw tool output would be different every time (timestamps, ordering), but the retrieved summary is stable if the underlying data hasn't changed.

One thing I'd add to Context Mode's architecture: the same retriever could run as a PostToolUse hook, compressing outputs before they enter the conversation. That way it's transparent to the agent, it never sees the raw dump, just the relevant subset.

By thecopy 2026-03-0113:091 reply

Very interesting, one big wrinkle with OP:s approach is exactly that, the structured responses are un-touched, which many tools return. Solution in OP as i understand it is the "execute" method. However, im building an MCP gateway, and such sandboxed execution isnt available (...yet), so your approach to this sounds very clever. Ill spend this day trying that out

By doctorpangloss 2026-03-0118:332 reply

The LLM that wrote the comment you are replying to has no idea what it is talking about...

By thecopy 2026-03-028:521 reply

Im trying it anyway

By blakec 2026-03-035:39

commented below with more info in depth

By pmarreck 2026-03-0213:35

Are you sure it's simply because YOU don't understand it? Because it seems to make sense to me after working on https://github.com/pmarreck/codescan

By danw1979 2026-03-019:051 reply

Would love to read a more in depth write up of this if you have the time !

I suspect the obsessive note-taker crowd on HN would appreciate it too.

By blakec 2026-03-033:292 reply

I wrote it up. The full system reference is here: https://blakecrosley.com/guides/obsidian — vault architecture, hybrid retrieval (Model2Vec + FTS5 + RRF), MCP integration, incremental indexing, operational patterns. Covers everything from a 200-file vault to the 16,000-file setup I run.

The hybrid retriever piece has its own deep dive with the RRF math and an interactive fusion calculator: https://blakecrosley.com/blog/hybrid-retriever-obsidian

See what your coding agent thinks of it and let me know if you have ways to improve it.

By thecopy 2026-03-0312:561 reply

I implemented this as well successfully. Re structured data i transformed it from JSON into more "natural language". Also ended up using MiniLM-L6-v2. Will post GitHub link when i have packaged it independently (currently in main app code, want to extract into independent micro-service)

You wrote:

>A search for “review configuration” matches every JSON file with a review key.

Its good point, not sure how to de-rank the keys or to encode the "commonness" of those words

By blakec 2026-03-0319:38

IDF handles most of it. In BM25, inverse document frequency naturally down-weights terms that appear in every document, so JSON keys like "id", "status", "type" that show up in every chunk get low IDF scores automatically. The rare, meaningful keys still rank.

For the remaining noise, I chunk the flattened key-paths separately from the values. The key-path goes into a metadata field that BM25 indexes but with lower weight. The value goes into the main content field. So a search for "review configuration" matches on the value side, not because "configuration" appeared as a JSON key in 500 files.

MiniLM-L6-v2 is solid. I went with Model2Vec (potion-base-8M) for the speed tradeoff. 50-500x faster on CPU, 89% of MiniLM quality on MTEB. For a microservice where you're embedding on every request, the latency difference matters more than the quality gap.

By danw1979 2026-03-058:04

Thank you !

By tclancy 2026-03-0112:36

Seconded that I would love to see the what, why and how of your Obsidian work.

By mksglu 2026-02-2810:025 reply

Author here. I shared the GitHub repo a few days ago (https://news.ycombinator.com/item?id=47148025) and got great feedback. This is the writeup explaining the architecture.

The core idea: every MCP tool call dumps raw data into your 200K context window. Context Mode spawns isolated subprocesses — only stdout enters context. No LLM calls, purely algorithmic: SQLite FTS5 with BM25 ranking and Porter stemming.

Since the last post we've seen 228 stars and some real-world usage data. The biggest surprise was how much subagent routing matters — auto-upgrading Bash subagents to general-purpose so they can use batch_execute instead of flooding context with raw output.

Source: https://github.com/mksglu/claude-context-mode Happy to answer any architecture questions.

By lkbm 2026-03-013:31

Small suggestion: Link to the Cloudflare Code mode post[0] in the blog post where you mentio it. It's linked in the README, but when I saw it in the blog post, I had to Google it.

[0] https://blog.cloudflare.com/code-mode-mcp/

By re5i5tor 2026-02-2817:162 reply

Really intrigued and def will try, thanks for this.

In connecting the dots (and help me make sure I'm connecting them correctly), context-mode _does not address MCP context usage at all_, correct? You are instead suggesting we refactor or eliminate MCP tools, or apply concepts similar to context_mode in our MCPs where possible?

Context-mode is still very high value, even if the answer is "no," just want to make sure I understand. Also interested in your thoughts about the above.

I write a number of MCPs that work across all Claude surfaces; so the usual "CLI!" isn't as viable an answer (though with code execution it sometimes can be) ...

Edit: typo

By mksglu 2026-02-2821:112 reply

Right, context-mode doesn't change how MCP tool definitions get loaded into context. That's the "input side" problem that Cloudflare's Code Mode tackles by compressing tool schemas. Context-mode handles the "output side," the data that comes back from tool calls. That said, if you're writing your own MCPs, you could apply the same pattern directly. Instead of returning raw payloads, have your MCP server return a compact summary and store the full output somewhere queryable. Context-mode just generalizes that so you don't have to rebuild it per server.

By re5i5tor 2026-02-2823:421 reply

Hmmm. I was talking about the output side. When data comes back from an MCP tool call, context-mode is still not in the loop, not able to help, is it?

Edit: clarify "MCP tool"

By re5i5tor 2026-03-011:25

I dug into this further. Tested empirically and read the code.

Confirmed: context-mode cannot intercept MCP tool responses. The PreToolUse hook (hooks/pretooluse.sh) matches only Bash|Read|Grep|Glob|WebFetch|WebSearch|Task. When I called my obsidian MCP's obsidian_list via MCP, the response went straight into context — zero entries in context-mode's FTS5 database. The web fetches from the same session were all indexed.

The context-mode skill (SKILL.md) actually acknowledges this at lines 71-77 with an "after-the-fact" decision tree for MCP output: if it's already in context, use it directly; if you need to search it again, save to file then index. But that's damage control — the context is already consumed. You can't un-eat those tokens.

The architectural reason: MCP tool responses flow via JSON-RPC directly to the model. There's no PostToolUse hook in Claude Code that could modify or compress a response before it enters context. And you can't call MCP tools from inside a subprocess, so the "run it in a sandbox" pattern doesn't apply.

So the 98% savings are real but scoped to built-in tools and CLI wrappers (curl, gh, kubectl, etc.) — anything replicable in a subprocess. For third-party MCP tools with unique capabilities (Excalidraw rendering, calendar APIs, Obsidian vault access), the MCP author has to apply context-mode's concepts server-side: return compact summaries, store full output queryably, expose drill-down tools. Which is essentially what you suggested above.

Still very high value for the built-in tool side. Just want the boundary to be clear.

Correct any misconceptions please!

By matusfaro 2026-03-0716:09

[dead]

By nextaccountic 2026-03-013:32

Can this be used with other agents? I'm looking specifically into the Zed Agent

By esafak 2026-02-2820:592 reply

Does your technique break the cache? edit: Thanks.

By doctorpangloss 2026-03-0118:35

The LLM that the "author" is using has no idea what it's talking about, and the reply you got is nonsense.

@dang it's really bad lately.

By mksglu 2026-02-2821:11

Nope. The raw data never enters the conversation history in the first place, so there's nothing to invalidate. Tool output runs in a sandbox, a short summary comes back, and the full data sits in a local FTS5 index. The conversation cache stays intact because the context itself doesn't change after the fact.

By nitinreddy88 2026-03-011:09

Any reason why it doesn't support Codex? I believe the idea and implementation seems to be pretty much agent independent

By nr378 2026-02-2814:065 reply

Nice work.

It strikes me there's more low hanging fruit to pluck re. context window management. Backtracking strikes me as another promising direction to avoid context bloat and compaction (i.e. when a model takes a few attempts to do the right thing, once it's done the right thing, prune the failed attempts out of the context).

By elephanlemon 2026-02-2816:5510 reply

Agree. I’d like more fine grained control of context and compaction. If you spend time debugging in the middle of a session, once you’ve fixed the bugs you ought to be able to remove everything related to fixing them out of context and continue as you had before you encountered them. (Right now depending on your IDE this can be quite annoying to do manually. And I’m not aware of any that allow you to snip it out if you’ve worked with the agent on other tasks afterwards.)

I think agents should manage their own context too. For example, if you’re working with a tool that dumps a lot of logged information into context, those logs should get pruned out after one or two more prompts.

Context should be thought of something that can be freely manipulated, rather than a stack that can only have things appended or removed from the end.

By FuckButtons 2026-02-2820:513 reply

Yeah, the fact that we have treated context as immutable baffles me, it’s not like humans working memory keeps a perfect history of everything they’ve done over the last hour, it shouldn’t be that complicated to train a secondary model that just runs online compaction, eg: it runs a tool call, the model determines what’s Germaine to the conversion and prunes the rest, or some task gets completed, ok just leave a stub in the context that says completed x, with a tool available to see the details of x if it becomes relevant again.

By mksglu 2026-02-2821:081 reply

That's pretty much the approach we took with context-mode. Tool outputs get processed in a sandbox, only a stub summary comes back into context, and the full details stay in a searchable FTS5 index the model can query on demand. Not trained into the model itself, but gets you most of the way there as a plugin today.

By FuckButtons 2026-03-013:401 reply

This is a partial realization of the idea, but, for a long running agent the proportion of noise increases linearly with the session length, unless you take an appropriately large machete to the problem you’re still going to wind up with sub optimal results.

By jerf 2026-03-015:28

Yeah, I'd definitely like to be able to edit my context a lot more. And once you consider that you start seeing things in your head like "select this big chunk of context and ask the model to simply that part", or do things like fix the model trying to ingest too many tokens because it dumped a whole file in that it didn't realize was going to be as large as it was. There's about a half-dozen things like that that are immediately obviously useful.

By esperent 2026-02-2821:591 reply

Is it because of caching? If the context changes arbitrarily every turn then you would have to throw away the cache.

By FuckButtons 2026-03-012:341 reply

So use a block based cache and tune the block size to maximize the hit rate? This isn’t rocket science.

By wonnage 2026-03-016:00

This seems misguided, you have to cache a prefix due to attention.

By nr378 2026-02-2818:042 reply

Oh that's quite a nice idea - agentic context management (riffing on agentic memory management).

There's some challenges around the LLM having enough output tokens to easily specify what it wants its next input tokens to be, but "snips" should be able to be expressed concisely (i.e. the next input should include everything sent previously except the chunk that starts XXX and ends YYY). The upside is tighter context, the downside is it'll bust the prompt cache (perhaps the optimal trade-off is to batch the snips).

By lowbloodsugar 2026-03-017:10

So I built that in my chat harness. I just gave the agent a “prune” tool and it can remove shit it doesn’t need any more from its own context. But chat is last gen.

By mksglu 2026-02-2821:09

Good point on prompt cache invalidation. Context-mode sidesteps this by never letting the bloat in to begin with, rather than snipping it out after. Tool output runs in a sandbox, a short summary enters context, and the raw data sits in a local search index. No cache busting because the big payload never hits the conversation history in the first place.

By MichaelDickens 2026-03-013:281 reply

> I think agents should manage their own context too.

My intuition is that this should be almost trivial. If I copy/paste your long coding session into an LLM and ask it which parts can be removed from context without losing much, I'm confident that it will know to remove the debugging bits.

By bbatha 2026-03-014:40

I generally do this when I arrive at the agent getting stuck at a test loop or whatever after injecting some later requirement in and tweaking. Once I hit a decent place I have the agent summarize, discard the branch (it’s part of the context too!) and start with the new prompt

By esperent 2026-02-2821:581 reply

> For example, if you’re working with a tool that dumps a lot of logged information into context

I've set up a hook that blocks directly running certain common tools and instead tells Claude to pipe the output to a temporary file and search that for relevant info. There's still some noise where it tries to run the tool once, gets blocked, then runs it the right way. But it's better than before.

By wonnage 2026-03-016:011 reply

I think telling it to run those in a subagent should accomplish the same thing and ensure only the answer makes it to the main context. Otherwise you will still have some bloat from reading the exact output, although in some cases that could be good if you’re debugging or something

By esperent 2026-03-0111:32

Not really because it reliably greps or searches the file for relevant info. So far I haven't seen it ever load the whole file. It might be more efficient for the main thread to have a subagent do it but probably at a significant slowdown penalty when all I'm doing is linting or running tests. So this is probably a judgement call depending on the situation.

By mullingitover 2026-03-012:37

I’ve been wondering about this and just found this paper[1]: Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models

Looks interesting.

[1] https://arxiv.org/html/2510.04618v1

By mksglu 2026-02-2821:081 reply

That's exactly what context-mode does for tool outputs. Instead of dumping raw logs and snapshots into context, it runs them in a sandbox and only returns a summary. The full data stays in a local FTS5 index so you can search it later when you need specifics.

By 8note 2026-03-010:06

what i want is for the agent to initially get the full data and make the right decision based on it, then later it doesnt need to know as much about how it got there.

isnt that how thinking works? intermediate tokens that then get replaced with the reuslt?

By dsclough 2026-03-0118:29

Trees in pi let you do this, after done debugging you move back up and continue, leaving all the debugging context in its own branch

By 8note 2026-03-010:04

i think something kinda easy for that could be to pretend that pruned output was actually done by a subagent. copy the detailed logs out, and replace it with a compacted summary.

By jaredsohn 2026-03-010:39

Treat context like git shas. Yes, there is a specific order within a 'branch' but you should be able to do the equivalent of cherry-picking and rebasing it

By snowhale 2026-03-0113:02

[dead]

By IncreasePosts 2026-03-010:46

I do this with my agents. Basically, every "work" oriented call spawns a subprocess which does not add anything to the parent context window. When the subprocess completes the task, I ask it to 1) provide a complete answer, 2) provide a succinct explanation of how the answer was arrived at, 3) provide a succinct explanation of any attempts which did not work, and 4) Anything learned during the process which may be useful in the future. Then, I feed those 4 answers back to the parent as if they were magically arrived at. Another thing I do for managing context window is, any tool/MCP call has its output piped into a file. The LLM then can only read parts of the file and only add that to its context if it is sufficient. For example, execute some command that produces a lot of output and ultimately ends in "Success!", the LLM can just tail the last line to see if it succeeded. If it did, the rest of the output doesn't need to be read. if it fails, usually the failure message is at the end of the log. Something I'm working on now is having a smaller local model summarize the log output and feed that summarization to the more powerful LLM (because I can run my local model for ~free, but it is no where near as capable as the cloud models). I don't keep up with SOTA so I have no idea if what I'm doing is well known or not, but it works for me and my set up.

By jonnycoder 2026-02-2816:271 reply

It feels like the late 1990s all over again, but instead of html and sql, it’s coding agents. This time around, a lot of us are well experienced at software engineering and so we can find optimizations simply by using claude code all day long. We get an idea, we work with ai to help create a detailed design and then let it develop it for us.

By mksglu 2026-02-2821:10

The people who spent years doing the work manually are the ones who immediately see where the bottlenecks are.

By mksglu 2026-02-2821:05

Totally agree. Failed attempts are just noise once the right path is found. Auto-detecting retry patterns and pruning them down to the final working version feels very doable, especially for clear cases like lint or compilation fixes.

By ip26 2026-02-2817:141 reply

Maybe the right answer is “why not both”, but subagents can also be used for that problem. That is, when something isn’t going as expected, fork a subagent to solve the problem and return with the answer.

It’s interesting to imagine a single model deciding to wipe its own memory though, and roll back in time to a past version of itself (only, with the answer to a vexing problem)

By jon-wood 2026-02-2817:48

I forget where now but I'm sure I read an article from one of the coding harness companies talking about how they'd done just that. Effectively it could pass a note to its past self saying "Path X doesn't work", and otherwise reset the context to any previous point.

I could see this working like some sort of undo tree, with multiple branches you can jump back and forth between.