MCP server that reduces Claude Code context consumption by 98%
https://mksg.lu/blog/context-modeI built a hybrid retriever for a similar problem, compressing a 15,800-file Obsidian vault into a searchable index for Claude Code. Stack is Model2Vec (potion-base-8M, 256-dimensional embeddings) + sqlite-vec for vector search + FTS5 for BM25, combined via Reciprocal Rank Fusion. The database is 49,746 chunks in 83MB. RRF is the important piece: it merges ranked lists from both retrieval methods without needing score calibration, so you get BM25's exact-match precision on identifiers and function names plus vector search's semantic matching on descriptions and error context.
The incremental indexing matters too. If you're indexing tool outputs per-session, the corpus grows fast. My indexer has a --incremental flag that hashes content and only re-embeds changed chunks. Full reindex of 15,800 files takes ~4 minutes; incremental on a typical day's changes is under 10 seconds.
On the caching question raised upthread: this approach actually helps prompt caching because the compressed output is deterministic for the same query. The raw tool output would be different every time (timestamps, ordering), but the retrieved summary is stable if the underlying data hasn't changed.
One thing I'd add to Context Mode's architecture: the same retriever could run as a PostToolUse hook, compressing outputs before they enter the conversation. That way it's transparent to the agent, it never sees the raw dump, just the relevant subset.
I suspect the obsessive note-taker crowd on HN would appreciate it too.
The core idea: every MCP tool call dumps raw data into your 200K context window. Context Mode spawns isolated subprocesses — only stdout enters context. No LLM calls, purely algorithmic: SQLite FTS5 with BM25 ranking and Porter stemming.
Since the last post we've seen 228 stars and some real-world usage data. The biggest surprise was how much subagent routing matters — auto-upgrading Bash subagents to general-purpose so they can use batch_execute instead of flooding context with raw output.
Source: https://github.com/mksglu/claude-context-mode Happy to answer any architecture questions.
In connecting the dots (and help me make sure I'm connecting them correctly), context-mode _does not address MCP context usage at all_, correct? You are instead suggesting we refactor or eliminate MCP tools, or apply concepts similar to context_mode in our MCPs where possible?
Context-mode is still very high value, even if the answer is "no," just want to make sure I understand. Also interested in your thoughts about the above.
I write a number of MCPs that work across all Claude surfaces; so the usual "CLI!" isn't as viable an answer (though with code execution it sometimes can be) ...
Edit: typo
Edit: clarify "MCP tool"
Confirmed: context-mode cannot intercept MCP tool responses. The PreToolUse hook (hooks/pretooluse.sh) matches only Bash|Read|Grep|Glob|WebFetch|WebSearch|Task. When I called my obsidian MCP's obsidian_list via MCP, the response went straight into context — zero entries in context-mode's FTS5 database. The web fetches from the same session were all indexed.
The context-mode skill (SKILL.md) actually acknowledges this at lines 71-77 with an "after-the-fact" decision tree for MCP output: if it's already in context, use it directly; if you need to search it again, save to file then index. But that's damage control — the context is already consumed. You can't un-eat those tokens.
The architectural reason: MCP tool responses flow via JSON-RPC directly to the model. There's no PostToolUse hook in Claude Code that could modify or compress a response before it enters context. And you can't call MCP tools from inside a subprocess, so the "run it in a sandbox" pattern doesn't apply.
So the 98% savings are real but scoped to built-in tools and CLI wrappers (curl, gh, kubectl, etc.) — anything replicable in a subprocess. For third-party MCP tools with unique capabilities (Excalidraw rendering, calendar APIs, Obsidian vault access), the MCP author has to apply context-mode's concepts server-side: return compact summaries, store full output queryably, expose drill-down tools. Which is essentially what you suggested above.
Still very high value for the built-in tool side. Just want the boundary to be clear.
Correct any misconceptions please!
It strikes me there's more low hanging fruit to pluck re. context window management. Backtracking strikes me as another promising direction to avoid context bloat and compaction (i.e. when a model takes a few attempts to do the right thing, once it's done the right thing, prune the failed attempts out of the context).
I think agents should manage their own context too. For example, if you’re working with a tool that dumps a lot of logged information into context, those logs should get pruned out after one or two more prompts.
Context should be thought of something that can be freely manipulated, rather than a stack that can only have things appended or removed from the end.
There's some challenges around the LLM having enough output tokens to easily specify what it wants its next input tokens to be, but "snips" should be able to be expressed concisely (i.e. the next input should include everything sent previously except the chunk that starts XXX and ends YYY). The upside is tighter context, the downside is it'll bust the prompt cache (perhaps the optimal trade-off is to batch the snips).
My intuition is that this should be almost trivial. If I copy/paste your long coding session into an LLM and ask it which parts can be removed from context without losing much, I'm confident that it will know to remove the debugging bits.
Looks interesting.
I've set up a hook that blocks directly running certain common tools and instead tells Claude to pipe the output to a temporary file and search that for relevant info. There's still some noise where it tries to run the tool once, gets blocked, then runs it the right way. But it's better than before.
isnt that how thinking works? intermediate tokens that then get replaced with the reuslt?
It’s interesting to imagine a single model deciding to wipe its own memory though, and roll back in time to a past version of itself (only, with the answer to a vexing problem)
I could see this working like some sort of undo tree, with multiple branches you can jump back and forth between.
It parses ~/.claude/projects/*/*.jsonl and breaks usage down by session, tool, project, and timeline with cost estimates (including cache read/create split).
Context Mode solves output compression really well; this is more of a measurement layer so you can see where the burn is before/after changes.
Disclosure: I built it.
/context?
Compressing 153 git commits to 107 bytes means the LLM has to write the perfect extraction script before it can see the data. So if it writes a `git log --oneline | wc -l` when you needed specific commit messages, that information is gone.
The benchmarks assume the model always writes the right summarization code, which in practice it doesn't.
esafak's cache economics point is underrated. With prompt caching, verbose context that gets reused is basically free. If compression breaks cache continuity you might save tokens while spending more money.
The deeper issue is that most MCP tools do SELECT * when they should return summaries with drill-down. That's a protocol design problem, not a compression problem.
But it's not. It might be discounted cost-wise, however it will still degrade attention and make generation slower/more computationally expensive even if you have a long prefix you can reuse during prefill.
I see some of these AI companies adopting some of these ideas sooner or later. Trim the tokens locally to save on token usage.
Still high-value, outside MCPs.
A curious thing about the MCP protocol is it in theory supports alternative content types like binary ones. That has made me curious about shifting much of the data side of the MCP universe from text/json to Apache Arrow, and making agentic harnesses smarter about these just as we're doing in louie.
1. Can this help me? 2. How?
Thanks for sharing and building this.
Which frontier model will (re)introduce the radical idea of separating data from executable instructions?
I just enable fat MCP servers as needed, and try to use skills instead.
> Bun auto-detected for 3–5x faster JS/TS execution
This is quite a claim, and even so, doesn't matter since the bottleneck is the LLM and not the JS interpreter. It's a nit, but little things like this just make the project look bad overall. It feels like nobody took the time to read the copy before publishing it.
More importantly, the claimed 98% context savings are noise without benchmarks of harness performance with and without "context mode".
I'm glad someone is working on this, but I just feel like this is not a serious solution to the problem.
Note: you’re replying to the library’s author.
author reply: not as obvious, but for one thing yes literally em dash, their post has 10 em dashes in 748 words, this comment has 2 em dashes in 115 words. Not that em dash = ai, but in the context of a post about AI it seems more likely. And finally, https://github.com/mksglu/claude-context-mode/blob/main/cont... the file the author linked in their own repo does not exist!
(https://github.com/mksglu/claude-context-mode/blob/main/src/... exists but they messed up the link?)