The compression numbers look great but I keep wondering: does the model actually produce equivalent output with compressed context vs full context? Extending sessions from 30min to 3hrs only matters if reasoning quality holds up in hour 2.
esafak's cache economics point is underrated. With prompt caching, verbose context that gets reused is basically free. If compression breaks cache continuity you might save tokens while spending more money.
The deeper issue is that most MCP tools do SELECT * when they should return summaries with drill-down. That's a protocol design problem, not a compression problem.
> With prompt caching, verbose context that gets reused is basically free.
But it's not. It might be discounted cost-wise, however it will still degrade attention and make generation slower/more computationally expensive even if you have a long prefix you can reuse during prefill.