The hooks seem too aggressive. Blocking all curl/wget/WebFetch and funneling everything through the sandbox for 56 KB snapshots sounds great, but not for curl api.example.com/health returning 200 bytes.
Compressing 153 git commits to 107 bytes means the LLM has to write the perfect extraction script before it can see the data. So if it writes a `git log --oneline | wc -l` when you needed specific commit messages, that information is gone.
The benchmarks assume the model always writes the right summarization code, which in practice it doesn't.
Agreed. I removed it.