Why can’t they have trained the classifier on internal red teaming?
They basically said "Deepseek ran 150,000 requests and here's the gist of one of their prompts". Anthropic doesn't know which accounts are Deepseek proxies beforehand, so definitely sounds like retrospective analysis of broad user logs to me.
Of course Anthropic realizes saying this straight is problematic so they said they examined request metadata, but no, I don't think they can get this kind of insight from metadata (token counts, request time, etc.)