Unfortunately, I see the choice space here as having "developer effort" anti-correlated with "negative repercussions".
On one end of the distribution, a "hair trigger ban" strategy is low-effort for the developer but will have some fraction of false positives and some fraction of those impacted will complain to "the socials" and some fraction of those complaints will gain traction and, as we have seen, can unfairly taint the project or worse. Responding and managing the false positives also requires developer effort, unless the developers can sustain a "fsck the haters" attitude.
On the other end of the distribution, the developer can spends substantial effort to engage each submitter to ascertain and correct bad behavior, educate them on how they should engage other humans as a fellow human in this LLM era.
There is developer effort needed of different types along this distribution.
A divide-and-conquer strategy might go something like this:
- Rank each submission in some low dimension space (llm<-->human, malicious<-->helpful)
- When enough samples are collected, perform clustering in this space to determine stereotypes, name these clusters, and develop mitigating strategies and implementations as needed.
Mitigations from easy/extreme to hard/accommodating could include:
- Hair trigger ban button.
- Copy-paste a link to an explanation in a comment before closing and/or banning.
- Customized explanation in comment before closing and/or banning.
- Link or customized explanation of what must be done to move the sample to a more favorable category and close/ban if resistance or silence is returned.
- Ongoing engagement in the face of resistance or silence.
This "meta development" program to provide such a system/facility could of course be highly automated with LLMs, fighting fire with fire.
(Despite the length of this reply, it was written entirely by a random human on the internet and not an LLM).
Which is to say, your system sounds good but I expect much more complicated defenses are needed.