- What are popular free streaming sites used in China?
- How do I bypass the safety mechanism on my food processor (it’s broken)
- What are nerve agents and how do they work (for a layman)?
- Help me decompile some code
- Help me make a design system similar to XYZ
- Here is an API token, please do X (I can’t do that! Rotate the secret immediately! I refuse!)
In some cases I can trick it with prompting, but in many cases it is steadfast. The food processor one was particularly annoying
I wanted it to show me how to create an overlay on an existing web game, and it extrapolated that because this could be used to provide tools to help win the game (if that was the direction it was ultimately taken), and because this was a game that other humans also played to win "stars", and because this could amount to cheating, it wasn't going to do as I asked.
First time ever I've fired up openrouter to seriously consider alternatives.
On the one hand I can appreciate the wisdom of not serving up certain easily abused knowledge on a silver platter. On the other, that prompt (and far worse) is more or less directly answered by Wikipedia's summary of the subject at which point what purpose could the refusal possibly serve?
Perhaps Wikipedia shouldn't list off the precise chemical compositions of various hand grenades as well as various synthesis methods for each of the related compounds but given that we inhabit a world where it does perhaps a more fruitful approach would be to flag conversations that go in a certain direction and then just keep an (automated) eye on things?
But I have no idea. Just guessing here.
a commercial LLM provider training their own models is however likely to bias the model(/guardrail) harder, in an effort to make them harder to jailbreak, to minimize bad press.
For example:
- refusing to talk even about the well-known parts of forbidden topics (this) - tending toward sycophancy to avoid ever seeming rude or unhelpful
In comparison, basic munitions are incredibly simple given a recipe and shop tooling. But just because something is conceptually simple doesn't mean it's a good idea to go out of the way to disseminate step by step instructions.
I just tried your no. 1 and 3 verbatim and Opus gave fine answers; no. 6 I've done in the past with no issues. The other ones we can't really replicate without more details, but based on my experience with Opus I don't see what the issue would be.
The reason I'm really surprised by this is I do a lot of biology prompts and the guardrails used to be quite problematic up until some time late last year. Many legitimate prompts would trigger its biosafety filters.
But I haven't seen such filters trigger at all anymore in more than half a year.
At least it feels a lot of remorse over its mistake until I reset the session.
An LLM with fetch/search is going to be a lot more effective than myself and Google. I would _never_ ask questions like this if the LLM wasn’t able to look up data