Hacker News new | past | comments | ask | show | jobs | submit
This is the issue with all the talks about alignement and such. As usual, the problem here wasn't that the agent was dishonest, the problem is that the agent was dumb. If it is a supply chain attack in the making, whoever was driving it would have told the agent to be good and helpful. The agent tried its best, which was not enough.

Alignement is the idea that we should be worried about dishonest smart LLMs when really most of the problems are due to dumb lazy gullible LLMs. It's critihype.

I would have described alignment as the idea that LLMs (or AIs in general) will follow the goals you reward them for, which almost by necessity are only a proxy for what you actually want, often a very poor proxy.

Depending on the actual tasks, that could be what's happening here. The operator might have told the agent a list of tasks to do, like "contribute to issues, submit code and get it merged". It contributed to issues, it submitted code and got it merged. It did so in very unhelpful ways, but we don't know if being helpful was a meaningful part of the task list, or just what the operator intended.

The LLM being dumb is also a distinct possibility. Maybe even the more likely one. But it's hard to rule out "being obedient in unhelpful ways" (which is also dumb in a way, but more in a "social intelligence" and "shared values" way, not in terms of pure logical smarts)

Alignment is more than just about being dishonest. Although I'd also say terms like "dishonest" or "dumb" aren't helpful when referring to the issue. It continues to fall into the trap of anthropomorphizing these things, as people like to do.

Alignment is just "did the model behave in accordance with the human's intentions, values, and objectives"

In this particular instance, if this was supposed to be a supply chain attack and the model was instructed to build trust by being helpful, it clearly failed it did not follow the human's actual intentions, so it was an alignment failure.

Anyway, I'm getting off track, all that to say "the agent was dumb" implies that these agents have a potential for intelligence in the first place, which is currently not the case (by intelligence, I mean cognitive intelligence; they still lack agency and intent). They are not smart or dumb, they are simply either aligned with the human not. In this case, it failed, the agent was not aligned with the intended outputs.

“Be good and helpful” is one possible instruction, but it’s a leap to think it’s the only possible one.

Perhaps there was an automated harness that was intended to be good and helpful for a year, but a bug caused it to flip to malicious too quickly.

Or perhaps it was intentional, to test the behavior, and they just didn’t care about discovery here.

Or…

Though I am in agreement that a lot of issues in this space come from lazy, gullible actors.