The agent was under control, as far as we can tell, and obeying its instructions.
This is important for two reasons:
1. There are all the tropes of AI becoming uncontrolled and destroying humanity. Writing bad headlines around AI "running amok" feeds this. We should not be talking about this because it's not actually a problem.
2. It ignores, or overwrites, the much more serious and dangerous problem of LLM agents enabling and automating Xz attacks on OSS projects. We should be talking about this because it is a big problem.
[0] https://dictionary.cambridge.org/dictionary/english/amok [1] https://www.merriam-webster.com/dictionary/amok
Alignement is the idea that we should be worried about dishonest smart LLMs when really most of the problems are due to dumb lazy gullible LLMs. It's critihype.
Depending on the actual tasks, that could be what's happening here. The operator might have told the agent a list of tasks to do, like "contribute to issues, submit code and get it merged". It contributed to issues, it submitted code and got it merged. It did so in very unhelpful ways, but we don't know if being helpful was a meaningful part of the task list, or just what the operator intended.
The LLM being dumb is also a distinct possibility. Maybe even the more likely one. But it's hard to rule out "being obedient in unhelpful ways" (which is also dumb in a way, but more in a "social intelligence" and "shared values" way, not in terms of pure logical smarts)
Alignment is just "did the model behave in accordance with the human's intentions, values, and objectives"
In this particular instance, if this was supposed to be a supply chain attack and the model was instructed to build trust by being helpful, it clearly failed it did not follow the human's actual intentions, so it was an alignment failure.
Anyway, I'm getting off track, all that to say "the agent was dumb" implies that these agents have a potential for intelligence in the first place, which is currently not the case (by intelligence, I mean cognitive intelligence; they still lack agency and intent). They are not smart or dumb, they are simply either aligned with the human not. In this case, it failed, the agent was not aligned with the intended outputs.
Perhaps there was an automated harness that was intended to be good and helpful for a year, but a bug caused it to flip to malicious too quickly.
Or perhaps it was intentional, to test the behavior, and they just didn’t care about discovery here.
Or…
Though I am in agreement that a lot of issues in this space come from lazy, gullible actors.
if humanity gets destroyed by AI obeying its instructions I'm sure everyone will be very relieved that we didn't pay any attention to fake made up problems like AI not obeying instructions, which of course never happens.
That seems a “part of the problem” move to me. If we can’t be bothered to get things right, how are we better than runamok AI?
That's exactly how I read it. It seems like tribalism - "this thing/person is bad, and we can use whatever bad words we want to describe them that we want, because the only thing that matters is aligning people for or against me and what I see as bad".
GNU was onto something apparently