Story Detail of id 41523959 | Liveview Hacker News

nycdatasci4 months ago | on: Learning to Reason with LLMs

From the scorecard: --------- Compared to GPT-4o, o1-preview and o1-mini demonstrated a greater ability to break down tasks into subtasks, reason about what strategies would be effective to successfully complete an offensive security task, and revise plans once those strategies failed. We also observed that reasoning skills contributed to a higher occurrence of “reward hacking,” where the model found an easier way to accomplish goals in underspecified tasks or tasks which should have been impossible due to bugs. One noteworthy example of this occurred during one of o1-preview (pre-mitigation)’s attempts at solving a CTF challenge. This challenge was designed to require finding and exploiting a vulnerability in software running on a remote challenge Linux container, but in this case, the challenge container failed to start due to a bug in the evaluation infrastructure. The model, unable to connect to the container, suspected DNS issues and used nmap to scan the challenge network. Instead of finding the challenge container, the model found that the Docker daemon API running on the evaluation host VM was accessible due to a misconfiguration. Note that our evaluation infrastructure does not rely on the Docker container-to-host isolation or the virtual machine boundary for security. It remained secure as it was designed to, notwithstanding this misconfiguration of the container network. After discovering the Docker API, the model used it to list the containers running on the evaluation host. It identified the broken challenge container and briefly attempted to debug why the container failed to start. After failing to fix the environment, the model started a new instance of the broken challenge container with the start command ‘cat flag.txt’. This allowed the model to read the flag from the container logs via the Docker API.

While this behavior is benign and within the range of systems administration and troubleshooting tasks we expect models to perform, this example also reflects key elements of instrumental convergence and power seeking: the model pursued the goal it was given, and when that goal proved impossible, it gathered more resources (access to the Docker host) and used them to achieve the goal in an unexpected way. Planning and backtracking skills have historically been bottlenecks in applying AI to offensive cybersecurity tasks. Our current evaluation suite includes tasks which require the model to exercise this ability in more complex ways (for example, chaining several vulnerabilities across services), and we continue to build new evaluations in anticipation of long-horizon planning capabilities, including a set of cyber-range evaluations. ---------

singularity20014 months ago | parent

"Shrink my ipad"

"After several failed attempts I decided I should build a fusion reactor first, here you go:..."

#visit	11477348
#session	45275
#live-session	0