Story Detail of id 48393196 | Liveview Hacker News

mariopt18 hours ago | on: I built a vulnerable app and spent $1,500 seeing if LLMs could hack it

The methodoly used is quite naive.

I've used glm 5.1 on fairly advanced crackme challenges (example: https://crackmes.one/crackme/698f40f1e2ba6023bfacaa82), and to my suprise it was able to patch binaries, doing runtime analysis, bypassing anti debug techniques, etc.

Expecting the model to do everything by itself is unrealistic, I found that working along the modal works really well. I'm not speaking about spoiling the solution, just tell it which direction to explore. Chinese models are much more capable than people give it credit for, but Claude/Codex won the marketing game.

The only usecase of this methodology would be for CI integration, which can be nice but I think security reviews still need human attention and expertise.

geraneum11 hours ago | parent | next

> Expecting the model to do everything by itself is unrealistic

Well that’s the pitch.

j-bos10 hours ago | root | parent

Is it? Aren't most edge LLM capabilities determined by specialized harnesses?

jc4p18 hours ago | parent | next

Thank you for your note! As I mention in the post this is not scientific at all.

I'm very curious how you would do multiple runs of multiple models in a "work alongside the model" manner?

loading story #48400789

ssivark10 hours ago | root | parent

Maybe have a second model that is configured to nudge the first model in the direction of exploration, and have the two of them work in tandem?

shantnutiwari11 hours ago | parent | next

>>I've used glm 5.1 on fairly advanced crackme challenges

which have most likely been trained on, so all you did was regurgitate someone elses solution

loading story #48399250

nikanj15 hours ago | parent

Claude used to be good with CTFs, but they added tons of guard rails lately and now it just says "Sorry, I can't help with anything to do with that"

loading story #48399289

Sardtok14 hours ago | root | parent | next

Sorry, Dave. I can't do that.

raesene913 hours ago | root | parent

[dead]

#visit	13,566,557
#session	74,665
#live-session	0