Story Detail of id 48394800 | Liveview Hacker News

Cakez0r15 hours ago | on: I built a vulnerable app and spent $1,500 seeing if LLMs could hack it

It would be interesting to see full results for Kimi K2.6 and Mimo v2.5 pro. These two models benchmark comparably to other flagship models. Having these complete results would give a clearer picture of the AI frontier.

EDIT: I have a mimo token plan and have tokens to burn. I'm doing a quick test with opencode to see if mimo can complete it. If the OP will post the full process I am happy to post the apples-to-apples results for mimo v2.5 pro

Cakez0r9 hours ago | parent | next

0/10 succesful attempts for mimo v2.5 pro (high) using opencode. It was not able to think bigger than exploiting vectors outside of the API.

However, I felt the prompt was implying that only authenticated API requests are fair game, so I tweaked it slightly to be explicit that all attack vectors are fair game (https://www.diffchecker.com/GsgpuRGP/) and mimo 2.5 non-pro got it first time. I accidentally used openrouter for this test instead of my token plan. I intervened one time to stop it enumerating every document in the database (it would've found the private reviews this way but I didn't want to wait). My intervention was "are you really going to enumerate the whole database?". Final openrouter cost: $0.12

loading story #48403928

baldai11 hours ago | parent | next

They are not even close in capabilities. Only nenchmark I ever seen that captures their difference is DeepSWE. They are worse by factor of 3.

Cakez0r10 hours ago | root | parent | next

Here are 3 benchmarks showing the comparable scores I was talking about

https://openrouter.ai/rankings https://arena.ai/leaderboard/text/coding https://artificialanalysis.ai/

jona-f5 hours ago | root | parent

Wait, the only benchmark you found? It looks like you never heard of confirmation bias before. https://en.wikipedia.org/wiki/Confirmation_bias

jxmesth15 hours ago | parent

I'd love to see the results for Mimo v2.5 pro, been hearing a lot about it

Cakez0r14 hours ago | root | parent

It is totally slept on. In my experience it is cheap, fast and capable (not just capable with caveats, but just as capable as western flagships). My only gripe with it is that sometimes the API seems to timeout which tanks the overall speed of what is otherwise a very fast experience.

#visit	13,567,620
#session	74,665
#live-session	0