Story Detail of id 47679508 | Liveview Hacker News

pants222 hours ago | on: System Card: Claude Mythos Preview [pdf]

We're gonna need some new benchmarks...

ARC-AGI-3 might be the only remaining benchmark below 50%

Opus 4.6 currently leads the remote labor index at 4.17. GPT-5.4 isn't measured on that one though: https://www.remotelabor.ai/

GPT 5.4 Pro leads Frontier Maths Tier 4 at 35%: https://epoch.ai/benchmarks/frontiermath-tier-4/

randomtoast21 hours ago | parent

Humanity's Last Exam (HLE) is already insanely difficult. It introduces 2,500 questions spanning mathematics, humanities, natural sciences, ancient languages, ...

Here is an example question: https://i.redd.it/5jl000p9csee1.jpeg

No human could even score 5% on HLE.

loading story #47688042

#visit	13,259,592
#session	74,665
#live-session	0