Story Detail of id 47679345 | Liveview Hacker News

babelfish20 hours ago | on: System Card: Claude Mythos Preview [pdf]

Combined results (Claude Mythos / Claude Opus 4.6 / GPT-5.4 / Gemini 3.1 Pro)

  SWE-bench Verified:        93.9% / 80.8% / —     / 80.6%
  SWE-bench Pro:             77.8% / 53.4% / 57.7% / 54.2%
  SWE-bench Multilingual:    87.3% / 77.8% / —     / —
  SWE-bench Multimodal:      59.0% / 27.1% / —     / —
  Terminal-Bench 2.0:        82.0% / 65.4% / 75.1% / 68.5%

  GPQA Diamond:              94.5% / 91.3% / 92.8% / 94.3%
  MMMLU:                     92.7% / 91.1% / —     / 92.6–93.6%
  USAMO:                     97.6% / 42.3% / 95.2% / 74.4%
  GraphWalks BFS 256K–1M:    80.0% / 38.7% / 21.4% / —

  HLE (no tools):            56.8% / 40.0% / 39.8% / 44.4%
  HLE (with tools):          64.7% / 53.1% / 52.1% / 51.4%

  CharXiv (no tools):        86.1% / 61.5% / —     / —
  CharXiv (with tools):      93.2% / 78.9% / —     / —

  OSWorld:                   79.6% / 72.7% / 75.0% / —

sourcecodeplz20 hours ago | parent | next

Haven't seen a jump this large since I don't even know, years? Too bad they are not releasing it anytime soon (there is no need as they are still currently the leader).

loading story #47679611

loading story #47681946

loading story #47679901

WarmWash19 hours ago | parent | next

Are these fair comparisons? It seems like mythos is going to be like a 5.4 ultra or Gemini Deepthink tier model, where access is limited and token usage per query is totally off the charts.

loading story #47680693

WinstonSmith8416 hours ago | parent | next

Not discussing Mythos here, but Opus. Opus to me has been significantly better at SWE than GPT or Gemini - that gets me confused why Opus is ranking clearly lower than GPT, and even lower than Gemini.

loading story #47683201

loading story #47688597

pants220 hours ago | parent | next

We're gonna need some new benchmarks...

ARC-AGI-3 might be the only remaining benchmark below 50%

loading story #47680706

loading story #47680245

AlexC0418 hours ago | parent | next

but how does it perform on pelican riding a bicycle bench? why are they hiding the truth?!

(edit: I hope this is an obvious joke. less facetiously these are pretty jaw dropping numbers)

loading story #47680961

ninjagoo17 hours ago | parent | next

> Combined results (Claude Mythos / Claude Opus 4.6 / GPT-5.4 / Gemini 3.1 Pro)

> Terminal-Bench 2.0: 82.0% / 65.4% / 75.1% / 68.5%

> GPQA Diamond: 94.5% / 91.3% / 92.8% / 94.3%

> MMMLU: 92.7% / 91.1% / — / 92.6–93.6%

> USAMO: 97.6% / 42.3% / 95.2% / 74.4%

> OSWorld: 79.6% / 72.7% / 75.0% / —

Given that for a number of these benchmarks, it seems to be barely competitive with the previous gen Opus 4.6 or GPT-5.4, I don't know what to make of the significant jumps on other benchmarks within these same categories. Training to the test? Better training?

And the decision to withhold general release (of a 'preview' no less!) seems to be well, odd. And the decision to release a 'preview' version to specific companies? You know any production teams at these massive companies that would work with a 'preview' anything? R&D teams, sure, but production? Part of me wants to LoL.

What are they trying to do? Induce FOMO and stop subscriber bleed-out stemming from the recent negative headlines around problems with using Claude?

loading story #47681519

loading story #47682051

matheusmoreira15 hours ago | parent | next

Wow. Mythos must be insanely good considering how good a model Opus already is. I hope it's usable on a humble subscription...

loading story #47686910

loading story #47685230

whalesalad20 hours ago | parent | next

Honestly we are all sleeping on GPT-5.4. Particularly with the influx of Claude users recently (and increasingly unstable platform) Codex has been added to my rotation and it's surprising me.

loading story #47679759

loading story #47679757

johnnichev17 hours ago | parent | next

damn... ok that's impressive.

simianwords19 hours ago | parent | next

The real part is SWE-bench Verified since there is no way to overfit. That's the only one we can believe.

loading story #47680158

loading story #47687194

#visit	13,257,359
#session	74,665
#live-session	0