Story Detail of id 48311777 | Liveview Hacker News

onlyrealcuzzo23 hours ago | on: Claude Opus 4.8

Does anyone troll these releases and cherry pick random metrics other companies would cherry pick to show how amazing their models are?

There's like 8 million benchmarks. Every release, every model randomly picks 5-10 where they win in everything except 1, to make it look like they aren't randomly cherry picking benchmarks they probably benchmaxxed for.

loading story #48323347

aronowb1423 hours ago | parent | next

https://arena.ai/leaderboard - I’ve found this company is a pretty good ranker - not sure their exact methodology but during day to day programming with Claude / gpt models I’ve felt qualitatively what they report

XCSme22 hours ago | root | parent | next

Also check mine[0], basically random private tests/questions and an ok-ish methodology, testing mostly for general intelligence than coding-specific tasks.

I built it for myself, to test which models to use via OpenRouter for my n8n agents. Currently actually still using gpt-5.3-codex for many things, as its pricing is really good in production (due to how their token caching works).

Gemini models still have the best intelligence (when asked any questions, most likely to get it right), but in production they still have many failure modes[1].

[0]: https://aibenchy.com

[1]: https://news.ycombinator.com/item?id=48230368

BoorishBears20 hours ago | root | parent

Every model release you'll post this, and every time I'll be there to point out how it's completely useless (for reasons you've shared are intentional)

It does things like place the old Gemini 3 Flash above the more capable 3.5 Flash and Opus 4.5 - Opus 4.8 and gpt-5.5

At least, until hopefully one day HN has a rule about accounts that derive 99.9999% of their engagement with the site from shilling a personal project.

XCSme19 hours ago | root | parent | next

Also, what about the major flaw/bias linked for Gemini 3.5 flash? That has major real-life consequences if the model ends up being used for any automated scoring systems.

I found it while trying to use 3.5 Flash for scoring the reasoning of some models, and it gets it wrong because of the centering bias, whereas 3 Flash gets scoring right.

XCSme19 hours ago | root | parent

I'm happy you do comment, I did add more coding tests since then and add more improvements (price history per model, displaying cost to run at current pricing, improved scoring).

How is it useless to see that Opus 4.8 is 2x more expensive and 2x slower on some questions?

Bnjoroge22 hours ago | root | parent | next

Have you seen https://deepswe.datacurve.ai/blog? This is the closest to a vibe check i’ve felt even with the open models.

Imustaskforhelp21 hours ago | root | parent

This actually looks like a really good test.

There are many benchmarks all for specific use cases but with them the difference seems to be in extreme points (93% vs 92%)

I think that, that tracks but still, it was refreshing to see a benchmark which I can help make better opinions about.

Surprised about Mimo v2.5, within artificial-analysis and other benchmarks, the difference between Mimo and deepseek seems very partial and a lot of focus/(hype?) is on Deepseek

But mimo seems like an interesting model and they are having some crazy discounts too.

Deepseek is valuable for the research community because of how open they are but absolutely crazy to think how Xiaomi basically pulled up in creating Mimo given that they didn't have anything till quite recently.

Either way, an interesting benchmark, also a plus point for giving golang some decent representation equal to python/typescript.

I think that there are sets of things which resemble something like normal benchmarks where open source models can be absolutely fine and for a very small fraction or more technical things, the benchmark that you linked starts to be better projected so it depends upon the scale of complexity but its good to see how models compete given enough complexity. definitely fascinating.

I would be interested to see more models compete on this test. The current range is still a bit limited as compared to other benchmarks but OSS models like Kimi/mimo seem to only be 3-4 (at max 6 months) behind closed source.

GneojJ2 hours ago | root | parent

Having used both Deepseek v4 Pro and Mimo v2.5 for agentic coding, I'm not surprised Mimo comes out quite far in front. It reflects my experience at least.

The recent hype is Deepseek is a combination of existing name recognition along with incredibly low pricing. Their v4 models, both pro and flash are incredible for their price. That's more revolutionary than Mimo which is multiple times more expensive, just like Kimi 2.6.

reckless22 hours ago | root | parent | next

No way is Muse Spark generally better than offerings from Google and OpenAI. I actually find arena to be amongst the most useless indicators

WASDx18 hours ago | root | parent

I think their "code" ranking is biased towards visual aesthetics more than raw coding as the voters are just asked which generated website they prefer.

morley22 hours ago | root | parent | next

I'm finding it a little hard to believe that GPT 5.5 is in 11th place for webdev, outranked by models like Kimi, Qwen, and Z.ai. I'm not saying it's not true (I have noticed GPT being less smart in recent weeks), but this is very different from my expectation.

WarmWash21 hours ago | root | parent | next

On paper it's one of the best because it's meant to be blind comparison of your own prompts. However if you are someone who geeks hard on one or a few models, you learn their "personality" and can recognize them in a blind test.

dakolli21 hours ago | root | parent

If you don't know their methodology, or anything about it why do you think its a good ranker?

nerevarthelame23 hours ago | parent | next

It's interesting they only included 6 metrics this time. Opus 4.7 had 12, and 4.6 had 13.

Of the metircs they reported for 4.7, for 4.8 they excluded BrowseComp, CharXiv Reasoning, CyberGym, GPQA Diamond, MCP Atlas, MMMLU, SWE-bench Verified. The last 4 were almost always mentioned in previous Opus releases.

loading story #48321705

onlyrealcuzzo23 hours ago | root | parent

Gonna assume it's because they barely budged or moved downward and most of their reported benchmark results are probably within sampling errors...

hyperpape23 hours ago | root | parent

They will release a system card, and you can then confirm or disconfirm your assumptions.

ddosmax55622 hours ago | parent | next

I would take all benchmarks with a grain of salt. I don't really use them. What's it supposed to tell me? "5% smarter", what does that mean? My experience will differ. Just try it!

I doubt Anthropic internally sets as a goal to improve this or that benchmark - it's just a way to visualize progress. They probably have much more complex metrics internally.

bel823 hours ago | parent | next

On this note, is there a benchmark aggregator to compile all benchmarks in a single large grid?

jpadkins22 hours ago | root | parent

I find this site useful https://artificialanalysis.ai/leaderboards/models

YetAnotherNick23 hours ago | parent

At least they show competitors in any benchmark, compared to OpenAI which likes to pretend that there isn't any competitor.

#visit	13,438,481
#session	74,665
#live-session	0