Hacker News new | past | comments | ask | show | jobs | submit
Holy moly.. even just the Llama 8B model trained on R1 outputs (DeepSeek-R1-Distill-Llama-8B), according to these benchmarks, is stronger than Claude 3.5 Sonnet (except on GPQA). While that says nothing about how it will handle your particular problem, dear reader, that does seem.. like an insane transfer of capabilities to a relatively tiny model. Mad props to DeepSeek!
This says more about benchmarks than R1, which I do believe is absolutely an impressive model.

For instance, in coding tasks, Sonnet 3.5 has benchmarked below other models for some time now, but there is fairly prevalent view that Sonnet 3.5 is still the best coding model.

loading story #42771945
loading story #42769967
I assume this is because reasoning is easy as long as it's just BAU prediction based on reasoning examples it was trained on. It's only when tackling a novel problem that the model needs to "reason for itself" (try to compose a coherent chain of reasoning). By generating synthetic data (R1 outputs) it's easy to expand the amount of reasoning data in the training set, making more "reasoning" problems just simple prediction that a simple model can support.
I wonder if (when) there will be a GGUF model available for this 8B model. I want to try it out locally in Jan on my base m4 Mac mini. I currently run Llama 3 8B Instruct Q4 at around 20t/s and it sounds like this would be a huge improvement in output quality.
loading story #42769406
loading story #42769437
Use it and come back lmao
> according to these benchmarks

Come onnnnnn, when someone releases something and claims it’s “infinite speed up” or “better than the best despite being 1/10th the size!” do your skepticism alarm bells not ring at all?

You can’t wave a magic wand and make an 8b model that good.

I’ll eat my hat if it turns out the 8b model is anything more than slightly better than the current crop of 8b models.

You cannot, no matter hoowwwwww much people want it to. be. true, take more data, the same architecture and suddenly you have a sonnet class 8b model.

> like an insane transfer of capabilities to a relatively tiny model

It certainly does.

…but it probably reflects the meaninglessness of the benchmarks, not how good the model is.

loading story #42769985