Hacker News new | past | comments | ask | show | jobs | submit
> However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL.

We've been running qualitative experiments on OpenAI o1 and QwQ-32B-Preview [1]. In those experiments, I'd say there were two primary things going against QwQ. First, QwQ went into endless repetitive loops, "thinking out loud" what it said earlier maybe with a minor modification. We had to stop the model when that happened; and I feel that it significantly hurt the user experience.

It's great that DeepSeek-R1 fixes that.

The other thing was that o1 had access to many more answer / search strategies. For example, if you asked o1 to summarize a long email, it would just summarize the email. QwQ reasoned about why I asked it to summarize the email. Or, on hard math questions, o1 could employ more search strategies than QwQ. I'm curious how DeepSeek-R1 will fare in that regard.

Either way, I'm super excited that DeepSeek-R1 comes with an MIT license. This will notably increase how many people can evaluate advanced reasoning models.

[1] https://github.com/ubicloud/ubicloud/discussions/2608

loading story #42768979
loading story #42769590
loading story #42769218
loading story #42769174
OK, these are a LOT of fun to play with. I've been trying out a quantized version of the Llama 3 one from here: https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B-...

The one I'm running is the 8.54GB file. I'm using Ollama like this:

    ollama run hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0
You can prompt it directly there, but I'm using my LLM tool and the llm-ollama plugin to run and log prompts against it. Once Ollama has loaded the model (from the above command) you can try those with uvx like this:

    uvx --with llm-ollama \
      llm -m 'hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0' \
      'a joke about a pelican and a walrus who run a tea room together'
Here's what I got - the joke itself is rubbish but the "thinking" section is fascinating: https://gist.github.com/simonw/f505ce733a435c8fc8fdf3448e381...

I also set an alias for the model like this:

    llm aliases set r1l 'hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0' 
Now I can run "llm -m r1l" (for R1 Llama) instead.

I wrote up my experiments so far on my blog: https://simonwillison.net/2025/Jan/20/deepseek-r1/

loading story #42769848
loading story #42769735
loading story #42769958
loading story #42769709
loading story #42769756
loading story #42773052
Over the last two weeks, I ran several unsystematic comparisons of three reasoning models: ChatGPT o1, DeepSeek’s then-current DeepThink, and Gemini 2.0 Flash Thinking Experimental. My tests involved natural-language problems: grammatical analysis of long texts in Japanese, New York Times Connections puzzles, and suggesting further improvements to an already-polished 500-word text in English. ChatGPT o1 was, in my judgment, clearly better than the other two, and DeepSeek was the weakest.

I tried the same tests on DeepSeek-R1 just now, and it did much better. While still not as good as o1, its answers no longer contained obviously misguided analyses or hallucinated solutions. (I recognize that my data set is small and that my ratings of the responses are somewhat subjective.)

By the way, ever since o1 came out, I have been struggling to come up with applications of reasoning models that are useful for me. I rarely write code or do mathematical reasoning. Instead, I have found LLMs most useful for interactive back-and-forth: brainstorming, getting explanations of difficult parts of texts, etc. That kind of interaction is not feasible with reasoning models, which can take a minute or more to respond. I’m just beginning to find applications where o1, at least, is superior to regular LLMs for tasks I am interested in.

loading story #42769025
loading story #42769089
loading story #42769936
loading story #42768971
Holy moly.. even just the Llama 8B model trained on R1 outputs (DeepSeek-R1-Distill-Llama-8B), according to these benchmarks, is stronger than Claude 3.5 Sonnet (except on GPQA). While that says nothing about how it will handle your particular problem, dear reader, that does seem.. like an insane transfer of capabilities to a relatively tiny model. Mad props to DeepSeek!
loading story #42769372
loading story #42772943
loading story #42769224
loading story #42773412
loading story #42769333
Kind of insane how a severely limited company founded 1 year ago competes with the infinite budget of Open AI

Their parent hedge fund company isn't huge either, just 160 employees and $7b AUM according to Wikipedia. If that was a US hedge fund it would be the #180 largest in terms of AUM, so not small but nothing crazy either

loading story #42770217
The nature of software that has not moat built into it. Which is fantastic for the world, as long as some companies are willing to pay the premium involved in paving the way. But man, what a daunting prospect for developers and investors.
I'm not sure we should call it "fantastic"

The negative downsides begin at "dystopia worse than 1984 ever imagined" and get worse from there

That dystopia is far more likely in a world where the moat is so large that a single company can control all the llms.
loading story #42768945
That dystopia will come from an autocratic one party government with deeply entrenched interests in the tech oligarchy, not from really slick AI models.
loading story #42770212
loading story #42768571
loading story #42769442
loading story #42768568
loading story #42768710
loading story #42772432
I was initially enthusiastic about DS3, because of the price, but eventually I learned the following things:

- function calling is broken (responding with excessive number of duplicated FC, halucinated names and parameters)

- response quality is poor (my use case is code generation)

- support is not responding

I will give a try to the reasoning model, but my expectations are low.

ps. the positive side of this is that apparently it removed some traffic from anthropic APIs, and latency for sonnet/haikku improved significantly.

loading story #42769858
loading story #42768994
I just pushed the distilled Qwen 7B version to Ollama if anyone else here wants to try it locally: https://ollama.com/tripplyons/r1-distill-qwen-7b
> This code repository and the model weights are licensed under the MIT License. DeepSeek-R1 series support commercial use, allow for any modifications and derivative works, including, but not limited to, distillation for training other LLMs.

Wow. They’re really trying to undercut closed source LLMs

Yep, it's a national strategy.
Amazing progress with this budget.

My only concern is that on openrouter.ai it says:

"To our knowledge, this provider may use your prompts and completions to train new models."

https://openrouter.ai/deepseek/deepseek-chat

This is a dealbreaker for me to use it at the moment.

loading story #42768940
loading story #42768695
loading story #42768964
loading story #42768858
loading story #42768687
loading story #42772971
loading story #42772954
There are all sorts of ways that additional test time compute can be used to get better results, varying from things like sampling multiple CoT and choosing the best, to explicit tree search (e.g. rStar-Math), to things like "journey learning" as described here:

https://arxiv.org/abs/2410.18982?utm_source=substack&utm_med...

Journey learning is doing something that is effectively close to depth-first tree search (see fig.4. on p.5), and does seem close to what OpenAI are claiming to be doing, as well as what DeepSeek-R1 is doing here... No special tree-search sampling infrastructure, but rather RL-induced generation causing it to generate a single sampling sequence that is taking a depth first "journey" through the CoT tree by backtracking when necessary.

loading story #42772326
loading story #42770347
I love that they included some unsuccessful attempts. MCTS doesn't seem to have worked for them.

Also wild that few shot prompting leads to worse results in reasoning models. OpenAI hinted at that as well, but it's always just a sentence or two, no benchmarks or specific examples.

I use Cursor Editor and the Claude edit mode is extremely useful. However the reasoning in DeepSeek has been a great help for debugging issues. For this I am using yek[1] to serialize my repo (--max-size 120k --tokens) and feed it the test error. Wrote a quick script name "askai" so Cursor automatically runs it. Good times!

Note: I wrote yek so it might be a little bit of shameless plug!

[1] https://github.com/bodo-run/yek

loading story #42770138
Does anyone know what kind of HW is required to run it locally? There are instructions but nothing about HW required.
loading story #42768742
loading story #42768649
loading story #42768625
loading story #42768583
loading story #42771932
Great, I've found DeepSeek to consistently be a better programmer than Chat GPT or Claude.

I'm also hoping for progress on mini models, could you imagine playing Magic The Gathering against a LLM model! It would quickly become impossible like Chess.

loading story #42772488
I am curious about the rough compute budget they used for training DeepSeek-R1. I couldn't find anything in their report. Anyone having more information on this?
For months now I've seen benchmarks for lots of models that beat the pants off Claude 3.5 Sonnet, but when I actually try to use those models (using Cline VSCode plugin) they never work as well as Claude for programming.
loading story #42769145
loading story #42769361
loading story #42772598
It already replaces o1 Pro in many cases for me today. It's much faster than o1 Pro and results are good in most cases. Still, sometimes I have to ask the question from o1 Pro if this model fails me. Worth the try every time tho, since it's much faster

Also a lot more fun reading the reasoning chatter. Kinda cute seeing it say "Wait a minute..." a lot

Just shows how much fruit is available outside of just throwing more hardware at a problem. Amazing work.
loading story #42770627
I'm curious about whether anyone is running this locally using ollama?
These models always seem great, until you actually use them for real tasks. The reliability goes way down, you cant trust the output like you can with even a lower end model like 4o. The benchmarks aren't capturing some kind of common sense usability metric, where you can trust the model to handle random small amounts of ambiguity in every day real world prompts
Fair point. Actually probably the best part about having beaucoup bucks like Open AI is being able to chase down all the manifold little ‘last-mile’ imperfections with an army of many different research teams.
loading story #42768670
deepseek v3 and r1 are both 700B models, who has that much memory to run the model locally these days?
loading story #42769596
loading story #42772490
loading story #42770639
Looks promising. Let's hope that the benchmarks and experiments for DeepSeek are truly done independently and not tainted or paid for by them (Unlike OpenAI with FrontierMath.)
For anyone wanting GGUFs, I uploaded them to https://huggingface.co/collections/unsloth/deepseek-r1-all-v...

There's the distilled R1 GGUFs for Llama 8B, Qwen 1.5B, 7B, 14B, and I'm still uploading Llama 70B and Qwen 32B.

Also I uploaded a 2bit quant for the large MoE (200GB in disk size) to https://huggingface.co/unsloth/DeepSeek-R1-GGUF

loading story #42769977
loading story #42772240
loading story #42770240
loading story #42772042
{"deleted":true,"id":42768819,"parent":42768072,"time":1737381764,"type":"comment"}
One point is reliability, as others have mentioned. Another important point for me is censorship. Due to their political nature, the model seemed to be heavily censored on topics such as the CCP and Taiwan (R.O.C.).
loading story #42769098
loading story #42768916
Deepseek is well known to have ripped off OpenAI APIs extensively in post training, embarrassingly so that it sometimes calls itself “As a model made by OpenAI”.

At least don’t use the hosted version unless you want your data to go to China

loading story #42770566
Just like OAI and copyrighted content. And I would rather my data go to China than the US, personally.
loading story #42772054
loading story #42770052