DeepSeek-R1

https://github.com/deepseek-ai/DeepSeek-R1

808meetpateltech | 9 hours ago | 256 | HN

> However, DeepSeek-R1-Zero encounters challenges such as endless repetition, poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates cold-start data before RL.

We've been running qualitative experiments on OpenAI o1 and QwQ-32B-Preview [1]. In those experiments, I'd say there were two primary things going against QwQ. First, QwQ went into endless repetitive loops, "thinking out loud" what it said earlier maybe with a minor modification. We had to stop the model when that happened; and I feel that it significantly hurt the user experience.

It's great that DeepSeek-R1 fixes that.

The other thing was that o1 had access to many more answer / search strategies. For example, if you asked o1 to summarize a long email, it would just summarize the email. QwQ reasoned about why I asked it to summarize the email. Or, on hard math questions, o1 could employ more search strategies than QwQ. I'm curious how DeepSeek-R1 will fare in that regard.

Either way, I'm super excited that DeepSeek-R1 comes with an MIT license. This will notably increase how many people can evaluate advanced reasoning models.

[1] https://github.com/ubicloud/ubicloud/discussions/2608

loading story #42768979

loading story #42769590

loading story #42769218

loading story #42769174

simonw7 hours ago | parent | next

OK, these are a LOT of fun to play with. I've been trying out a quantized version of the Llama 3 one from here: https://huggingface.co/unsloth/DeepSeek-R1-Distill-Llama-8B-...

The one I'm running is the 8.54GB file. I'm using Ollama like this:

    ollama run hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0

You can prompt it directly there, but I'm using my LLM tool and the llm-ollama plugin to run and log prompts against it. Once Ollama has loaded the model (from the above command) you can try those with uvx like this:

    uvx --with llm-ollama \
      llm -m 'hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0' \
      'a joke about a pelican and a walrus who run a tea room together'

Here's what I got - the joke itself is rubbish but the "thinking" section is fascinating: https://gist.github.com/simonw/f505ce733a435c8fc8fdf3448e381...

I also set an alias for the model like this:

    llm aliases set r1l 'hf.co/unsloth/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0'

Now I can run "llm -m r1l" (for R1 Llama) instead.

I wrote up my experiments so far on my blog: https://simonwillison.net/2025/Jan/20/deepseek-r1/

loading story #42769848

loading story #42769735

loading story #42769958

loading story #42769709

loading story #42769756

loading story #42773052

tkgally8 hours ago | parent | next

Over the last two weeks, I ran several unsystematic comparisons of three reasoning models: ChatGPT o1, DeepSeek’s then-current DeepThink, and Gemini 2.0 Flash Thinking Experimental. My tests involved natural-language problems: grammatical analysis of long texts in Japanese, New York Times Connections puzzles, and suggesting further improvements to an already-polished 500-word text in English. ChatGPT o1 was, in my judgment, clearly better than the other two, and DeepSeek was the weakest.

I tried the same tests on DeepSeek-R1 just now, and it did much better. While still not as good as o1, its answers no longer contained obviously misguided analyses or hallucinated solutions. (I recognize that my data set is small and that my ratings of the responses are somewhat subjective.)

By the way, ever since o1 came out, I have been struggling to come up with applications of reasoning models that are useful for me. I rarely write code or do mathematical reasoning. Instead, I have found LLMs most useful for interactive back-and-forth: brainstorming, getting explanations of difficult parts of texts, etc. That kind of interaction is not feasible with reasoning models, which can take a minute or more to respond. I’m just beginning to find applications where o1, at least, is superior to regular LLMs for tasks I am interested in.

loading story #42769025

loading story #42769089

loading story #42769936

loading story #42768971

pizza8 hours ago | parent | next

Holy moly.. even just the Llama 8B model trained on R1 outputs (DeepSeek-R1-Distill-Llama-8B), according to these benchmarks, is stronger than Claude 3.5 Sonnet (except on GPQA). While that says nothing about how it will handle your particular problem, dear reader, that does seem.. like an insane transfer of capabilities to a relatively tiny model. Mad props to DeepSeek!

loading story #42769372

loading story #42772943

loading story #42769224

loading story #42773412

loading story #42769333

qqqult8 hours ago | parent | next

Kind of insane how a severely limited company founded 1 year ago competes with the infinite budget of Open AI

Their parent hedge fund company isn't huge either, just 160 employees and $7b AUM according to Wikipedia. If that was a US hedge fund it would be the #180 largest in terms of AUM, so not small but nothing crazy either

loading story #42770217

jstummbillig8 hours ago | parent | next

The nature of software that has not moat built into it. Which is fantastic for the world, as long as some companies are willing to pay the premium involved in paving the way. But man, what a daunting prospect for developers and investors.

HeatrayEnjoyer8 hours ago | root | parent | next

I'm not sure we should call it "fantastic"

The negative downsides begin at "dystopia worse than 1984 ever imagined" and get worse from there

rtsil7 hours ago | root | parent | next

That dystopia is far more likely in a world where the moat is so large that a single company can control all the llms.

loading story #42768945

CuriouslyC7 hours ago | root | parent

That dystopia will come from an autocratic one party government with deeply entrenched interests in the tech oligarchy, not from really slick AI models.

loading story #42770212

loading story #42768571

loading story #42769442

loading story #42768568

loading story #42768710

loading story #42772432

fullstackwife8 hours ago | parent | next

I was initially enthusiastic about DS3, because of the price, but eventually I learned the following things:

- function calling is broken (responding with excessive number of duplicated FC, halucinated names and parameters)

- response quality is poor (my use case is code generation)

- support is not responding

I will give a try to the reasoning model, but my expectations are low.

ps. the positive side of this is that apparently it removed some traffic from anthropic APIs, and latency for sonnet/haikku improved significantly.

loading story #42769858

loading story #42768994

tripplyons7 hours ago | parent | next

I just pushed the distilled Qwen 7B version to Ollama if anyone else here wants to try it locally: https://ollama.com/tripplyons/r1-distill-qwen-7b

jerpint8 hours ago | parent | next

> This code repository and the model weights are licensed under the MIT License. DeepSeek-R1 series support commercial use, allow for any modifications and derivative works, including, but not limited to, distillation for training other LLMs.

Wow. They’re really trying to undercut closed source LLMs

wumeow6 hours ago | parent

Yep, it's a national strategy.

chaosprint8 hours ago | parent | next

Amazing progress with this budget.

My only concern is that on openrouter.ai it says:

"To our knowledge, this provider may use your prompts and completions to train new models."

https://openrouter.ai/deepseek/deepseek-chat

This is a dealbreaker for me to use it at the moment.

loading story #42768940

loading story #42768695

loading story #42768964

loading story #42768858

loading story #42768687

loading story #42772971

loading story #42772954

HarHarVeryFunny7 hours ago | parent | next

There are all sorts of ways that additional test time compute can be used to get better results, varying from things like sampling multiple CoT and choosing the best, to explicit tree search (e.g. rStar-Math), to things like "journey learning" as described here:

https://arxiv.org/abs/2410.18982?utm_source=substack&utm_med...

Journey learning is doing something that is effectively close to depth-first tree search (see fig.4. on p.5), and does seem close to what OpenAI are claiming to be doing, as well as what DeepSeek-R1 is doing here... No special tree-search sampling infrastructure, but rather RL-induced generation causing it to generate a single sampling sequence that is taking a depth first "journey" through the CoT tree by backtracking when necessary.

loading story #42772326

loading story #42770347

zurfer7 hours ago | parent | next

I love that they included some unsuccessful attempts. MCTS doesn't seem to have worked for them.

Also wild that few shot prompting leads to worse results in reasoning models. OpenAI hinted at that as well, but it's always just a sentence or two, no benchmarks or specific examples.

mohsen17 hours ago | parent | next

I use Cursor Editor and the Claude edit mode is extremely useful. However the reasoning in DeepSeek has been a great help for debugging issues. For this I am using yek[1] to serialize my repo (--max-size 120k --tokens) and feed it the test error. Wrote a quick script name "askai" so Cursor automatically runs it. Good times!

Note: I wrote yek so it might be a little bit of shameless plug!

[1] https://github.com/bodo-run/yek

loading story #42770138

sschueller8 hours ago | parent | next

Does anyone know what kind of HW is required to run it locally? There are instructions but nothing about HW required.

loading story #42768742

loading story #42768649

loading story #42768625

loading story #42768583

loading story #42771932

9999000009996 hours ago | parent | next

Great, I've found DeepSeek to consistently be a better programmer than Chat GPT or Claude.

I'm also hoping for progress on mini models, could you imagine playing Magic The Gathering against a LLM model! It would quickly become impossible like Chess.

loading story #42772488

wielandbr5 hours ago | parent | next

I am curious about the rough compute budget they used for training DeepSeek-R1. I couldn't find anything in their report. Anyone having more information on this?

gman838 hours ago | parent | next

For months now I've seen benchmarks for lots of models that beat the pants off Claude 3.5 Sonnet, but when I actually try to use those models (using Cline VSCode plugin) they never work as well as Claude for programming.

loading story #42769145

loading story #42769361

loading story #42772598

msoad7 hours ago | parent | next

It already replaces o1 Pro in many cases for me today. It's much faster than o1 Pro and results are good in most cases. Still, sometimes I have to ask the question from o1 Pro if this model fails me. Worth the try every time tho, since it's much faster

Also a lot more fun reading the reasoning chatter. Kinda cute seeing it say "Wait a minute..." a lot

hodder7 hours ago | parent | next

Just shows how much fruit is available outside of just throwing more hardware at a problem. Amazing work.

loading story #42770627

aliljet6 hours ago | parent | next

I'm curious about whether anyone is running this locally using ollama?

ldjkfkdsjnv8 hours ago | parent | next

These models always seem great, until you actually use them for real tasks. The reliability goes way down, you cant trust the output like you can with even a lower end model like 4o. The benchmarks aren't capturing some kind of common sense usability metric, where you can trust the model to handle random small amounts of ambiguity in every day real world prompts

pizza8 hours ago | parent | next

Fair point. Actually probably the best part about having beaucoup bucks like Open AI is being able to chase down all the manifold little ‘last-mile’ imperfections with an army of many different research teams.

loading story #42768670

synergy207 hours ago | parent | next

deepseek v3 and r1 are both 700B models, who has that much memory to run the model locally these days?

loading story #42769596

loading story #42772490

loading story #42770639

rvz7 hours ago | parent | next

Looks promising. Let's hope that the benchmarks and experiments for DeepSeek are truly done independently and not tainted or paid for by them (Unlike OpenAI with FrontierMath.)

danielhanchen6 hours ago | parent | next

For anyone wanting GGUFs, I uploaded them to https://huggingface.co/collections/unsloth/deepseek-r1-all-v...

There's the distilled R1 GGUFs for Llama 8B, Qwen 1.5B, 7B, 14B, and I'm still uploading Llama 70B and Qwen 32B.

Also I uploaded a 2bit quant for the large MoE (200GB in disk size) to https://huggingface.co/unsloth/DeepSeek-R1-GGUF

loading story #42769977

buyservice6 hours ago | parent | next

[dead]

loading story #42772240

loading story #42770240

loading story #42772042

7 hours ago | parent | next

{"deleted":true,"id":42768819,"parent":42768072,"time":1737381764,"type":"comment"}

_imnothere7 hours ago | parent | next

One point is reliability, as others have mentioned. Another important point for me is censorship. Due to their political nature, the model seemed to be heavily censored on topics such as the CCP and Taiwan (R.O.C.).

loading story #42769098

loading story #42768916

nextworddev6 hours ago | parent

Deepseek is well known to have ripped off OpenAI APIs extensively in post training, embarrassingly so that it sometimes calls itself “As a model made by OpenAI”.

At least don’t use the hosted version unless you want your data to go to China

loading story #42770566

Argonaut9985 hours ago | parent

Just like OAI and copyrighted content. And I would rather my data go to China than the US, personally.

loading story #42772054

loading story #42770052

#visit	11561298
#session	45458
#live-session	0