Will DeepSeek Fail at Simple Math Problems?

Amid the waves of surprises brought by DeepSeek, an old friend pointed out that it struggles with simple math problems, using a popular elementary arithmetic question as an example:

Is 3.11 greater than 3.8?

What’s the core issue here?

In the wake of the DeepSeek frenzy, I looked into its research paper, which explains how its reasoning capabilities are enhanced through outcome-oriented reinforcement learning. The paper suggests that, in theory, outcome-oriented reinforcement learning can help a model learn proper reasoning processes. However, in practice, it’s not necessarily so.

Take the above math problem as an example. The answer is a binary yes/no question, meaning even random guessing has a 50% chance of being correct. This highlights a key potential flaw: outcome-oriented supervision signals are weak because they lack sufficient granularity. This kind of weak supervision inevitably hampers the model’s ability to learn proper reasoning processes.

Three Possible Solutions

  1. Scaling Up the Model
    One approach is to make the model larger and deeper, hoping that the theoretical concept of lossless compression based on Kolmogorov complexity can be pushed to its limit. In doing so, proper reasoning, as the "shortest program," might eventually be learned by the model. Theoretically, correct reasoning ensures accurate results. However, the gap between theory and practice makes it hard to place much confidence in this. The shortest program or lossless compression might just be an unreachable ideal.
  2. Targeted Supervision Data
    Another solution is to feed the model with problem-specific supervised data. For example, providing thousands or tens of thousands of reasoning cases involving such math problems. There’s no reason the model wouldn’t learn from this. However, solving one specific problem this way is merely a stopgap measure. Soon, others will come up with new edge cases involving weak supervision signals and reasoning pitfalls to challenge it.Another common challenge is the so-called “self-identification” problem. For instance, when asked “Who are you?”, many models, including DeepSeek (earlier versions), would claim they are ChatGPT developed by Open AI if no targeted supervised data is injected. After all, ChatGPT has dominated the internet in the two years since its explosive debut, and its data has inevitably influenced other models. However, this issue is already on the radar for specialized solutions and is gradually becoming less of a problem. Some Western media still claim that DeepSeek is just a distilled version of ChatGPT. Their evidence?  Probably based on early versions they tested, the DeepSeek bot often self-claimed to be OpenAI’s ChatGPT. But if you test it now, you won’t see this problem anymore. Most likely, it was fixed with specialized training data. Their research paper also mentioned addressing the self-identification as a problem.

    Similarly, the problem of comparing 3.11 and 3.8 can also be a transitional issue. If it disappears in the future, it won’t be a cause for celebration. Most likely, it will be resolved through targeted fixes rather than through fundamental improvements in intelligence brought about by algorithms or architecture changes or innovations.

  3. Re-introducing Process Reward Models (PRM)?
    The inherent weakness of outcome-oriented supervision signals is that it focuses only on the result while ignoring the checking of the process—a natural shortcoming of reinforcement learning driven by  results-oriented pragmatism in RL (following the “black cat, white cat” principle, lol). This is essentially the cost of abandoning PRMs (Process Reward Models). So, would re-introducing process-based reward models solve the issue? Honestly, we don’t know. This is the third possible path, and it might be worth exploring. But again, as mentioned in my previous blog post (DeepSeek's R1 Paper: A Storm in AI LLM Circle), PRMs aren’t easy to work with—they’re unstable and difficult to implement, although, in theory, they could help correct nonsensical reasoning during the process.

In conclusion, the issue with DeepSeek struggling with problems like 3.11 vs. 3.8 lies in the limitations of weak supervision in results-oriented reinforcement learning. While there are potential solutions—scaling the model, targeted data, or process reward models—each comes with challenges and trade-offs. Whether any of these approaches can fundamentally improve reasoning capabilities remains an open question.

 

 

 

DeepSeek 不懂简单数学题吗?

在 deepseek 带来的一浪一浪惊喜中,老友发现它不懂简单数学题,用的就是网上流行的小学算术的测试题,3.11 比 3.8 大吗:

这个问题的要害何在?

我在 DeepSeek 风暴下看看它的论文中解说了他们的结果导向的推理能力的强化学习。也指出结果导向的强化学习理论上可以学会合理的推理过程。但实际上不好说的。

对于上述数学题,答案是yes/no二分的,就是说,瞎蒙也有一半概率结果正确。这说明结果导向的监督信号区分度低(不可靠),这种弱监督自然影响了推理过程的学习。

三个办法。

第一是把模型做大做深,指望复杂性理论上的无损压缩可以做到极致,从而合理的推理作为“最短程序”最终被模型学到,理论上正确的推理会保证结果的正确性。但理论与实践的距离,可能让我们很难对此抱有太大信心。最短程序可能只是一个美好的梦想。

第二个办法是把针对性监督数据喂给模型,例如同类型的数学题的推理案例喂给它几千上万条,没有道理学不会。但针对性解决了这个问题,只是权宜之计。也许不久,人们会想到其他的答案监督信号弱,推理容易走歪的案例,来继续挑战它。

另一个常见的问题就是所谓“自我认知”的问题,who r u,如果没有针对性监督数据的注入,deepseek 以及很多其他的模型都会自称自己是 ChatGPT,毕竟ChatGPT核爆两年来,它的数据充斥互联网,不可能不受侵染。但这个问题已经进入专项解决的雷达屏上了,所以逐渐不是问题了。西方媒体有的还在说 deepseek 不过就是蒸馏 chatGPT 的,依据就是(他们测试过某个前期版本吧)deepseek bot 常自称是 open ai 开发的 chatGPT,但你现在上去试试,这种问题重复不了了。大概率是被专项数据解决了,记得他们论文也提到了这个自我认知的问题。

同理,3.11 vs 3.8 的大小比较这样的问题也是阶段性问题。以后不见了,也不必为它欢呼,大概率可能就是专项解决了,而不是因为算法或架构把智能真正提升了。

结果导向的监督信号不够强,是只认结果不看过程(白猫黑猫原则)的强化学习天生的短板,应该算是放弃 PRM(process reward model)的代价。那么,把过程奖励模型上马了,是不是就可以解决了呢?不知道。这就是第三条路,也许值得探索。但,again,上一篇博文说了,PRM 不好玩,不稳定,不好实现,虽然理论上可以帮助纠正推理过程中的胡说八道。

【后记】

刚才测试发现不能复现这个bug,看来早已解决了。也许老友昨天“亲测”的结果是忘了打开 deepthink?

【相关】

 

DeepSeek's R1 Paper: A Storm in AI LLM Circle

[Note: This is a blog analyzing DeepSeek's R1 paper and its impact]

Before DeepSeek, Chinese AI companies had always been engaged in fierce competition, achieving world-class SOTA benchmark scores on and off. However, none achieved DeepSeek's level of commanding respect and making such a stunning impact. Their recent breakthrough caught global attention.

Their paper and open source code are also beautifully written and accessible. No unnecessary complexity or obscurity. Simple and straightforward, yet radiating confidence. It exhibits engineering elegance while conveying innovation as well as passion. Simply remarkable. Should be nominated for best paper of the year.

Reading the R1 paper reveals that what OpenAI had kept mysterious - from Q* to O-series' so-called slow thinking reinforcement training - suddenly becomes clear and simple.

DeepSeek_R1 paper

Their key findings:

They demonstrated that reasoning capabilities can be acquired through pure reinforcement learning with simple rule-based rewards and multi-answer sampling, without the need for extensive supervised fine-tuning (SFT) data. This resulted in DeepSeek-R1-Zero, following AlphaZero's philosophy. While AlphaZero achieved absolute mastery in the narrow domain of Go by eliminating human data, their approach proved effective in broader domains of math, coding and logic.

Though R1-Zero worked well, they found incorporating minimal SFT data (a few thousand samples) for cold-start was more practical. R1-Zero matched OpenAI-o1-0912's performance, but its reasoning steps had poor readability and mixed languages. R1, however, with cold-start SFT and multi-stage pipeline of trainings, achieved further improvements matching OpenAI-o1-1217.

A new star was born.

Their valuable innovation was challenging the SFT+RL paradigm by proving pure RL's potential for reasoning through R1-Zero. This gave them confidence to further build the practical R1 with minimal cold-start data. Both models are open-sourced for research - an elegant execution.

DeepSeek excels at simplification. In reinforcement learning, they eliminated:
- The critic model parallel to policy model in RL, replaced by simple GRPO
- Complex reward models, replaced by rule-based rewards

GRPO (Group Relative Policy Optimization) generates multiple answers per question, comparing them within groups to calculate advantage scores:

Advantage = (Current score - Group mean) / Group std dev

Example: For a math problem generating 4 answers scoring 90,80,70,60 (mean=75), the 90-point answer gets a positive advantage score. This eliminates need for a critic model while enabling the model to identify better answers.

GRPO advantages:
- Training efficiency: No critic model saves compute
- Training stability: Clipping prevents over-optimization
- Simple implementation: Clear algorithm structure

Why did traditional RL use critic models? Critics offered:
- Global evaluation beyond group comparisons
- Learning complex rewards like user preferences
- Single answer evaluation capability
- Long-path rewards for games/robotics

However, GRPO showed that for well-defined tasks (math, coding, logic), simple group comparisons work equally well at scale.

For rewards, R1-Zero used pure rule-based rewards, only employing V3's existing preference reward models in R1's final alignment phase. Human preferences (safety, helpfulness) require complex value judgments that simple reward rules cannot capture.

They intentionally avoided the difficult Process Reward Models (PRM) because:
- Difficult to define granular reasoning steps
- Hard to validate intermediate step correctness
- Risk of reward hacking
- Resource intensive reward model retraining

R1's reward rules were simple, somethng like:
- Correct answer: +1
- Correct format: +0.5
- Wrong answer: -1
- Vague answer: 0

Like GPT's next token prediction scaling led to emergent general intelligence, correct result-oriented RL naturally developed complex internal reasoning capabilities. This insight has profound implications for advancing deep reasoning.

R1's four-stage training:

1. Reasoning Cold-start: e.g.

Question: Solve x^2 + 2x + 1 = 0
<think>
1. Identify quadratic equation
2. Coefficients: a=1, b=2, c=1
3. Use formula: x = (-b ± √(b^2-4ac))/2a
4. Substitute: x = (-2 ± √(4-4))/2
5. Simplify: x = -1
</think>
<answer>x = -1</answer>

2. Reasoning RL:

- Result-oriented data generation with <think>...</think> template
- No human bias, allowing model's natural reasoning evolution
- Model gradually increased thinking time and length of tokens
- GRPO optimization with rule-based rewards

While only validating final answers risks accepting wrong reasoning paths in theory, practice showed sufficient scale leads to correct reasoning. This seems to align well with Kolmogorov complexity theory - correct reasoning is the "shortest program" for reliable correct solutions.

Input sources:
- Manually designed math/coding problems
- Public benchmarks (e.g., AIME)

Output process:

Input: x^2 + 2x + 1 = 0

Model generates multiple answers:

Answer1: [Reasoning1] -> x = -1
Answer2: [Reasoning2] -> x = -1
Answer3: [Reasoning3] -> x = 2

Filter: Keep 1,2 (correct), discard 3 (wrong)

3. Comprehensive Fine-tuning:

- 800k samples: 600k reasoning + 200k general tasks
- V3 model judges cases where rule rewards can't
- Reuses V3 training data for non-reasoning tasks

4. Global RL:

- Human preference alignment while maintaining reasoning
- Rule rewards for reasoning
- employing V3's existing reward model for preferences

The process is clear with sufficient implementation detials, and in principle, reproducible.

Reasoning Distillation

Finally, DeepSeek's R1 excelled in distilling reasoning capabilities to open-source smaller models, outperforming OpenAI's o1-mini. This demonstrates open-source LLMs approaching closed-source models in almost all aspects.

However, those expensive closed-source models paved the way and set baselines/goals. The current landscape is ideal: wealthy companies push boundaries while "dark horses" like DeepSeek follow impressively close not far behind.

It is worth noticing that R1 not only enhanced complex reasoning ("slow thinking") but also significantly improved "traditional" knowledge capabilities compared to its V3 base model, suggesting reasoning strength can also benefit traditional tasks.

Key innovations as a summary:

1. DeepSeek-R1-Zero: First reasoning model trained purely through RL
2. DeepSeek-R1: Improved with cold-start data and 4-stage training
3. Distillation: Successfully transferred capabilities to small models

Technical highlights:

- GRPO algorithm replacing critic model
- Rule-based rewards replacing reward models
- Simple template enabling autonomous learning:
"<think> may well be all you need for reasoning"

[Epilogue from notes]

Silicon Valley old buddies group discussions heated up lately:

"DeepSeek needs quick funding/IPO or risks losing their 18 core contributors to big tech."

"Reproduction seems not difficult. Everyone considered RL but hesitated due to compute costs. o1 likely used RL similar to r1 but chose to keep details private and mysterious."

"This team represents China's technological prowess."

"Several companies have reproduced DeepSeek's core results - autonomous reasoning emergence. Expect rapid iterations and development in the coming days/months."

"OpenAI has fewer cards to play. Sam tries psychological warfare - emphasizing process rewards, suggesting complex search for O1... likely all unnecessary."

"Success factors include hiring young talent with fresh thinking."

"DeepSeek R1 showed how we were misled by PRM and MCTS - indeed, looks like all you need is a <thinking> tag."

"It's not about simplicity - fact is large models already have strong reasoning capabilities, they just need:

1. Thinking space/time/tokens (<think> tag)
2. Correct feedback (answer accuracy)
3. Exploration opportunity (GRPO optimization)"

Complex PRM and MCTS actually limited model's self-exploration. We underestimated large models' potential.

DeepSeek's success prerequisite was V3 - their world-class foundation model matching GPT4o. They knew how to leverage its potential. Using external models like GPT4 would've made R1 much harder to implement this quickly.

"What's next?"
"AI for science? Machine-proving century-old problems, discovering new drugs..."
"Only two problems matter: Riemann Conjecture and P vs NP"
"Big tech will pursue larger models, more data"
"Nvidia's business will improve"

DeepSeek has achieved parity with benchmarks others set. To truly lead, they need to set new benchmarks and directions. Currently, it is still the case that only those willing to burn money massively are breaking new ground.

R1 demonstrates how a Chinese AI company not only caught up but showed the way forward through intelligent simplification. Their approach of making complex problems simpler may influence the entire field.

----------

But I cannot repeat the error which my old friend tested yesterday as shown above, it looks like its been handled already:

 

DeepSeek 风暴下看看它的论文

DeepSeek_R1 paper

The Turbulent Second Chapter of Large Language Models: Has Scaling Stalled?

DeepSeek 风暴下看看它的论文

DeepSeek 之前,国内大模型公司各种刷榜,也是内卷得一塌糊涂,也都刷榜刷到了世界先进水平,但没有哪家做到了 DeepSeek 这种硬气、震撼和让人服气。一鸣惊人天下知。

NND,人家论文也写得漂亮、亲民,看上去、读起来就像一首码农诗。没有任何故作高深的玄乎和遮蔽。简单、平实,但那种底气也算是力透纸背。有一种工程美,还能感受到情怀。邪门。应该推举为年度 best paper。

DeepSeek_R1 paper

好,奇文共欣赏,咱们就坐下来读。

R1 论文读下来,原来被 OpenAI 从 q* 开始到 o 系列,搞得神秘兮兮的所谓 System 2 慢思维的强化训练过程,一下子就变得清晰简单多了。

他们的主要发现是:

不用人造强化数据做监督学习微调(sft),利用多答案采样选优的再生数据来“硬做”强化学习,也一样可以学到慢思维的推理能力,这就是他们的 DeepSeek-R1-Zero,实际上是 follow Alpha-Zero 的思路。AlphaZero 在围棋这种非常单纯狭窄的场景,可以把 Zero 进行到底,排除了人类/人为的数据,最终成为绝对王者。

在更广一点的数学、代码和某些逻辑问题的推理场景,他们最终发现还是借助少量的 sft 人工数据更好。但也不过就是几千条的数据,做推理sft的“冷启动”,人工准备一点也不难。这就是他们的 DeepSeek-R1。

他们的 Zero 也走通了,达到了 OpenAI-o1-0912 的水平(o1的9月12日版本?)。其所以做 R1, 加入了sft冷启动的步骤,主要是因为机器完全自主学习出来的 Zero 的推理步骤可读性差、里面还混杂了不同的语言表达方式,这对进一步改良这个系统造成困扰,毕竟模型要“以人为本”,服务开发者和用户的。最终炼成的 R1 推理表现进一步提升,达到 OpenAI-o1-1217 (估计是12月17日版本的o1)的水平。

他们的创新和探索精神表现在,当 community 把 sft+rl 当成是后训练范式的时候,他们做 Zero,完全排除人工数据,验证了纯粹的rl对于推理能力的学习潜力。从 Zero 首先是学到了信心,体验了探索创新者的 aha moment,然后再回头加一些用于冷启动的高质量人工数据sft,再做实用的 R1 就有底气了。两个模型都开源,供人研究和验证,做得煞是漂亮。

DeepSeek 是化繁为简的大师

强化学习中,直接砍掉了应该与policy模型平行迭代的 critic 模型,代之以简单的GRPO。critic 是评估每个步骤的价值模型,砍掉了等于是训练中一下子降低了一半的资源耗费。需要单独训练的奖励模型也省掉了,代之以简单的规则奖励。

咱们先看看GRPO (Group Relative Policy Optimization,分组相对策略优化) 是什么创新,为什么GRPO算法可以平替 Actor-Critic架构的PPO来优化模型。本质上,GRPO是一个无需critic模型的PPO变体。它通过组内统计计算优势值,而不是用critic网络评估价值。

具体说,GRPO 要求每个问题生成多个答案,形成一组,组内答案相互比较,计算每个答案的"好坏程度"(优势值):

优势值 = (当前答案的得分 - 组内平均分) / 组内标准差

假设一道数学题,生成4个答案,得分分别是: 90, 80, 70, 60分;平均分 = 75。90分答案的优势值 = (90-75)/标准差。高于平均分的答案获得正优势值,反之为负。这样就不需要额外的评判模型 (Critic),通过组内比较,模型就知道哪些答案更好,强化学习的优化目标就是要提升生成好答案的概率。

GRPO 算法的优点:

    1. 训练效率高:不需要额外的 critic 模型,节省了大量计算资源。
    2. 训练稳定性好:用clip限制更新幅度,防止过度优化
    3. 实施简单:算法简单,易于实现。

GRPO简单易行又有效,为什么传统的强化学习要用critic模型呢?Actor-Critic 架构有其优势,包括:

    1. 全局评估:不受限于当前组内比较,可以评估答案的绝对质量
    2. 可学习复杂奖励:比如用户偏好、安全性等难以用规则定义的指标
    3. 单个答案也能评估:不需要同时生成多个答案
    4. 场景优势:早期RL主要用于线条很长的游戏和机器人,需要 critic 学习长期奖励

但GRPO表明,对于明确的任务(如数学、coding和逻辑题),只要能规模化做大强化学习,简单的组内比较也能达到模型优化的同样效果。这是强化学习“多快好省”的重要发现。

至于奖励模型,他们在 Zero 训练中完全弃之不用,而是用简单直接的奖励规则代之。只是在R1训练最后阶段的偏好对齐任务上(不是推理任务),才按照RLHF(人类反馈强化学习)的常规使用了 reward model(实际是对于自己基座模型V3中的奖励模型的复用)。原因如前所述,是人类偏好 (如安全性、有帮助性等) 涉及复杂的价值判断,难以用简单规则量化。对这类评估,还是沿用训练过的reward model来模拟人类判断。但推理任务,他们的探索表明完全可以只用简单的奖励规则 -- 因为正确性判断相对明确:math 有答案,code 可以编译和执行 unit testing。

还有什么能简化的,他们没有简化?

强化学习中的难缠的痛点之一是所谓过程奖励 Process Reward Model (PRM),就是深入到推理的每一步去评估。对此他们是“知难而退,敬而远之”,干脆绕过去:DeepSeek的强化是结果导向,不深究过程。论文说明他们选择不使用PRM的原因如下:

    • 难以明确定义推理中的每个细节步骤 (难:绕过去)
    • 难以判断中间步骤的正确性 (难:绕过去)
    • 重新训练reward model需要额外资源,使流程复杂化 (复杂:能简则简)
    • 模型评估会导致reward hacking:即神经模型可能学会欺骗奖励模型

就最后一条是出于神经模型本性上的短板考量,主要原则还是能简则简,能绕则绕。所以说,他们选择简单的规则奖励 + 答案验证的方案,是一个有意识的权衡选择。

就是说,明明简单的规则就可以确定奖励指向,为什么要训练叠床架屋的奖励模型呢?不过是查一下答案或测试一下code,判定结果的对错,加上判定格式是不是符合规范。R1 主打的奖励刺激属于规则绑定:例如,答案正确,奖励+1分;格式正确,奖励 +0.5;答案错误,“奖励”-1分;答案不具体,奖励 0分。

当然,这样做,在把推理拓展到数学、代码以外的任务的时候,可能行不通。但目前大家发力的重点主要就是数学和代码,而更加狭窄的长线条棋类和游戏场景,基本被传统RL攻克。尽管如此,绕过过程奖励仍然可能是潜在的软肋,理论上给结果正确,过程逻辑混乱留下了空间。

只关注答案对错,不问过程是怎么强化出长线条的复杂推理过程的呢?门道就在筛选答案的时候偏向于长答案,随着训练这就自然增加了 test time compute ,发展出对于复杂推理的应对能力。这使得 R1 的强化学习更易自主探索推理路径,成就了DeepSeek的这次突破和出圈。

与 GPT 的 next token prediction(ntp) 规模化以后可以涌现通用智能异曲同工,DeepSeek那帮年轻人发现,只要结果明确可判定,结果导向的强化学习可以自然涌现出复杂的内部推理能力,因为正确的结果需要推理。这个发现的意义,对于领域今后的深度推理的推进非同一般,可媲美GPT系列预训练时发现的 scaling law。

 头部推理模型R1 的训练四阶段 :

1. 推理冷启动

利用数千条高质量人工推理数据,例如:

# 收集高质量示例 
Question: 求解方程 x^2 + 2x + 1 = 0 
<think> 
1. 识别这是一个二次方程 
2. 系数: a=1, b=2, c=1 
3. 使用求根公式: x = (-b ± √(b^2-4ac))/2a 
4. 代入: x = (-2 ± √(4-4))/2 
5. 化简: x = -1 
</think> 
<answer>x = -1</answer>

2. 推理强化学习

结果导向,再生数据的模版“留白” <think>.........</think>

    • 设计简单模板让模型自主再生训练数据:
      <think>思考过程</think><answer>答案</answer>
      模型生成多个答案 -> 筛选正确答案 -> 加入强化学习的训练集
    • 不添加任何人工偏见或策略提示,留下RL自主学习推理过程的空间:
      逐渐增加思考时间(test time compute)和tokens量,模型就自发涌现反思步骤等推理能力,这就是论文作者描述的 aha moment,令人动容的见证
    • GRPO算法采样多个答案,通过内部对比来优化模型
    • 规则奖励:奖励答案正确 + 格式规范

前面提到,系统只验证最终答案,理论上无法保证中间推理步骤的正确性(可能学到"答案对但推理错"的模式),但实践似乎显示,只要强化学习足够充分和规模化,答案正确会自然导向推理步骤的正确性。根据K氏复杂性(Kolmogorov complexity)压缩理论,正确的推理导向正确的答案才是可靠解决方案的“最短程序”,这是无损压缩的终极目标。后训练强化学习的过程与预训练一样,都是对无损压缩的逼近。

DeepSeek 的探索再次表明,简单即美,scale为王
(一简遮三丑,你是服也不服? LOL)

天机就是,scale 是硬道理。简单架构/算法有利于真正的 scale up,只要目标清晰,一旦 scale 了,一切就自然搞定。

训练数据的源头

模版再生数据的input 应该是来自两个源头,1 人工设计的数学题/编程题;2 公开基准测试题(如AIME)。

根据模版的再生数据的output流程:

Input: x^2 + 2x + 1 = 0 Model生成多个答案:
 Answer1:
<think>[推理过程1]</think>
<answer>x = -1</answer>
 Answer2:
<think>[推理过程2]</think>
<answer>x = -1</answer>
 Answer3: 
<think>[推理过程3]</think>
<answer>x = 2</answer>
 筛选:  - 保留Answer1、2(答案正确) - 丢弃Answer3(答案错误)

保留的答案作为推理再生数据加入训练集用于下轮迭代。所有再生训练数据都需要标准答案来评估正确性,这在来源中就给定了:

    • 数学题:确定的数值答案
    • 编程题:通过测试用例验证
    • 逻辑推理:有明确的正确结论

有标准答案是规则奖励能工作的基础。对没有标准答案的任务(如写作),需要用其他方式评估质量,例如使用奖励模型。

3. 综合性微调

800k 条训练数据,其中推理 600k, 其他任务 200k

论文没说为什么按照这个比例选取微调数据,应该是根据经验。600k 推理数据是再生的,用的就是阶段2的推理模型。但这里有一个值得注意的插曲:在阶段2的推理强化学习中,再生数据必须是奖励规则可以判定的。但阶段3的推理数据,却突破了这个限制。阶段3的推理数据增加一些 reward rule 不能判定的 cases,既然简单的奖励规则无法判定,就找 V3 模型来判定。好像是说,当一道推理题(数学、coding或逻辑题)生成n个奖励规则难以评判优劣的结果的时候,就把这些结果和标准答案送给V3,让V3做裁判。

另外的200k数据呢?一部分是拿来主义,直接从他们自己的V3的原始finetune训练数据中选取;另一部分让 V3 生成数据,但要求V3不仅给答案,还要给思维链过程(就是要求它 step by step 输出结果)。这可以理解,这里虽然不是纯粹的长线条推理题,其他任务很多时候也是要有条理的。

4. 全局强化学习

这最后的强化学习很像是早就使用过的 RLHF,更注重人类偏好的对齐。但为了防止推理退化,在偏好对齐的同时,也强化了推理,用的还是规则奖励。而人类偏好对齐用的则是V3原有的奖励模型(这是唯一真正用到的奖励模型)。

整个过程还是相当清晰的,原则上可复现。

用R1再生数据去蒸馏小模型,提升其推理能力

最后,Deepseek 的R1推理强化工作在蒸馏开源小模型方面也做得很牛,干翻了openAI 的 o1-mini 小模型。展示给世人看,开源 LLMs 开始全面逼近闭源模型。

但话说回来,没有这些巨烧钱的闭源模型在前面开路,并建立标杆,后来者也容易失去方向。现在这种局面非常好:让有钱的去砸银子。在金钱的赋能和压力下,不断开疆拓土。让deep“黑马”们在后面紧追不舍,而且还追赶得特别牛气。

令人印象深刻的是,R1 不是仅仅大幅度提升了推理能力(慢思维),在“传统”的知识能力方面比起它的基座模型V3也有显著提升。这可能是因为,推理能力的增强对于一些传统任务具有正面作用,但更应该归功于他们探索出来的四阶段训练R1的pipeline。

最后总结一下。

主要创新点:

1. DeepSeek-R1-Zero: 首个仅通过强化学习(RL)训练的推理模型, 无需人工推理数据的监督微调(SFT)。展示了模型可以纯靠 RL 自主发展出推理能力。

2. DeepSeek-R1: 在 R1-Zero 基础上做以下改进:
- 后训练阶段先用少量高质量数据进行冷启动SFT
- 采用4阶段的后训练流程,两次SFT,两次RL
- 性能可与 OpenAI-o1-1217 相媲美

3. 蒸馏技术: 成功将推理能力迁移到一系列开源小模型:
- 1.5B 参数的模型就超越了 GPT-4 在数学方面的表现
- 32B 和 70B 的模型创造了密集模型的新记录

关键技术细节:

- 使用 GRPO (Group Relative Policy Optimization)算法,舍弃 Critic 模型
- 采用基于规则的奖励系统, 舍弃奖励模型 PRM
- 设计特定的训练模板引导模型再生数据进行自主学习:
<think> is all we need for reasoning!

 

【笔者后记】

这两天莫名很兴奋。跟 deep啥 纠缠不休,今天才缓过气来 lol

硅谷老友群也热议不断:

Hongtao:
DeepSeek若不快速大融资和上市, R1的18位主要贡献者估计很快就被国内外大厂抢光了[Grin]
Core Contributors:
Daya Guo
Dejian Yang
Haowei Zhang
Junxiao Song
Ruoyu Zhang
Runxin Xu
Qihao Zhu
Shirong Ma
Peiyi Wang
Xiao Bi
Xiaokang Zhang
Xingkai Yu
Yu Wu
Z.F. Wu
Zhibin Gou
Zhihong Shao
Zhuoshu Li
Ziyi Gao
当年DeepMind被迫卖给Google,因为founders被告知若不卖,就高薪挖人。挖走一两个作者,就少走大部分弯路了。

主要还是幻方的AI量化投资受挫, 因势利导做deepseek成功;开源后,国内外大家都沿着这路子去试。若不财大气粗起来,优势恐怕难以为继。

超大模型训表征,
开源一蹴而就成。
强化学习各求精,
蒸馏定制缩小型。

内卷已经卷出墙,
硅谷AI圈被激荡。
OpenAI&Meta领头羊,
都被鞭策加速闯。

硅谷不眠夜:DeepSeek为何震动美国科技界?

Nick:复现DeepSeek貌似很容易。其实强化学习大家也都想到过,过去总觉得可能要花很多算力,少人试。貌似o1就是强化学习练出来的,但一些推理token他们没open。这可能迫使国内头部那两家加速上市过程。

立委:这类团队属于中华之光,国之重器。

他们写得基本够清晰了。让人担心他们下一步怎么保持这个势头和地位。很多神秘就是一层窗户纸。最大的功劳是他们同时也差不多捅破了o系列神秘面纱的窗户纸。

Nick:马上融一大笔钱,突击上市。除非手里还有更硬的牌。

马老师:好几家复现了deepseek,各家再各自探索,相信会是快速迭代的过程,有望再一次大发展。

Nick:也是个试金石,倒逼openAI看看还有啥新东西,是骡子是马拉出来溜溜。

Hongtao:给openai压力;更是 叫板meta, 争夺开源盟主地位

鲁总:OpenAI 的牌越来越少了。但SamA 希望通过心理战误导大众。之前发文强调过程奖励,O1 出来时放烟幕弹让人相信推断时使用复杂的搜索 ... 结果都应该没有用。

香港科技大学的团队说是也独立发现了RL涌现推理能力,不过只针对数学问题求解,但也特别指出使用输出格式奖励。

白老师:数学能力和编程能力是相通的。

不请贵的人是成功的很重要因素。

施总:哈哈。贵的不一定能干,能干的都比较贵。

刘总:主要是要用年轻人,岁数大的没戏。岁数大了,思维僵化,精力不行。当然,我说的是统计规律,个例总是有的。

立委:deepseek 不是常态,是冒尖。但 deepseek 这么一捅窗户纸,很多人就跟上了。不知道 它还有多少宝贝没有显露。否则 逐渐暗淡下去 也不是不可能的。

deepseek 之前,各种刷榜,也是内卷得一塌糊涂,也都刷榜刷到了世界先进水平。但没有哪家做到了 deepseek 这种硬气 震撼 和让人服气。一鸣惊人天下知。

Nick:估计每家都会短期内在数学能力上长足进步。豆包上周一周内就进步不小。窗户纸捅破,门槛也不是那么高。大概率o1也是这么做的,只不过内帮孙子比较鸡贼。

Liren:DeepSeek-R1告诉大家,你们都被PRM和MCTS误导了,其实只需要一个<thinking>标签就够了[Chuckle]

Nick:是啊,你写篇文章,“<thinking> is enough"

立委:就是留白。你留了白,系统就会给自主填上。

zero 的实践表明,根本不用想那么复杂,还要考虑怎么从各种不同推理任务中找到共同的思维链 patterns,等等。甚至也不管里面的逻辑是不是胡说八道,结果导向,最终,推理还是学出来了。预训练靠的是简单的 next token prediction,后训练推理靠的就是结果导向的强化自主学习。设计一个简单的模版就搞定了无穷的再生推理数据。

Nick:是啊,有了ToT和Gemini,话都在嘴边了。

Liren:增加在推理时的tokens来提升思考时间。

立委:秘方就是4步走:1 冷启动 2 强化 3 微调 4 再强化。zero 干脆省掉了 1 3 4,所以显得过于生猛,但 beautifully 证明了“硬启动”的强化学习也能涌现高级推理能力。r1 就是完善后训练的节奏和数据配比。很多应该就是经验,是摸索出来的 best practice,他们肯定有过很多其他失败的尝试,但还是摸着石头过了河。

马老师:感觉就是碰运气,不过沿着别人路走的永远没有运气。

立委:我觉得他们还有一些东西,所以才“肆无忌惮”。等于是他们推出了一个菜谱,这个菜谱做的菜比肩世界一流。但他们其实还有其他的菜谱,更高级,但不急于拿出来?

不是大道至简,而是大模型本身已经具备了强大的推理能力,它需要的只是:

1 足够的思考空间/时间/tokens量(<think>标签)
2 正确的反馈信号(答案正确性)
3. 探索优化的机会(GRPO采样选优)

复杂、难缠、费力的PRM(过程奖励模型)和MCTS(蒙特卡洛树搜索路径空间)反而限制了模型的自主探索。这说明大模型的能力被我们低估了。

deepseek 的成功的先决条件是 v3,他们自己做出了世界前列的头部基础模型,他们自己知道怎么善用它的潜力。如果是借助于外部基础模型 GPT4o,就很难这么快做出r1,很多 v3 的资源和practice 就在 r1 过程中直接借用了。

马老师:在理。

Nick:So what's next? assuming everybody will have as strong math capabilities within a month

立委:AI for science?机器自动证明百年难题啥的;机器自动发明新药......

Nick: only two problems matter: Riemann Conjecture and P vs NP

马老师:大厂也许会用更大的模型,更多的数据,继续向大上走。

Nick:那肯定。我觉得Nvidia的生意会更好。

立委:deep 目前为止还是在追平,是人家先树立了标杆,它去对齐。多快好省。

deep 要真牛,再上一个台阶,需要自己树立标杆和方向。但这太难了。目前为止似乎还是只有敢于疯狂烧钱 敢于无限做大的那些狂人才在开疆拓土。

 

【相关】