[Note: This is a blog analyzing DeepSeek's R1 paper and its impact] Before DeepSeek, Chinese AI companies had always been engaged in fierce competition, achieving world-class SOTA benchmark scores on and off. However, none achieved DeepSeek's level of commanding respect and making such a stunning impact. Their recent breakthrough caught global attention. Their paper and open source code are also beautifully written and accessible. No unnecessary complexity or obscurity. Simple and straightforward, yet radiating confidence. It exhibits engineering elegance while conveying innovation as well as passion. Simply remarkable. Should be nominated for best paper of the year.
Reading the R1 paper reveals that what OpenAI had kept mysterious - from Q* to O-series' so-called slow thinking reinforcement training - suddenly becomes clear and simple.
Their key findings:
They demonstrated that reasoning capabilities can be acquired through pure reinforcement learning with simple rule-based rewards and multi-answer sampling, without the need for extensive supervised fine-tuning (SFT) data. This resulted in DeepSeek-R1-Zero, following AlphaZero's philosophy. While AlphaZero achieved absolute mastery in the narrow domain of Go by eliminating human data, their approach proved effective in broader domains of math, coding and logic.
Though R1-Zero worked well, they found incorporating minimal SFT data (a few thousand samples) for cold-start was more practical. R1-Zero matched OpenAI-o1-0912's performance, but its reasoning steps had poor readability and mixed languages. R1, however, with cold-start SFT and multi-stage pipeline of trainings, achieved further improvements matching OpenAI-o1-1217.
A new star was born.
Their valuable innovation was challenging the SFT+RL paradigm by proving pure RL's potential for reasoning through R1-Zero. This gave them confidence to further build the practical R1 with minimal cold-start data. Both models are open-sourced for research - an elegant execution.
DeepSeek excels at simplification. In reinforcement learning, they eliminated:
- The critic model parallel to policy model in RL, replaced by simple GRPO
- Complex reward models, replaced by rule-based rewards
GRPO (Group Relative Policy Optimization) generates multiple answers per question, comparing them within groups to calculate advantage scores:
Advantage = (Current score - Group mean) / Group std dev
Example: For a math problem generating 4 answers scoring 90,80,70,60 (mean=75), the 90-point answer gets a positive advantage score. This eliminates need for a critic model while enabling the model to identify better answers.
GRPO advantages:
- Training efficiency: No critic model saves compute
- Training stability: Clipping prevents over-optimization
- Simple implementation: Clear algorithm structure
Why did traditional RL use critic models? Critics offered:
- Global evaluation beyond group comparisons
- Learning complex rewards like user preferences
- Single answer evaluation capability
- Long-path rewards for games/robotics
However, GRPO showed that for well-defined tasks (math, coding, logic), simple group comparisons work equally well at scale.
For rewards, R1-Zero used pure rule-based rewards, only employing V3's existing preference reward models in R1's final alignment phase. Human preferences (safety, helpfulness) require complex value judgments that simple reward rules cannot capture.
They intentionally avoided the difficult Process Reward Models (PRM) because:
- Difficult to define granular reasoning steps
- Hard to validate intermediate step correctness
- Risk of reward hacking
- Resource intensive reward model retraining
R1's reward rules were simple, somethng like:
- Correct answer: +1
- Correct format: +0.5
- Wrong answer: -1
- Vague answer: 0
Like GPT's next token prediction scaling led to emergent general intelligence, correct result-oriented RL naturally developed complex internal reasoning capabilities. This insight has profound implications for advancing deep reasoning.
R1's four-stage training:
1. Reasoning Cold-start: e.g.
Question: Solve x^2 + 2x + 1 = 0 <think> 1. Identify quadratic equation 2. Coefficients: a=1, b=2, c=1 3. Use formula: x = (-b ± √(b^2-4ac))/2a 4. Substitute: x = (-2 ± √(4-4))/2 5. Simplify: x = -1 </think> <answer>x = -1</answer>
2. Reasoning RL:
- Result-oriented data generation with <think>...</think> template
- No human bias, allowing model's natural reasoning evolution
- Model gradually increased thinking time and length of tokens
- GRPO optimization with rule-based rewards
While only validating final answers risks accepting wrong reasoning paths in theory, practice showed sufficient scale leads to correct reasoning. This seems to align well with Kolmogorov complexity theory - correct reasoning is the "shortest program" for reliable correct solutions.
Input sources:
- Manually designed math/coding problems
- Public benchmarks (e.g., AIME)
Output process:
Input: x^2 + 2x + 1 = 0 Model generates multiple answers: Answer1: [Reasoning1] -> x = -1 Answer2: [Reasoning2] -> x = -1 Answer3: [Reasoning3] -> x = 2 Filter: Keep 1,2 (correct), discard 3 (wrong)
3. Comprehensive Fine-tuning:
- 800k samples: 600k reasoning + 200k general tasks
- V3 model judges cases where rule rewards can't
- Reuses V3 training data for non-reasoning tasks
4. Global RL:
- Human preference alignment while maintaining reasoning
- Rule rewards for reasoning
- employing V3's existing reward model for preferences
The process is clear with sufficient implementation detials, and in principle, reproducible.
Reasoning Distillation
Finally, DeepSeek's R1 excelled in distilling reasoning capabilities to open-source smaller models, outperforming OpenAI's o1-mini. This demonstrates open-source LLMs approaching closed-source models in almost all aspects.
However, those expensive closed-source models paved the way and set baselines/goals. The current landscape is ideal: wealthy companies push boundaries while "dark horses" like DeepSeek follow impressively close not far behind.
It is worth noticing that R1 not only enhanced complex reasoning ("slow thinking") but also significantly improved "traditional" knowledge capabilities compared to its V3 base model, suggesting reasoning strength can also benefit traditional tasks.
Key innovations as a summary:
1. DeepSeek-R1-Zero: First reasoning model trained purely through RL
2. DeepSeek-R1: Improved with cold-start data and 4-stage training
3. Distillation: Successfully transferred capabilities to small models
Technical highlights:
- GRPO algorithm replacing critic model
- Rule-based rewards replacing reward models
- Simple template enabling autonomous learning:
"<think> may well be all you need for reasoning"
[Epilogue from notes]
Silicon Valley old buddies group discussions heated up lately:
"DeepSeek needs quick funding/IPO or risks losing their 18 core contributors to big tech."
"Reproduction seems not difficult. Everyone considered RL but hesitated due to compute costs. o1 likely used RL similar to r1 but chose to keep details private and mysterious."
"This team represents China's technological prowess."
"Several companies have reproduced DeepSeek's core results - autonomous reasoning emergence. Expect rapid iterations and development in the coming days/months."
"OpenAI has fewer cards to play. Sam tries psychological warfare - emphasizing process rewards, suggesting complex search for O1... likely all unnecessary."
"Success factors include hiring young talent with fresh thinking."
"DeepSeek R1 showed how we were misled by PRM and MCTS - indeed, looks like all you need is a <thinking> tag."
"It's not about simplicity - fact is large models already have strong reasoning capabilities, they just need:
1. Thinking space/time/tokens (<think> tag)
2. Correct feedback (answer accuracy)
3. Exploration opportunity (GRPO optimization)"
Complex PRM and MCTS actually limited model's self-exploration. We underestimated large models' potential.
DeepSeek's success prerequisite was V3 - their world-class foundation model matching GPT4o. They knew how to leverage its potential. Using external models like GPT4 would've made R1 much harder to implement this quickly.
"What's next?"
"AI for science? Machine-proving century-old problems, discovering new drugs..."
"Only two problems matter: Riemann Conjecture and P vs NP"
"Big tech will pursue larger models, more data"
"Nvidia's business will improve"
DeepSeek has achieved parity with benchmarks others set. To truly lead, they need to set new benchmarks and directions. Currently, it is still the case that only those willing to burn money massively are breaking new ground.
R1 demonstrates how a Chinese AI company not only caught up but showed the way forward through intelligent simplification. Their approach of making complex problems simpler may influence the entire field.
----------
But I cannot repeat the error which my old friend tested yesterday as shown above, it looks like its been handled already:
The Turbulent Second Chapter of Large Language Models: Has Scaling Stalled?