Reinforcement Learning for Reasoning: Supervised Outcomes, Unsupervised Processes

In reading DeepSeek R1 paper, some may have overlooked the nuances: the training datasets are both human labeled and regenerated, blending supervised and unsupervised reinforcement learning (RL).

How so?

From the perspective of the data's origin and gold standards, the training data is undeniably human labeled. They derive from existing math problems and human-crafted code from GitHub’s open-source community—products of years of effort by educators, developers, and others. The problems (input) and their "gold-standard" answers (output) are human-designed or labeled. In this sense, reinforcement learning (RL) represents typical end-to-end supervised learning:

Input: Math/coding problems
Output: Verified answers

However, unlike other supervised learning, RL requires the model to learn the reasoning process leading to answers. Critically, the intermediate steps lack human annotations or feedback. Instead, the system autonomously generates these reasoning data, iteratively appending to the training set. This makes the process unsupervised. The brilliance of RL lies here: self-guided exploration, path discovery, and data regeneration.

Cold Start and Human Data
DeepSeek R1’s initial training did use a small set of human-annotated reasoning data. But these couple of thousand examples pale against millions of regenerated data—effectively negligible. In fact, research like DeepSeek Zero demonstrates that such process-labeled human data is not a must-have.

Inspired by AlphaZero (which showed human data might even hinder optimal pather discovery in Go), DeepSeek Zero confirms human annotations are not necessary. The minor human data in R1’s pipeline primarily enhances readability for developers, not necessarily for enabling reasoning capability. After all, humans (including developers in debugging) prefer interpretable thought processes.

A New Paradigm: Process-Unsupervised, Outcome-Supervised Learning
This self-play/self-study style RL framework represents a novel approach: unsupervised in process but supervised in outcome. DeepSeek’s breakthrough reveals that "slow thinking" in RL—meticulously generating intermediate steps as CoT (chain of thought)—boosts performance in logical reasoning as well as non-logical tasks like creatuive writing.

As my old buddy Cheng insightfully noted:
Deep reasoning inserts extensive text between questions and answers, reducing the perplexity of generating correct answers. Directly jumping from problem to answer has high perplexity, but adding a "reasoning bridge" lowers it. This follows the language model framework: the key is to search for the optimal path in text generation.

Can Unsupervised Regenerated Process Data Lead Astray?
One might worry: if the model autonomously generates flawed reasoning steps in its process data, could errors compound? The answer lies in the clear supervision signal from the gold standard. Like flying a kite—held by a string in human's hands—the final reward anchors the learning. As long as the model truly scales up, outcome-oriented RL ensures deviations' self-correct probabilistically.

Mathematically, minor process imperfections or illogical steps don’t statistically compromise final accuracy. For non-logical tasks (beyond math/coding), reasoning paths may even involve contradictions and/or heavy redundancies. Yet, as long as the "slow thinking" mechanism guides learning, results remain robust—often superhuman, as demonstrated repeatedly lately by many users of R1.

Why Regenerated Data Works
Regenerated reasoning data aren’t random data from nowhere. They’re generated by a solid large foundation model trained on vast human knowledge data, following autoregressive generation (e.g. next-token prediction). While each step might drift slightly, the context grows incrementally, allowing continuous stepwise self-correction. This dynamic—probabilistic fluctuations balanced by stepwise adjustments—enhances semantic coherence and knowledge fluency in generation, lowering overall perplexity and steering toward correct outcomes. Thus, process data rarely derails; instead, it converges toward reliability.

A Final Note on Cheng’s Observation
Cheng highlights a pivotal finding of DeepSeek:
OpenAI’s "Let’s Verify Step by Step" argues for rewarding each reasoning step. Yet DeepSeek’s RL model achieves remarkable results using only final-outcome rewards—no Chain-of-Thought (CoT) data needed. Whether OpenAI’s process supervision is essential or simply a red herring, DeepSeek Zero’s breakthroughs redefine the field, proving outcome-oriented RL can master reasoning autonomously.

In essence, when guided by scalable outcome supervision, machines learn to self-correct, turning imperfect processes into near-perfect results.

Reinforcement Learning for Reasoning: Supervised Outcomes, Unsupervised Processes

发布者

立委

发表回复