DeepSeek R1 has become the most talked-about breakthrough in recent times. It not only matches OpenAI's top reasoning models (the 'o' series) in mathematics and coding capabilities but also produces stunning results in linguistic creativity and mimicry. Particularly in Chinese (classical) capabilities, everyone has experienced a miraculous leap in performance.
All of this can be attributed to the reasoning-enhanced Chain of Thought (CoT). Why is CoT so effective, so magical, and how has it maximized its empowering effect through reinforcement?
The key likely lies in the fact that CoT tokens are autonomously generated by the large model, effectively reducing the perplexity from query to answer, serving as a bridge to brilliant performance. Those who have seen CoT outputs know that the bridge itself isn't always impressive - it often seems overwrought, overly cautious, verbose, redundant, and methodical - yet it enables magnificent answers to emerge. From first principles, this seems to involve deep implications of perplexity in information theory.
The Essence of CoT
- From an Information Theory Perspective:
- CoT builds a low-entropy channel between high-perplexity queries and answers
- Through step-by-step decomposition, each step's conditional probability becomes more "natural" and smooth, aligning with the language model's nature
- Eventually transforming seemingly "leaping" reasoning conclusions into a series of accumulated "small steps"
- From an Information Entropy Viewpoint:
- For complex problems, directly jumping from query to answer requires crossing a vast information gap, which "forces" the model to hallucinate and output random answers
- Each CoT step reduces local conditional entropy
- It's like breaking down a large information compression/decoding task into multiple smaller ones
- This Explains Why Even "Mundane" CoT is So Effective:
- Its power doesn't lie in how brilliant the process steps themselves are
- Rather, it creates a path of decreasing information entropy
- The model can stably migrate toward the target along this path
- This Also Explains Why DeepSeek's Training is So Vital to Its Success:
- It's not about teaching the model "smarter" reasoning, which is undefinable in humanities tasks
- Instead, it optimizes the ability to construct these low-entropy channels
- Essentially optimizing information flow path planning
This perspective provides a lens for understanding CoT, reinterpreting the surface-level "chain of thought" as an "entropy reduction pathway" in information theory terms. It offers a reasonable explanation for result-driven reinforcement learning without process supervision:
Process is important, but process supervision isn't, because the process data naturally produced by large models is more practical and feasible than any human supervision. Let us embrace the tansition from human supervision to LLM-internal self-supervision.
【相关】