Understanding the Power of Chain of Thought

DeepSeek R1 has become the most talked-about breakthrough in recent times. It not only matches OpenAI's top reasoning models (the 'o' series) in mathematics and coding capabilities but also produces stunning results in linguistic creativity and mimicry. Particularly in Chinese (classical) capabilities, everyone has experienced a miraculous leap in performance.

All of this can be attributed to the reasoning-enhanced Chain of Thought (CoT). Why is CoT so effective, so magical, and how has it maximized its empowering effect through reinforcement?

The key likely lies in the fact that CoT tokens are autonomously generated by the large model, effectively reducing the perplexity from query to answer, serving as a bridge to brilliant performance. Those who have seen CoT outputs know that the bridge itself isn't always impressive - it often seems overwrought, overly cautious, verbose, redundant, and methodical - yet it enables magnificent answers to emerge. From first principles, this seems to involve deep implications of perplexity in information theory.

The Essence of CoT

From an Information Theory Perspective:

CoT builds a low-entropy channel between high-perplexity queries and answers
Through step-by-step decomposition, each step's conditional probability becomes more "natural" and smooth, aligning with the language model's nature
Eventually transforming seemingly "leaping" reasoning conclusions into a series of accumulated "small steps"

From an Information Entropy Viewpoint:

For complex problems, directly jumping from query to answer requires crossing a vast information gap, which "forces" the model to hallucinate and output random answers
Each CoT step reduces local conditional entropy
It's like breaking down a large information compression/decoding task into multiple smaller ones

This Explains Why Even "Mundane" CoT is So Effective:

Its power doesn't lie in how brilliant the process steps themselves are
Rather, it creates a path of decreasing information entropy
The model can stably migrate toward the target along this path

This Also Explains Why DeepSeek's Training is So Vital to Its Success:

It's not about teaching the model "smarter" reasoning, which is undefinable in humanities tasks
Instead, it optimizes the ability to construct these low-entropy channels
Essentially optimizing information flow path planning

This perspective provides a lens for understanding CoT, reinterpreting the surface-level "chain of thought" as an "entropy reduction pathway" in information theory terms. It offers a reasonable explanation for result-driven reinforcement learning without process supervision:

Process is important, but process supervision isn't, because the process data naturally produced by large models is more practical and feasible than any human supervision. Let us embrace the tansition from human supervision to LLM-internal self-supervision.

【相关】

推理强化学习是端到端的监督，推理过程的非监督

发布者

立委

立委博士，多模态大模型应用咨询师。出门问问大模型团队前工程副总裁，聚焦大模型及其AIGC应用。Netbase前首席科学家10年，期间指挥研发了18种语言的理解和应用系统，鲁棒、线速，scale up to 社会媒体大数据，语义落地到舆情挖掘产品，成为美国NLP工业落地的领跑者。Cymfony前研发副总八年，曾荣获第一届问答系统第一名（TREC-8 QA Track），并赢得17个小企业创新研究的信息抽取项目（PI for 17 SBIRs）。查看立委的所有文章

发布者

立委

发表回复