Understanding the Power of Chain of Thought

DeepSeek R1 has become the most talked-about breakthrough in recent times. It not only matches OpenAI's top reasoning models (the 'o' series) in mathematics and coding capabilities but also produces stunning results in linguistic creativity and mimicry. Particularly in Chinese (classical) capabilities, everyone has experienced a miraculous leap in performance.

All of this can be attributed to the reasoning-enhanced Chain of Thought (CoT). Why is CoT so effective, so magical, and how has it maximized its empowering effect through reinforcement?

The key likely lies in the fact that CoT tokens are autonomously generated by the large model, effectively reducing the perplexity from query to answer, serving as a bridge to brilliant performance. Those who have seen CoT outputs know that the bridge itself isn't always impressive - it often seems overwrought, overly cautious, verbose, redundant, and methodical - yet it enables magnificent answers to emerge. From first principles, this seems to involve deep implications of perplexity in information theory.

The Essence of CoT

  1. From an Information Theory Perspective:
  • CoT builds a low-entropy channel between high-perplexity queries and answers
  • Through step-by-step decomposition, each step's conditional probability becomes more "natural" and smooth, aligning with the language model's nature
  • Eventually transforming seemingly "leaping" reasoning conclusions into a series of accumulated "small steps"
  1. From an Information Entropy Viewpoint:
  • For complex problems, directly jumping from query to answer requires crossing a vast information gap, which "forces" the model to hallucinate and output random answers
  • Each CoT step reduces local conditional entropy
  • It's like breaking down a large information compression/decoding task into multiple smaller ones
  1. This Explains Why Even "Mundane" CoT is So Effective:
  • Its power doesn't lie in how brilliant the process steps themselves are
  • Rather, it creates a path of decreasing information entropy
  • The model can stably migrate toward the target along this path
  1. This Also Explains Why DeepSeek's Training is So Vital to Its Success:
  • It's not about teaching the model "smarter" reasoning, which is undefinable in humanities tasks
  • Instead, it optimizes the ability to construct these low-entropy channels
  • Essentially optimizing information flow path planning

This perspective provides a lens for understanding CoT, reinterpreting the surface-level "chain of thought" as an "entropy reduction pathway" in information theory terms. It offers a reasonable explanation for result-driven reinforcement learning without process supervision:

Process is important, but process supervision isn't, because the process data naturally produced by large models is more practical and feasible than any human supervision. Let us embrace the tansition from human supervision to LLM-internal self-supervision.

 

【相关】

发布者

立委

立委博士,出门问问大模型团队前工程副总裁,聚焦大模型及其AIGC应用。Netbase前首席科学家10年,期间指挥研发了18种语言的理解和应用系统,鲁棒、线速,scale up to 社会媒体大数据,语义落地到舆情挖掘产品,成为美国NLP工业落地的领跑者。Cymfony前研发副总八年,曾荣获第一届问答系统第一名(TREC-8 QA Track),并赢得17个小企业创新研究的信息抽取项目(PI for 17 SBIRs)。

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注

这个站点使用 Akismet 来减少垃圾评论。了解你的评论数据如何被处理