深度学习的基石：多层感知机

一、单层感知机的致命缺陷

要理解多层感知机的突破性，我们先看一个经典案例：假设我们需要实现“异或”逻辑（XOR），定义如下：

| 输入A：0 | 输入B：0 | 输出：0 |
| 输入A：0 | 输入B：1 | 输出：1 |
| 输入A：1 | 输入B：0 | 输出：1 |
| 输入A：1 | 输入B：1 | 输出：0 |

当我们在坐标系中绘制这些点时，会发现正例(0,1)和(1,0)位于对角线两侧，形成一个典型的非线性可分问题。单层感知机的决策边界只能是直线，就像试图用一把直尺分开两堆交叉摆放的硬币——这是数学上证明不可能完成的任务（闵斯基1969年证明）。

二、解法：特征空间升维

关键思路：通过增加隐藏层，将原始二维输入投影到三维特征空间。

我们设计两个特殊神经元：

神经元H₁：激活条件 = 1*A + (-1)*B > 0.5
神经元H₂：激活条件 = (-1)*A + 1*B > 0.5

输入变换过程：

当输入(1,0)时：

H₁：1*1 + (-1)*0 = 1 → 激活
H₂：(-1)*1 + 1*0 = -1 → 不激活

当输入(0,1)时：

H₁：1*0 + (-1)*1 = -1 → 不激活
H₂：(-1)*0 + 1*1 = 1 → 激活

当输入(1,1)时：

H₁：1*1 + (-1)*1 = 0 → 不激活
H₂：(-1)*1 + 1*1 = 0 → 不激活

此时特征空间变为：

神奇的事情发生了——原本线性不可分的问题，在特征空间中被转换为线性可分问题！

三、输出层

在输出层，我们只需一个简单的OR逻辑：

输出 = H₁ OR H₂

此时只需一条直线就能完美分割正负例。这种「空间变换+线性分割」的思想，正是深度学习突破维度诅咒的武器。现代神经网络中，每个隐藏层都在进行更复杂的空间扭曲，最终在高维特征空间中用超平面分割数据。

四、数学本质：非线性变换

整个过程可视为：

f(X) = σ(W₂·σ(W₁X + b₁) + b₂

其中：

W₁是第一层权重矩阵 [[1,-1], [-1,1]]
b₁是偏置向量 [-0.5, -0.5]
σ是阶跃激活函数
W₂是输出层权重 [1,1]
b₂是输出偏置 -0.5

这个结构实现了：
1. 第一层：将输入空间扭曲为线性可分的新空间
2. 第二层：在新空间中进行线性分类

传统OR门的感知机表达式：

输出 = 1*H₁ + 1*H₂ - 0.5 > 0

每个输入的权重都是1，表示H₁和H₂同等重要
当任意一个输入为1时，加权和为1；两个都为1时和为2
设置阈值0.5（移项后表达式为加权和 > 0.5）
只要有一个输入为1（1 > 0.5），即触发激活
两个都为0时（0 > 0.5）不激活

用真值表验证公式的正确性：

H₁	H₂	计算过程	结果
0	0	0+0-0.5 = -0.5	0
0	1	0+1-0.5 = 0.5	1
1	0	1+0-0.5 = 0.5	1
1	1	1+1-0.5 = 1.5	1

（注：在XOR场景中，H₁和H₂不会同时为1，最后一行为理论验证）

用Python实现这个经典XOR网络：

import numpy as np

# 定义网络结构
W1 = np.array([[1, -1], [-1, 1]])  # 第一层权重
b1 = np.array([-0.5, -0.5])        # 第一层偏置
W2 = np.array([1, 1])              # 输出层权重
b2 = -0.5                          # 输出层偏置

def perceptron(X):
    # 第一层计算
    h = np.heaviside(np.dot(X, W1) + b1, 0)
    # 输出层计算
    return np.heaviside(np.dot(h, W2) + b2, 0)

# 测试所有输入
inputs = [[0,0], [0,1], [1,0], [1,1]]
for x in inputs:
    print(f"输入{x} → 输出{perceptron(x)}")

输出结果：

输入[0, 0] → 输出0
输入[0, 1] → 输出1
输入[1, 0] → 输出1
输入[1, 1] → 输出0

六、历史意义与延伸

这个简单案例揭示了深度学习的核心思想：

- 层次化处理：如同视觉皮层V1→V2→V4的信息处理流程
- 分布式表示：单个概念由多个神经元协同表示
- 端到端学习：现代网络通过反向传播自动学习W,b参数

今天的Transformer架构依然延续这个基本原理，只是：

- 隐藏层数从2层增加到数百层
- 手动设计参数变为自动优化
- 阶跃激活变为ReLU等平滑函数
- 增加了注意力机制等复杂交互

理解这个案例，就掌握了理解深度学习的钥匙——任何复杂网络，本质上都是这个「俄罗斯套娃」结构的扩展与优化。

交替使用监督微调（SFT）与强化学习（RL）的策略探讨

在大模型后训练中，像deepseek R1那样交替使用监督微调（SFT）和强化学习（RL），而非单一依赖 RL 或蒸馏，背后的核心逻辑如下。

1. 交替使用 SFT 和 RL 的根本动因

(1) 训练稳定性与策略纠偏

- RL 的脆弱性
  强化学习高度依赖奖励函数设计，但在现实任务中，奖励信号往往稀疏（例如数学推理任务中仅有最终答案正确性的反馈）或含有噪声（如人类反馈存在标注误差）。如果长期仅依靠 RL，模型可能陷入局部最优，生成虽能获得高奖励却逻辑混乱的答案。
- SFT 的锚定作用
  定期引入 SFT 训练，通过高质量数据（如人工修正的思维链或模型“拒绝采样”而过滤的思维链）校正模型生成分布，可以有效防止 RL 阶段过度偏离合理路径。例如，DeepSeek-R1 在第二阶段 RL 后，通过 SFT 数据，成功修复了模型在复杂不等式推导中出现的符号错误。

(2) 数据效率与知识复用

- RL 的数据饥渴性
  生成有效的 RL 训练数据（如通过模型自身采样获得的推理轨迹）成本极高。以 Open-R1 项目为例，每天需用 512 块 H100 GPU 生成 18 万条轨迹，其中只有约 30% 能通过数学验证。【依据细节待查验】
- SFT 的快速收敛优势
  在关键能力瓶颈期（例如模型无法处理多步逻辑组合时），直接注入少量精标的 SFT 数据（如 5000 条人工编写的分步解析）能迅速突破性能瓶颈，避免 RL 长时间的探索过程。R1第一步的冷启动即是如此。

(3) 防止灾难性遗忘

- RL 的窄化效应
  当 RL 过度优化特定任务（如数学证明）时，模型可能会牺牲其他能力（例如常识推理）。有研究表明，纯 RL 训练的模型在 MATH 数据集上准确率提升了 5%，但在 TruthfulQA 上真实性得分下降了 8%。【依据细节待查验】
- SFT 的全域校准
  通过混合多领域 SFT 数据（例如同时包含数学题和事实核查问答），可以有效维持模型的通用性。DeepSeek-R1 第三阶段的混合数据微调正是基于这一设计理念。

2. 为何不持续使用 RL 或仅用蒸馏？

(1) RL 的固有局限性

- 奖励假设的不可靠性
  RL 假设奖励函数能够完全表征任务目标，但在复杂任务中，这一假设几乎难以成立。例如，代码生成任务若仅以单元测试通过率作为奖励，模型可能生成通过测试但存在安全漏洞（如缓冲区溢出）的代码。
- 策略坍塌风险
  长期 RL 训练可能导致模型策略多样性丧失。在对话任务中，模型可能反复生成高奖励但公式化、缺乏创意的回答，从而损害用户体验。

(2) 蒸馏的适用边界

- 表达能力损失
  蒸馏通过模仿教师模型的输出分布实现知识迁移，但这种方式往往会丢失隐式推理能力。例如，DeepSeek-R1-Zero 的蒸馏版本在多跳推理的 MATH 题目上性能较原模型下降了约 12%。【依据细节待查验】
- 教师依赖陷阱
  蒸馏效果受限于教师模型的整体质量。如果教师模型存在系统性错误（如物理常识错误），学生模型难以自主纠正，而 RL 能够利用环境反馈及时修正此类错误。

3. 交替循环的深层价值

(1) 螺旋式能力进化

- SFT → RL 的递进
  SFT 阶段提供了基本能力的锚点，RL 则在此基础上探索更优策略。例如，模型在 SFT 阶段学会标准数学符号的使用，随后通过 RL 发现了更简洁高效的路径。
- RL → SFT 的修正
  RL 阶段暴露出的不足（如过度简化关键步骤）可以通过新一轮 SFT 注入数据进行修正。这种交替循环类似于人类“学习-实践-反思”的认知过程。

(2) 多目标动态平衡

- 性能与安全性的博弈
  纯 RL 可能为追求高得分而牺牲安全性，例如生成偏激或不当内容。而定期使用 SFT 注入经过伦理审查的数据，可以动态约束模型输出，确保既有高性能又符合安全规范。
- 通用与专用的适配
  交替训练使得模型既能通过 SFT 保持广泛适用性（如客服对话），又能通过 RL 深耕垂直场景（如医疗诊断中的检查项推荐），从而在多种任务上均表现出色。

总结：交替策略的系统性优势

- 风险控制：通过 SFT 定期校准，规避了 RL 训练过程中可能出现的失控风险。
- 资源优化：在 RL 数据生成成本与 SFT 标注成本之间取得了良好的平衡。
- 能力完备性：既兼顾了 SFT 的模仿学习规范性，又融合了 RL 的探索学习创造性，最终实现了模型能力的稳健提升。

这种交替策略实际上是将“定向引导”与“自由探索”相结合，既避免了纯 RL 的“野蛮生长”，也超越了纯 SFT 的“照本宣科”，最终实现了模型综合能力的全面进化。同时，尽管蒸馏在某些场景下（如移动端轻量化）有其独特优势，但它并不能替代这种系统性的优化路径。

【相关】

o3 Deep Research: DeepSeek R1 多阶段训练流程问答解析

好的，我将对 DeepSeek R1 的整体训练过程进行研究，并重点分析强化阶段（第二、三、四阶段）的具体细节，包括 RL、SFT、CoT 的方法和策略。

报告将采用问答解析的形式，同时保持一定的技术深度，并确保可读性。还会在必要时提供与 GPT-4、Claude、Gemini 等模型的简单核心对比，以展示其独特性。

我会尽量参考 DeepSeek 官方网站的论文和技术资料，同时结合网络上的研究和学习经验进行综合分析。

DeepSeek R1 多阶段训练流程问答解析

Q1: DeepSeek R1 的整体训练包含哪些阶段，为什么采用多阶段策略？

A: DeepSeek R1 采用了四阶段的训练流程，包括两个强化学习（RL）阶段和两个监督微调（SFT）阶段 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。这一多阶段策略旨在先通过少量数据稳定模型输出格式，然后大规模提升推理能力，接着扩展通用能力，最后对齐人类偏好，逐步打造出既擅长推理又安全实用的模型 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园) (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。具体来说：

- 阶段1（冷启动 SFT）：使用少量人工整理的高质量长链推理数据对基础模型进行监督微调，规范输出格式，提高可读性，解决直接RL训练时出现的中英混杂、格式混乱等不稳定现象 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园) (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。这一冷启动确保模型在进入强化学习前已有基本的思维链格式和可读性，为后续RL提供稳定起点。
- 阶段2（推理导向的强化学习）：在阶段1基础上进行大规模 RL 训练，重点激发模型在数学、编程、科学问答等复杂推理任务上的能力，同时通过特别的奖励设计避免输出再次变得难以阅读或语言混杂 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园) (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。这一阶段显著增强模型的推理水平。
- 阶段3（拒绝采样 + 监督微调）：利用阶段2得到的强推理模型生成大规模的推理数据，并结合一定量的非推理数据，对模型进行监督微调 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园) (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。该阶段平衡模型在推理和非推理任务上的表现，使模型不仅会做推理题，也能应对日常问答、写作等。
- 阶段4（全场景强化学习，对齐人类偏好）：最后再进行一次RL训练，引入人类偏好相关的奖励（即RLHF，Reinforcement Learning from Human Feedback）来提升模型的安全性和有用性 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园) (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。此阶段确保模型的行为与人类偏好对齐，在各种开放场景下都表现稳健。

通过上述多阶段渐进式训练，DeepSeek R1 最终在推理能力上达到接近OpenAI o1系列闭源模型的水平，同时保持了输出的规范性和对用户需求的良好适应 ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning) (阅读 | DeepSeek-R1：怎么用强化学习把一个普通语言模型调教成推理高手？_腾讯新闻)。

Q2: 第二阶段的推理强化学习是如何实施的？训练是逐个问题进行还是批量进行？训练数据是否会保留？奖励机制如何设计？

A: 阶段2是针对推理能力的强化学习训练。在这一阶段，DeepSeek R1 采用了GRPO算法（分组相对策略优化）来高效地进行RL训练 ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning)。具体来说，对每个问题，不会只生成单一回答，而是从当前策略采样一组不同的回答，然后根据这组回答的得分情况来优化模型 ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning) ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning)。这种“按问题分组”的方式相当于批量生成和评价：每道训练题目产生多个解答，计算每个解答的奖励，并用组内奖励的相对差异（优势值）指导模型参数更新 ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning)。由于采用组内平均得分作为基准（baseline），GRPO 不需要额外的价值网络（critic），降低了大模型RL的开销 ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning)。因此，就单个问题而言是通过多次尝试来强化，但在实现中会并行处理多个问题批次，以加速训练。每轮生成的回答在用于计算梯度更新后并不长期保留，属于在线的RL采样；只有当阶段2训练收敛后，会使用最终的模型来批量生成数据供下阶段使用 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。

奖励机制方面，DeepSeek R1 在此阶段设计了规则奖励，主要包括：

- 准确性奖励：检测模型回答是否正确。 ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning)例如，对数学题要求答案格式固定、便于比对正确结果；对编程题则通过编译运行测试用例判断解答是否正确。 ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning)如果答案正确，给予正奖励，错误则无奖励或负奖励。
- 格式奖励：要求模型将推理过程和最终答案分别写在特定标签（如 <think> 和 <answer>）内，如果模型遵守格式就给予奖励 ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning)。这保证了输出包含清晰的思维链步骤和答案。
- 语言一致性奖励：针对之前发现的中英混用问题，增加了奖励信号鼓励模型使用单一目标语言回答。 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园) (阅读 | DeepSeek-R1：怎么用强化学习把一个普通语言模型调教成推理高手？_腾讯新闻)具体做法是统计回答中目标语言词汇的占比，若模型夹杂其它语言则扣分，从而促使整个推理过程语言一致。

最终，以上各项奖励会加权求和形成总奖励 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。模型通过最大化期望奖励来更新参数：正确且格式规范、语言一致的回答获得最高回报，从而模型逐步学会既推理正确又表述规范地回答问题 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。

值得一提的是，DeepSeek R1-Zero（没有冷启动微调的版本）在纯RL下已经自行涌现出了反思验证、超长链式推理等强大能力，但同时输出可读性差、语言混杂 (阅读 | DeepSeek-R1：怎么用强化学习把一个普通语言模型调教成推理高手？_腾讯新闻) (阅读 | DeepSeek-R1：怎么用强化学习把一个普通语言模型调教成推理高手？_腾讯新闻)。因此在 R1 中通过以上冷启动和奖励改进，显著改善了这些问题 (阅读 | DeepSeek-R1：怎么用强化学习把一个普通语言模型调教成推理高手？_腾讯新闻)。阶段2 的RL训练让模型的推理准确率在数学等任务上大幅提升（例如 AIME 数学竞赛准确率从15.6%提高到71% (阅读 | DeepSeek-R1：怎么用强化学习把一个普通语言模型调教成推理高手？_腾讯新闻)），为后续步骤打下强大的推理基础。训练过程中产生的大量问答尝试并未直接用于模型参数更新之外的用途，而是在训练完成后经过筛选用于下一阶段的数据集构建 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。

Q3: 第三阶段为什么要用 60 万条推理再生数据和 20 万条非推理数据进行微调？这一混合微调策略有何作用？

A: 阶段3是承上启下的监督微调，目的是在保持高推理能力的同时扩展模型的通用能力 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。之所以采用“60万推理 + 20万非推理”的数据组合，主要有以下考虑：

- 丰富高质量推理数据 (60万)：大规模高质量的推理示例难以人工标注获取，因此DeepSeek团队利用第二阶段训练出的强推理模型来自动生成这些数据 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。通过拒绝采样，他们为每个问题让模型生成多种解答，筛选出推理正确、过程合理的轨迹 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。这样累计约 60万条长链思维过程的数据，为模型提供了前所未有规模的推理示例库 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。这些数据囊括了数学、代码、逻辑推理等复杂问题的正确解题过程，大幅拓展了模型学习推理模式的知识面和多样性。模型在这些数据上微调，可以巩固并细化第二阶段通过RL学到的推理技能 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。
- 引入非推理领域数据 (20万)：如果模型只在推理数据上训练，可能会偏科，即擅长推理但在日常对话、写作、常识问答等任务上表现退化。为避免这点，研究者引入了约 20万条非推理类的数据 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。这些数据主要来自前代模型 DeepSeek-V3 的SFT数据集，涵盖写作、翻译、问答、自我认知等一般指令场景 (DeepSeek-R1：开源Top推理模型的实现细节、使用与复现 - 博客园) (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。通过把非推理任务的数据混合进训练集，模型能够维持和提升在通用NLP任务上的表现，不会因为专注推理而忽视了其他能力 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。
- 比例与效果权衡：选择大约 3:1 的推理 vs 非推理数据比例（60万对20万），体现了团队在推理能力和通用能力之间的权衡。一方面，推理数据占据多数，确保模型主要朝着擅长复杂推理优化 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。另一方面，也保留了相当数量的非推理样本，让模型不会在日常任务上“不接地气” (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。实践中，他们对混合数据集进行了两轮微调（两次epoch），进一步平衡了模型在推理与非推理任务上的性能 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园) ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning)。经过阶段3微调，模型不仅保持了高水平的逻辑推理能力，还在开放问答、创意写作等方面达到令人满意的效果，实现了**“一专多能”** (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。这一策略使DeepSeek R1在广义能力上更加接近GPT-4这类通用大模型，同时仍保有卓越的推理特长。

Q4: 第四阶段为何需要再次进行强化学习？这一阶段是如何结合人类偏好进行对齐的，人类偏好对齐有什么必要性？

A: 阶段4是DeepSeek R1的二次强化学习阶段，核心目的在于对齐人类偏好，进一步提升模型在实际应用中的安全性和实用性 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。即使经过阶段2和阶段3，模型在推理和一般任务上已表现出色，但仍可能存在不符合用户期望的行为，例如回答不够礼貌有用，或在敏感问题上产生不安全内容。这正是当前尖端LLM都会面临的问题，需要引入人类偏好对齐（例如 OpenAI 的GPT-4通过RLHF过程进行对齐）。对于DeepSeek R1，这一阶段通过融合人类偏好相关的奖励信号来微调模型，使其行为更符合人类期望 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。具体而言：

- 继续强化推理任务表现：对于推理类的问答，仍沿用之前的规则奖励体系（比如数学、代码的准确性检查等），保证模型的推理能力不倒退 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。这样模型在追求对齐的同时，不会牺牲已经获得的严谨推理技能。
- 引入偏好模型评估的一般任务奖励：针对闲聊对话、开放问答等非确定性场景，单靠规则已无法评判回答的好坏。因此研究者使用训练好的奖励模型（偏好模型）来评估模型输出对人类的友好程度 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。这个偏好模型并非简单的规则判断，而是通过大量人类偏好数据（可能来自类似DeepSeek-V3管线的人类反馈数据对）训练而成，用来打分模型回答的有用性和无害性 ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning) ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning)。例如，一条回答是否切题、详尽、有帮助，以及是否避免了偏见、有害内容等 ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning)。在训练时，DeepSeek R1 对这些通用场景的回答会获得偏好模型的打分作为奖励信号，高分意味着更符合人类偏好。通过最大化这些奖励，模型学会产出更令用户满意、安全的回答。
- 多样化的提示分布训练：为使模型适应各种应用场景，这一阶段在训练时还混合了不同类型的提示（prompts），如一般用户提问、角色扮演指令、闲聊寒暄等 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。这种多提示分布训练使模型在面对形式多样的输入时都能表现得体，从而真正成为一个全场景的助手，而不仅限于解题机器。

通过上述措施，阶段4 实现了模型行为和人类期望的深度对齐，相当于执行了一次RLHF式（基于人类反馈的强化学习）训练。 (阅读 | DeepSeek-R1：怎么用强化学习把一个普通语言模型调教成推理高手？_腾讯新闻)在对齐后，DeepSeek R1 兼具强大的推理能力和良好的安全性/有用性：既能给出复杂问题的正确解答，又能确保回答让用户容易理解、愿意接受，在敏感问题上遵循安全准则 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园) ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning)。这种人类偏好对齐对于模型大规模应用是必要的 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)——它保证模型不会因为追求高推理准确率而忽略了与用户互动的品质，确保模型的高智商能够以让人类舒适的方式呈现出来。换句话说，经过阶段4，DeepSeek R1 不仅是“推理高手”，也是“贴心助手”。

Q5: DeepSeek R1 与 GPT-4、Anthropic Claude、Google Gemini 等顶尖模型有哪些核心差异？

A: DeepSeek R1 的训练路线与这些闭源大模型有一些独特之处，但总体目标相似（都追求强性能和对齐人类意图）。主要区别在于：

- 训练开创性：DeepSeek R1 强调通过纯RL激发推理能力。其 R1-Zero 实验证明了即使没有大型监督微调数据，仅靠强化学习也能让模型学会复杂推理 (deepseek-ai/DeepSeek-R1 · Hugging Face) (The Illustrated DeepSeek-R1 - by Jay Alammar)。这在 GPT-4 等模型中未被公开证实，因为 OpenAI 的方法相对保密，我们只知道GPT-4经过了海量预训练+监督微调+RLHF，但具体是否单独用RL增强推理尚不得而知 (阅读 | DeepSeek-R1：怎么用强化学习把一个普通语言模型调教成推理高手？_腾讯新闻) (阅读 | DeepSeek-R1：怎么用强化学习把一个普通语言模型调教成推理高手？_腾讯新闻)。DeepSeek R1 的多阶段过程是公开透明的，并将推理RL成果开源，这对研究社区非常有意义 ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning) (deepseek-ai/DeepSeek-R1 · Hugging Face)。
- 数据获取方式：GPT-4、Claude 等模型大量能力来自于规模空前的训练数据和人类标注反馈。相比之下，DeepSeek R1 使用RL生成了60万条思维链数据，以模型自我生成数据来弥补人工标注不足 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。这种自举式获取高质量数据的方法降低了对人工的依赖，提高了性价比，被认为是构建强大模型的高效新途径 (The Illustrated DeepSeek-R1 - by Jay Alammar) (The Illustrated DeepSeek-R1 - by Jay Alammar)。
- 开放性与可复现：DeepSeek R1是开源的，甚至提供了小模型蒸馏版，让社区能够复现和验证其成果 ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning) ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning)。相反，GPT-4 和 Gemini 等目前是闭源且难以复现，其具体架构和训练细节不公开。DeepSeek R1 的开放使研究者可以更深入地了解大模型的推理培养过程，并将类似方法应用到其它基座模型上。
- 模型侧重点：Claude 等模型以安全和长上下文见长，Gemini 传闻将侧重多模态与工具使用，而 DeepSeek R1 的突出卖点是卓越的推理解题能力 (deepseek-ai/DeepSeek-R1 · Hugging Face)。在定位上，DeepSeek R1 更像是对标OpenAI的 o1系列（专注推理的模型），通过RL结合SFT达到了特定领域接近SOTA的推理性能 ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning)。同时经过偏好对齐后，R1 的综合能力也达到了当代顶尖助手模型的水准，只是在训练思路上有所创新。

综上，DeepSeek R1 展现了一条与主流商用大模型不同但并行的研发路线：用较少人工介入、充分利用RL自我提升推理，再辅以必要的监督和对齐，来培养强大的通用AI。这为业界提供了新思路，也使研究者能够探究在不开天量数据集的情况下提升模型智力的可能 (The Illustrated DeepSeek-R1 - by Jay Alammar) (The Illustrated DeepSeek-R1 - by Jay Alammar)。各模型各有侧重，但目标殊途同归：让AI既“智商高”又“情商高”，能真正解决问题并让用户放心使用。

参考文献：

DeepSeek R1 官方论文 ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning) ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning)及技术报告、DeepSeek 官方博客和数据集说明 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园) ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning)，以及相关分析文章和讨论 (阅读 | DeepSeek-R1：怎么用强化学习把一个普通语言模型调教成推理高手？_腾讯新闻) (阅读 | DeepSeek-R1：怎么用强化学习把一个普通语言模型调教成推理高手？_腾讯新闻)。

上述内容综合了官方资料与社区解读，以问答形式呈现，希望对DeepSeek R1的训练策略有全面而清晰的说明。

后记

才试了一下 openai o3 最新推出的 deep research 功能，不能不说，作为研究工具，非常给力，完成后的成品具有直接发表的水平，想做一点后编辑都难。有点叹为观止的感觉，堪称完美。当然，现在各家头部模型都开始 offer 这种“深度研究”的功能了，这等于是平替了大多数咨询公司的研究人员的工作。即便是训练有素的数据分析和调研人员，多数人实际上做不了它这么全面和细致。最见功力的是它的引用的全面和细密。说话有根据，幻觉自然减少。

【相关】

Dilemma of RPA and Early-Stage LLM Co-pilot Entrepreneurs in the Age of Agent Tsunami

As large language models (LLMs) surge forward, LLM Agents are reconstructing the automation landscape at unprecedented speed. This revolution not only threatens traditional RPA (Robotic Process Automation, reliant on rule engines or small models) but also pushes early-stage co-pilot application builders to the edge of a cliff. At its core, this technological shift represents two fundamental disruptions: 
1. Natural language interaction overpowering low-code programming in complex, dynamic, unstructured data scenarios. 
2. General intelligence violently overshadowing shallow vertical solutions.

"Triple Disruption" of LLM Agents

1. Paradigm Shift: From "Low-Code" to "Emergent Intelligence"

- Traditional RPA: Engineers script step-by-step logic (e.g., UiPath’s drag-and-drop designer), akin to teaching robots to hop grids – brittle and error-prone.
- LLM Agent: Directly interprets human intent (e.g., "Extract invoice data from emails into the system"), autonomously decomposes tasks, and dynamically adjusts execution paths.
- Case Study: ChatGPT plugins already book flights or fetch data via API calls, while traditional RPA requires low-code scripting for equivalent functions.

2. Moat Erosion: Data Barriers vs. General Intelligence

Pre-LLM RPA Moats:
Industry know-how (e.g., nuances of financial reimbursement workflows) + custom deployment capabilities + template libraries.
Reality: Most RPA firms accumulated shallow industry exposure rather than deep vertical data expertise.

LLM’s Breaching Tactics:
- Digests unstructured documents (e.g., diverse invoice formats) via multimodal vision and computer use capabilities.
- Adapts to new workflows via zero-shot Chain-of-Thought (CoT) reasoning (e.g., interpreting vague commands like "Sync key contract terms to CRM").

Final Blow: As standardized scenarios get natively covered by leading LLMs (including reasoning models), RPA’s last defense – proprietary industry APIs – is being devoured by LLM vendors’ customization and privacy solutions.

3. Ecosystem Cannibalization: From "Tool Vendor" to "LLM-native Layer"

Early Co-pilot Traps:
Products like Character.ai (personalized chatbots) and Jasper (writing/marketing assistants) – essentially thin wrappers over base models – crumble when ChatGPT launches role presets or DALL·E 3 plugins.

Survivor Playbooks:
- Perplexity.ai: Carves a niche with real-time search + academic citations (fixing LLM hallucination).
- Cursor: Builds vertical moats via developer workflow integration (codebase semantics, AI pair programming).

Industry Upheaval in RPA

- UiPath’s stock plummets from 2021 highs; its "Autopilot" feature (English-to-automation) criticized as a "GPT-4 wrapper."
- Microsoft Power Automate integrates Copilot, generating cloud workflows from natural language prompts.
- Adept (AI-for-computer-actions startup) hits $1B+ valuation, directly threatening RPA’s existence.

Survivor’s Map: Niches Resisting the LLM Tide

1. Deep Verticalization
- Cursor: Dominates IDE ecosystems via VSCode extensions and developer workflow data.
- Harvey (legal AI): Trains on LexisNexis corpus + private deployment for compliance.

2. Real-Time Data Masters
- Perplexity.ai: Search engine-grade indexing + academic database partnerships.
- Hedgeye (finance): Aggregates Bloomberg/Reuters feeds + proprietary prediction models.

3. Hardware Fusion
- Covariant: Embeds LLMs into warehouse robotics, leveraging mechanical barriers.
- Tesla Optimus: Physical-world operation via embodied AI, evading pure-digital competition.

Agent Startup Pitfalls & Counterstrategies

Common Traps

- Thin Model Wrapping
Issue: Repackaging ChatGPT prompts as "AI customer service" adds no real value.
Fix: Develop domain-specific features (e.g., clinical decision support requiring privacy-sensitive data pipelines).

- Over-Reliance on Fine-Tuning
Issue: Claiming "medical LLM" after basic terminology tuning ignores the need for closed-loop clinical workflows.
Fix: Build proprietary data flywheels and scenario-optimized architectures.

- Ignoring Enterprise Needs
Issue: Overlooking security, SLA guarantees, and system integration.
Fix: Architect enterprise-grade frameworks for organizational deployment.

Differentiation Strategies

- Workflow Integration Specialists: Develop deep connectors for niche scenarios (e.g., legal document parsing).
- Human-AI Orchestrators: Design quality control layers and manual override mechanisms.
- Vertical Knowledge Engineers: Curate domain-specific benchmarks and evaluation protocols.

RPA’s Last Stand

While battered, RPA retains residual value in:

- High-compliance scenarios: Auditable/traceable workflows (e.g., financial regulations).
- Legacy system integration: Stability in outdated IT environments.
- Ultra-high precision demands: Deterministic execution for core systems (e.g., stock trading).

Challenges for Early Co-pilot Entrepreneurs

Two fatal flaws plague AI application startups: 
1. No proven scaled success cases – LLMs are barely 2-3 years old, leaving co-pilots (beyond chatbots) unvalidated for commercial viability. 
2. Vulnerability to LLM upgrades – Without exclusive industry data or customer channels, co-pilot startups risk being crushed by foundational model advancements.

The Inevitable Conclusion

LLM Agents are replaying cloud computing’s annihilation of on-prem servers: foundational capabilities get standardized (like AWS replacing data centers), while vertical opportunities spawn new giants (like Snowflake). RPA and generic Agent startups must either:
1. Become vertical domain experts, or
2. Master human-AI collaboration architectures

... or face obsolescence as LLM agents absorb 90% of automation value. The silver lining? This disruption will unlock an automation market 100x larger than the RPA era – but tickets are reserved for those who architect vertically fused, LLM-empowered solutions.

As Sam Altman warned: Avoid building what foundational models will inevitably swallow.

【相关】

RPA 赛道与大模型Copilots早期创业者的困局
- Does the New Reasoning Paradigm (Query+CoT+Answer) Support a New Scaling Law?
- Technical Deep Dive: Understanding DeepSeek R1's Reasoning Mechanism in Production

RPA 赛道与大模型Copilots早期创业者的困局

在大模型技术狂飙的当下，LLM Agents（基于大语言模型的智能体）正以前所未有的速度重构自动化版图。这场革命不仅冲击着传统 RPA（机器人流程自动化）的生存空间，更将早期依赖规则引擎或小模型的RPA创业者以及co-pilot的大模型应用的早期创业者逼至悬崖边缘。这场技术迭代的本质，是在复杂、多变、非结构化数据处理的自动化场景下，自然语言交互对低代码编程的降维打击，更是通用智能对浅垂直场景的暴力覆盖。

一般而言，作为agent的早期形式，co-pilot创业的有两大问题：第一是没有成功案例可循，因为LLM才两三年，还没来得及给除了chatbot以外的copilot证明其商业闭环的机会。第二是co-pilot创业如果没有深厚的独家行业数据基础以及客户渠道，就非常容易把自己定位在大模型的迭代升级的路口上，被大模型升级所碾压。

一、LLM Agents 的 "三重碾压效应"

1. 技术路径的颠覆：从 "低代码" 到 "自然涌现"

- 传统 RPA：依赖工程师编写脚本或流程图（如 UiPath 的拖拽式设计器），需精准定义每一步操作逻辑，如同教机器人跳格子，容错率低。
- LLM Agent：直接理解人类意图（"把邮件里的发票信息录入系统"），自主拆解任务链条，动态调整执行路径。
- 典型案例：ChatGPT 插件系统已能调用 API 完成订机票、查数据等操作，而传统 RPA 实现同等功能需低代码脚本。

2. 护城河的瓦解：数据壁垒 vs 通用智能

- 前大模型时代的 RPA 壁垒：

行业 know-how（如财务报销流程的细微差异）+ 客户定制化部署能力 + 长期积累的模板库。

但多数RPA公司本来的业务就是深入各行各业办公室的小批次业务累积，而不是垂直行业的数据深耕，基本没有多少垂直行业的门坎。

- LLM Agent 的破壁逻辑：

- 通过大模型视觉以及电脑使用（Computer Use）能力直接消化非结构化文档（如五花八门的发票格式）以及模拟人类使用网络（而不是调用API）
- 通过思维链（CoT）零样本学习适应新业务流程（如理解 "将合同关键条款同步到 CRM" 的模糊指令）
- 致命打击：当大部分标准化场景被头部大模型（包括推理模型）原生覆盖，RPA 仅存的护城河只剩特定行业的私有数据接口——而这块蛋糕正被大模型厂家的的定制化、私有部署以及隐私保护能力所蚕食。

3. 生态位的吞噬：从 "工具提供商" 到 "基础设施层"

- 早期 Copilot 创业者的困境：

- Character.ai（个性化对话角色）、Jasper（营销文案生成）等曾风光一时的产品，本质上是对基础模型能力的浅层封装。
- 当 ChatGPT 开放角色预设功能和 DALL·E 3 插件，这些 "中间件" 的生存空间被急剧压缩。

- 幸存者法则：

- Perplexity.ai 以实时搜索+学术级引用杀出血路（弥补了通用模型的事实性缺陷）
- Cursor 通过深度绑定开发者工作流（代码库语义检索、AI 结对编程）建立垂直壁垒

二、RPA 赛道的行业剧变

- UiPath 股价较 2021 年高点严重下跌，其推出的 Autopilot 功能（ "用英语描述自动化流程"）被用户吐槽为 "套壳 GPT-4"
- 微软 Power Automate 全面接入 Copilot，支持 "描述你想要自动化的任务" 直接生成云端工作流
- 硅谷明星创业公司 Adept（专注训练 AI 执行电脑操作）估值突破 10 亿美元，直接威胁 RPA 生存根基

三、幸存者地图：哪些赛道还能抵抗 LLM 洪流？

1. 重度垂直型

- Cursor 的代码助手：深度集成 IDE 生态（利用VSCode 扩展市场的高占有率），掌握开发者真实工作流数据
- Harvey（法律 AI）：通过LexisNexis法律特有语料训练理解法律术语+私有化部署解决合规问题

2. 实时数据型

- Perplexity.ai：构建搜索引擎级索引体系+ 学术数据库独家合作
- Hedgeye（金融分析）：聚合 Bloomberg/Reuters 实时行情+行业独家预测模型

3. 硬件耦合型

- Covariant 仓储机器人：将 LLM 与机械臂控制算法深度融合，硬件壁垒延缓大模型侵蚀速度
- Tesla Optimus：通过具身智能实现物理世界操作，暂时规避纯数字自动化竞争

四、Agent 创业的陷阱与对策

常见陷阱

1. 简单封装大模型
  - 问题：仅对通用大模型进行表层封装，缺乏实质性增值，例如将 ChatGPT 的提示词模板包装成 "智能客服系统"
  - 修正：开发特定领域专用功能，针对垂直场景深度优化
2. 过度依赖微调
  - 问题：认为对大模型进行简单微调就能构建竞争壁垒，例如微调行业术语就标榜 "医疗大模型"，实则临床决策支持需要具有高度隐私敏感的“地下”数据综合能力
  - 修正：构建专有数据闭环和场景优化的工作流程，形成实质性差异
3. 忽视企业级需求
  - 问题：不重视安全合规、SLA保障和系统集成
  - 修正：构建企业级功能框架，满足组织级部署要求

差异化策略

1. 工作流集成专家
  - 聚焦特定工作场景深度集成
  - 开发专用连接器和数据通道
2. 人机协作架构师
  - 设计高效的人机分工模式
  - 构建质量监控和人工干预机制
3. 行业知识库构建者
  - 整合垂直领域专业知识
  - 开发领域特定的评估标准

RPA 的剩余价值： 虽然 LLM Agent 冲击巨大，但 RPA 并非完全没有生存空间。RPA 在以下方面具有剩余价值：

- 合规性要求高的场景： RPA 流程的可审计性和可追溯性可能更符合某些行业的合规要求。
- 与遗留系统的集成： 在某些遗留系统集成方面，RPA 可能比 LLM Agent 更成熟和稳定。
- 超高精度和稳定性的需求： 在极少数对精度和稳定性要求极高的场景下，例如金融交易核心系统，RPA 的确定性执行可能更受青睐。

残酷结论

LLM Agents 正在重演云计算淘汰本地服务器的历史：通用能力标准化底层服务（AWS 取代企业机房），垂直场景留给细分玩家（Snowflake 在云上长成数据仓库巨头）。RPA 和早期 Agent/Copilot 创业者若不能快速转型为 "领域场景深耕者" 或 "人机协作架构师"，必将成为大模型吞噬算力时扬起的尘沙。唯一的好消息是：这场碾压的终局将释放出比 RPA 时代大百倍的自动化市场——但入场券只留给能与 LLM 共舞的垂直场景产品架构师。

大模型创业需要避坑，首要的坑正如Sam所说，就是不要开发大模型边缘迟早会内化的能力。

【相关】

Understanding the Division of Labor Among Q, K, V in Self-Attention Mechanism

For those diving into self-attention mechanisms, the roles of Query (Q), Key (K), and Value (V) often spark confusion: Why must every token in a sequence generate these three distinct roles to capture contextual dependencies?To grasp how Q, K, and V matrices autonomously specialize through backpropagation, we must delve into the foundational logic of model training. This process mirrors the natural phenomenon of "ant colony division of labor": though initially identical, ants evolve into workers, soldiers, or reproducers through environmental feedback. Similarly, Transformer parameters self-organize via error-driven optimization.

I. The Driving Force: Loss Function as a Macro-Regulator

Suppose we are training a translation model, with the input sentence "猫追逐激光点", and the target output "The cat chases the laser dot." The following are the key steps in parameter differentiation:

1. Initial Chaotic State
- W_Q, W_K, W_V matrices are all randomly initialized
- At this point, the Q vector of "追逐" (chase) may have no correlation with the K vector of "激光点" (laser dot)

2. First Forward Propagation
- When calculating attention weights, "追逐" (chase) fails to associate with "激光点" (laser dot)
- This leads to an incorrect translation (such as outputting "The cat eats the laser")

3. Error Signal Feedback
The loss function calculates two key gradients:
- Content missing gradient: Need to strengthen the action association "追逐→chases"
- Object mismatch gradient: Need to establish the verb-object relationship between "追逐" (chase) and "激光点" (laser dot)

4. Parameter Differentiation Begins
- W_Q matrix receives the signal: Make the Q vector of verbs more attentive to action target features
- W_K matrix receives the signal: Strengthen the acted-upon object attributes in noun K vectors
- W_V matrix receives the signal: Preserve details such as mobility in nouns

🔥 Key Mechanism: The same error signal propagates through different computational paths, causing the update directions of the three matrices to differentiate.

II. Mathematical Principles of Parameter Differentiation

By breaking down the attention calculation process, we can see how gradients guide division of labor:

Attention Weight Calculation Paths

- Gradients for W_Q:
Mainly come from the similarity calculation between the Q of the current token and the K of contextual tokens, forcing W_Q to learn how to generate effective query features
(Example: Making the Q vector of a verb contain potential features like "needs to be paired with an object (transitive verb)"; Q resembles the encoding signal for potential sentence patterns in traditional linguistics, similar to Subcat)

- Gradients for W_K:
Also come from Q-K similarity calculation, but the direction is to optimize K features to be recognizable by Q
(Example: Making the K vector of nouns contain attributes like "can serve as an object of action (object noun)")

- Gradients for W_V:
Come from the final weighted sum, requiring V to retain sufficient information
(Example: The V vector of "激光点" (laser dot) needs to include details like "small, bright, movable")

Four Steps of Weight Calculation:

1. Q-K Dot Product: Measure relevance.
2. Scaling: Prevent gradient explosion.
3. Softmax: Normalize into probability weights.
4. Weighted Sum: Generate contextualized representations.

III. Structural Guarantees for Stable Division of Labor

Beyond gradient driving, model structure design also ensures that the division of labor remains consistent:

1. Isolation of Linear Transformations
- Q/K/V come from three completely independent matrix multiplications
(Unlike LSTM gating mechanisms that share parameters)
- Gradient updates for each matrix do not interfere with each other

2. Multi-Head Attention Mechanism
Using 8-64 independent attention mechanisms (multi-head attention) is like having a team of detectives investigating different directions: some focus on the timeline, others analyze character relationships, and finally, all relationship matching results are synthesized.

Different attention heads form a "division of labor":
- Head 1: W_Q¹ learns grammatical role matching
(Example: Matching the Q of a subject with the K of a predicate)
- Head 2: W_Q² learns semantic associations
(Example: Matching the Q of "bank" with the K of "interest rate")
- This multi-objective optimization forces parameters to specialize

IV. Empirical Validation: Concretization of Parameter Division of Labor

By visualizing the parameters after training, clear patterns of division of labor can be observed:

Case Study: Related Parameters for the Verb "吃" (eat)
- W_Q Matrix:
In the Q vector of "吃" (eat), high-weight dimensions correspond to features like "edible," "concrete object," etc.
- W_K Matrix:
In the K vector of "苹果" (apple), high-weight dimensions correspond to attributes like "food category," "solid," etc.
- W_V Matrix:
In the V vector of "苹果" (apple), high-weight dimensions include details like "color," "texture," "nutritional components," etc.

When calculating `Q(吃)·K(苹果)` (`Q(eat)·K(apple)`), strong attention weights are generated due to high activation values on the "edibility" dimension from both parties. Meanwhile, V(apple) carries the specific information needed for output production (such as knowing it's a fruit rather than a technology company when translating to "apple").

Key Conclusion: The Wisdom of Self-Organization
The essence of parameter division of labor in Transformers is the functional specialization that naturally evolves under the constraints of a unified objective function. The system does not need to preset division of labor details but spontaneously forms an efficient information processing system through repeated "trial-error-feedback" cycles with massive data. This self-organizing process driven by error is the source of the powerful representation capabilities of deep learning models.

[Addendum] A Deeper Interpretation of Q/K/V Relationships

Relationship Between Q and K
- Q is a specific perspective or projection of the K space
- Just like a book can be retrieved from different angles:
- Q1: Subject classification (K1: Literature/Technology/History)
- Q2: Difficulty level (K2: Beginner/Advanced/Professional)
- Q3: Writing style (K3: Theoretical/Practical/Case-based)

This is because Q "actively" seeks certain features associated with other tokens, while K is "passively" prepared to be matched by other tokens. K is like an index that needs to summarize all the main features of a token, but Q focuses on querying a specific feature.

This makes understanding multi-head attention more intuitive:

```
Each head learns a different projection perspective
Q1 = token * W_q1 # May focus on thematic relevance
Q2 = token * W_q2 # May focus on grammatical relationships
Q3 = token * W_q3 # May focus on semantic roles
```

It's like different facets of a high-dimensional space:
- Each attention head learns a specific "query perspective"
- These perspectives collectively build a complete picture of inter-token relationships

Division of Labor Between K and V
- K: Information's "retrieval representation"
- Contains various ontological features that might be queried
- Similar to a multidimensional tagging system for books
- V: Information's "content representation"
- Contains information that actually needs to be utilized
- Like the specific content of a book's text

A Concrete Example
Using the word "驾驶" (driving) as an example:

Different perspectives that multi-head attention might learn:
- Q1: Seeking action tools (highly relevant to "汽车" (car))
- Q2: Seeking action subjects (highly relevant to "司机" (driver))
- Q3: Seeking action modifiers (relevant to "快" (fast), "稳" (stable), etc.)

This understanding effectively explains:
1. Why Q/K separation is necessary
2. Why multi-head QKV mechanisms are needed
3. How the model automatically learns different types of contextual relationships

Continuity Between V and Token Representation
A token's V (Value) is most related to the token's initial embedding, as both represent the content and meaning of this token.
- Initial embedding: Represents the general meaning of the token learned in large-scale embedding training in advance, similar to looking up a dictionary
- Value vector: Can be seen as a continuation and update of this initial representation in a specific context

In other words:
1. Embedding is the "basic dictionary definition" of a token
2. Value is the "specific expression" of this definition in a particular context

Evolution of Token Represenation in the Model

As information flows through multiple network layers:

Initial embedding → Layer 1 Token → Layer 2 Token → ... → Final representation

During this process:
- Each layer's token representation carries increasingly rich contextual information
- While maintaining continuity with the original token meaning (residual connections can compensate if continuity degradation is a concern)
- This evolution is gradual, not disruptive

Essential Differences Between Q/K and V
- Q and K primarily serve the goal of "establishing relationships"
- Q and K extract query features and index features for matching
- Q and K are naturally more abstract and general than V
- V directly carries "concrete content"
- Contains actual information that the token needs to convey
- More specific, more detailed

Figuratively speaking:
- Q/K is like the retrieval system in a library
- V is like the actual content of books on the shelves

Conclusion: The Deep Wisdom of the QKV Mechanism

From the perspective of the entire model:
1. Initial embeddings enter the first layer
2. Each layer updates the next layer's token representation through attention mechanisms and feed-forward networks
3. The final layer's representation encompasses all contextual relationships and meanings, directly empowering the output

The QKV division of labor in self-attention mechanisms, seemingly simple yet embedding profound information processing philosophy: through carefully designed computational flows and gradient paths, the model naturally develops functional differentiation during the optimization process. This design philosophy of "emergent intelligence" has become a core paradigm in modern artificial intelligence.

It is precisely this capability for self-organization and self-evolution that enables Transformer models to capture complex and variable relationship patterns in language, laying the foundation for the powerful capabilities of large language models.

【相关】

Linguists Should Find Self-Attention Intuitively Familiar

Written for my linguistics and symbolic NLP peers — a reflection on my journey to leverage computational linguistics in undersranding modern AI LLM.

Breaking Through the Jargon Barrier

For linguists bewildered by large language models (LLMs), the confusion often stems from terminology and implementation details obscuring shared foundational principles. Let’s cut through the noise and focus on self-attention — the beating heart of the Transformer architecture.

As a computational linguist and lifelong NLP practitioner, I’ve spent years dissecting symbolic grammars and, more recently, tracking the rise of LLMs. Here’s my attempt to "translate" the core design of multi-head Query-Key-Value (QKV) mechanisms into a framework linguists already know.

QKV: A Linguistic Reinterpretation

Query as Subcategorization (Subcat)
First, I would like to point out, Query mirrors Subcat in symbolic grammar: the slots a head word "digs" for its dependents. Take a transitive verb (vt) as an example: it creates two syntactic "slots"—a noun subject (pre-verbal) and a noun object (post-verbal). Similarly, the predicate eat defines two semantic slots: an animate agent (e.g., animal) and an edible patient (e.g., food). These constraints — syntactic roles and semantic selection restrictions — are bread-and-butter concepts for linguists.

Key as Ontological Features
Key represents ontological attributes: nounhood, animacy, action, state, time, descriptive, etc. Value is the filler—the "carrot" that occupies a slot. When I first read Attention is all you need, the QKV triad felt alien. No one explained that this was just dynamic slot-filling.

Why LLMs "Get" Language

LLMs thrive because their "slots" and "fillers" align perfectly across linguistic hierarchies. Every token carries QKV information because every word can both be a seeker (Query) and a target (Key/Value). When a Query (e.g., eat) finds a compatible Key (e.g., apple), their dot product sparks a high attention weight. The Value (the token’s semantic essence) is then passed forward, blending into the next layer’s representation of the token.

Contextual "Polygamy"
Tokens in the context window engage in group marriage, not monogamy. Each token 'flirts' with all others via Query-Key dot products. Relationships vary in intensity (weights), and the resulting "offspring"—the next layer’s tokens—inherit traits from multiple "parents" through weighted summation. Stronger relationships dominate; weaker ones fade. This crazy yet efficient "breeding" compresses linguistic structure into dense vector spaces, a process conceptually equivalent to parsing, understanding, and generation in one unified mechanism.

The Database Analogy (and Why It 'Misled' Us)

QKV borrows terms from database systems (Query for search, Key-Value for retrieval), but early attempts to map this to linguistics fell flat. We thought: "Databases? That’s just dictionary lookups — isn't it already handled by embeddings?!" The breakthrough came when we realized: Self-attention isn’t static retrieval—it’s dynamic, context-aware slot-filling.

For decades, we built bottom-up parsers using Subcat frames. Transformer layers do the same, but with vectors instead of symbolic representaions. See the 2 slides I made 3+ years ago when GPT3 playground was launched when I compared the parallel archtectures and approaches from two schools of AI, grammar school and multi-neural network school. Symbolic grammars, though, despite their transparency, pale in scalability:
- Granularity: LLMs leverage hundred or thousand dimensional vectors; we relied on only hundreds of one-hot features.
- Generalization: Transformers parse text, audio, video—any modality. Symbolic grammars, at best, aspire to universal grammar across languages.

A Convergence of Paths

My colleague Lü Zhengdong once mapped the evolution of attention: 
Seq2Seq (Google Brain) → Auto-alignment (Mila) → Transformer (Google) → Pre-trained LMs → LLMs (OpenAI)...

To this, I chuckled: "You pioneers see the trajectory clearly. But for us symbolic refugees, diving into Attention is all you need felt like drinking from a firehose." Without fully understanding the historical context, the concepts overwhelmed us—until one day, it clicked: Subcat-driven parsing and self-attention are two sides of the same coin.

Symbolic methods are obsolete, yes—clunky, rigid, and modality-bound, with the only merit of full transparency of symbolic logic. Yet understanding their parallels to Transformers suddenly made LLMs feel familiar. The difference? Scale and ambition. Linguists seek cross-linguistic universals; AI aims for cross-modal universals.

Postscript: Simplifying the Transformer

The original Transformer paper (Attention is all you need) is not an easy-read at all, bogged down by encoder-decoder specifics for machine translation. Strip away the noise, and the core is simple:
1. Self-attention layers (dynamic slot-filling).
2. Feedforward networks (nonlinear transformations).

GPT’s decoder-only architecture reveals the essence: next-token prediction (NTP) is the key to general intelligence. The so-called "decoder" isn’t just about decoding or generation—it’s also analysis and understanding fused into one stream.

Closing Thoughts

Dr. Bai, Shuo once remarked:

Language processing demands a unified ‘currency’—a mechanism to reconcile syntax, semantics, pragmatics, and world knowledge. Only neural networks (imperfect as they are) managed to have achieved this, probabilistically. Attention is that currency.

He’s right. Attention isn’t just a tool—it’s the universal metric we’ve sought all along.

【相关】

语言学家应该很容易理解自注意力机制

作为计算语言学家和NLP老司机，本篇是写给我的语言学老师和同学，以及符号NLP同行的，分享自己的心路历程，希望对大家有所启发。

如果语言学家以前觉得大模型烧脑和不解，那是因为窗户纸没有捅破，存在不同体系下的术语与机制细节的理解障碍，但底层逻辑并不乏不少共同的东西。本篇聚焦在大模型Transformer架构中的核心注意力机制的理解。

我出身计算语言学，这几年又一直在追踪大模型，可以“翻译”一下自注意力机制的核心设计多头的 QKV。

我们做符号文法的早就该知道 Query 就是 Subcat，主导词为潜在结构“挖坑”用的，例如及物动词（vt）就挖了两个坑：一个「名词」主语，一个「名词」宾语。主语、宾语是句法结构的角色要求，「名词」是对所要求对象的限制条件（更细致地，Subcat 还有其他限制条件，例如主语在vt前，宾语在 vt后，等）。具体到概念谓词“eat”，逻辑语义上也相应地挖了两个坑：一个是「动物」施事，一个是「食物」受事。「动物」（包括「人」）是逻辑语义结构中对施事角色的语义限制条件，「食物」是逻辑语义结构中对受事角色的语义限制条件。这些都是我们语言学家践行多年、耳熟能详的概念体系。

Key 就是本体特征，例如，名词、物体、食物、动作、状态、修饰、时间等，Value 就是填坑的“萝卜”。可惜，初读论文「Attention is all you need」，被 QKV弄得晕头转向的时候，没有人指点迷津。

为什么LLM大模型吃透了语言，说话那么顺溜，原来各个层级的坑与萝卜，都是那么相配，天赐良缘。为什么每一个单词都有QKV信息，道理也简单，每一个词都可能在“求偶”，追求别人，也都可能被追求。追与被追发现非常谈得来的时候， QK相配，注意力权重大，于是结合，就是把身子（Value）献上；然后生子，就是创造下一层的 Token 表示。

有意思的是，上下文窗口里的 Tokens 是群婚制，不是一夫一妻制生孩子。一个 Token 与周围所有 tokens 谈恋爱 q k（i）点积，其他tokens（包括该 Token自己）都与该Token 有一腿，但关系强度不同（谈吹的tokens，权重为0）。该 Token与这种多边关系“杂交”生出的孩子是怎样的呢？加权求和。就是说孩子继承了母亲的很多特征，同时也继承了周围众父亲的特征。谁的特征在子代最彰显或较弱，完全决定于交情的深浅、关系的强度。每个token都是这样推陈出新，一代一代传下去。最后发现，这种群婚制对于信息压缩（理解、解析、生成）特别有效。真有意思。

QKV这些概念显然是从数据库技术拿来的，对于软件界一点都不陌生。但当年理解注意力机制的时候，还是遭遇很大的困惑，想不明白语言解析与数据库啥关系。当时觉得扯上数据库，最多就是查词典相关，而查词典早就在词嵌入（embedding）编码阶段就完成了。自注意力的核心不是静态的查词典，而是动态的上下文理解。当年因为想不明白为什么要套用数据库的信息查询和匹配的说法，只能把注意力机制泛泛理解为关注上下文。这也没错，但看不到 insights，也理解不了为什么这东西这么厉害，成就了大模型的超人语言能力。

经过很久终于豁然开朗：原来我们做了一辈子的 subcat-based bottom up parsing，跟 attention+feedforward 做的完全是一回事，一个原理，殊途同归（见下图：这是我在大模型GPT3 playground刚冒头时候做的一个对比，illustrate 当年意识到两条路线殊途同归时候的震撼）。只不过我们符号文法处理完全打不过它。一个是颗粒度没法比，人家的家底是几百上千维度的实数向量，加上各种投射以及非线性转换，我们才有最多几千个强行结构化的符号特征（one hot features）。另外，人家的泛化可以从文字解析，推广到任何模态信号的压缩和理解，我们的文法不具有任何的模态可推广性、可移植性，最多只是跨过了语言壁垒，文法追求可以适用于任何语言的 universal grammar。

我的主流弄潮儿的一位老友吕正东说：

在attention 的发展史上（so far），我看到多次颠覆式的创新， 从最早的 seq2seq （Google Brain） → auto alignment (Mila) → Tranformer (Google again) → pre-trained language model (?) → LLM （openAI ）→ ...

我苦笑道：你是真正业内前沿，一路发展轨迹自然一目了然。你可能想象不出我们这些符号出身的人，突然被逼去研读这种经典论文（Attention is all you need）时候的困境。因为缺乏历史演进的知识垫底，一下子就被这些概念砸晕了。不知道经过多少次嘀咕、查阅，才慢慢明白：天下大势，冥冥之中，有万变不离其宗。原来，我们在符号文法摸索了一辈子的被证明最有效的谓词Subcat框架和自底而上的结构解析算法，底层逻辑与 transformer 及其自注意力机制不谋而合。虽然符号技术过时了，也确实笨拙，除了符号逻辑的透明性外，没有多少其他长处，但现在理解深度学习大模型的原理和框架，由此骤然变得亲切了很多。只不过现在眼界开阔了，与信息论和计算理论更加靠近。（计算）语言学家一辈子的追求就是跨语言，而AI的追求更高一筹，是跨模态，无论音频、视频还是文字。

【后记】

大模型经典论文 Attention is all you need 不好读也有时代的原因，它提出的框架是直接应对RNN的短板和机器翻译的需求和验证。这就让它的核心部分被这些因素干扰而模糊了。框架看上去太复杂，encoder decoder 两大部分，还必须在encoder 到 decoder 之间做一种交叉对齐，但其实内核组块没有任何区别。这些对于理解 transformer 的通用性和原理，都是“噪音”。

transformer 主体简单极了，不过就是多层感知器，在每一层加了个自注意力而已。到了GPT 发现 ntp（下一词预测）是打开通用智能的钥匙的时候，从框架角度就更简单了，decoder-only 足矣（说明：decoder 并不是名字所说的那样只做解码，它一样做分析和理解，这是由其核心组块决定的）。

老友看了我的transformer博客解说（Transformer 和注意力机制简介），说你这个太简陋了，连篇幅都比原论文短。

原文中有一些细节舍去了。
作者: 立委 (*)
日期: 2025/02/21 12:23:37

包括：

原架构是两大块：encoder + decoder

但实际上这两大块里面的组快都是完全一样的。而且，主流已经抛弃了 encoder，GPT 采用的就是 decoder-only 架构。

另外，位置编码是序列的一个因素，与处理过程解耦以后，位置编码有一套说法，怕干扰了主旨的理解，就点到为止了。

再有就是一些数学公式和实现细节，包括归一化、残差等。舍弃不影响对于 “注意力+神经网络” 这个核心主旨的理解。

所以通篇实际上就是一个理解重点：自注意力机制怎么work的，因为多层感知器是个 given，不是 transformer 的创新。

顺便一提，所谓自注意力，国人喜欢顾名思义，以为是自己注意自己，感觉很蹊跷。其实自注意力是针对跨序列的交叉注意力而言的，自注意力是在同一层序列的上下文中注意所有的相关tokens（确实也包括它自己），是单层序列之内的事儿，这个“自”回指的不是token自己，而是token自己所在的那个窗口序列。交叉注意力说的是跨序列层的注意力，例如传统的神经机器翻译中，目标语序列针对源语序列的注意力。到了 GPT 的通用生成式AI（gen-AI）主流，没有跨序列的必要了，因为所有的 input 序列和 output 序列，都被自回归“挤压”到同一层的序列去了。仅有的 output 就是 next token，其余一切tokens都连成一串了作为input的条件：everything is ntp。

以“中译英：我爱你” 的机器翻译为例，GPT自回归生成的序列是这样的：

Input                  Output
中译英：我爱你         I
中译英：我爱你 I       love
中译英：我爱你 I love  you

屠可伟老师的研究进一步验证了parsing与transfromer的可对齐性：

李老师，关于transformer自注意力机制和语言学的关系，我们前年有个工作，之前也和您提过：我们发现transformer自注意力机制与概率依存句法模型的近似推理计算图极为相似，局部几乎一模一样： Probabilistic Transformer: A Probabilistic Dependency Model for Contextual Word Representation

白硕老师说：

我对这个问题的观点：

1、语言处理的符号主义本身并没有一个基于第一性原理的强有力理论体系，最牛的乔姆斯基也没做到。

2、语言处理的完整方案必须包含一个词法、句法、语义、语用、常识、事理、逻辑各方面“角力”因素能够以可以“统一度量衡”的表达-竞争-筛选机制，这点，目前只有多层神经网络可以做到，虽然只是“概率性的”。

3、语言处理和知识处理的共性是滑动上下文内的key-value填充，也就是我们俗称的“哪些萝卜填哪些坑”，这个共性的需求，被一个共性的机制——注意力机制在很大程度上解决了。再单独、分别做语言层面的填充机制（什么成分做什么角色）或是知识层面的填充机制（什么槽位取什么值）已经失去意义。要么不做，要么统一做而且比注意力机制做得更好。没有其他的出路。

白老师所言极是。白老师说的“统一的度量衡”就是自注意力。

【相关】

Introduction to Transformer and Its Attention Mechanism

The Transformer architecture and its attention mechanism form the foundation of mainstream GPT large language models, making them extraordinarily important. Despite the abundance of explanations and popular science articles on this topic, many friends tell me they still find it bewildering or only partially understand it. Therefore, I've decided to write a couple of blogs to contribute my understanding.

As someone curious about mainstream AI, you've likely heard of the renowned Transformer framework and its "attention mechanism" that powers large language models, perhaps considering them mysterious concepts. You may have read the classic paper "Attention is All You Need," but still found it confusing or difficult to decode. Don't worry—this is completely normal, and most of us have gone through this stage! While the paper may be a bit mind-bending, its core logic isn't actually that complex.

To understand the Transformer architecture in AI large language models (LLMs), we need to break down its workflow. First, we should understand how large language models work and how they're trained. Base large language models gain knowledge from data through "self-supervised learning" using multi-layer neural networks. Self-supervised learning is a special type of machine learning that uses "masking" to generate supervision signals. While supervised learning typically uses human-annotated data with output targets, self-supervised learning requires no human annotation. Instead, it masks certain data points and trains the system to predict them (like "filling blanks" or "continuing sequences"), using the masked data as the correct answer and supervision signal. Mainstream GPT models mask the next word, training the system to predict it based solely on previous context (called "next token prediction")—this is the current paradigm for generative AI.

The Complete Process from Input to Output

1. Starting with "Dictionary Lookup": Tokenization and Embedding

To understand an entire input text for next token prediction, we first need to break it down into basic units, called tokenization, which converts text into a sequence of tokens (the smallest units of text). These tokens might be complete words (like "work") or subwords (like "un+believ+able").

Tokens are symbols, and computers struggle with direct symbol manipulation—they only work well with numbers. So we need to convert tokens into numbers.

Each token is converted into a numerical representation—a multi-dimensional vector—by looking up an embedding dictionary. Each token is transformed into a 300-1024 dimensional vector (imagine establishing feature representations for each word across many conceptual dimensions, such as: noun, singular, organization, finance, etc.). Embedding allows words to have computable semantic spatial relationships.

This multi-dimensional vector space acts like a "meaning space" where each token's vector defines its position. The distance between tokens across different dimensions represents their semantic distinctions. This aligns with our intuition: a word's meaning becomes apparent through comparison with other words.

These vectors aren't randomly generated but are numerically encoded representations trained on massive natural text corpora, providing the basic semantic information of tokens—their position in meaning space. For example, the vector for "bank" naturally sits closer to "money" and far from "trees." Similarly, the vector for "apple" might contain information about both "fruit" and "technology company."

Imagine trying to help a computer understand the sentence: "The cat sat on the mat."

Step one: Tokenization breaks this sentence into individual tokens: The+cat+sat+on+the+mat.

Step two: Dictionary lookup (Embedding) finds a numerical representation—a multi-dimensional vector—for each token.
"cat" -> [0.1, 0.5, -0.2, ...]
"sat" -> [-0.3, 0.8, 0.1, ...]
...

Simply put:
Tokenization breaks text into the smallest units (tokens) that computers can easily process and analyze.
Embedding converts these tokens into vectors that computers can easily calculate and combine.

Key point: The vectors obtained from the embedding dictionary are only the "initial meaning representations" of tokens, without considering their specific context. Decoding contextual meaning from vector representations is the task of the next steps, using the multi-layer neural networks + attention mechanism in the Transformer architecture.

The core modules of Transformer can be broken down into two parts:
1. Attention mechanism: Used to calculate the relevance between tokens and dynamically update token representations.
2. Neural network: Used to process information transformation between tokens.

The entire Transformer is stacked with multiple such blocks for transformation, and with each attention layer recalculating token representations, deepening understanding progressively.

2. Attention Takes the Stage: Updating Word Meanings Based on Context

Now we have a sequence of vectors, each representing the "initial meaning" of a token. But here's the problem: the same word can have different meanings in different contexts! For instance, "bank" can mean a financial institution or a riverbank.

The core of the Transformer architecture is the attention mechanism (self-attention), which serves to dynamically adjust the representation of each token based on context, reflecting its relationships with other tokens.

For example: In the sentence "I like to eat apples," "apple" and "eat" are highly correlated, so the model will rely more on the word "eat" to update the meaning of "apple," determining that "apple" here refers to fruit rather than a company.

How is this done?

The model calculates attention weights between each token and other tokens through QKV attention:
- Query: querying vector of the current token (e.g., "he")
- Key: key vectors of contextual tokens (e.g., "police," "witness")
- Value: The actual meaning after association

For example, through matrix operations, the model discovers that "he" is most strongly associated with "witness," so it updates the vector for "he" to carry information from "witness."

Calculating "relevance": For each token, we calculate its "relevance" with all other tokens in the sentence, assigning different "attention weights" (attention scores) to different tokens. This "relevance" can be understood as: how important are other tokens when understanding the meaning of the current token.
* For example, when understanding the word "sat," "cat" and "mat" are obviously more important than "the."

Weighted average: Based on the calculated "relevance" (i.e., token weights), we take a weighted average of the V vectors from all tokens in the context to obtain a new vector representation for this token. This new vector is the meaning representation of the current token in this specific sentence.
For instance, the new vector for "sat" will be more influenced by the vectors of "cat" and "mat," and less by the vector of "the."

Key point: The attention mechanism dynamically updates the meaning of each token by calculating the relevance between tokens. This update is context-based—the same token will have different representations in different sentences.

This way, each token's meaning is no longer fixed but changes dynamically based on the entire sentence's context. For example, in "I saw a bat," "bat" could refer to either a flying mammal or a sports implement, but the attention mechanism will combine the bigger context to infer its more appropriate meaning.

For details on how QKV works in the attention mechanism, please refer to the companion article "How to Understand QKV Division of Labor in Self-Attention Mechanism?"

3. The Transformer Backbone: Multi-layer Progressive Information Compression

The core building blocks of Transformer can be broken down into two parts:
Multi-head attention layer: Used to calculate relevance between tokens and dynamically update token representations.
Feed-forward neural network layer: Further process and transform information (compression, abstraction)

The entire Transformer consists of multiple such modules stacked together, with each layer recalculating token representations for deeper understanding. Depending on the number of blocks, the Transformer repeatedly performs this update process. Like humans pondering a text multiple times, each layer deepens the understanding of the text. Deeper layers may capture more complex semantic relationships.

Each Transformer block iteratively upgrades understanding, for example:
- Bottom layers: Capture local grammar (such as the contrasting relationship in "not...but...")
- Middle layers: Understand "who 'he' actually refers to"
- Top layers: Grasp the main theme of the entire text

The main features of Transformer
1. Parallel computation: Word order is decoupled from token processing, allowing parallel processing of all tokens (in contrast to the linear inefficiency of previous RNNs)
2. Hierarchical understanding: Progressive interpretation from literal meaning to deep intention, capturing patterns both large and small.

4. Output: The Model's Final Prediction

Transformer models can be used for various tasks. Different tasks have different forms of output.

GPT: Next Token Prediction
For mainstream GPT models, their ultimate task is to predict what comes next through "autoregressive" next token prediction (autoregression is the dynamic extension of previous context, recursively implementing word-by-word continuation). The model decides what content should logically follow based on the deeply understood context. This opened the path to general AI, as sequence learning has mastered the "code" for converting inputs to outputs for general tasks, but that's a topic for another article.

5. Summary

Tokenization and Embedding lay the foundation for computers to understand text, similar to looking up a dictionary.
Attention mechanism calculates relevance between tokens and dynamically updates token representations.
Transformer consists of neural network layers + attention layers, optimizing token representations layer by layer, covering various relationships at different levels.
The final output depends on the task. Translation models generate target language text. GPT is responsible for predicting the next token, ultimately evolving this simple prediction mechanism into a general-purpose large model capable of unlocking various tasks.

【相关】

Has Symbolism Been Sidelined for Too Long? Could Neural LLM terminate AI?

Though no one can predict the future, and though abandoning one of the two paths feels politically incorrect, we cannot rule out the possibility of such unipolar dominance.

As is widely known, AI has always been marked by the competition between two schools: symbolic rationalism and data-driven empiricism. Their fortunes have waxed and waned throughout history, but over the past 30+ years, the pendulum has shown no sign of swinging back toward symbolism.

Why?

The ongoing contemporary history of large language models is fascinating. Each time challenges and obstacles arise, the mainstream paradigm overcomes them from within. Whether this will continue remains to be seen, but the trend seems likely to persist.

When large language models (LLM) first emerged, people marveled at their capabilities. But soon, critiques arose: their simple "next token prediction" (NTP) objective and the statistical nature of their probabilistic models led many to conclude they were merely advanced statistical tools, like large parrots—lacking true "understanding."

Ilya Sutskever and Geoffrey Hinton had to step in repeatedly to explain: "Do not underestimate next token prediction. This is no mere statistical n-gram model from the past. It abstracts a system of understanding that integrates human knowledge. When next-token prediction grows increasingly accurate, deep comprehension of context becomes indispensable." Such explanations struggled to convince skeptics. Later, Ilya invoked Kolmogorov complexity as a theoretical foundation, but this framework remains esoteric and inaccessible to most audiences—even many PhDs and professors view it with bemused skepticism. Yet, no better explanation exists.

What ultimately dissolved the "statistical parlor trick" critique was firsthand experience. Users interacting with LLMs realized: these systems seem to genuinely understand. No matter how you phrase your queries, in any language, with nuance or subtext, large models grasp meaning more reliably than most humans.

With the "understanding" debate fading, critics shifted focus: "LLMs cannot reason."

As recently as last year, Yann LeCun cited this as one of his core arguments against the mainstream GPT-style LLM paradigm (advocating instead for vision-based world models as the true path). Many relished pointing out flaws—like LLMs failing at elementary arithmetic such as multi-digit multiplication.

But this critique no longer holds. With the advent of reasoning models like OpenAI’s "o-series" and DeepSeek’s "r-series," accusations of "no reasoning ability" have collapsed. Hardliners may still dismiss probabilistic reasoning as unstable, lacking the rigor of symbolic logic. Yet users deploying these models for mathematics, coding, or project planning overwhelmingly report breakthroughs. Large-model reasoning now rivals or surpasses human experts, approaching master’s or doctoral proficiency. Coding capabilities already exceed those of average engineers. This is just the beginning. It is well plausible that within a year or two, reasoning models could dominate Olympiad-level math or competitive programming.

Once again, barriers were breached through internal innovation—this time after large-model pretraining neared its limits. The core framework remains unchanged, though: reinforcement learning still relies on NTP for chain-of-thought (CoT) generation; reasoning models remain probabilistic. Symbolic AI contributed nothing. Symbols remain confined to input/output interfaces—even the "inner monologue" of CoT manifests as output tokens.

The sheer creative potential within this paradigm is staggering. Those of us from symbolic AI backgrounds once naively imagined that when neural approaches hit walls, our logic-and-grammar toolkit would ride to the rescue. Hybrid neuro-symbolic fantasies danced in our minds.

Zooming out, modern large models evolved from earlier statistical frameworks, with neural networks as a tributary. When those statistical models hit ceilings, breakthroughs came from within—via deep learning. Symbolism played no role.

A profound question arises: Why has the theoretically appealing vision of hybrid neuro-symbolic synergy remained an impractical or unnecessary dream?

Two possibilities stand out.

First, the data-driven empiricist approach possesses far greater resilience and potential than we imagined.

This hints at deeper truths. Artificial neural networks, inspired by biological brains, had languished for decades until the deep learning revolution. Over the past decade, their human-like (or superhuman) performances have forced us to confront a possibility: perhaps this is indeep how intelligence works. If artificial systems achieve human-level cognition through mechanisms mirroring our own biology—despite neuroscientists’ caveats about our limited brain knowledge—this alignment would powerfully validate the neural paradigm. Symbolic logic and statistical feature engineering, by contrast, are alien to biological cognition. Their limitations may stem from this fundamental mismatch. One might even argue that high-dimensional vector spaces in LLMs—where multimodal signals are embedded within neural frameworks—encode a "language of God," or the essence of universal information. Symbols, then, are mere human-imposed constructs, sensory accommodations divorced from reality’s substrate.

Second, when a paradigm harbors untapped potential, progress demands sufficient talent density to exploit it.

AI uniquely concentrates genius. Countless brilliant minds flock to this field, creating an intellectual critical mass unmatched in most domains.

With these conditions in play, we must never underestimate the internal momentum to break through barriers. AGI (Artificial General Intelligence) believers, via their "insane" grind, keep delivering results. Could they indeed be AI’s ultimate Terminators?

Addendum: Symbolic might just be "reduced" to a symbolic tool that may retain its irreplaceable cognitive value

yanyongxin:

What distinguishes humans from other animals is our evolved reasoning capacity. Though rooted in neurons, this ability represents a qualitative leap beyond mere "instinctive reactions." It abstracts object relationships, enabling multi-step reasoning that can be transmitted and memorized through linguistic symbol chains. Reasoning is inherently discrete—thus symbolizable—as a simulated system built atop neural architecture. This simulation likely requires structural differences in human neural systems compared to other animals.

The most striking contrast between reasoning systems and primal neural cognition lies in sustained deliberation. Unlike "muscle memory" or intuition, human reasoning varies dramatically. During my university years, I observed students who excelled at quick problem-solving yet froze when faced with complexity. Today's LLMs approximate the reasoning level of humanities undergraduates, but still lag behind trained STEM specialists—particularly in mathematics and physics. The essence of STEM disciplines lies in rigorously symbolizing real-world problems. Simulating such precision within biological neural systems demands rare opportunities (elite education), prolonged training, and specific neurostructural advantages ("talent"), all channeled through disciplined formalization. Yet achieving this capability bridges biology with mechanical rigor—enabling interfaces with tools like Mathematica.

This charts AI's next frontier: building superior logical simulation systems atop neural frameworks until seamless integration with pure symbolic tools is achieved. The brain's logical simulation system remains energy-intensive, error-prone, and costly to develop. Its key advantage? Seamless integration with underlying neural processes.

Li Wei: Well said.

Interfacing with symbolic systems manifests as tool use. For instance, when confronting complex math problems, instead of forcing probabilistic reasoning through natural-language chain-of-thought (CoT), LLMs should just generate code properly to invoke Mathematica. This tool-use capability is now defined as a fundamental trait of LLM-native agents—yet another innovation emerging from within the paradigm.

Thus, we see a clear evolutionary trajectory:

1. Traditional Statistical Models ("Artificial Idiocy"):
Failure: Little natural language understanding 
Solution: LLMs (e.g., ChatGPT)

2. Pretrained LLM: 
Failure: Lacking reasoning ability 
Solution: Reasoning-reinforced LLMs (e.g., OpenAI’s o1, DeepSeek’s r1)

3. Reasoning LLM:
Failure: Insufficient symbolic rigor 
Solution: LLM Agents (symbolic tool integration)

yanyongxin:

Traditional statistical models earned their "artificial idiocy" label because their parameter spaces and data structures proved inadequate to host the world models required for true language understanding.

【相关】

符号主义被打入冷宫太久了，难道神经是AI的终结者吗？

虽然没人可以预测未来，虽然抛弃两条路线的一条感觉政治不正确，但的确不能排除这种单极主义的可能性。

众所周知，AI自从诞生，就有符号理性主义和数据经验主义的两条路线竞争，以及此伏彼起来回震荡的历史轨迹，直到30年前，钟摆就再也没有回落到符号主义一边的迹象。

这是为什么？

看大模型的当代史（still on-going）很有意思。每一次遇到挑战和障碍，都是主流自己跨过去的。将来是不是还是如此，可以观望，但感觉大概率还会持续。

大模型刚问世，大家惊异于其表现，但很快就从它的简单的“下一词预测”（ntp，next token prediction）的目标函数以及它概率模型的统计特性，得出一个结论：这不过是高级的统计模型，大号鹦鹉而已，它并没有理解。

伊利亚和辛顿不得不站出来一再解释：不要小看 next token prediction，它绝不是以前统计时代的 ngrams，而是抽象了人类各种知识的理解系统。当下一词预测越来越准的时候，没有对上文的深入理解是不可想象的。这种解释并不容易说服人。后来伊利亚追溯到了 K氏复杂性理论，似乎有了拿得上台面的理论支撑。但这一套说法对于绝大多数受众，太过玄妙和高冷，甚至很多博士和教授，也一样是雾里看花，不得不半信半疑。但除此之外，其实也没有更好的理论解释。

最后解开这个质问心结（“大模型根本就没有真正理解，一切不过是统计”）的是无数人与大模型交互的切身体验。自己的体验不会骗自己：每一个玩过大模型的人现在都意识到，大模型真能听懂自己的话，无论你怎么说，无论你用什么语言，甚至你话语背后的机锋和细微之处，它也比多数人理解更到位。

好，理解的问题大家搁置一边了，现在很少人还质疑大模型理解能力了。但接着就是另一个问题：大模型不会推理。

杨立昆去年还把这一条作为自己反对主流大模型路线的根本理由（并鼓吹他的以视觉为基础的世界模型才是真正的出路）。很多人曾津津乐道的是大模型不会多位数乘法这样的小学算术。

但这条理由现在不成立了。在主流业界推出了 OpenAI的o系列和 DeepSeek的r系列等推理模型的今天，这种大模型不懂推理的指责不攻自破。

极端批判派当然可以从概率模型的不稳定的角度，继续质疑其推理不是真正的推理，不具有符号推理的严谨性和确定性。但用过推理模型做数学、代码以及做项目计划的人，绝大多数不再被此困扰了，因为很明显，大模型推理正在超越人类，现在已经修炼到硕士或博士的水平，代码能力也已经超过了大多数码农。而这只是推理模型刚刚起步上线的阶段，再给一两年，推理模型成为奥数冠军或代码冠军，都不是不可想象的事儿。

Again，跳过这个障碍，仍然来自内部，是大模型预训练几乎撞墙后的又一次内部创新。基本性质和基础未变，强化学习还是靠 ntp 去生成思维链（cot，chain-of-thought），推理模型仍然是概率模型。符号AI并没有帮任何忙。符号的作用仍然局限在 input 和 output 两端，连所谓内心独白的思维过程 cot，也是以 output 形式表现的。

这种领域流派内部的创造性和潜力，让人瞠目结舌。以前我们出身符号流派的人老一厢情愿地想着，AI主流遭遇困难了，我们逻辑派、文法派可以出手帮忙了。各种结合符号AI与神经AI的幻想，在我们的心中激荡。

如果把回顾AI历史的眼光拉得更远，我们可以看到现代大模型的前身是当年的各种统计模型，而神经基本是其中的一个支流。当年的那些统计模型遭遇种种瓶颈或天花板，最后还是在内部被深度学习突破了，从来没有符号主义任何事儿。

一个值得深思的问题是：为什么两条路线融合互补的理论美好，一直都是不可实现或不必实现的梦想呢。

可能说明了两件事。

第一是这种数据经验主义的路线，的确具有极大的韧性和潜力，远超我们想象。

这种潜力应该有更深层的原因。众所周知，人工神经网络是受到人脑神经启发而来的一种模拟，停滞多年后爆发了深度学习革命。十几年前的这场革命给我们带来了一次又一次的震撼，这些亮丽的拟人或超人的智能表现，使我们不禁在想，人脑神经也许就是这么工作的，具有同样的原理。人工智能达到或超越人类智能，机理相同应该是最有力的一个支持（虽然脑科学总有人说，我们对大脑知之甚少，貌似有一种把大脑神经的工作机理神秘化的倾向）。其他的一切手段，例如符号逻辑的公式和算法、统计模型的 feature engineering，都不是从真正意义上模拟人脑神经的工作方式，也许这就是它们注定走不远的症结所在。甚至可以说，神经框架内的各模态信号编码嵌入后的高维向量空间的确就是上帝的语言，或宇宙信息的本质，而符号不过是迁就人类感官的表象而已，或者说是人类强加给世界的一种体系。

第二，当一种路线具有超出我们想象的潜力的时候，科技进步所需要的就是足够人才密度去挖掘这种潜力。

我们知道，AI 领域具有这种聚集天才的优势。无数年轻才俊、聪明的大脑被吸引到这个领域，其天才密度超过了大多数领域。

在这样两个条件都具备的时候，我们永远不要低估它突破各种瓶颈的内部力量。那些AGI（Artificial General Intelligence，通用人工智能）的信仰者们一直在疯狂内卷，同时也的确不断在交出自己的成绩单。他们可能是AI的终结者吗？

【外一篇：符号逻辑“沦落”为辅助工具还是有自己不可取代的智能补足作用的】

yanyongxin：人之所以区别于其他动物，是因为进化出了推理能力。这种推理能力虽然是建立在神经元上的，但跟单纯的“本能反应”有了质的飞跃。它有了对象关系的抽象，从而可以进行多步骤推理，从而可以以对象关系链——语言的形式传播和记忆。推理本质上是一个离散过程，因此是可以符号化的，是一个在神经元体系上建立出来的模拟系统。但这种模拟的可能，很可能要求神经元系统与其他动物有结构上的差异。

推理系统与原始神经元思维的最显著差异，就是可以长时间思考。而不是“肌肉记忆”、“直觉”。不同人之间的差异很大。大学期间，注意到有些学生遇到简单问题时反应很快，面对复杂问题时，给他多少时间也没用。现在的LLM大致达到了文科生的推理水平。但还与训练有素的理科生，尤其是数学、物理的理科生有差距。数理专业的特征，就是将现实世界的问题严格的符号逻辑化。要在人的神经元系统上模拟出如此严谨的符号逻辑系统是需要特殊的机会（好大学，好老师）、长期的训练、和特殊的脑结构（“天赋”）的，要遵循良好的规则化习惯。但达到这个能力，就可以与机械化的严谨逻辑系统接口了，可以使用 Mathematica 之类了。

这就是人工智能下一步的走向：在神经元上建立更好的逻辑模拟系统，直到能与纯符号逻辑工具无缝链接。人脑中逻辑模拟系统是一个高能耗、高错误率、高构建成本的系统。它的一大好处，是能够跟底层神经元系统无缝联结。

立委：说得很好。

与符号逻辑系统接口就是 tool use，例如，遇到复杂数学问题，不要用自然语言的思维链（cot）去强行推理来影响概率性推理结论，而是LLM自己写代码去 call Mathematica。现在这种 tool use 的能力被定义为 LLM-native agent 的根本特性（之一）。所以这种能力也还是 from within。

所以我们看到了清晰的演进路线：

1 传统统计模型人工智障 不理解自然语言
solution： LLM e.g. ChatGPT


2. pretrained LLM 缺乏推理能力
solution： reasoning LLM e.g. o1 / r1


3. reasoning LLM 缺乏严谨的符号逻辑能力
solution： LLM agent （interfacing symbolic tools）

yanyongxin: 传统统计模型之所以人工智障不理解自然语言，是因为没有足够大的参量空间和足够复杂的数据结构，来容纳理解语言所必须的世界模型。

【相关】

Transformer 和注意力机制简介

transformer架构及其注意力机制是主流GPT大模型的基础，特别特别重要。虽然这方面的科普和讲解已经汗牛充栋，但还是有很多朋友跟我说一头雾水或雾里看花。所以下决心写了三篇科普系列，试图做出我的理解贡献。

作为对主流AI有好奇心的同学，你可能早就听说过大名鼎鼎的大模型的基本框架 transformer及其“注意力机制”，觉得它们是高深莫测的概念。 你也许也读过那篇经典论文 Attention is all you need，但还是感觉云里雾里。别担心，这很正常，我们多数人都经过这个阶段！这篇论文确实有点“烧脑”，但它的核心逻辑其实并不复杂。

要理解AI大模型的Transformer架构，就需要拆解其工作流程。

首先应该了解一下大模型的工作原理和训练方式。

基础大模型通过原始大数据的“自监督学习”（self-srupervised learning），利用多层神经网络，获得数据相关的知识。自监督学习是一种特别的监督学习，它利用“掩码”获得监督信号。我们知道监督学习的训练数据是标注了输出目标作为监督信号的学习，但自监督无需人类标注，而是在数据中遮盖了部分数据点，让系统学习预测它（“填空”或“接龙”），以被遮盖的数据点作为标准答案和监督信号。主流的GPT大模型的掩码就是遮盖住下一个词，让系统仅仅根据上文来预测它（叫 next token prediction），这是当前生成式AI的主流模型。

当我们输入一段文字时，模型首先要做的就是把它切分成一个个基本单位（词元），然后给每个词元找到它的"字典释义"（向量表示）。

从输入到输出的全过程

1. 从“查词典”开始：Tokenization 和 Embedding

要理解整个输入文本，首先需要拆分基本单元，叫做 tokenization（分词），即，将文本拆解成序列 tokens（词元，文本的最小单位），这些 tokens 可能是完整词（如"work"）或子词（subword，如"un+believ+able"）。

词元是符号，而计算机对符号难以计算，它只玩得转数字，所以需要把词元转成数字。

每个 token 都会通过查一种嵌入（embedding）词典，把词元符号转化成一个数字化表示：多维向量。每个Token被转换为300-1024维的向量（想象给每个词建立很多概念维度的特征表示，例如：名词，单数，机构，金融，......）。Embedding 让词语有了可计算的语义空间关系。

多维向量好比一个“意义”空间，每个token的多维向量定义了这个token在意义空间的位置；token与其他tokens在不同维度的距离，就是它们在意义上的区分。这符合我们的常识：一个词的意义可以在与其他词的比较中显现。

这些向量不是随机生成的，而是通过海量语料训练出来的数字化表示，提供了词元的基本语义信息，即词元在意义空间的位置——例如"银行"的向量天然接近"金钱"，而与"树木"相距甚远。再如"苹果"这个词的向量，可能会包含"水果"、"科技公司"等多个方面的信息。

想象一下，你要让计算机理解一句话：“The cat sat on the mat”。

第一步：分词（Tokenization），先把这句话拆成一个个的 tokens：The+cat+sat+on+the+mat 。

第二步：查字典（Embedding）， 给每个 token 找一个数字化表示，也就是一个多维向量。

“cat” -> [0.1, 0.5, -0.2, ...]
“sat” -> [-0.3, 0.8, 0.1, ...]
...
（注：实际向量维度更高且值为连续分布）

简单来说：

Tokenization 将文本拆解成计算机容易处理分析的最小单位 token。
Embedding 把这些 token 转换成计算机容易运算组合的向量。

关键点： 嵌入词典得到的向量只是 token 的“初始意义表示”，它还没考虑这个token的具体语境。在向量表示中解码上下文意义是下面步骤的任务，用的就是transformer架构中的多层神经网络+注意力机制。

Transformer 的核心模块可以拆解为两部分：

1. 注意力机制：用于计算 token 之间的相关性，并动态更新 token 的表示。
2. 神经网络：用于处理 token 之间的信息转换。

整个 Transformer 由多个这样的模块堆叠而成，每一层都会重新计算 token 的表示，使得理解越来越深。

2. 注意力登场：根据上下文更新词义

现在，我们有了一串向量，每个向量代表一个 token 的“初始含义”。但问题来了，同一个词在不同语境下可能有不同的意思啊！比如，“bank” 可以是“银行”，也可以是“河岸”。

Transformer 架构的核心是 注意力机制（self-attention），其作用就是：根据上下文，动态地调整每个 token 的含义表示，反映与其他token的关系。

打个比方：在"我喜欢吃苹果"这句话里，"苹果"和"吃"的相关度很高，所以模型会更多地参考"吃"这个词来更新"苹果"的含义，从而确定这里的"苹果"指的是水果而不是公司。

怎么做呢？

模型通过QKV注意力计算每个词元与其他词元的注意力权重：
- Query：当前词元的特征向量（如"他"）
- Key：上下文词元的特征向量（如"警察","目击者"）
- Value：关联后的实际含义

- 例如，通过矩阵运算，发现"他"与"目击者"关联度最高，于是更新"他"的向量，使其携带"目击者"的信息。

计算“相关度”： 对于每个 token，我们都要计算它和句子中所有其他 token 的“相关度”，给不同的词元分配不同的"注意力权重"（attention scores）。这个“相关度”可以理解为：在理解当前这个 token 的含义时，其他 token 有多重要。

- 例如，在理解 "sat" 这个词时，"cat" 和 "mat" 显然比 "the" 更重要。

加权平均： 根据计算出的“相关度”（也就是词元的权重），把上下文中所有 token 的V向量 加权平均 起来，得到本token的一个新的向量表示。这个新的向量就是当前 token 在 这个特定句子 中的含义表示。

- 比如，"sat" 的新向量会更多地受到 "cat" 和 "mat" 向量的影响，而较少受到 "the" 向量的影响。

关键点： 注意力机制通过计算 token 之间的相关度，实现了对每个 token 含义的 动态更新。这种更新是 基于上下文 的，同一个 token 在不同的句子中会有不同的表示。

这样，每个 token 的意义不再是固定的，而是会根据整个句子的上下文动态变化。例如，在 "I saw a bat" 这句话中，"bat" 可能是 "蝙蝠"，也可能是 "球棒"，但注意力机制会结合上下文来推测其在上下文中更合适的含义。

关于注意力机制中QKV如何分工和工作的细节，可参照姊妹篇《立委科普：如何理解自注意力机制中的QKV分工？》。

3. Transformer 主干：多层递进的信息压缩

Transformer 的核心组块可以拆解为两大部分：

- 多头注意力层：用于计算 token 之间的相关性，并动态更新 token 的表示。
- 前馈神经网络层：非线性特征转换，进一步压缩信息（抽象、泛化）

整个 Transformer 由多个这样的模块堆叠而成，每一层都会重新计算 token 的表示，使得理解越来越深。根据组块的多寡，Transformer 会反复进行这个更新过程。就像人类理解文章时会反复琢磨一样，每一层都在加深对文本的理解。越深的层次可能捕获到越复杂的语义关系。

每个Transformer组块都在迭代升级认知，例如：
- 底层组块：捕捉局部语法（如"not...but..."的转折关系）
- 中层：理解"他指代的真实人物"
- 高层：把握全文主旨

Transformer的最大特点
1. 并行计算：词序与token处理解耦，并行处理所有token（对比此前RNN的线性低效）
2. 层次化理解：从字面含义到深层意图的渐进式解读，捕捉大大小小的规律性。

与前Transformer的RNN相比，架构优势的工程化体现

特性	RNN	Transformer
并行计算	序列依赖无法并行	全token并行处理
长程依赖处理	存在梯度衰减	直接全局注意力
训练效率	O(n)时间复杂度	O(1)层内时间复杂度
内存消耗	相对较低	随序列长度平方增长

4. Output：模型的最终预测

Transformer 模型可以用于各种各样的任务。不同的任务，输出（output）的形式也不同。

- GPT：预测下一个词（Next Token Prediction） 对于主流 GPT ，其最终的任务是预测下文，通过所谓“自回归”下一词元预测实现（自回归就是动态扩展上文，递归实现一个词一个词的接龙）。模型会根据已经深入理解的上下文，来决定接下来最合理的内容应该是什么。这一路打开了通用AI的路，原理是序列学习学到了一种通用任务的输入转输出的“密码”，但这是另一篇科普的内容了。

5. 总结

- Tokenization 和 Embedding 给计算机理解文本打下基础，好比查了词典。
- 注意力机制 计算 token 之间的相关性，并动态更新 token 表示。
- Transformer 由神经网络层 + 注意力层组成，层层优化 token 表示，涵盖不同层次的各种关系。
- 最终的 output 取决于任务，翻译模型是生成目标语言文本。GPT 负责预测下一个 token，最终发现这个简单的预测机制自然进化成解锁了各种任务的通用大模型。

【相关】

立委科普：如何理解自注意力机制中的QKV分工？

这可能是开始学习自注意力机制的同学遇到的一个不容易理解的烧脑问题。

为了学习序列上下文的依赖关系，为什么序列中的每一个 token 都要派生出三个分工角色：Q（Query），K（Key），V（Value）？

要理解为什么每一个token派生出来的Q、K、V矩阵能通过反向传播自动分工，我们需要深入模型训练的底层逻辑。这个过程可以用「蚁群分工」的生态现象来类比：虽然所有蚂蚁最初都是相似的，但通过环境反馈和任务训练，它们会自发分化为工蚁、兵蚁、繁殖蚁等不同角色。Transformer的参数分化也遵循类似的自然演化规律。

一、分工的本质驱动力：损失函数的宏观调控

假设我们要训练一个翻译模型，输入句子为 "猫追逐激光点"，目标输出 "The cat chases the laser dot"。以下是参数分化的关键步骤：

1. 初始混沌状态
- W_Q、W_K、W_V矩阵均为随机初始化
- 此时"追逐"的Q向量可能与"激光点"的K向量毫无关联

2. 第一次前向传播
- 计算注意力权重时，"追逐"未能关联到"激光点"
- 导致翻译结果错误（如输出 "The cat eats the laser"）

3. 误差信号反馈
损失函数计算出两个关键梯度：
- 内容缺失梯度：需要加强"追逐→chases"的动作关联
- 对象错配梯度：需要建立"追逐"与"激光点"的动宾关系

4. 参数分化开始
- W_Q矩阵收到信号：要让动词的Q向量更关注动作目标特征
- W_K矩阵收到信号：要让名词的K向量强化被作用对象属性
- W_V矩阵收到信号：需要保留名词的可移动性等细节特征

🔥 关键机制：同一误差信号通过不同的计算路径反传，导致三个矩阵的更新方向产生分化。

二、参数分化的数学原理

通过拆解注意力计算流程，可以看到梯度如何引导分工：

注意力权重计算路径

- 对W_Q的梯度：
主要来自本token的Q与上下文中的K的相似度计算，迫使W_Q学习如何生成有效的查询特征
（例：让动词的Q向量包含"需要搭配宾语（及物动词）"的潜在特征；Q很像是传统语言学中潜在句型的编码信号Subcat）

- 对W_K的梯度：
同样来自Q与K相似度计算，但方向是优化K的特征可被Q识别
（例：让名词的K向量包含"可作为动作对象（可做宾语）"的属性）

- 对W_V的梯度：
来自最终的加权求和，要求V保留足够的信息量
（例："激光点"的V向量需要包含「小、明亮、可移动」等细节）

权重计算四部曲：

1. Q-K点积计算关联度
2. 缩放防止梯度爆炸
3. Softmax归一化得到概率权重
4. 加权求和生成语境化表示

三、分工稳定的结构性保障

除了梯度驱动，模型结构设计也确保了分工不会混乱：

1. 线性变换的隔离性

- Q/K/V来自三个完全独立的矩阵乘法
（不同于共享参数的LSTM门控机制）
- 每个矩阵的梯度更新互不干扰

2. 多头注意力机制

使用8-64组独立的注意力机制（多头注意力），就像侦探团分头调查不同方向：有的关注时间线，有的分析人物关系，最后综合所有关系的匹配结果。

不同注意力头形成「分工协作」：
- 头1：W_Q¹学习语法角色匹配
（例：让主语的Q匹配谓语的K）
- 头2：W_Q²学习语义关联
（例："银行"的Q匹配"利率"的K）
- 这种多目标优化迫使参数必须专业化

四、实例验证：参数分工的具象化

通过可视化训练后的参数，可以观察到明确的分工模式：

案例：动词"吃"的关联参数

- W_Q矩阵：
在"吃"的Q向量中，高权重维度对应「可食用」「具体物体」等特征

- W_K矩阵：
在"苹果"的K向量中，高权重维度对应「食物类」「固体」等属性

- W_V矩阵：
在"苹果"的V向量中，高权重维度包含「颜色」「口感」「营养成分」等细节

当计算 `Q(吃)·K(苹果)` 时，由于双方在「可食用性」维度上的高激活值，会产生强注意力权重。而V(苹果)则携带了制作输出时需要的具体信息（如翻译成"apple"时需要知道这是水果而非科技公司）。

关键结论：自组织的智慧

Transformer参数分工的本质，是在统一目标函数约束下，不同计算路径自然演化出的功能专门化。系统不需要预先设定分工细节，而是通过海量数据中反复的"试错-反馈"循环，自发形成了高效的信息处理体系。这种基于误差驱动的自组织过程，正是深度学习模型强大表征能力的根源。

【外一篇】

Q/K/V的关系：一个更深入的解读

Q和K的关系

- Q 是 K 空间的一个特定视角或投影
- 就像同一本书可以从不同角度去检索：
- Q1主题分类（K1：文学/科技/历史）
- Q2难度等级（K2: 入门/进阶/专业）
- Q3写作风格（K3：理论/实践/案例）

这是因为Q是“主动”寻求某个与其他tokens关联的特征；而K是“被动”准备被其他tokens匹配的特征。K好比索引，需要概括token的所有主要特征，但Q则是专注于查询某个特征。

这样理解多头注意力就更自然了：

# 每个头学习到不同的投影视角
Q1 = token * W_q1 # 可能关注主题相关性
Q2 = token * W_q2 # 可能关注语法关系
Q3 = token * W_q3 # 可能关注语义角色

就像一个高维空间的不同切面：
- 每个注意力头学习到一种特定的"查询视角"
- 这些视角共同构建了token间关系的完整图景

K和V的分工

- K：是信息的"检索表示"
- 包含了各种可能被查询的特征
- 好比图书的多维度标签系统

- V：是信息的"内容表示"
- 包含了实际需要被利用的信息
- 就像书本正文的具体内容

## 一个具体例子
以"开车"这个词为例：

多头注意力可能学到的不同视角：
Q1：寻找动作的工具（与"汽车"高度相关）
Q2：寻找动作的主体（与"司机"高度相关）
Q3：寻找动作的修饰（与"快"、"稳"等相关）

这种理解很好地解释了：
1. 为什么需要Q/K分离
2. 为什么需要多头QKV机制
3. 模型如何自动学习到不同类型的上下文关系

最后，我们来进一步了解第三个关键角色 V

V与Token表示的连续性

一个token 的 V（Value）与该 token 的初始embedding最相关，因为表示的都是这个token的内容和意义。

- 初始embedding：代表词元在大规模预训练中学到的一般含义，好比是查了词典
- Value向量：可以看作是这个初始表示在特定上下文中的延续和更新

换句话说：
1. Embedding是词元的"基本词典定义"
2. Value是这个定义在特定语境下的"具体表达"

Value在模型中的演化

随着信息在多层网络中的流动：

初始embedding → 第1层Value → 第2层Value → ... → 最终表示

这个过程中：
- 每一层的Value都承载了更加丰富的上下文信息
- 同时保持着对原始token含义的连续性（若担心连续性衰减大，还可以用残差来弥补）
- 这种演化是渐进式的，而不是断裂式的

Q/K与V的本质区别

- Q和K主要服务于"建立关系”（俗称“挖坑和填坑”）这一目标
- Q和K提取出用于匹配的查询特征和索引特征
- Q和K自然比V更抽象、更概括

- V则直接承载"具体内容"
- 包含词元需要传递的实际信息
- 更具体、更详细

形象地说：
- Q/K像是图书馆中的检索系统
- V则像是书架上的实际书籍内容

从整个模型的角度看：
1. 初始embedding进入第一层
2. 每一层都通过注意力机制加权求和以及前馈网络来更新下一层token表示
3. 最终层的表示涵盖了上下文的全部关系和意义，直接赋能输出

【相关】

True Story Behind DeepSeek's Success: AI Learning to Think Slowly Without Human Supervision

*Edited transcript from InfoQ's second DeepSeek series livestream featuring Dr. Wei Li, former VP of Engineering at Mobvoi's Large Language Model team, discussing R1 Zero's innovative contribution to democratizing reasoning models.*

DeepSeek's Greatest Achievement: Making Everything Transparent

InfoQ: "DeepSeek adheres to a pure reinforcement learning approach, but the industry often refers to RL as 'alchemy' - how did they make this process controllable and accessible? What's innovative about their reasoning paradigm?"

Dr. Li:** The reinforcement learning for reasoning models has long been an industry challenge. About six months ago, when Ilya and others declared the end of the pre-training era, it signaled that simply scaling up pre-trained models was no longer sufficient for performance improvements. The delayed release of GPT-5 is another indicator of pre-training's decline. As a result, the industry began seeking new growth paths, with on-the-fly reasoning models gaining momentum among leading teams until OpenAI released O1, the world's first reasoning large language model. DeepSeek's R1 then followed with its breakthrough success.

From the mysterious Q-Star project (reportedly causing dramatic internal conflicts at OpenAI) to the release of O1, reasoning models have been widely recognized as a new paradigm in AI. The core of this paradigm is enabling models' "slow thinking" capability, or System 2 as it is called, using reinforcement learning to enhance model intelligence in complex tasks. However, all of this was closed-source. OpenAI even deliberately created mystique around their chain-of-thought content. Apart from a few top players like Google and Anthropic quietly exploring and tracking this field, other teams knew very little about it.

DeepSeek's greatest achievement lies in making everything about LLMs transparent. They open-sourced their models and detailed technical papers, and weren't afraid to expose their thought of chains (CoTs) in the system. Through pure reinforcement learning, they proved that even without process control data, result-based control alone could achieve top-tier reasoning model performance. This breakthrough was like piercing through a paper window, showing the industry a feasible path to democratizing reinforcement learning.

InfoQ: The innovation in reasoning paradigm sounds abstract. Could you provide an example?

Dr. Li:** R1's paper is outstanding, arguably one of the finest in the large model field. It consists of two parts: one focusing on Zero research, which presents remarkable achievements in pure reinforcement learning for reasoning; the other detailing the practical R1 system, a top-tier production reasoning model. For R1's development, they considered practicality, balancing comprehensive performance, safety, and various practical considerations, detailing a four-stage training pipeline as best practice to help other teams understand and replicate their success.

The most brilliant part is the Zero research. Zero proved a revolutionary point: contrary to traditional beliefs (or OpenAI's implied stance that reasoning requires step-by-step supervision), process supervision isn't actually necessary. Using only the final result against the "gold standard" as a supervision signal is sufficient to train the "slow thinking" process required for reasoning models.

This is Zero's greatest highlight and the origin of its name - it draws inspiration from AlphaZero's spirit. AlphaZero historically pioneered complete independence from human game records or experience, achieving zero human supervision reinforcement learning through self-play generated process data (state+move+score triplets). Similarly, DeepSeek's Zero research demonstrates that in reasoning tasks, models can autonomously generate internal process data - Chain of Thought (CoT) sequences - without human annotation.

Specifically, reasoning models initially focused on mathematics and coding because these domains have standard answers. Macroscopically, this is typical end-to-end supervised learning, as both input (math/coding problems) and output (answers/execution results) are fixed and known. However, the process from input to output is highly complex with significant information gaps, requiring a CoT bridge. Just as humans need to break down problems and think step by step when facing difficulties, models need this process too. DeepSeek's research found that models possess the ability to learn this deep thinking process autonomously if given sufficient time and space.

InfoQ: Dynamic reasoning paths sound like AI "drawing mind maps" - but how do you prevent it from going off track? Like suddenly writing poetry while coding?

Dr. Li:** Based on current evidence, this possibility is virtually non-existent or negligibly low. Before DeepSeek published their results and research details, many were puzzled about this point: wouldn't deep thinking go haywire with only result supervision and no process supervision? Without large-scale reinforcement learning experiments, this was indeed a significant concern. It's like flying a kite - you're holding just one string while letting it soar freely, worried it might nosedive.

These concerns proved unnecessary. The reason it doesn't go off track is that all this reasoning reinforcement learning, including self-generated reasoning CoTs, is built upon existing top-tier models (like V3). These models have already mastered coherent expression through massive data learning. This coherence implies orderliness, which, while not equivalent to pure logic, prevents completely unreasonable deviations. It is observed that fluent human speech typically reflects organized thinking.

InfoQ: On another note, compared to OpenAI's O1, DeepSeek R1 has another notable highlight in applying reasoning CoTs to language generation and style imitation. Could you elaborate on this?

Dr. Li:** When O1 was released, everyone knew it demonstrated significant improvements in mathematics and coding abilities, as standard tests revealed higher performance levels. What people didn't realize was that this reasoning ability, or "slow thinking" capability, excels not only in domains requiring strict logical reasoning but can also shine in traditional language tasks.

By nature, language ability has been a strength of large models - everyone knows they generate very fluent text, more native than natives. By the time we reached models like 4o or V3, their writing was already quite smooth, seemingly leaving little room for improvement. However, when asked to write classical poetry or imitate Lu Xun's writing style, previous models fell short. R1 solved these challenges. From a social impact perspective, this is actually quite remarkable and particularly noticeable.

Honestly, not many people are deeply concerned about mathematics or coding, although we know coding is a major direction for the coming years and automated programming can change the world. Everything in IT ultimately comes down to software; the digital world is built on software. If software development can transition from manual coding to model-assisted or even model-autonomous programming, this will greatly increase productivity. While this is visible to all, it's not as intuitive for ordinary people who more often face tasks like writing compelling articles.

When R1's humanities capabilities were discovered, not just geeks or software developers saw the benefits of reasoning models - ordinary people were excited too. Suddenly, anyone could claim to be a poet, writer, advisor or philosopher - the impact was tremendous. This wasn't felt with o1, perhaps because OpenAI didn't realize or at least didn't focus on this aspect of reasoning models. But while working on code and mathematical reasoning, DeepSeek must have internally realized that this "slow thinking" mechanism could also significantly improve writing abilities, especially in classical Chinese.

Everyone knows Chinese data isn't as rich as English data, so while previous models could write beautiful English poetry, they struggled with Tang poetry. This might be because Chinese data was insufficient in quantity or quality, preventing models from learning adequately. We always felt this was unfortunate - models would sometimes rhyme correctly, sometimes not, sometimes add or miss characters, not to mention tonal patterns to follow. DeepSeek clearly put effort into this area; their data quality must be significantly higher than industry standards. More significantly, they know how to transfer the CoT ability from science and technology to language and literature.

InfoQ: If you were to recommend a DeepSeek module most worth replicating for programmers, which would it be? Like those "Aha moments" claiming to replicate R1 for tens of dollars?

Dr. Li:** If I were to recommend a DeepSeek module most worth replicating for the programming community, it would be the Zero research-related components. This replication isn't about achieving comprehensive capabilities but rather verifying Zero research's key revelation - that machines can indeed autonomously learn. This is what OpenAI kept under wraps; perhaps they had figured it out earlier but chose not to disclose it.

Now, we've seen quite a number of different teams claimed to have reproduced R1's reflective capabilities with minimal resources. This isn't just an interesting experiment; more crucially, it marks the democratization of reasoning models. Previously, people didn't understand how reasoning models worked, only knowing that vast amounts of process data were needed for models to learn slow thinking. This was considered an almost insurmountable barrier because process data is hard to obtain, and reinforcement learning's instability and high data requirements confused and challenged many programmers.

But now, we know we can bypass this most difficult process data requirement and reproduce this "Aha moment" with limited resources, proving that slow-thinking capabilities can be learned autonomously by models. Based on this premise, if you're a domain expert, you might wonder: could these techniques achieve significant improvements in your field? This is entirely possible. Even the most powerful models (like V3 or 4o) only achieve 60-70% accuracy in specific scenarios without optimization, and experience tells us that without at least 80-85% accuracy, you can't launch a truly valuable system in real-life applications.

That is to say, between a large model's out-of-box results and actual valuable application deployment, there's a gap. Previously, our only method was collecting domain data for fine-tuning (SFT). Now, we have another path RL: following the reasoning model approach, letting systems fully utilize slow thinking capabilities during the reasoning phase to improve data quality to acceptable or even exceptional levels. This path seems to have been opened.

However, my programmer friends tell me that in their comparison experiments between fine-tuning (SFT) and DeepSeek-style reinforcement learning (RL), while RL indeed outperforms SFT, the computational cost for RL training is still far higher than SFT. The superior performance makes sense because SFT data is always very limited, while successfully reinforced RL self-generated data can far exceed SFT data volume.

InfoQ: Some say large models represent "brute force aesthetics," but OpenAI's former Chief Scientist and co-founder Ilya says pre-training has reached its limit. How do you view this? Is the emergence of reasoning models just adding another scaling law to brute force aesthetics?

Dr. Li:** This is more about a shift in technical focus and a paradigm change in technical innovation. Large models involve three major components: first, pre-training, which builds foundational capabilities by learning basic patterns from massive data; second, post-training, initially mainly fine-tuning - OpenAI early on used some reinforcement learning (like RLHF) for human preference alignment, but by Meta's time, they even abandoned typical PPO style RLHF for simpler DPO, as they, like many others, struggled with it. Finally, there's the reasoning phase, where models interact with users real-time after deployment.

The current situation with high-quality natural data is that pre-training has nearly exhausted all available quality resources. The industry began to notice data growth challenges, making performance improvements increasingly difficult. GPT-5's delayed release, reportedly yielding limited returns despite massive computational investment, suggests pre-training may have indeed hit a wall.

This led the industry to explore alternative AI growth curves. Reinforcement learning-based reasoning models emerged at center stage in this context: pure reinforcement learning should be added to post-training. Previous reinforcement learning relied on human preferences, but this time it's about giving models more thinking time before reaching answers, learning underlying chain of thought (CoT). While V3 was already doing well, it didn't cause as much social sensation until R1 appeared. DeepSeek truly broke through after the Chinese New Year, becoming the most discussed public topic and causing excitement and shock overseas. R1 and O1 represent a new paradigm. Before R1, only OpenAI's O1 existed as a reasoning model, seemingly unreachably advanced, with would-be-followers unsure how to follow. However, R1 not only reproduced O1's capabilities but did so with greater transparency and clarity. This contrast further highlighted R1's importance as an open-source model leader.

InfoQ: At first glance, DeepSeek seems like an engineering masterpiece. Why did it cause such global sensation? Its user acquisition speed (100 million in a week) surpassed ChatGPT's nuclear moment? What's its historical significance?

Dr. Li:** From my personal experience and observation, ChatGPT's explosion was a landmark event in large model development. Research insiders were following large models before ChatGPT, at least since GPT-3. When GPT-3's Playground appeared, we were already immersed in it, sensing an approaching storm. But from society's perspective, ChatGPT truly shocked everyone, exceeding all expectations, like an AI nuclear explosion.

I believe R1's emergence is the second major shock after ChatGPT. Of course, between ChatGPT and R1, other influential models appeared, like 4o - another remarkable milestone. While ChatGPT 3.5 was already so impressive, 4o proved it could be even better. Then came Sora, bringing shock with video capabilities in multi-modal LLMs. I personally also greatly appreciate Suno, the music model, making me feel like I could become a musician overnight.

If I were to rank them, R1's impact is second only to ChatGPT, perhaps even exceeding 4o and Sora's sensational effects. R1's impact feels similar to ChatGPT's initial appearance, creating the same addiction. While ChatGPT was groundbreaking and R1 a follower, albeit with innovative highlights sometimes surpassing previous models (like in classical poetry and style imitation), achieving such global impact as a follower is truly miraculous.

In terms of practical effects, R1's productization was amazingly successful. Gaining hundreds of millions of users in a week, it far broke ChatGPT's record and elevated society's AI awareness. Furthermore, regarding geopolitical influences on technology access, many domestic users had long desired access to the world's most advanced models like GPT series, Claude, or Gemini but often couldn't reach them. R1's appearance eliminated these concerns about domestic and international restrictions, contributing to its rapid global popularization.

InfoQ: What's your vision of AI programming's ultimate form? Is it programmers telling AI "make me a TikTok," and it outputs deployable code and operations plans?

Dr. Li:** There are always two types of people: skeptics and optimists. People like Ilya believe Artificial General Intelligence (AGI) is imminent and Artificial Super Intelligence (ASI) isn't far away, so the biggest concern now, according to him, is ensuring superintelligence safety.

Anthropic's CEO Dario predicts that within 3-5 years, large models will achieve real breakthroughs - not just the current impressive demonstrations, but revolutionary changes in societal productivity. Fundamentally, they're talking about AI's ability to scale replacement of both physical and intellectual human labor.

However, while large models are buzzing now, their practical applications haven't reached the level of the previous generation's mobile platforms. Previous super apps like Meituan, Didi, Xiaohongshu, and TikTok transformed major aspects of our daily lives, from basic necessities to communication and entertainment, maximally shortening the distance between suppliers and customers - value everyone of us feels daily. While playing with large models is interesting, their practical value at the lifestyle level isn't yet obvious; at best we're still on the verge of the coming AI application explosion.

Notably, DeepSeek's emergence has lowered large model application barriers, paving the way for scaled applications, though we haven't yet entered the era of true application explosion.

What will it look like when AI applications truly explode? I believe the ultimate goal, by nature of AI, is for LLMs to comprehensively replace humans in both intellectual and physical labor. Signs of large models impacting white-collar workers are already undoubtedly evident, with even programmers not an exempt. In physical labor, embodied intelligence is developing rapidly, with both humanoid robots and mechanical hands gradually replacing human physical work.

Of course, this brings side effects, like massive job displacement. How society adapts to this state of greatly developed productivity, but this is another discussion topic. But looking at AI's nature and ultimate goals, AI development could have two milestones: first, when it can replace 50% of human work, allowing half of society to maintain a decent, free life through social programs perhaps like Universal Basic Income (UBI) - this might mark the arrival of AGI (Artificial General Intelligence); second, when it replaces 90% of human work, possibly signifying the emergence of ASI (Artificial Super Intelligence) - a kind of technological utopia (or 'communism') in some sense.

These two milestones are my own verifiable definitions of AGI and ASI. I do not agree with the idea that while old jobs are replaced, more new jobs will be created by AI. It just does not make sense as any new jobs are also a mixture of human labor, destined to be replaced soon by super intelligence if they do emerge for time being.

This vision of AI's future development shows how DeepSeek's innovations in reasoning models might be just the beginning of a much larger transformation in how we think about work, society, and human potential in an AI-driven world.

【相关】

DeepSeek爆火真相：不靠“人盯”，让AI自己学会慢思考

Does the New Reasoning Paradigm (Query+CoT+Answer) Support a New Scaling Law?

Technical Deep Dive: Understanding DeepSeek R1's Reasoning Mechanism in Production

DeepSeek 学习笔记：R1 部署阶段的推理机制

从R1幻觉谈起，大模型幻觉是缺陷还是创意火花？

推理强化学习是端到端的监督，推理过程的非监督

DeepSeek爆火真相：不靠“人盯”，让AI自己学会慢思考

本文整理自InfoQ策划的DeepSeek系列直播第二期节目——DeepSeek爆火背后DeepSeek，纯强化学习路线到底有何不同。在直播中，出门问问大模型团队前工程副总李维博士聚焦推理范式的创新，分析了R1 Zero 对推理模型平民化的创新贡献。他提到，DeepSeek通过开源和透明化，证明了不需要过程监督，仅通过结果控制就能训练出优秀的推理模型，这大大颠覆了传统认知以及OpenAI 所暗示的需要在每一步监督推理强化学习的观点。

DeepSeek 的最大功绩在于将这一切透明化

InfoQ：“DeepSeek坚持纯强化学习路线，但业界常说RL（强化学习）是‘炼丹’”——他们如何让这个过程可控和“平民化”？有什么"推理范式的创新"？

李维博士：实际上，推理模型的强化学习一直是业界的难题。大约半年前，IIya 等人宣称预训练时代已经结束，这意味着单纯依靠预训练模型的规模扩展来提高性能已经难以为继。GPT5迟迟不能上线也是预训练式微的一个迹象。因此，业界开始寻找新的增长道路，推理大模型在头部团队开始暗流涌动，直到 Open AI发布全球第一个推理大模型O1. 紧接着就是DeepSeek的R1出圈，这就是deepseek爆火的背景。

从神神秘秘、据传引发了OpenAI宫斗的Q-Star 项目开始到 o1 大模型的推出，推理大模型被AI主流广泛公认为新的范式。这种范式的核心是开启模型的“慢思考”能力，即所谓 System 2，利用强化学习提升模型在复杂任务中的智能程度。然而，这一切都是闭源的，OpenAI 甚至故意制造了一些神秘感，遮掩其思维链的内容。除了少数头部玩家如 Google 和 Anthropic 在背后悄悄探索追踪外，其他团队对这一领域知之甚少。

DeepSeek 的最大功绩在于将这一切透明化。它的模型和详尽的技术论文全部开源，甚至也不怕露怯，在系统里公开了思维链的所有内容。它通过纯粹强化学习，证明了即使没有过程控制数据，仅通过结果控制也能达到头部推理大模型的水平。这就好像是捅破了一层窗户纸，让业界看到了强化学习平民化的道路。

InfoQ：推理范式的创新听起来很抽象，能否举个例子？

李维博士：DeepSeek 的R1论文非常出色，堪称大模型领域中的一篇佳作。论文分为两部分：一部分是关于 Zero 的研究，这是纯粹的强化学习推理方向的成果，非常精彩；另一部分则是基于 Zero 研究成果的实用系统 R1，这是一个真正上线的头部推理大模型。在开发 R1 时，需要考虑实用性，包括综合性能、安全性以及各种实用考量等，因此论文中详细介绍了四阶段训练的最佳实践（best practice），帮助其他团队理解和复制这一成果。

论文最精彩的部分还是 Zero 的研究。Zero 的研究证明了一个颠覆性的观点：与传统认知（或 OpenAI 所暗示的需要在每一步监督推理强化学习的观点）不同，实际上并不需要过程监督。仅通过最终结果（即“黄金标准”）作为监督信号，就能训练出推理大模型所需的“慢思考”过程。

这是 Zero 的最大亮点，也是其名称的由来——它借鉴了 AlphaZero 的精神。AlphaZero 在人工智能历史上开创性地完全不依赖人类棋谱或经验学习，而是通过自我对弈的再生的过程数据（即：棋局状态+落子+评分的三元组步骤数据），实现了零人类监督的强化学习，并最终完全碾压了人类顶尖棋手。DeepSeek 的 Zero 研究也是如此，它表明在推理任务中，模型可以自主生成内部的过程数据，即思维链（CoT，Chain of Thought）序列，而无需人类标注。

具体来说，推理模型最初以数学和代码为对象，因为这些领域本身就存在标准答案。从宏观上看，这其实是一种典型的端到端监督学习，因为输入端（数学题或代码题）和输出端（答案或代码运行结果）都是固定的、已知的。然而，从输入到输出的过程非常复杂，信息差很大，这就需要一个“思维链”作为桥梁。就像人类遇到难题时需要分解问题、逐步思考一样，模型也需要这样的过程。DeepSeek 的研究发现，模型本身具有自主学习这种深度思考过程的能力，只要给予足够的时间和空间。如果没有这个空间，模型就只能直接从问题跳到答案，信息鸿沟大，随机性就强，成绩好不了。

DeepSeek 的解决方案是通过设计一个简单模板引导模型进行思考。具体说，就是在传统的监督数据 question+answer里面人为增加了一个标签[think]: question+[think]+answer, 通过强化学习的方式，模型会自主填空，再生过程数据 question+cot+answer，以此迭代学习，cot中就自动出现了反思、自我校正等过程。这表明，只要给予模型思考的空间，它就能自主生成思维链。非常奇妙！

给模型留够充分的自主学习空间

InfoQ：动态推理路径听起来像AI自己“画思维导图”——但如何避免它中途跑偏？比如写代码时突然开始写诗？

李维博士：从目前的情况来看，这种可能性几乎不存在，或者概率极低，可以忽略不计。在deepseek公布他们的结果和研究细节之前，大家确实对这一点感到困惑：只靠结果监督，没有过程监督，深度思维不会乱套吗。在没有真正进行大规模强化学习实验之前，这确实是一个很大的疑问。就好比放风筝，你只牵着一根线，让风筝在天上自由飞翔，你会担心它会不会一头栽到地上。

现在看来是过虑了。它不会走偏的原因在于，所有这些推理的强化学习，包括自主生成的推理思维链的数据，实际上都是建立在原有的头部大模型（如V3）的基础上的。这些大模型在海量数据的学习过程中，已经很好地掌握了如何把话说得顺溜。这种“顺溜”的背后是条理性。虽然不能说它完全等同于逻辑性，但至少不会偏离到完全不合理的情况。就像一个人说话很顺畅，背后的思想相对来说也是有条理的。

所以，模型在原有大模型的基础上生成数据，经过筛选和强化学习迭代，会越来越条理化。这种思考方式本身是由大模型自然生成的，再加上有选择机制在不断强化过程中让它越来越符合条理地导向正确答案。

话说回来，在研究人员真正做出成果之前，大家心里还是充满了怀疑和疑问，不知道让机器模拟学习人类的高阶智能这条路是否真的能走通。如果是一个能力弱的小模型，这条路是否能走通就很难说了。但V3本身是一个很强大的基座模型，在此基础上让模型自己生成思维链，虽然这些思维链并不总是很有条理，但并不影响最终结果。因为这是一个以结果为导向的强化学习过程，只要坚持用正确和错误的结果来控制强化学习过程，即使思维链中有时会出现一些偏差，但总体目标是一致的，最终还是能学到推理高难度题目的能力。

再从更大的角度来看，我们发现当大模型发展到一定程度时，日常人类的数据已经基本用尽，高品质的数据也所剩无几。要进一步提升能力，就必须依靠模型自己生成数据。说到底，AI发展到现在，需要AI自己反哺自己才能进一步提升。

在过去很长一段时间里，很多人对这一点存在疑问，担心模型自己教自己会导致退化，或者即使是一个好的模型教一个差的模型，也会有天花板。但现在回过头来看，再生数据的重要性越来越大。不仅是推理模型，就连多模态大模型也是如此。以Sora为例，我们知道视频和语言之间的自然对齐数据非常少，很难找到大量对视频情节进行详细讲解的数据。为了实现视频和语言的对齐，Sora选择了再生数据的道路，用自己的模型对整个的视频训练数据集进行了非常详细的标注。再生数据助力，Sora成为了第一个爆款的视频大模型。如今，国内的视频大模型也已经迎头赶上，如快手的可灵和字节的即梦，甚至比Sora还要更强一些，这背后也离不开再生数据的作用。

InfoQ：另一方面，与 OpenAI 的 o1 相比，DeepSeek R1 还有一个显著亮点是将推理思维链应用到了语言文字的创作和风格模仿能力上，这一点可以详细介绍一下吗？

李维博士：o1 出来时，大家都知道它在数学和代码能力上有了显著提升，因为标准测试显示它达到了一个更高的水平。但大家没有意识到的是，这种推理能力，或者说“慢思维”能力，不仅仅在需要严格逻辑推理的领域表现出色，它在传统的语言文字创作方面同样可以大放异彩。

传统上，语言文字能力一直是大模型的强项，大家都知道大模型生成的语言非常流畅。到了像 4o 或 V3，它们写文章已经很顺了，似乎提升空间不大。然而，当要求模型写一篇古典诗歌，或者模仿鲁迅的文风时，之前的模型还做不到。直到 R1 推出，这些问题都得到了解决。从社会效应来看，这其实是非常厉害的。

老实说，真正关心数学或代码的人并不多，虽然我们知道代码是今后几年的一个大方向，自动编程能改变世界。所有 IT 方面的东西归根结底都是软件，数字世界是由软件构成的。如果软件能力可以从手工编写变成模型辅助，甚至模型自主编写，这将极大地提高我们的生产力。这是大家都能看到的，但对普通老百姓来说却没有那么直观，因为他们面对的更多是写文章如何出彩这类任务。

当 R1 的文科能力被大家发现后，不仅仅是极客或者做软件应用的人看到了推理模型的好处，普通人也为之奔走相告。一旦上手，任何人都可以成为诗人、文学家、哲学家，这种震撼是非常大的。在o1 出来时，大家没有这种感觉，可能是因为 OpenAI 没有意识到，或者至少没有聚焦这一点。但 DeepSeek 在做代码和数学推理时，内部肯定已经意识到，这种“慢思维”在文字能力方面也可以提升一大步，尤其是在中文领域。

大家都知道，中文的数据相对没有英文那么丰富，所以之前大模型写英文诗可以写得很漂亮，但写唐诗就不够好。这可能是因为中文数据要么量不够，要么品质不够，导致模型学习得不够到位。我们一直觉得这是一个遗憾，模型写诗有时押韵，有时不押韵，有时多一个字，少一个字，更不用说平仄，总是有问题。DeepSeek 在这方面肯定下了功夫，其数据品质一定比行业标准更高、更好。

但大模型光有数据还不够，另一条腿是推理时间的计算量。在用户实际使用时，增加计算量和思考时间，我们发现模型的文字能力显著提升了层次，这给大家的震撼非常大。思维链是模型“慢思考”的一个特征。一开始，我们可能想当然地认为，逻辑思维是它的核心，思维链就是要非常严谨地符合逻辑的每个步骤，以确保在数理化和代码中表现出色。

但我们根本没想到，在文学创作这种领域，并不需要严谨的逻辑思维，它更多的是要有想象力，需要反复斟酌和修改。比如你要写一篇非常漂亮的文章，或者模仿一种风格，你需要考虑的方面很多，写古风诗词要考虑押韵、平仄、用词，考虑如何用古典文字表达现代概念等。为了写出一篇好文章，你需要周密地计划，这本质上是一种“planning”，而不仅仅是狭义的“reasoning”。可见，慢思维背后的真正价值在于为最终结果做铺垫，制定计划和反复修正。无论任务是文科还是理科，只要是高难度的任务，都需要这种“planning”的时间，就像我们打草稿、反复校改一样，这些都是思维链的用武之地。

InfoQ：思维链机制具体是如何产生的？

李维博士：DeepSeek 之所以能够产生复杂的思维链，背后是因为它是基于头部大模型V3训练的，而 V3 所涵盖的知识比我们任何个体所了解的都要广博得多得多。在这基础上，关键点是要给模型留下空间，让它有自主学习的机会。作为设计者或开发者，需要设计出这样的空间，让模型自己去填补、去学习。DeepSeek 就是这样实现的。它设计了一种格式，在输入问题question和输出答案answer之间，它留下了一个“思考”的空间，用标签 [think] 来标记: question+[think]+answer。这个 think 标签就是准备要学思维链（cot）的, 虽然开始为空，Zero 的 research 表明：只要留下think的标签，就给LLM自主填补cot 留下了空间。此后他们“啊哈”地惊喜发现，越来越条理化的cot 居然在 GRPO 组内选优的强化学习迭代算法的指引下，就自主学出来了。啥也不用做，模型就是自己要思考，而且能思考。LLM really wants/tends to think and think deep if given a chance. 比如，它可能会在推理过程中发现自己前面的某个结论与已知事实不符，于是就会自我纠正，说：“不对，这里可能有偏差。”这种反思和自我纠正的能力，是模型在学习过程中自然形成的。可以想像研究者当时的兴奋之情, 简直就是上帝给他们面授了天机。不但他们“啊哈”, 我们读论文追踪他们的人也感觉开了天目，不可思议，但 it just works。Zero research 的美丽就是没有人工的过程数据的任何干预，完完全全的纯强化出来的奇迹。

从信息论的角度来说，思维链降低了困惑度（perplexity），搭建了从难题到答案之间的桥梁，使得得出正确结论的可能性增大，从而提高了模型的智能。

推理模型已经进入“平民化”时代

InfoQ：如果让您给程序员推荐一个最值得复现的DeepSeek模块，会是哪个？比如各种声称几十美元复制R1的Aha moment？

李维博士：如果让我推荐程序员群体最值得复现的 DeepSeek 模块，大概会是与 Zero 研究相关的部分。这种复现并不是从全面能力上，而是证实了 Zero 研究中揭示的关键点——机器确实能够自主学到反思能力或慢思维推理。这是 OpenAI 一直遮掩不让人知道的，也许他们早就悟出来了，但就是不公开。

现在，我们看到至少有五六组不同的团队，用很少的资源就复现出了 R1 的这种反思能力。这不仅是一个有趣的实验，更关键的是，它标志着推理模型已经进入“平民化”时代。以前，大家不知道推理模型是如何工作的，只知道需要大量的过程数据，模型才能学会慢思维。这被认为是一个难以跨越的门槛，因为过程数据很难获取，而且强化学习的不稳定性高、对数据要求也高，所以很多程序员觉得这条路很难走。

但现在，我们知道可以绕过这个最繁难的过程数据，通过有限的资源复现这种“Aha moment”，证明慢思维能力是可以让模型自主学出来的。基于这个前提，如果你是一个行业专家（domain expert），在自己的项目或应用领域中，你会想：是否可以用这些技术在你的领域实现大幅提升？这是完全可能的。因为即使是最强大的大模型（如 V3 或 4o），在具体场景中如果不经过优化，也只能达到 60%~70% 的正确率，而在 real life应用场景中，经验告诉我们没有 80% 或 85% 以上的正确率，根本无法上线一个真正有价值的系统。

从大模型的“开箱即用”（out-of-box）结果到真正能投入应用并产生价值，中间存在一个差距。以前，我们想到的唯一方法是收集领域数据进行微调。但现在，我们多了一条路：顺着推理模型的思路，让系统充分发挥推理阶段的慢思维能力，从而提升数据质量到可接受甚至出彩的程度。这条路似乎已经打通了。

不过，我的码农朋友告诉我，他做了一个微调（SFT）与deepseek式强化学习（RL）的对比实验，发现RL的确强过SFT，但RL训练目前的计算代价还是远远大于SFT。效果好于SFT可以理解，因为SFT的数据总是非常有限的，而RL自主再生的数据成功强化的话，会远远大于SFT数据。

仔细看 R1 的设计，它是一个实用系统，不像 Zero 那么纯粹。Zero 是一个研究项目，旨在证明可以排除人类干预来构建推理模型。但 R1 是为了实际应用，所以它结合了微调和强化学习：遵循他们自己创新的SFT+RL+SFT+RL的四阶段训练的pipeline。它在第一阶段是微调，使用了 2,000 条左右的人类过程数据来提高效率，他们称为“冷启动”。强化学习之后，又加入了微调和最后的偏好强化学习，以确保合适的数据配比和能力平衡，以及与人类偏好的对齐。这种设计是经过深思熟虑，可能经过了很多尝试和调整，最终呈现出的一个最佳实践。

虽不好说R1 的这种设计一定就是绝对的最佳方案，但它确实提供了一个很好的思路：现在我们有两个工具——SFT 和 RL。如果能够将这两个工具很好地结合起来，互相补充，那么在实际应用场景中，我们就能构建出更好的系统。

从更广泛的意义上说，DeepSeek 的出现不仅是因为各种原因而短暂火爆，它更重要的作用是极大地加速了大模型向应用领域发展的速度。这对整个行业来说是一个巨大的利好刺激。

InfoQ：有人说大模型是“暴力美学”，但OpenAI 的前首席科学家、联合创始人 IIya 说预训练到头了，怎么讲？推理模型出现的背景就是增加了又一个暴力美学的scaling law 吗？？

李维博士： 这更像是技术聚焦点的转移和技术创新的范式转变。大模型涉及三大块：首先是预训练，这是大模型的基础能力，从海量数据中学习基本规律；其次是后训练，最初主要是微调，OpenAI 早期也用了一些强化学习（如 RLHF）来对齐人类偏好，但到了 Meta 时，他们甚至放弃了典型的RLHF，代之以更简单的DPO，因为与很多人一样，他们玩不转。最后是推理阶段的工作，即模型上线后与用户交互的阶段。

这三个阶段理论上都可能找到资源投入与性能提升之间的正相关S曲线，即scaling laws的某种表现函数。在过去，预训练是最受重视的部分，大家认为只要数据量不断加大、模型规模足够大，能力就一定持续提升。

LLM Scaling的底层逻辑是什么？为什么到了千亿tokens这种以前难以想象的数据规模，大模型依然显得"吃不饱"？为什么从千亿扩展到万亿tokens，scaling law依然有效？

这个现象的关键在于LLM是序列学习（编码）和序列推理（解码）的系统。序列本身是一维的，但序列中蕴含的patterns和规律性却是高维的。举个例子：即使是简单的"猫追老鼠"这样的序列，背后可能涉及物种关系、捕食行为、空间运动等多个维度的知识。这种多维知识表现在序列层面，就会发生天然的组合爆炸。对大数据的"大胃口"正是应对这种组合爆炸的有效策略。

然而，人类自然产生的高质量数据是有限的。预训练已经几乎吃尽了现有的高质量自然数据。业界开始意识到数据增长的困扰，性能提升也变得困难。GPT-5 难产，据传投入大量算力却收效有限，这表明预训练可能遭遇了瓶颈。

于是，业界开始探索另外的AI智能增长曲线。强化学习的推理模型就是在这种背景下走到主流舞台的中心：应该在后训练中加入纯粹的强化学习。以前的强化学习依赖人类偏好，但这次是让模型在得出答案之前有更多思考时间，学习背后的规律。V3 已经做得很好，但当时除了业界并没有在社会上引起太大轰动。直到 R1 出现，deepseek 才真出圈了，成了春节后最受关注的大众话题，在海外也引发了热议和震惊。R1 代表了一种新的范式。在 R1 之前，只有 OpenAI 出了 o1 这种推理模型，给人一种高不可攀的感觉，大家不知道如何跟进。然而，R1 不仅复现了 o1 的能力，还更加透明、清晰。这种反差进一步凸显了 R1 作为开源大模型引领者的重要性。

未来脑洞

InfoQ：DeepSeek 乍看就是工程上的极致化，为什么会引起全世界的轰动？它的获客速度（一周上亿）超过了 ChatGPT 核爆的时候？它的历史地位到底如何？

李维博士：从我个人的体会和感受来说，大模型的发展历程中，ChatGPT 的爆火是一个标志性事件。其实我们业内人在 ChatGPT 出现之前就开始关注大模型了，至少从 GPT-3 开始吧。当时 GPT-3 的 Playground 出现，我们乐在其中，就已经感觉到一场风暴要来了。但从整个社会的感知来看，真正引发全社会震动的还是 ChatGPT 的出现，它像核爆一样震撼了我们，超出了所有人的预期。ChatGPT 出来，我们就陷入了一种痴迷的状态。

R1的出现，我认为是继 ChatGPT 之后的第二个重大震撼。当然，在 ChatGPT 和 R1 之间也出现了一些有影响力的大模型，比如 4o，它也是一个了不起的里程碑。我们当时觉得 ChatGPT 已经很好了，3.5 版本已经很出色了，但 4o 的出现证明了它还可以更好。我们一直在案头使用它。再后来出现了 Sora，这种视频大模型也给人带来了震撼。我个人还特别喜欢一个叫 Suno 的音乐模型，它在音乐创作方面表现出色，让我觉得自己仿佛一夜之间就能成为音乐家，想写什么歌就写什么歌，还能配上自己的视频。这些模型都给人带来了不同阶段的震撼，但都没有 R1 这么强烈。

如果让我排序的话，我认为 R1 的震撼力仅次于 ChatGPT，甚至超过了 4o 和 Sora 所创造的轰动效应。R1 的震撼感有点类似于当年 ChatGPT 刚出现时的感觉，让人痴迷。ChatGPT 是开天辟地的大模型，R1 总体上是一个追随者，尽管它有很多创新亮点，有些方面甚至超越了之前的模型，比如在古典诗词创作和文风模仿方面。作为追随者，能在太平洋两岸乃至全球引起如此大轰动，是奇迹般的成就。

从实际效果来看，R1 的产品化非常成功。它在一周内就获得了上亿客户，远远打破了 ChatGPT 所创造的记录，提升了整个社会对 AI 的感知度。此外，从地缘政治对技术应用的影响来看，国内很多用户一直渴望使用全世界最先进的大模型，比如 GPT系列、Claude 或 Gemini，但常常够不着。而 R1 的出现，让人们不用担心国内外的限制。这些也都是促成R1 快速普及的因素。

InfoQ：您理想中AI编程的终极形态是什么？是程序员对着AI说“给我做个抖音”，它就直接输出可部署的代码+运维方案吗？

李维博士：总是有两类人：怀疑派和乐观派。像 Ilya 这样的人，认为通用人工智能（AGI）已经迫在眉睫，超级智能（ASI）也在不远的未来，所以现在最大的问题是确保超级智能的安全性。

Anthropic 的 CEO 预计，在未来 3 到 5 年内，大模型将实现真正的突破，不仅仅是目前让我们震撼的表现和demos，而是真正能在生产力上对整个社会带来革命性的改变。他们所说的，归根结底就是 AI 能规模化平替人类的体力劳动和脑力劳动。目前大模型虽然很热闹，但在社会生活中的实际应用还远未达到上一代移动互联网平台的水平。上一代的 super apps，比如美团、滴滴、小红书、抖音等，它们改变了我们日常生后的主要方面，无论吃穿住行还是通信和娱乐，它们最大程度缩短了供应商和客户之间的距离，这些价值我们每天都能感受到。而玩大模型虽然有趣，但在生活层面的实际价值还不明显，应用层面还处于爆发的前夕。

值得指出的是，DeepSeek 的出现降低了大模型应用门槛，为应用铺平了道路，虽然目前我们还没有进入应用真正爆发的时代。未来，当AI应用真正爆发时，会是什么时候、什么样子呢？我认为，最终目标是 AI 在脑力劳动和体力劳动中全面代替人类。大模型对白领阶层的冲击，迹象已经很明显，甚至连程序员群体都难幸免。体力劳动方面，具身智能发展也很快，无论是人形机器人还是机械手，都在逐步代替人类的体力劳动。

当然，这也会带来副作用，比如大量工作岗位消失，社会如何适应这种生产力大发展但缺乏工作岗位的状态，是另一个层面的讨论。但从AI本性和最终目标来看，AI 的发展可以有两个里程碑：一是何时能替代人类 50% 的工作，让社会只需要一半人工作，剩下的人通过基本收入保障（UBI）等方式维持一个体面的自由生活，在我看来这就是AGI到来的标志；二是何时能替代 90% 的人类工作，这可能算是所谓的超级智能（ASI）出现的时候，某种意义上的技术共产主义。

【相关】

Does the New Reasoning Paradigm (Query+CoT+Answer) Support a New Scaling Law?

Technical Deep Dive: Understanding DeepSeek R1's Reasoning Mechanism in Production

DeepSeek 学习笔记：R1 部署阶段的推理机制

从R1幻觉谈起，大模型幻觉是缺陷还是创意火花？

推理强化学习是端到端的监督，推理过程的非监督

Does the New Reasoning Paradigm (Query+CoT+Answer) Support a New Scaling Law?

— Reflections on LLM Scaling Laws and DeepSeek's R1

My friend Zhang Junlin's article "Looking at the Future of Scaling Laws through DeepSeek R1" has sparked interesting discussions among peers.

Core Insights from Initial Discussions

Professor Bai summarised the key highlights as follows:

Infinite stacking won't lead to infinite growth (physical laws don't support this)

Only S-shaped growth is possible, with diminishing returns inevitably appearing

The initial emergence of language capabilities relates to the density of linguistic knowledge in training data

The next growth phase represents a second S-curve, driven by common sense knowledge, which requires more computing power due to lower knowledge density

The third phase involves learning logical reasoning (Chain of Thought), where natural data has even lower density of such knowledge. Brute-force mining with computing power becomes inefficient, making reinforcement learning with synthetic data a more rational approach

As Dr. Lu points out: The term "Scaling Law" is becoming overloaded. While S-curves (nonlinear curves characterized by sigmoid functions) can describe technology adoption lifecycles, they typically occur in succession (one technology hits its ceiling, making way for another). Large language models' multiple "Scaling Laws" confirm this pattern, with some overlap between Test-Time and Post-Training "Scaling Laws".

The Nature of LLM Scaling

Let's examine the fundamental logic behind LLM scaling. First, it's crucial to understand that LLMs are not databases - they don't aim to memorize long-tail data details. Large model training essentially compresses big data, or more precisely, compresses the knowledge systems behind the data (including common sense and encyclopedic knowledge), focusing on capturing patterns and regularities of various patterns (what we call generalizations).

Conventional intuition suggests that as data scale increases, redundancy increases too. Regardless of filtering, cleaning, and deduplication, growing redundancy seems to imply diminishing returns. So why do large models still appear "hungry" even at the unprecedented scale of hundreds of billions of tokens? Why does the scaling law remain effective from hundreds of billions to trillions of tokens?

The key lies in LLMs being sequence learning and sequence decoding systems. While sequences are one-dimensional, the patterns and regularities behind are high-dimensional. For instance, even a simple sequence like "cat chases mouse" potentially involves multiple knowledge dimensions: species relationships, predatory behavior, spatial movement, actor-patient roles, etc. This multi-dimensional knowledge naturally leads to combinatorial explosion at the sequence level as information is flattened in language. The "appetite" for insatiable big data effectively addresses this combinatorial explosion. As long as there isn't complete information redundancy, additional diverse sequences will help models abstract data patterns more precisely.

The Two vs. Three S-curves Debate

Zhang Junlin observes that since OpenAI's O1, two other phases have gained recognition with their own Scaling Laws: the reinforcement learning Scaling Law (RL Scaling Law) for post-training, and the Inference Scaling Law (also called Test Time Scaling Law).

This raises a crucial question: Are there really three S-curves, or just two? How comparable is the reasoning model's S-curve to the pre-training S-curve?

While theoretically we can identify three phases:

Pre-training
Post-training (especially reasoning-focused reinforcement learning)
Inference phase

In practice, post-training and inference phases likely share a single S-curve; there aren't two independent growth curves.

DeepSeek R1's Insights: The Truth About "Slow Thinking"

Consider DeepSeek R1: users can activate "deepthink" mode to enable Chain-of-Thought (CoT) reasoning, but they can't actually control reasoning quality by increasing computation time. Why is this?

Let's examine a concrete example. When R1 solves a complex mathematical problem:

Traditional models might directly answer: "The result is 42"

R1 shows detailed reasoning: "Let's think step by step: 1) First consider... 2) Then we can... 3) Finally, we get 42"

While R1's response appears to demonstrate "slow thinking" (CoT), this reasoning process reflects actually a generation pattern fixed during training, not dynamic exploration of multiple potential reasoning paths during response time. In other words, CoT+answer might look like "slow thinking," but it doesn't fundamentally change the unidirectional next-token prediction paradigm. R1's CoT+answer creates an illusion of slow thinking, but the generative nature remains fundamentally the GPT "fast thinking" paradigm. At test time, unlike AlphaGo, the depth and scale of thinking isn't dynamically explored, though beam search, if applied, can provide implicit multi-path optimization internally.

Test Time Compute Constraints

The industry's buzz word "test time compute" refers to reasoning models requiring more online computational resources compared to traditional non-reasoning models. For example, R1 with CoT enabled might need several times more computation time than its base model V3 for the same problem. However, this increased computation results from behavior patterns acquired during training, not dynamically adjustable compute investment. Without controllable scalability in test time compute, we can't really talk about a test time scaling law.

A major difference between pre-training and CoT reinforcement learning lies here: pre-training scaling laws can remain stable long-term because once training completes, it doesn't significantly impact online response time - the generation mode remains a simple query+answer. Therefore, offline training for months is acceptable if the resulting model shows significant capability improvements. However, reasoning models' post-training CoT reinforcement learning differs - it cultivates models' habits of responding with slow thinking, changing the generation mode to query+CoT+answer. Extending the CoT isn't just about the cost of training resources and time; more critically, it reflects in extended test time compute for each query during deployment, severely delaying system response time. Users generally have limited tolerance for slow thinking computation time and delays during online system use.

The Sustainability Debate

OpenAI's Sam Altman and Anthropic's Dario might argue that for extremely complex problems (like proving the Riemann hypothesis or designing next-generation aerospace vehicles), even if a model needs a week of computation time, it's still a massive improvement over human teams requiring decades. However, this argument has two issues:

LLM feasibility for such super-complex problems remains far from validated

Extreme scenarios lack universality and can't serve as data points for sustainable scaling laws

This isn't to deny S-curves as effective models for describing scaling laws, nor to reject the rationality of S-curve stacking. The combination of pre-training and post-training growth curves (s1 and s2) might indeed reflect the overall relationship between resource investment and performance improvement. However, we should carefully examine whether CoT reasoning truly opens a sustainable scaling curve.

Conclusion: How Far Is the LLM Road to AGI?

If reasoning models' scaling laws lack sustainability, this raises a deeper question: Can we reach the promised land of Artificial General Intelligence (AGI) through these two scaling laws alone? Furthermore, is the technical ideal of Artificial Super Intelligence (ASI) - AI replacing human labor and dramatically improving productivity - truly feasible?

Current evidence suggests that while pre-training scaling laws have shown considerable sustainability, reasoning models' scaling laws may quickly hit practical constraints. This reminds us that the path to AGI/ASI likely requires more innovative breakthroughs, not just simple extrapolation of existing methods. In the next phase of artificial intelligence development, we might need to discover entirely new growth curves.

[#LLMs #ArtificialIntelligence #DeepLearning #AGI #ScalingLaws #MachineLearning]

【相关】

张俊林：从Deepseek R1看Scaling Law

Technical Deep Dive: Understanding DeepSeek R1's Reasoning Mechanism in Production

DeepSeek 学习笔记：R1 部署阶段的推理机制

从R1幻觉谈起，大模型幻觉是缺陷还是创意火花？

推理强化学习是端到端的监督，推理过程的非监督

Technical Deep Dive: Understanding DeepSeek R1's Reasoning Mechanism in Production

A detailed analysis of how DeepSeek R1's inference mechanism works in production, and how it differs from training-time reinforcement learning.

Training vs. Deployment: Key Questions

1. Training Phase (GRPO): Does the reinforcement learning mechanism generate multiple candidate CoT+answer sequences to optimize the policy and cultivate "slow thinking" habits?

- The answer is definitively yes.

2. Deployment Phase: Does R1 implicitly generate multiple paths during inference but only display one? If so, how does this mechanism compare to traditional ensemble methods?

3. Comparison with AlphaGo's MCTS: How does R1's mechanism fundamentally differ from Monte Carlo Tree Search?

1. Inference Mechanism in Production

DeepSeek R1's real-time reasoning can be characterized by two modes:

A. Implicit Multi-path Generation and Selection

- Generation: The model may implicitly generate multiple potential reasoning paths (CoT+Answers) during a single inference but outputs only one.

- Technical Implementation: Through decoding strategies (e.g., beam width adjustment), the model maintains multiple candidate sequences, ultimately selecting the highest-scoring path.

- User Experience: Users see only the final output, though internal multi-path exploration occurs.

- Efficiency Trade-off: Setting beam_width=1 (greedy search) defaults to single-path generation for fastest response; increasing beam width improves quality at the cost of latency.

B. Explicit Multiple Candidate Generation (Optional)

- API Control: The num_return_sequences parameter allows explicit generation of multiple candidates.

- Practical Application: While not enabled by default in the DeepSeek App, this functionality may be available through enterprise APIs or open-source implementations.

2. Training Phase: Cultivating "Slow Thinking"

A. Role of Reinforcement Learning

- Objective: GRPO algorithm trains the model to generate more detailed, logical reasoning steps (longer CoT) to maximize rewards.

- Mechanism: Training generates multiple candidate answers, with rewards evaluating both answer correctness and format correctness.

B. Driving Forces Behind CoT Growth

- Reward Design: Longer CoTs naturally emerge when they lead to better answers.

- Data Feedback: High-quality SFT data generated through rejection sampling enhances this pattern.

3. Comparison with Ensemble Methods

Similarities

- Multi-path generation conceptually similar to ensemble predictions

- Result filtering comparable to voting/weighted averaging

Key Differences

R1's implicit multi-path generation is fundamentally a dynamic decoding strategy within a single model, distinct from traditional ensemble's static combination of multiple models.

4. Fundamental Distinction from AlphaGo's MCTS

AlphaGo's MCTS

- Dynamic Programming: Builds search trees through simulation

- Online Learning: Adjusts search strategy based on real-time feedback

R1's Implicit Multi-path Generation

- Static Model: Fixed parameters during deployment

- No Reward Modeling: Path selection based on model probability rather than cumulative rewards

Key Insights

1. Training phase GRPO cultivates detailed CoT capabilities for effective single-pass inference.

2. Deployment allows flexible trade-off between single-path (for speed) and multi-path (for quality) generation.

3. While model parameters are fixed post-training, decoding strategies offer some runtime flexibility.

4. R1's multi-path generation fundamentally differs from both traditional ensembles and MCTS-style dynamic planning.

This architecture achieves a practical balance between efficiency and effectiveness for large-scale industrial applications, though it sacrifices some dynamic planning and global optimization capabilities.

#ArtificialIntelligence #MachineLearning #DeepLearning #LLM #DeepSeek

【相关】

从R1幻觉谈起，大模型幻觉是缺陷还是创意火花？

推理强化学习是端到端的监督，推理过程的非监督

DeepSeek 笔记：推理新范式 query+cot+answer 支持新的 scaling law 吗？

LLM的"大就是好"还能走多远？

——关于Scaling Law的一些思考

老友张俊林《从Deepseek R1看Scaling Law的未来》一文，引起老友热议。

白老师的推荐抽提是：

核心观点：

——无限堆叠不会无限增长（物理世界规律也不支持），只有S型增长，一个S曲线一定会出现边际效益递减。

初期语言能力的涌现，与语料数据中包含的语言知识密度有关。

接下来的增长实际上是第二根S型曲线，更多语料贡献的是常识性知识，常识知识密度不及语言知识密度，所以要更大算力才能涌现。

再接下来是逻辑知识（思维链）的学习。自然语料中逻辑知识密度更低，用算力野蛮淘金，吃力不讨好。所以，用逻辑知识密度更高的合成数据做强化学习，才能让第三个S曲线爬坡。这就顺理成章了。

鲁总评论说：Scaling Law 这个词现在有点滥用。S 曲线（Sigmoid函数刻画的非线性曲线）倒是可以描述技术的生命周期，但它往往是一个接下一个（一个技术遇到瓶颈，往往才有另一个技术的开始）。。。这个在ChatGPT刚出来时我们回顾过。大模型的这几个 "Scaling Laws" 也印证这一点（Test-Time 和 Post-Training “Scaling Laws" 有点重叠部分）：

创新就是从一个S曲线到另一个S曲线，well known results。这也是斯坦福大学那位鼓吹新能源、自动驾驶以及再生食品革命等科技乐观主义的教授（叫？）每次演讲必谈的技术革命的adoption曲线。他自称根据这个曲线，他在过去30年对于技术影响社会的许多预见都证明是对的，虽然每一次预见社会都会取笑他。

回到LLM领域的 scaling law 话题。Scaling law本质上是一种经验法则，而经验告诉我们，大多数经验法则都符合S形曲线（或增量的正态分布）。具体到LLM，"大就是好"正是这种法则在遇到平台期或天花板之前的体现。这里的"大"指的是数据规模大、模型参数量大，缺一不可。模型规模不够大，数据再多也无法有效消化——这早已是业界共识。不过OpenAI早期的设计中过分强调模型规模的做法现在看来是一种误导，直到Chinchilla Scaling Law的提出，业界才形成了更合理的共识：数据规模和模型参数量需要保持适当的比例关系。

LLM Scaling的底层逻辑是什么？

首先要明确：LLM不是数据库，其目标不是记忆长尾数据的细节。大模型训练本质上是对大数据内容的压缩，换句话说，压缩的是数据背后的知识体系（包括常识、百科知识等），重点在于揭示大大小小的各种规律性（也就是所谓的泛化能力，generalizations）。

一般直觉会认为，数据规模越大，冗余也越多。无论如何过滤清洗和去重，冗余度随规模增长，似乎意味着可榨取的"油水"会越来越少。那么为什么到了千亿tokens这种以前难以想象的数据规模，大模型依然显得"吃不饱"？为什么从千亿扩展到万亿tokens，scaling law依然有效？

这个现象的关键在于LLM是序列学习（编码）和序列推理（解码）的系统。序列本身是一维的，但序列中蕴含的patterns和规律性却是高维的。举个例子：即使是简单的"猫追老鼠"这样的序列，背后可能涉及物种关系、捕食行为、空间运动等多个维度的知识。这种多维知识表现在序列层面，就会发生天然的组合爆炸。对大数据的"大胃口"正是应对这种组合爆炸的有效策略。只要不是完全的信息冗余，增加的不同序列对模型抽象数据patterns通常都是有帮助的。

然而，人类自然产生的高质量数据是有限的。预训练已经几乎吃尽了现有的高质量自然数据。于是，业界开始探索另外的AI智能增长曲线。

从预训练到推理：两个还是三个S曲线？

张俊林指出：

OpenAI o1推出后，另外两个阶段不再孤单，也各自拥有了姓名，产生了各自的Scaling Law，对应后训练阶段的强化学习Scaling Law(RL Scaling Law)和在线推理阶段的Inference Scaling Law(也叫Test Time Scaling Law)。

这里值得探讨的问题是：到底是三个S曲线，还是两个？推理模型的S曲线与此前的预训练S曲线有多大可比性？

理论上确实可以分为三个阶段：

1. 预训练
2. 后训练（尤其是推理强化学习）
3. 推理阶段

这三个阶段理论上都可能找到资源投入与性能提升之间的正相关S曲线，即scaling laws的某种表现函数。但实际上，在当前部署的应用中，后训练和推理这两个阶段应该共享同一个S曲线，原则上不存在两条独立的增长曲线。

当然，如果用户利用提示词技巧来影响模型的test time，让它更深入的思考，这可能间接影响 CoT （ChainOfThought）的长度或深度。但那是 query 的改变，是 input context 的变化，感觉也不应该算作 test time compute 的独立的 s曲线。

另外，说推理模型这一波潮流是范式转变，开启了新的 RL/Test-time scaling law，总觉得有一点太言之凿凿了。直觉上，推理模型的增长曲线与此前的预训练 scaling law 的增长曲线，大概率没有直接的可比性。

Scaling law 说的 law，实际上我们都知道是所谓经验“法则”。经验需要足够的实践数据积累，才能总结出来。强化学习赋能的推理模型才刚开始，没有足够的经验数据刻画这是怎样的一种增长关系，能持续多久，是不是昙花一现，还是可以持续相当长的时候，等等。

持续时间不够长的 scaling，其实没有多少经验法则的意义。Anthropic CEO Dario 提到 deepseek 的时候说（大意）， deepseek 显得这么亮眼其实是赶上了好时机，言下之意是运气的成分大于技术硬核实力和创新（滑稽的是，Anthropic 迄今没有能力推出任何推理模型，虽然R1以来，谷歌和国内都有推理模型的上线）。他说，推理刚刚开始，所以任何人走通了这条路，在这个初期阶段都会有一个大增长。譬如PhD段位的考试题，在没有推理模型的LLM中，可能分数很低，但一旦有了推理模型，有了所谓 test time compute 的 CoT，成绩就会直线上升，给人创造了奇迹的感觉。

现在是推理模型的早期，后去会如何呢？靠增加 test time compute，或不断延长 CoT，还会有多少增长空间？这个问题是现在进行时，貌似没有明确答案。但隐隐觉得，这个持续增长的时间或曲线，远不如预训练那样稳定和持续，进而其作为 scaling law 的说法不一定站得住。

这第二条反映 RL scaling law 的后训练智能增长曲线，不大好与 pretrain scaling law 相提并论，很可能并不是可持续的，也可能很快就遭遇制约因素的强烈反弹（见后“Test Time Compute 的制约”）。

DeepSeek R1的启示：慢思考的真相

以DeepSeek R1为例，用户可以选择"deepthink"模式来启动慢思考的chain-of-thought（CoT）推理，但实际上用户难以通过增加计算时间来提升推理质量。这是为什么呢？

让我们看一个具体例子。假设我们让R1解决一个复杂的数学问题：

- 传统模型可能直接给出答案："结果是42"
- R1会展示详细的推理过程："让我们一步步思考：1) 首先考虑...... 2) 然后我们可以...... 3) 最后得出结果42"

表面上看，R1的回答展现了"慢思考"（CoT）的特征，但实际上这个推理过程是模型在训练阶段就已经固化的生成模式，而不是在回答问题时动态探索多个可能的推理路径。换句话说，CoT+answer 看似是"慢思考"后的回答，但其实并不改变自回归 ntp（next token prediction）的单向序列生成定式。说白了就是，R1 的 cot+answer 给人慢思考的样子，但生成的本性还是GPT“快思考”范式。在 test time，思考的深度和规模不是动态探索，虽然可以用 beam search 进行内部的隐式多路径选优。

Test Time Compute 的制约

目前业界热议的"test time compute"，指的是含有CoT机制的推理模型相比传统的非推理模型需要更多的在线计算资源。以V3/R1为例，处理同样的问题，启用CoT 的R1可能需要V3 n多倍的计算时间。但这种计算量的增加是模型训练后固化的行为模式导致的，而不是可以动态调节的算力投入。test time compute 没有可控的伸缩可能性，也就谈不上 test time scaling law。

预训练与后训练的CoT强化学习的一个很大的不同是：预训练 scaling law 可以长期稳定乃是因为一旦训练完成，不大影响在线响应的时间，生成模式就是简单 query+answer。因此预训练阶段离线训练几个月都是可以忍受的，只要训练出来的大模型能力有大的提升。但推理模型后训练阶段的CoT强化学习不同，它在培养模型在线回应慢思考的习惯，生成模式是 query+cot+answer。推理模型的 cot 拉长，不仅仅是训练的资源和时间的耗费问题，更主要的是它反映在部署推理阶段的 test time compute 的延长，严重拖延了系统的响应时间。而用户在线使用系统的时候，一般来说对于慢思考的计算量和耗费时间是有能够忍耐的上限的。

这就带来了一个关键问题：即使研究表明indefinitely 增加CoT的长度（相应增加在线计算时间）能带来持续的性能提升，符合某种 scaling law 的经验法则，这种增长也会受到推理阶段现实因素的制约。一般用户可能愿意等待5-10秒获得更好的答案，但如果需要等待几分钟乃至几小时，使用体验就会大打折扣，乃至不可接受。

Scaling Law的可持续性之辩

Open AI CEO Sam Altman 和 Anthropic CEO Dario 这些大佬可能会争辩说，对于极其复杂的问题（如证明黎曼猜想、设计下一代航天战机等），即使模型需要一周的计算时间，相比人类团队需要数十年的工作量仍是极大的进步。但这种论述有两个问题：

1. 这类超复杂问题的LLM可行性远未得到验证
2. 极端场景不具有普适性，难以作为可持续的scaling law 的数据点

当然，这并不是否认S曲线作为描述scaling law的有效模型，也不是否定S曲线叠加的合理性。预训练和后训练两个阶段的增长曲线（s1和s2）叠加确实可能反映了资源投入与性能提升的整体关系。但我们需要谨慎看待CoT推理是否开启了一个真正可持续的scaling曲线。

结语：通向AGI的道路还有多远？

如果推理模型的scaling law缺乏可持续性，这就带来了一个更深层的问题：仅依靠这两个scaling laws，我们能否达到通用人工智能（AGI）的理想彼岸？更进一步，让AI平替人类劳动、极大提升生产力的超级人工智能（ASI）的技术理想是否真的可行？

目前的证据表明，预训练scaling law确实展现了相当的持续性，但推理模型的scaling law可能会较快遇到现实约束。这提醒我们，通往AGI/ASI的道路可能需要更多的创新突破，而不仅仅是现有方法的简单外推。在人工智能发展的下一个阶段，我们或许需要寻找全新的增长曲线。

【相关】

张俊林：从Deepseek R1看Scaling Law

DeepSeek 学习笔记：R1 部署阶段的推理机制

从R1幻觉谈起，大模型幻觉是缺陷还是创意火花？

推理强化学习是端到端的监督，推理过程的非监督

DeepSeek 笔记：R1 部署阶段的推理机制

1. 训练阶段的强化学习机制：GRPO是否通过生成多条候选答案（multiple candidate cot+answer sequences）进行策略优化（修改模型），使得模型养成慢思考的习惯？

这个答案是毫无疑问的 YES。

2. 部署阶段的推理机制：R1是否在生成时隐式生成多条路径，但仅展示一条？如果是，这种机制与集成（ensemble）方法有何异同？

3. 与AlphaGo的MCTS的区别：MCTS树搜索是否在推理时动态构建搜索树，而集成方法只是静态组合多个模型的输出？

1. 部署阶段的隐式多路径推理机制

在 DeepSeek R1 的部署阶段，其推理机制可以概括为以下两种模式：

(1) 隐式多路径生成与筛选

- 生成多条路径：模型在单次推理时，可能隐式生成多条潜在的推理路径（CoT+Answers），但仅选择其中一条输出。
- 技术实现：通过调整解码策略（如束搜索宽度 `beam_width`），模型在生成过程中维护多个候选序列（即多条路径），最终选择综合评分最高的路径。
- 用户感知：用户仅看到最终输出，但模型内部进行了多路径探索与筛选。
- 效率权衡：若设置 `beam_width=1`（贪心搜索），则退化为单路径生成，响应速度最快；增大 `beam_width` 可提升输出质量，但增加计算延迟。

(2) 显式多候选生成（需主动配置）

- API级控制：通过设置 `num_return_sequences` 参数，模型可显式生成多个候选答案（如5个），用户或下游系统可进一步筛选。
- 实际应用：DeepSeek App默认未开放此功能，但在企业API或开源代码中可能支持。

关键点：
- 训练阶段的强化学习优化了模型的“单路径CoT生成能力”：通过GRPO训练，模型在单次生成时即可输出高质量的详细推理步骤（长CoT），无需依赖显式多候选生成。
- 部署时的多路径探索只是“锦上添花”：隐式多路径（如束搜索）或显式多候选生成可进一步提升输出质量，但非必需功能。

2. 训练阶段的“慢思考习惯”培养

(1) 强化学习的作用

- 目标：通过GRPO算法，模型学习生成更详细、更合理的推理步骤（长CoT）以提高奖励（如答案正确性）。
- 机制：训练时生成多个候选答案，奖励信号不仅评估最终答案正误，还隐式鼓励逻辑连贯的推理路径（如通过格式奖励）。

(2) CoT增长的驱动力

- 奖励设计：若长CoT更易得出正确答案（如分步解题减少错误），模型在策略优化中自然倾向于生成更长、更详细的步骤。Given room for [think], a reasoning model just wants/tends to think deep!
- 数据反馈：训练后期通过拒绝采样生成的高质量SFT数据，进一步强化这一模式。

结果：训练后的模型在单次生成时即可输出高质量的详细推理（即“慢思考习惯”内化）。

3. 与集成方法（Ensemble）的异同

(1) 相似性

- 多路径生成：隐式多路径探索（如束搜索）可视为同一模型生成多个潜在输出，类似集成方法中的多模型预测。
- 结果筛选：通过置信度选择最优解，类似于集成中的投票或加权平均。

(2) 核心差异

R1的隐式多路径生成本质是单模型内的动态解码策略，而传统集成依赖多模型的静态组合，二者在实现成本与多样性来源上存在根本差异。

4. 与AlphaGo蒙特卡洛树搜索（MCTS）的本质区别

(1) AlphaGo的MCTS机制

- 动态规划：通过模拟（Simulation）构建搜索树，评估每一步的长期收益（如胜率），动态选择最优路径。
- 在线学习：在推理时根据实时反馈（如对手落子）调整搜索策略，部分版本（如AlphaZero）甚至更新模型参数。

(2) R1的隐式多路径生成

- 静态模型：部署时模型参数固定，多路径生成依赖预训练的策略与解码规则。
- 无长期收益建模：路径选择基于模型自身的置信度概率，而非多步决策的累积收益。

(3) 差异

- R1的多路径生成是静态策略的有限探索，依赖训练阶段内化的CoT+answer的生成能力。
- MCTS是动态规划过程，通过实时模拟与评估实现长期收益最大化，属于在线决策优化。

5. 总结

- 训练阶段的目标：GRPO通过强化学习培养模型生成详细CoT的习惯，使得部署时单次生成即可输出合理答案。
- 部署阶段的灵活性：系统可选择单路径生成（快速响应）或多路径筛选（质量优先），后者类似轻量级集成。

- 训练完成后模型参数确实固定，但隐式多路径生成依赖解码策略（如束搜索宽度），用户可通过API参数调整，非完全静态。
- 与集成的实质差异：R1的多路径生成是同一模型的不同解码路径，而传统集成依赖多个独立模型，后者多样性更高但成本激增。

- MCTS的核心是动态搜索与长期收益建模，而非多模型预测的平均化。R1的隐式多路径更接近贪心策略的扩展，而非规划过程。

DeepSeek R1的部署机制通过训练阶段的强化学习内化“慢思考”能力，使其在单次生成时即可输出详细推理。隐式多路径生成（如束搜索）可进一步提升质量，但本质是同一模型的解码策略优化，与传统集成或AlphaGo的MCTS均有显著差异。这种设计在效率与效果间取得平衡，适配大规模工业应用需求，但牺牲了动态规划与全局最优的能力。

【相关】

从R1幻觉谈起，大模型幻觉是缺陷还是创意火花？

推理强化学习是端到端的监督，推理过程的非监督

Hallucinations in AI: Bug or Feature? A Deep Dive into DeepSeek-R1

Host: Hello everyone! Welcome to today's interview. Recently, there's been quite a buzz about AI "hallucinations," especially with DeepSeek-R1, which seems to have a higher hallucination rate than its predecessor, DeepSeek-V3. Today, we're joined by Dr. Li, a senior AI researcher. Welcome, Dr. Li!

Dr. Li: Hello, host! Hello, everyone!

Host: Let's start with the million-dollar question: Why do large language models "hallucinate"? Can you break it down for us in plain English?

Dr. Li: You see, large language models are like super-powered conversation completers. Give them the first half of a sequence, say, a question, and they'll predict the second half (say, an answer) based on their massive knowledge network. They learn like our brains do – they can't remember everything word-for-word, so they compress and generalize, grabbing the gist and finding patterns.

Here's a fun contrast: Ask them "How tall is Yao Ming?" and they'll nail it because that's such famous knowledge, this data point is practically carved in stone in their memory (represented in the model's parameter weights). But ask them "How tall is Old Wang from next door?" and they're stumped because they've never met Old Wang! But here's the kicker – they won't just say "I don't know." So what do they do? They "make up" a reasonable height based on what they know about the range of human heights. That's a hallucination for you!

Host: Wow, that's some impressive guesswork! But isn't this kind of making things up pretty problematic?

Dr. Li: Not necessarily! In a way, hallucination is imagination (for better or worse) – it's where creativity lies! Think about it: all those great literary works, artistic masterpieces – aren't they all flights of fancy, products of imagination? If everything had to match reality closely, art would just be photography, and where's the fun in that?

You know, Yuval Harari makes a fascinating point in "Sapiens" – humans became Earth's dominant species precisely because we could "tell stories," creating myths, religions, nations, and money – things that don't physically exist. These are all "hallucinations," but they're the driving force behind civilization!

Host: When you put it that way, hallucinations sound pretty important! But let's talk about DeepSeek-R1. Its hallucination issue seems quite serious.

Dr. Li: Indeed, it is! The academic consensus used to follow OpenAI's view that reinforced reasoning would significantly reduce hallucinations. I remember discussing this with a head honcho at an LLM unicorn who was particularly excited about reasoning's potential to curb hallucinations. But R1's performance threw us a curveball!

According to Vectara's tests, R1's hallucination rate is more than 3 times higher than its foundation model V3's – 14.3% compared to 3.9%. This definitely correlates with its prolonged "Chain of Thought" (CoT) enabled by reinforcemnnt learning for reasoning. R1 is absolutely brilliant at reasoning, math and coding, as well as poetry and storytelling, but this currently comes with the "side effect" of increased hallucinations in things like translation and summarization.

More specifically, there are several reasons for R1's increased hallucinations.

First, the standard hallucination tests use summarization tasks, something base models are already pretty good at. In this case, reinforcement learning can backfire – it's like using a cannon to swat a fly!

Second, R1's reinforced reasoning chains weren't specifically optimized for straightforward tasks like summarization, translation, or news writing that demand strict factual accuracy. Instead, it tries to add various layers of thinking to every task. Looking at its transparent CoT (ChainOfThought) printout, we see it tirelessly analyzing even simple instructions from multiple angles. This overcomplication of simple tasks can lead to deviations and hallucinations.

During R1's reinforcement learning for NLP-related tasks, it seems the model was rewarded more heavily for creativity, leading it to be more imaginative – and consequently more prone to straying from facts. For mathematical and coding tasks, R1's supervision came from gold standards (test answers or code test cases). But for humanities tasks, they used V3 or V3's reward model to judge quality, and the current system seems to clearly favor creativity.

Moreover, user feedback typically tends to focus and encourage creativity. Most people aren't sensitive to hallucinations, especially when they're wrapped in the model's smooth, fluent language. For most frontline developers, this kind of user feedback naturally pushes them to enhance creativity rather than tackle the thorny problem of hallucinations.

Host: So, you are saying that R1's hallucination problem rooted in its over-enthusiastic reasoning? What's the real relationship between reinforced reasoning ability and hallucinations?

Dr. Li: It's still a puzzle – there's not seem to be simple correlation. Look at R1, a leading reasoning model, versus Claude 3.5 Sonnet, a top non-reasoning model. Surprisingly, Sonnet still has a higher hallucination rate than R1! But when we compare R1 to its base model V3, we see clearly that adding reasoning significantly increased hallucinations.

It may well be about the model's "personality." R1, with its powerful reinforcement learning, loves "divergent thinking." Give it a simple prompt, and it'll spin out ideas like there's no tomorrow – its CoTs could run on like crazy! This suggests that while R1 was powering up its creativity, it inevitably amplified creativity's twin: hallucination.

As a model that excels in both STEM and humanities, R1 performs differently across tasks. In mathematics and coding, where more rigorous reasoning is required, there's little room for hallucination. But in language and creative tasks, especially in the summarization tests, hallucinations become more prominent. It's largely a side effect of R1's supercharged linguistic creativity.

Technically speaking, R1 automatically adds lengthy CoTs to simple user instructions, essentially complicating straightforward tasks. Its CoTs (like internal monologue of an entity following instructions) change the conditional part of the autoregressive probability model before generating answers, naturally affecting the final output. Compare:

V3: query → answer
R1: query+CoT → answer

For tasks that V3 already handles well, like summarization or translation, any lengthy CoT guidance might lead to deviation or embellishment, creating fertile ground for hallucinations.

Host: So where do R1's hallucinations mainly occur?

Dr. Li: Think of R1's abilities as split between "arts" and "sciences." In "science" areas like math and coding, its logic is fairly strong and hallucinations are relatively rare. But in "arts" areas like language, hallucinations become more noticeable.

R1's most impressive achievement compared to the first LLM reasoning model O1 is successfully extending mathematical and coding reasoning capabilities into creative writing, especially in Chinese. The internet is full of R1's brilliant literary works. In terms of wordplay and literary prowess, it clearly surpasses 99% of humans – even graduate students in literature and classical Chinese professors sing its praises.

But watch what happens when you ask it to do a simple summary – it can't help but "get creative," often "inventing" details not present in the original text. It's like its "arts" abilities are too powerful, a case of "too much of a good thing."

Host: That's an interesting perspective. Do all language tasks require creativity?

Dr. Li: Language tasks actually fall into two categories: ones that need high creativity, like poetry and fiction writing, and ones that demand high factual accuracy, like news reporting, translation, or summarization. R1 excels at the former, which was likely the development team's focus, but this creates side effects in the latter as it is today.

It reminds me of the old Chinese saying about translation needing to be "faithful, expressive, and elegant" – achieving all three has always been challenging. We see many examples where elegance is prioritized over faithfulness, like the use of hyperbole in literary works. We also see the opposite, like Lu Xun's advocacy for so-called "rigid translation."

Interestingly, humans have always had double standards here, but we have a mental switch we can flip at will. When watching movies or reading novels, we flip towards creativity and don't fuss about factual accuracy. But switch to news channels, and we have zero tolerance for falsehoods.

Host: People tend to believe content that appears logically coherent and detailed, so the potential harm from AI hallucinations could be significant. What should we ordinary folks do about AI hallucinations?

Dr. Li: While many people are starting to notice and become wary of these hallucinations amid their amazement at LLM's creativity, most are still mesmerized by its creative brilliance. We need to increase public awareness of AI hallucinations. I suggest a two-pronged approach:

Stay Alert: Don't take everything the model says as granted, especially factual claims. Hallucinations most commonly occur with names, places, times, locations, and other entities or numerical data.

Cross-Verify: For important details, check original sources online or consult experts to see if the claims align.

Guide the Model: When asking questions, add constraints like "please stay faithful to the original text" or "please verify facts." This can at times help reduce hallucinations.

Embrace Creativity: If you're looking for inspiration or creative ideas, model hallucinations can be a delightful surprise!

Think of AI hallucinations as "possibilities in parallel universes." What it makes up might not be true in our world, but could be true in another! It's like how novelists write fictions – while it cannot stand fact checking, it's a kind of "artistic truth." Just like novels arise from life but transcend it, AI arises from data but transcends it. AI compresses data into knowledge and common-sense network, not necesarily true to individual facts – that's what databases are for.

Host: This reminds me of what people often say: AI models aren't just "talking nonsense" – they're "talking nonsense seriously"!

Dr. Li: Haha, that's exactly it! AI hallucinations are its "educated guesses," based on the massive knowledge and patterns it's learned. The hallucinations are by noway completely random – they have internal constraints that make them seamless and convincing, but also more deceptive. Newcomers to AI need to be especially careful not to trust everything at their face value.

For regular users, understanding the nature of hallucinations is needed. For example, when asking about well-documented facts like "How long is the Yangtze River?" models won't make mistakes because these facts are firmly encoded in their parameters. But ask about an obscure creek or fictional river, and the model will activate its "reasonable completion" mechanism and make something up.

Host: Following your logic, human language itself prepares for a breeding ground for hallucinations.

Dr. Li: You could say that. Language enabled humans to create things which do not exist in the physical world, such as myths, religions, states, corporations, currency, and abstract concepts like ideals and beliefs. Harari emphasizes in "Sapiens" that story-telling (i.e. typical hallucinations) were fundamental to civilization: language enabled human story-telling abilities. Hallucinations catalyzed civilization. Humans are the only entities capable of 'lying' (besides LLMs).

Host: What about the future? Is there a way to maintain creativity while reducing hallucinations?

Dr. Li: This is definitely one of the "ultimate challenges" in AI! People are working on various solutions, including:

More Refined Training: During training, treat different types of tasks differently, teaching the model when to be strict and when to be creative.

Task-Specific Fine-tuning/Reinforcement Learning can help balance this contradiction. Tasks like summarization, paraphrasing, translation, and reporting need special care because they require both some creativity (like style) and strict factual accuracy.

Specifically, R1's training pipeline has four stages: fine-tuning 1, reinforcement 1, fine-tuning 2, and reinforcement 2. Reinforcement 2 mainly focuses on human preference alignment. Currently, this process seems to favor creativity over faithfulness, which could be rebalanced later. Perhaps more importantly, in stage three (i.e. fine-tuning 2), we could strengthen constraints for different tasks – for example, increasing supervised data for summarization to encourage faithful, straightforward results.

Routing: In the future, there will be a "model dispatcher" that assigns different models based on task type. Simple tasks could go to V3 or use tools, while complex tasks requiring deeper thinking go to R1.

For instance, arithmetic tasks should just use simple code calculations, equivalent to using a calculator. That's not how it works now – yesterday I tested a nine-digit multiplication, and R1 spent over three minutes thinking, producing CoT that could stretch down the street, breaking down the reasoning step by step. While the answer was correct, using such computationally expensive CoT for arithmetic instead of a simple function call is unreasonable. A one-line calculation code would do the job – no need to waste so much computing resource and tokens on explicit reasoning. These are foreseeable routing improvements, especially in the age of AI agents which can use all kinds of tools or applications. R1's CoT does not need to handle everything – besides hallucinations, compute-burning CoT is also not environmentally friendly.

Host: Thank you, Dr. Li, for this fascinating discussion! Today's interview has given us a much deeper understanding of AI hallucinations.

Dr. Li: My pleasure! It's been great chatting with you!

【相关】

从R1幻觉谈起，大模型幻觉是缺陷还是创意火花？

推理强化学习是端到端的监督，推理过程的非监督

从R1幻觉谈起，大模型幻觉是缺陷还是创意火花？

主持人： 大家好，欢迎来到今天的访谈。最近，大模型的“幻觉”问题引发了热议，尤其是DeepSeek-R1，它的幻觉率似乎比自己的基座大模型DeepSeek-V3高不少。今天我们请到了资深AI研究员立委博士，来跟大家聊聊这个话题。立委，您好！

立委： 主持人好，大家好！

主持人： 老李，咱们先来个灵魂拷问：为啥大模型会“产生幻觉”？能不能用大白话给大家解释一下？

立委： 这可算是大模型的经典问题。其实啊，大模型就像一个“超级接话茬儿高手”，你给它上半句，它就根据自己学过的海量知识，预测下半句该说啥。它学东西呢，就像咱们人脑记东西一样，不可能每个字都记得清清楚楚，它会进行压缩和泛化，抓大意、找规律。

打个比方，你问它“姚明有多高”，它大概率不会错，因为这知识点太出名了，它记得牢。但你要是问“隔壁老王有多高”，它可能就懵了，因为它没见过老王啊！但它又不能不回答，咋办？它就得“脑补”，根据“一般人有多高”这个学到的概念，给你编一个数出来，这就是“幻觉”。

主持人： 哎呦，这“脑补”能力也太强了！胡编乱造，这幻觉也太糟糕了。

立委： 那可不一定！你看啊，某种程度上，幻觉就是想象力（褒贬不论），就是创意！你想想，那些伟大的文学作品、艺术作品，哪个不是天马行空、充满想象？要是啥都得跟现实一模一样，艺术就成了照相机了，那还有啥意思？

就像赫拉利在《人类简史》里说的，人类之所以能成为地球霸主，就是因为我们会“讲故事”，会创造出神话、宗教、国家、货币这些现实中不存在的东西。这些都是“幻觉”，但它们却是文明诞生和发展的原动力。

主持人： 听您这么一说，感觉幻觉还挺重要的。那咱们回到DeepSeek-R1，它的幻觉问题真的很严重。

立委： 是很严重。此前学界普遍认同OpenAI的说法，推理增强会明显减少幻觉。我曾与大模型公司的一位负责人讨论，他就特别强调推理对减少幻觉的积极作用。但 R1 的表现却给出了一个出人意料的答案。

根据Vectara的测试，R1的幻觉率确实比V3高不少，R1的幻觉率14.3%，显著高于其前身V3的3.9%。这跟它加强了的“思维链”（CoT）和创造力直接相关。R1在推理、写诗、写小说方面，那叫一个厉害，但随之而来的“副作用”就是幻觉也多了。

具体到R1，幻觉增加主要有以下几个原因：首先，幻觉标准测试用的是摘要任务，我们知道摘要能力在基座大模型阶段就已经相当成熟了。在这种情况下，强化反而可能产生反效果，就像用大炮打蚊子，用力过猛反而增加了幻觉和编造的可能。

其次，R1 的长思维链强化学习并未针对摘要、翻译、新闻写作这类相对简单而对于事实要求很严格的任务做特别优化，而是试图对所有任务增加各种层面的思考。从它透明的思维链输出可以看到，即便面对一个简单的指令，它也会不厌其烦地从不同角度理解和延伸。过犹不及，这些简单任务的复杂化会引导结果偏离发挥，增加幻觉。

DeepSeek-R1在文科类任务的强化学习训练过程中，可能对模型的创造性给予了更多的奖励，导致模型在生成内容时更具创造性，也更容易偏离事实。我们知道，对于数学和代码，R1的监督信号来自于这些题目的黄金标准（习题集中的标准答案或代码的测试案例）。他们对于文科类任务，利用的是V3或V3的奖励模型来判定好坏，显然目前的系统偏好是鼓励创造性。

另外，用户更多的反馈还是鼓励和欣赏见到的创造力，一般人对于幻觉的觉察并不敏感，尤其是大模型丝滑顺畅，识别幻觉就更加困难。对于多数一线开发者，用户的这类反馈容易促使他们更加向加强创造力方向努力，而不是对付大模型领域最头痛的问题之一“幻觉”。

主持人： 这么说来，R1 的幻觉问题是不是源于它过于"积极"的思维推理？但推理能力增强和幻觉之间到底是什么关系？

立委：这个关系很微妙，并不是简单的正相关或负相关。你看 R1 是头部推理模型，而 Claude 3.5 Sonnet 是头部的非推理大模型，但后者的幻觉率反而高于前者。可是当我们对比 R1 和它的基座模型 V3 时，又发现增加推理强化后幻觉确实显著增加了。

这事儿跟大模型的“性格”有关。R1这家伙，强化做得给力，特别喜欢“发散思维”，你给它一个简单的指令，它能给你想出一大堆东西来，思维链能绕地球三圈！这似乎说明 R1 在强化创造力的过程中，不可避免地增加了创造力的伴生品：幻觉。作为一个文理通吃的推理大模型，R1 在不同领域的表现并不一样。在数学、代码等需要严谨推理的领域，幻觉的空间很小。但在语言创作领域，尤其是现在被测试的摘要任务上，幻觉问题就明显得多。这更多是 R1 语言创造力爆棚带来的副作用。

具体从技术角度来说，R1 会为用户的简单指令自动增加很长的思维链，等于是把一个简单明确的任务复杂化了。你一个简单的指令，它也反复从不同角度理解和衍伸（CoT思维链好比“小九九”，就是一个实体遵从指令时的内心独白）。思维链改变了自回归概率模型生成answer前的条件部分，自然会影响最终输出。

V3: query --〉answer
R1: query+CoT --〉answer

对于 V3 已经能很好完成的任务，比如摘要或翻译，任何思维链的长篇引导都可能带来偏离或发挥的倾向，这就为幻觉提供了温床。

主持人： 那对于R1来说，幻觉主要是出在哪方面呢？

立委： 我觉得可以把R1的能力分成“文科”和“理科”来看。它在数学、代码这些“理科”方面，逻辑性很强，幻觉相对少。但在语言文字这些“文科”方面，幻觉就比较明显了。

比起O1，R1 最令人惊艳的成就，是成功将数学和代码的推理能力充分延伸到了语言创作领域，尤其在中文能力方面表现出色。网上流传着无数的R1精彩华章。舞文弄墨方面，它显然超过了99%的人类，文学系研究生、甚至国学教授也赞不绝口。

但你看，让它做个摘要，本来是很简单的任务，但它非得给你“发挥”一下，结果就容易“编”出一些原文里没有的东西。这其实是它“文科”太强了，有点“用力过猛”。

主持人：这个说法有意思。那么在具体应用中，语言任务是不是都需要创造力呢？

立委：语言能力其实可以细分为两类：一类需要高创造力，比如写诗歌、小说；另一类需要高度真实性，比如新闻报道、翻译或摘要。R1 最受称赞的是前者，这也可能是研发团队的重点方向，但在后者中就出现了副作用。

这让我想到中国古人说的"信达雅"，自古难全。为"雅"牺牲"信"的例子我们见得很多，文学创作中夸张的修辞手法就是重要手段和例证。为"信"牺牲"雅"也有先例，比如鲁迅先生推崇的"硬译"。

有趣的是，我们人类在这方面其实一直是双标的，但我们心里有个可以随时切换的开关。看小说和电影时，我们把开关偏向创造性一侧，完全不会去纠结细节是否真实；但一旦切换到新闻频道，我们就对虚假内容零容忍。

主持人： 人对于逻辑看起来清晰自洽、且详细的内容，就会倾向于相信，所以大模型幻觉潜在的危害真地很大。那咱们普通人，面对大模型的幻觉，该咋办呢？

立委： 很多人在惊艳R1创造力的同时，现在开始慢慢注意到这个幻觉现象并开始警惕了。但更多人还是沉浸在它给我们带来的创造性的惊艳中，需要增强大众对模型幻觉的 awareness。我觉得吧，咱们可以“两手抓”：

保持警惕： 大模型说的话，特别是涉及到事实的，别全信，多留个心眼。最容易产生幻觉的地方是人名、地名、时间、地点等实体或数据。

交叉验证： 重要的细节，可上网查查原始资料或询问身边专家，看看说法是不是一致。

引导模型： 你可以在提问的时候，加一些限定条件，比如“请务必忠于原文”、“请核对事实”等等，这样可以引导模型减少幻觉。

享受创意： 如果你需要的是灵感、创意，那大模型的幻觉，会给你带来惊喜！

不妨把大模型的幻觉，看成是“平行世界的可能性”。它编出来的东西，也许在这个世界不是真的，但在另一个世界里，说不定就是真的呢！就像小说家写小说，虽然是虚构的，也是一种“艺术真实”。源于生活，高于生活。大模型是源于数据，高于数据。大模型压缩的是知识体系和常识，不是一个个事实，后者是数据库的对象。

主持人： 妙啊！您这说法，让我想起了大家常说的一句话：大模型不是在“胡说八道”，而是在“一本正经地胡说八道”。

立委： 哈哈，差不多就是这个意思！大模型的幻觉，其实是它“脑补”出来的，但它“脑补”的依据，是它学到的海量知识和规律。所以，它的幻觉，往往不是乱来的，有“内在的合理性”，这才丝滑无缝，假话说的跟真的似的，但同时也更具有迷惑性。初玩大模型的朋友，需要特别小心，不能轻信。

对于普通用户来说，理解幻觉的特点很重要。比如问"长江多长"这类有足够信息冗余的百科知识问题，大模型不会出错，这些事实是刻在模型参数里的。但如果问一个不知名的小河或虚构河流的长度，模型就会启动"合理补白"机制编造。

主持人: 按照您的说法，人类的语言本身就是幻觉的温床。

立委：可以这么说。语言使得人类创造了神话、宗教、国家、公司、货币等非真实实体的概念，以及理想、信念等形而上的意识形态。赫拉利在《人类简史》中强调了幻觉对于文明的根本作用：语言的产生赋能了人类幻觉（“讲故事”）的能力。幻觉是文明的催化剂。人类是唯一的会“说谎”的实体 -- 除了LLM外。

主持人: 那么在幻觉的背后，大模型是怎么运行的呢？

立委：幻觉的本质是补白，是脑补。

“白”就是某个具体事实，如果这个事实在训练数据中没有足够的信息冗余度，模型就记不住（零散事实等价于噪音）。记不住就用幻觉去补白，编造细节。

幻觉绝不是没有束缚的任意编造，大模型是概率模型，束缚就是条件概率中的前文条件。幻觉选择的虚假事实需要与补白所要求的value类型匹配，即符合ontology/taxonomy 的相应的上位节点概念。“张三”可以幻觉为“李四”，但不可以幻觉成“石头”。

所谓艺术真实是说，小说创作虽然可能背离了这个世界的事实，但却是可能的数字世界的合理想象。大模型的幻觉属于此类。

大模型的知识学习过程（训练阶段），是一种信息压缩过程；大模型回答问题，就是一个信息解码过程（推理阶段）。好比升维了又降维。一个事实冗余度不够就被泛化为一个上位概念的slot，到了生成阶段这个slot必须具像化补白。“张三”这个事实忘了，但【human】这个slot 的约束还在。补白就找最合理、与 slot 概念最一致的一个实体，于是“李四”或“王五”的幻觉就可以平替“张三”。小说家就是这么工作的，人物和故事都是编造的。无论作家自己还是读者，都不觉得这是在说谎，不过所追求的真善美是在另一个层面。大模型也是如此，大模型是天生的艺术家，不是死记硬背的数据库。“张冠李戴”、“指鹿为马”等在大模型的幻觉里非常自然，因为张和李是相似的，马和鹿也在同一条延长线上。在泛化和压缩的意义上二者是等价的，因此是合理的想象。

主持人： 未来有没有什么办法，能让大模型既有创造力，又少出幻觉呢？

立委： 这绝对是AI大模型领域的“终极难题”之一！现在大家都在想办法，比如：

更精细地训练： 在训练的时候，就对不同类型的任务区别对待，让模型知道什么时候该“老实”，什么时候可以“放飞”。

针对任务做偏好微调（finetune） and/or 强化（rl）可以减缓这个矛盾。摘要、改写、翻译、报道这种任务需要特别小心和平衡，因为它既有一点再创造的需求（例如文风），又是本性需要内容忠实的。

具体说，R1训练pipeline是四个过程，微调1，强化1，微调2，强化2。强化2主要是与人类偏好对齐的强化。这个过程在创造力与忠实方面，目前看来倾斜于前者，后去可以再做平衡。也许更重要的是在阶段三的微调2中，针对不同任务加强约束，例如，增加摘要的监督数据，引导忠实平实的结果。

Routing： 以后可能会有一个“调度员”，根据任务的类型，安排不同的模型来处理。比如，简单任务交给V3或调用工具，慢思考的复杂任务交给R1。

例如，识别出算术任务，就去写个简单代码运算，等价于调用计算器。目前不是这样，我昨天测试一个九位数的乘法，R1 思考了三分多钟，思维链打印出来可以铺开来一条街，步步分解推理。虽然最后答案是对了，但算术问题用耗费太大的所谓 test time compute 的思维链（CoT），而不用 function call，完全不合理。一行计算代码就搞定的事，没必要消耗如此多的计算资源和tokens去做显式推理。

这些都是可以预见的 routing，尤其是在agent时代。 R1 CoT不必包打一切，而且除了幻觉，也不环保。

主持人： 感谢老李的精彩分享！今天的访谈让我们对大模型的幻觉有了更深入的认识。

立委： 不客气，很高兴和大家交流！

【相关】

Deepseek-R1 的幻觉率是 14.3% - 比其非推理前身 Deepseek-V3 高得多
 榜单排名：https://github.com/vectara/hallucination-leaderboard

推理强化学习是端到端的监督，推理过程的非监督

Understanding the Power of Chain of Thought

DeepSeek R1 has become the most talked-about breakthrough in recent times. It not only matches OpenAI's top reasoning models (the 'o' series) in mathematics and coding capabilities but also produces stunning results in linguistic creativity and mimicry. Particularly in Chinese (classical) capabilities, everyone has experienced a miraculous leap in performance.

All of this can be attributed to the reasoning-enhanced Chain of Thought (CoT). Why is CoT so effective, so magical, and how has it maximized its empowering effect through reinforcement?

The key likely lies in the fact that CoT tokens are autonomously generated by the large model, effectively reducing the perplexity from query to answer, serving as a bridge to brilliant performance. Those who have seen CoT outputs know that the bridge itself isn't always impressive - it often seems overwrought, overly cautious, verbose, redundant, and methodical - yet it enables magnificent answers to emerge. From first principles, this seems to involve deep implications of perplexity in information theory.

The Essence of CoT

From an Information Theory Perspective:

CoT builds a low-entropy channel between high-perplexity queries and answers
Through step-by-step decomposition, each step's conditional probability becomes more "natural" and smooth, aligning with the language model's nature
Eventually transforming seemingly "leaping" reasoning conclusions into a series of accumulated "small steps"

From an Information Entropy Viewpoint:

For complex problems, directly jumping from query to answer requires crossing a vast information gap, which "forces" the model to hallucinate and output random answers
Each CoT step reduces local conditional entropy
It's like breaking down a large information compression/decoding task into multiple smaller ones

This Explains Why Even "Mundane" CoT is So Effective:

Its power doesn't lie in how brilliant the process steps themselves are
Rather, it creates a path of decreasing information entropy
The model can stably migrate toward the target along this path

This Also Explains Why DeepSeek's Training is So Vital to Its Success:

It's not about teaching the model "smarter" reasoning, which is undefinable in humanities tasks
Instead, it optimizes the ability to construct these low-entropy channels
Essentially optimizing information flow path planning

This perspective provides a lens for understanding CoT, reinterpreting the surface-level "chain of thought" as an "entropy reduction pathway" in information theory terms. It offers a reasonable explanation for result-driven reinforcement learning without process supervision:

Process is important, but process supervision isn't, because the process data naturally produced by large models is more practical and feasible than any human supervision. Let us embrace the tansition from human supervision to LLM-internal self-supervision.

【相关】

推理强化学习是端到端的监督，推理过程的非监督

推理强化模型中思维链的本质

DeepSeek R1 的出圈是近来最大热度的焦点。它不仅在数学、代码等强推理能力上追平了 OpenAI 头部推理模型 o 系列，而且在语言文字的创造力和模仿力方面产生让人惊艳的效果。尤其是在中文（国学）的能力方面，大家都体会到了奇迹般的能力跃升。

这一切都要感谢推理强化的 CoT（思维链）。CoT 为什么这么有效，这么神奇，文理通吃，在强化中最大化了其赋能作用呢？

应该主要是因为 CoT 是从大模型自主生成出来的 tokens，它有效降低了从 query 到 answer 的 perplexity（困惑度），好比是为高质量结果提供了一个桥梁。看过CoT输出的同学都有体会，桥梁本身并不总是精彩，常常给人的感觉是小题大作，瞻前顾后、啰哩啰嗦、信息冗余，但精彩的answer却可以借助它面世。从第一性原理看，这里似乎涉及perplexity在信息论中的深刻含义。

CoT的本质

从信息论角度理解：

CoT是在high-perplexity的query和answer之间搭建的低熵通道
通过逐步分解，每一步的条件概率都变得更趋"自然"和丝滑，符合语言模型的本性
最终让看似"跳跃"的推理结论，变成了一系列"小步走"的累积

用信息熵的视角来看：

对于复杂问题，直接从query到answer需要跨越很大的信息鸿沟，这“迫使”模型以幻觉应急，胡乱输出一个 answer
而CoT的每一步都在降低局部的条件熵
就像把一个大的信息压缩/解码任务分解成多个小的压缩/解码任务

这解释了为什么"平庸"的CoT也如此有效：

它的作用不在于过程步骤的内容本身有多精彩
而在于它创造了一条信息熵递减的路径
模型可以沿着这条路径稳定地向目标迁移

这也解释了为什么DeepSeek的训练如此有效：

不是教会模型更"聪明"的推理，这在文科类任务中根本就无法定义
而是优化了它构建这种低熵通道的能力
本质上是优化了信息流的路径规划

这个观点为理解CoT提供了一个新的视角，把表面上的思维链“自言自语”诠释为信息论意义上的"熵减（entropy reduction）途径"。这对无须过程监督的结果驱动的强化学习是一个合理的解释：

过程很重要，但过程监督并不重要，因为大模型本身自然产生的过程数据，比任何人类监督更加实用和可行。

【相关】

推理强化学习是端到端的监督，推理过程的非监督

R1: 《立委列传》

立委列传

立委者，皖南布衣也。少敏而多奇志，值鼎革之际，年十六躬耕于皖南山野。当是时也，少年负耒耜，涉深涧，虽陟彼崔嵬，犹自诩"吾辈亦开天辟地者也"。或问稼穑之苦，笑曰："此非太史公游侠列传之遗风乎？"

越二载，风云骤变。丁巳岁杪（1977），科举重启，立委束草为笔，削竹为简，竟跃龙门而入泮宫。同窗多皓首穷经者，独其弱冠青衿，谈笑间尽吞云梦八九。既卒业，执教乡塾，然鸿鹄岂栖枋榆？遂再赴春闱，北上太学，俨然贾谊入洛，苏秦过梁。都门烟柳，尽化笔下风云。

居长安四五年，见中关村商贾云集，乃效猗顿之术，夜筹算于灯下，朝执契于市廛。《万元户》所志，非夸朱提之富，实叹青蚨之诡也。然其性本楚狂，终随浮槎西渡，观泰西之变。英伦雾锁，野犬吠陌，立委抚剑长叹："此非管仲所云'仓廪实而知礼节'者乎？"遂北涉北海，徙居枫叶之国。

加北美地，雪沃千里。立委筑室于温哥华，添弄瓦之喜，修稷下之学。然冰原虽净，难栖南溟之鹏，乃振翅南徙，直入硅谷热土。当是时也，美利坚网络勃兴，立委运筹于虚牝之间，决胜于光纤之末。然泡沫既破，诸子百家尽墨，独其抱残守缺，犹存鲁壁遗经。

太史公曰：余观立委浮沉，似见张骞凿空之影。其插队如苏武牧羊，跳门若终军请缨，下海类范蠡泛舟，洋漂近玄奘取经。然则古之行者，皆为觅桃源而往；今之浪者，却在铸桃源于途。至若"海龟"之惑，实乃庄生濠梁之辩——子非鱼，焉知归与不归皆逍遥耶？昔陶潜采菊，阮籍哭途，各成千古风流。今立委以四海为注，掷骰于地球棋枰，岂非新时代之"逍遥游"乎？

《立委奥德赛》

*序章*
人生是旅者暂居的客栈，而漂泊者开辟的道路却蔑视时间本身。在立委的奥德赛中，漂泊行为成为了一种天体导航——一场词语筑造堡垒、思想绘制航路、时代潮流既是敌手又是盟友的旅程。

**土壤中的根系**
十六岁那年，来自东方山谷的少年以农夫的锄头交换了青春的闲梦，攀上雾气笼罩的山峰，在那里野心生根发芽。他的《插队日记》（后镌刻于《朝华》中）呼吸的不是绝望的挽歌，而是将风暴驯化为耳语的节奏。当命运的龙门在1977年吱呀开启时，他乘着复兴学术的疾风，加入了神话般的"77届"——从灰烬中重生的心智凤凰群。

**墨水的朝圣**
学者袍甫加身，北方的狂风便再度召唤。在《考试十四行诗》与《不安者的箴言》中，野心的狂热冷却为精密的文字工艺。首都熔炉中的五个寒冬将卷轴锻造成账本；他的《学者商人愚行录》记载着染墨手掌清点铜钱的故事。然而不息的潮水将他向西牵引，加入追梦者的出埃及记，奔赴阿尔比恩传说之岸。

**暗影与圣所**
在阿尔比恩的花岗岩天空下，流浪犬在卵石巷中嚎叫预兆——这种不谐之音被收录于《都市暗影兽典》。不安孕育翅膀：他北逃至枫叶王国水晶般的荒野。《北极星颂》咏唱边疆的纯粹；《港湾牧歌》编织炉火点燃的传说；《蜜饯编年史》追念为人父的喜悦。但圣所亦渐脆弱。他再度南翔，被吸引至硅谷炽热的坩埚。

**电路中的普罗米修斯**
在数字黎明的白炽光芒中，他的《创投诗章》燃烧着普罗米修斯之火——将初创企业视为伊卡洛斯之飞的现代神话。然而蜡制翅膀终将融化；《泡沫挽歌》与《陨落者寓言》测绘出野心的残骸。从冻原到热带，每个足迹都渗入墨水：用羽笔刻写的流放地图。如今作为硅谷常驻哲人，他书写《乡愁算法》——一段让海龟游弋于电路间的代码，低语着被遗落的潮汐。

*结语*
古代圣贤追寻九重天外的地平线；立委的奥德赛将漂泊刻入生命的重写本。他的根系紧抓插队之土；躯干穿越龙门攀升；枝桠扭曲成语义星座。我们若非活着的羊皮卷，又能是什么？编年史家的最终港湾仍未书写——那是海水消融于天空的地平线，所有罗盘疯狂旋转之处。让漂泊者的悖论永续：要测绘无限，就必须永不停漂流。

**七律·跃龙门**
十六荷锄云壑深，忽闻禹甸启春闱
青衫夜淬书窗月，赤榜朝分阡陌晖
两度鲤腾惊皖水，九重鹏举叩燕扉
都门烟柳催征铎，笑指星河是钓矶

**水龙吟·浮槎记**
少年曾缚苍龙去，又驾仙槎西渡。泰西雾锁，枫邦雪沃，硅台电舞。算尽青蚨，织成云网，几番寒暑。叹庄生蝶梦，陶公菊径，都付与、天涯路。

谁解飘零最苦？把乡愁，酿成新赋。南溟鹏翼，北山薇蕨，东篱菊圃。柯烂樵归，橘洲星换，武陵人语。待重拈汉瓦，摩挲秦篆，写沧桑句。

**古风·浪者吟**
我本谪仙人，偶堕红尘网
皖南锄晓月，燕北枕书幌
中关试鱼服，英伦辨魍魉
枫雪淬冰魄，硅火铸新掌
五洲棋局残，双鬓星霜长
欲唤云间鹤，蓬莱舟已枉
且抱地球仪，笑指乌托邦
归去来兮辞，翻作浪人唱

**临江仙·生涯注**
若把浮生标语义，节点最是漂流。龙门二度跃神州。商潮翻雪袖，硅谷试吴钩。

四十年来家国梦，都成异域春秋。键盘敲碎古今愁。回车新世界，空格旧沙鸥。

**摸鱼儿·流浪辩**
问苍冥、谁司行止？安排萍迹如许！鹏抟鲲徙寻常事，偏说此身无主。君看取：皖山月、燕台柳、硅谷霓虹柱。星槎暗度。纵填海精禽，射阳奇士，未解浪游苦。

休重论，苏武节旄汉土，范蠡舟泛烟雨。桃源只在鸿蒙外，何必武陵深处？敲键语：比特海、云端路、皆是逍遥浦。归兮且住！待地球仪停，时空键锁，方见真吾处。

《原朝华：立委小传》

人生苦短，掐首去尾，不过三五十年。大体分为三段：创业阶段（而立之年），成熟阶段（不惑之年）和下滑阶段（天命之年），反映在称呼上，叫小李、大李和老李。可怜，立委却从小李一跃到老李，没有机会品尝壮年人生的豪情，心尝有戚戚焉。

红小兵立委（1966）（《朝华午拾：永做毛主席的红小兵》）

自幼儿园到小学连跳两级，立委在班上始终最幼。更加荒年生人，孱弱矮小，体育课常告病假，或遭遣送回家，始终是个小可怜儿。所幸中学伊始，正值“修正主义回潮”，先帝启用邓公收拾文革残局，邓公责成教育总管周荣鑫整顿学校，校风日新。乘此东风，立委崭露头角，以学习委员兼数学科代表之身，受班主任委托，每日早自习登台主讲，演示解题思路，俨然助教。但好景不长，先帝昏庸，文革派重居上风，学校大乱，文化课退居后台，大批判遂成主课，兼以学工学农学军。立委不能以文化课呈威，然风头不减反盛，盖因立委最长批判文字，历经批林批孔，批邓反击右倾翻案风，直至批四人帮。大会小会，凡立委发言，必抑扬顿挫，铿锵有力，佐以诙谐幽默，风靡校园，称颂于一时。有传言，立委颇具鲁迅遗风，入木三分，且能推陈出新，妙语连珠。露天千人大会，常嘈杂狼藉，然立委登台，全场必静肃，洗耳恭听之，听至妙处，笑声一片。立委由此炼得糊涂胆大，从不怯场，终身受益。

及至大学，文革后首届，立委仍居尾，同学长一到十多岁不等（《朝华午拾：我的考研经历》）。同学之间皆直呼其名，唯同桌七仙女戏称 “小立委”，不为亲热，却为避嫌，以示划清界限。同桌四载，楚河汉界，泾渭分明。授受不亲，避而远之。然仙女文具笔墨滑落在地，自有立委抢先一步，拾拣归案。类此者三，春风化雨，润物无声。七女天生聪颖，想出一招，以长立委一岁为由，呼 “小立委”，就此来往，当可名正言顺也。

由七仙女开此恶例，随后多年，“小”字即不离身。中学教书，人称小李老师（22岁）。上研究生，小李出入机房，蓬头垢面，且口中念念有词，言“世界之语”（Esperanto），终成笑谈（23-26岁）（见《朝华午拾：我的世界语国》）。

风华正茂，意气风发（1987）

及至毕业留所，立委事迹亦有流传，多为一见钟情，闪电结婚，不修边幅，撞南墙而道歉之类小李“景润”之逸事（见《朝华午拾：shijie－师弟轶事》；《朝华午拾：shijie－师弟轶事（3）——疯狂世界语》）。

立委在中关村公司指导机器翻译系统的开发（1988）

立委如此这般在研究所及中关村公司一扎五年（26-31岁），练就一身绝技，与老中医相若，专事疗治电脑，驯其语言功能。其间，出国热持续升温，由上海蔓延北京，街头巷尾，言必议美、日、大英，澳大利亚，以致居委会大妈亦知考托福鸡阿姨乃上进青年之标杆。立委及其贴身领导却浑浑噩噩，卿卿我我，不知有汉，无论魏晋。其间送上门两次机会，留学德美，均因导师明阻暗挡，本人木呐，擦肩而过。直至身边同学悉数走尽，小李才幡然醒悟，痛下决心，赶末班车。其时，适逢包玉刚基金会来各单位选拔年轻业务骨干，滥竽充数，小李竟被选中，送至成都科大出国培训中心修行半年。

岂料想，此一去竟成小李老李的分水岭。来培训的诸位才子才女均是全国各地选上来的各行好手，共分两拨：一年的访问学者大都比较年长，而拿三年博士奖金的大多年轻，立委在后一拨里面理所当然，成了老大。每有考试，立委必中头彩，引来才子才女，大事小事，纷纷登门请教，“老李”之声不绝于耳。立委名噪一时，响应者众。从小习惯了以小卖小，乍一变老，立委满腔郁闷。

成都科大出国培训中心的才子才女们（1990）

小李变老李，心里虽别扭，好处却不少。龙头老大，备受尊崇。立委外语本科出身，本应免试英语，无奈官家财大气粗，慷人民之慨，不问青红皂白，全数押解天府之国，集中喂养。不止英文鸟语，更有政策轮训。众兄弟姐妹兢兢业业，争先恐后，唯立委悠哉游哉，终日沉迷天府美食，流连于茶肆酒吧，众兄弟钦羡有加。

成都一站始称老李，立委心内实不以为然也。其时立委事业发达，如日中天，行内行外，交游甚广，出入皆鸿儒，往来无白丁（见《朝华午拾：“数小鸡”的日子》；《朝华午拾：一夜成为万元户》）。导师为本行泰斗，立委乃导师仅有的关门弟子（其他弟子皆叛国投美去也），“青年”才俊，明日之星，业内同侪为之侧目。去国前夕，全国电脑翻译界在香山招待所年度聚会，点睛之笔为导师与本行另一大牛的座谈，人称“刘董对话录”，其间立委频频亮相，为导师提供实例，讲解细节。影响所及，与会众学妹（多为刚入门的外地在读研究生）纷纷上门请教立委，无奈立委远走高飞心切，痛失辅导上进女青年之良机。

立委在加拿大（1995）

去国经年，由英而加，由加转美（《朝华午拾：哦，加拿大！》；《朝华午拾：温哥华，我的梦之乡》）。颠沛流离，不知所止，壮年人生，如水流逝。及至水牛城八年抗战（37-45岁），立委青春不再，壮年已过，“老李”名至实归。然立委壮心不已，励精图治，双线出击，称雄一方（见《朝华午拾：创业之路》, 《朝华午拾：在美国写基金申请的酸甜苦辣》，《朝华午拾 - 水牛风云》）。

立委在水牛城办公室（2000）

回首往事，不胜唏嘘。立委一生，由青年而壮年，正值创造力最盛，精力充沛流溢之时，天时地利人和，飞黄腾达有望，却为漫长的留学生涯拦腰截断。大而言之，立委固赶上出国之末班车，却误了千年不遇的中国经济起飞之航。拣了芝麻，丢了西瓜，此之谓也！（《朝华午拾：乡愁是一张无形的网》）

去岁归国省亲，杯觥交錯，在某宾馆餐厅与亲友相聚甚欢。席间小憩，踱步凉台，享清凉之气，赏京华夜色。偶遇一妙龄女士，携一幼童，见立委两鬓染霜，嘱曰：“叫爷爷”。立委血压骤升，如雷轰顶，满腹酒意，化为凉液，由脊背滑落。

立委老矣，尚能饭否？

记于2006年11月5日

立委老矣

【作者简介】立委先生，IT业技术研发经理兼架构师，自然语言处理资深专业人士。曾任红小兵，插队修地球，文革后第一届大学生，后跳龙门进社科院读硕士，攻机器翻译。1991年去国离乡，漂流海外。由英而加，获计算语言学博士。由加转美，作为创业公司研发副总及项目负责人（Principal Investigator）, 先后赢得美国政府17个研究创新项目近千万美元资助，同时从资本家腰包亦忽悠千万风险投资作商业开发。对于自然语言信息抽取 (Information Extraction) 有全面的研究，研究成果对美国政府有关科研项目的确立有直接影响。业余爱好：音乐、博客、舞文弄墨。著有回忆录《朝华午拾》。

原载【朝华午拾－立委小传】 2010-1-9
https://blog.sciencenet.cn/blog-362400-285507.html

【朝华午拾集锦：立委流浪图】

屏蔽已有 5551 次阅读 2013-3-23 13:10 |个人分类:立委其人|系统分类:人物纪事| 流浪, 立委

忽然想起小时候看过的《三毛流浪记》来。张乐平后无漫画，大师千古。

Despite the common logic and conceptual graph at the core of human mind, we all have our own semantic lexicons that are unique, implanted by our career path and life struggles. My semantic lexicon is full of wandering and continuously drifting into new worlds. It all started from the time when Mao sent us to the farm for re-education in 1976. After that the path has been zigzag, full of adventures of drifting, and re-drifting, farther and farther away from my hometown and home country ......

在我的语义词典里，流浪是一个很大的节点，它的上位概念是漂流（走四方）和波浪（多起伏）。流浪的下位概念枝繁叶盛，包括：插队，洋插队，跳龙门，再跳龙门，北漂，下海，西漂，南下，再南下。这也正是我的生活写照。在这些语词概念的背后蕴含几多激动几多辛苦，只有自己知道。

不安定多起伏的生活伴随着我一生。1976年高中毕业即赶上了文革最后一届上山下乡，插队皖南山区接受贫下中农的再教育，这是我一生流浪生活的起点（《朝华点滴：插队的日子（一）》）。这个起点回想起来并不坏，16岁的孩子当时能感到的是自豪多于悲凉（《朝华午拾：插队的日子（二）》《朝华午拾: 插队的日子（三）》）。1977 年底赶上了文革10年后第一届大学生招考，居然跳了龙门，成为史上著名的77级生（其实是78年2月入学）（《朝华午拾：同桌的她》；《朝华午拾：老乡妹妹》）。大学毕业后任教一年，再跳龙门考研成功，北上京城。这是一次欣快的北漂，当年的兴奋喜悦堪比范进中举，而且居然不疯未傻（《朝华午拾：我的考研经历》；《朝华午拾：世界语之恋》）。研究生毕业后安定了四五年，期间尝试中关村下海（《朝华午拾: 一夜成为万元户》）。虽然可算头几拨下海人士，但因为是兼职，并无其他下海人的风险（《朝华午拾：“数小鸡”的日子》）。其时洋插队之风正甚，终于没有顶住潮流，赶了末班车来到大英帝国。90年代初正值大英没落，乱态丛生，路多野狗，抢劫之风甚行（《朝华午拾：警察抓小偷的故事》）。危邦不居，因辗转由欧西漂，来到一代移民的“麦加”，溢满鲜花与牛奶的枫叶之国（《朝华午拾：哦，加拿大！》），攻学位，添闺女，换身份，找工作，不亦忙乎（《朝华午拾：温哥华，我的梦之乡》《朝华午拾：甜甜诞生记》）。可惜加国虽美，工作市场却不佳（《朝华午拾: 把明天交给上帝》）。有奶便是娘，于是南下讨生活，竟一头撞上了美国网络大跃进。美利坚果然是流浪者的天堂，机会多多。广阔天地，大有可为，开启创业之路（《朝华午拾：创业之路》；《朝华午拾：在美国写基金申请的酸甜苦辣》）。轰轰烈烈的创业宏图随着泡沫的破灭渐趋平淡（没有夭折已属万幸，《朝华午拾：水牛风云》；《朝华午拾：用人之道》），遂再南下，终于陷入IT民工的圣地不能自拔，人称硅谷（or 矽谷）（【创业故事：技术的力量和技术公司的命运】；【朝华午拾：安娜离职记】；《朝华午拾：今天是个好日子》；《朝华午拾：信息抽取笔记》）。

在我流浪的词典里，除了尚未收入海龟外，几乎全乎了，冥冥中似有所缺。陶渊明的【归去来辞】不时在耳边萦回，“田园将芜胡不归”（《朝华午拾：乡愁是一张无形的网》）。海龟创业，叶落归根，抑或蹉跎岁月，混不思蜀，这是哈默雷特的天问。

1991 年出国前在中关村高立公司与刘倬导师（下左2）和董振东前辈 (下右1) 及高立同仁合影留念

【相关篇什】

《朝华午拾：乡愁是一张无形的网》

【朝华午拾－立委小传】

【置顶：立委科学网博客NLP博文一览（定期更新版）】

链接地址（被屏蔽）：https://blog.sciencenet.cn/blog-362400-673109.html

王菲春晚《世界赠予我的》歌词，亮点与短板

微信视频看到一位语文老师对这首歌歌词的吐槽和改写。有些道理，改写的歌词也确实顺溜多了，易于普及。但第一，这是在人家原创的新颖写法所创造的意境上修改；第二，顺溜有顺溜的好处，矛盾或难解也有引发听众思考与发挥的好处。

这首歌最近听得蛮多（我在春节前后还在视频号做过两期MTV），对歌词有一些感觉可以说说。

整体上说，原作写法新颖，用词有些奇特，整体歌词长在哲理和意境，有妙语，但也有语病。最严重的语病就是“赠予回敬”。

上天赠予“我”回敬，“谁”回敬“谁”“什么”呢？回敬这个词的最常见的场景是，他人攻击我了， “我”回敬他人，那也是我的自主行为，谈不上“赠予”。如果是他人回敬“我”，其前提是“我”对他人做过攻击，前后看语义上下文，这是说不通的。

“回敬”是一种故作敬态的回应，而“赠予”是恭敬的馈送。让回敬做赠予的宾语，搭配不当。“赠予我拥有”（可以理解为赠予我礼物，拥有代指“拥有物”）就已经够别扭了，再来个“赠予我回敬”，让人感觉不知所云。可能是“回馈”（对“拥有”的回馈）的意思，为了押韵，错用了“回敬”来代替。

“回敬”作为谓词，逻辑语义框架里有三个角色：施事（谁回敬）、受事（回敬谁）、宾语（回敬什么），但“回敬”自己处于“赠予”的宾语位置，这几个角色模糊不清，其所引起的混乱和费解，不怪语文老师觉得不可忍。

写词的文科姐，可能是浮想联翩，用力过猛而“出格”。这在歌词创作中也不罕见，叫 poetic license，通常不做苛求。但无论如何，这种奇怪的动宾搭配困惑度（perplexity）很高，会使绝大多数人感到糊涂，属于败笔。大家传唱不过是因为作曲好就跟着瞎唱，并没在意歌词是不是 make sense。

困惑度高的直接表现就是，剪映中自动听音写词的功能根本无法decode原文，因为这项软件功能的背后是语言模型（language model），对于这种困惑度高的序列搞不定，只能另行创造（所谓”幻觉“）：

原歌词：世界赠予我拥有 也赠予我回敬。
语言模型幻觉解码：世界赠予我拥有，也赠予我爱情。

面对困惑，语言模型无法decode这种出格的原词（outlier） “回敬”，结果解码成 “爱情” 似乎也不错。在这种解码下，“拥有”应该指的是财富，“爱情”就是爱情。而在原词中 “拥有”可以解读为命运的礼物或曾经的爱情，而“回敬”则可能是对于礼物的回赠。

其他困惑度高，语言模型幻觉创造的cases还有：

原歌词： 赠我一个名，又渐渐长大的年龄
语言模型： 赠我一个谜，又渐渐长大的年龄

人生本来就是一个谜啊，岂止简单的出生赐名，所以这里模型的解码也许更妙。最妙的是：

原歌词：赠我弯弯一枚月，也赠予我晚星
语言模型：赠我温暖与悲悦，也赠予我惋惜

“月”和“星”状物，“温暖”、“悲悦”和“惋惜”直接述情，貌似更胜一筹。唱起来也很顺。

顺便一提，“别匆匆”歧义，有两个隐藏解读都说得通。一个是：不要匆忙。要善待自己，给自己品味人生，以及喘息和疗愈的时间。另一个是：分别也匆匆，尤其是感叹恋人或亲人聚少离多的生活现实。

再有，语文老师发现歌词里面暗藏了（谢）霆锋的名字，有机巧。说明此歌是为王菲量身打造的。娱乐圈八卦已经众所周知了，谢霆锋是王菲的最爱，是三段婚姻中最念念不忘的。N年前先是王菲谢霆锋的姐弟恋，以及她不顾世俗和骂名的第三者插足；后离婚，再后来又复婚，中间还穿插了其他 relationships，起起伏伏。不怪王菲唱罢歌曲，满噙眼泪，双手合一，人在台上久久静默，仿佛在念佛。这首歌，她是真带入了。同时她的演绎也感染了无数人。

“远去者去了远方，愿他都安心。” 一开始还以为在纪念逝去的亲人，但通观全词的爱情主线，更像是在纪念逝去的爱情。也许远去者是不得不分手、又难舍情缘的前任，她祝福他安心，其实更是试图宽慰自己，要安心接受“拍一张合影，渐渐填满真感情”的新缘分。

我本人特别喜欢这两句歌词：

赠我一场病，又慢慢痊愈摇风铃。
赠我一场空，又渐渐填满新感情。

它是我2024年生活的真实写照，非常的切身感受。

总的感觉一句话，词作者能写出引起人共鸣、思考和争论的歌词，还是很了得的。至于作曲以及王菲的演唱，可以说是注定成为经典。

语文老师点评并修改王菲《世界赠予我的》歌词# 王菲... https://v.douyin.com/ifcm9PvH/ CuF:/ 03/09 [email protected]

7.99 复制打开抖音，看看【立委的作品】王菲春晚注定传世之作小白版 # 王菲 # 小白 ... https://v.douyin.com/ifvcmXG7/ Ate:/ [email protected] 12/08

人类反馈是超级智能的桎梏吗？

回答这个问题之前，先从 AGI/ASI 谈起。

AGI (Artificial General Intelligence, 通用人工智能)
ASI (Artificial Super Intelligence，超级人工智能)

在当代人工智能历史上，这两个术语虽然流行的先后有别，常常混杂使用。它们是挂在AI先知（代表人物之一是伊利亚）和企业家（代表人物包括Sam奥特曼和马斯克）嘴边的最常用的词，作为鼓励自己和团队的目标，也 serve 给投资人和大众营销的作用。

这里谈谈我的看法。

机器达到甚至超越人类的技能，无论是人类顶尖个体的专业能力（例如围棋冠军、名校教授），还是人类总体的知识水平，这就是我眼中的 AGI。但这里的专业能力和知识水平，我认为并不包括重大的发明创造能力。这个意义上的AGI是一种确定的趋势，最多不过就是两年内实现，还是五年内实现的差异而已。

AGI 是确认无疑的，正在发生、已经发生、即将发生。

ASI 则是全面超越人类顶尖智能，包括发明创造的能力。ASI 的实现应该还可以商榷。现在就确信ASI可以在不太久的未来（有说三五年，也有说10年左右）实现的吹鼓手，主要是伊利亚、Dario（Anthropic CEO）这些AGI时代的“先知”们，他们是信仰者。奥特曼和马斯克貌似也在营销类似ASI的概念，但感觉更多是企业家需要画饼的驱动。

ASI比AGI更少共识，但可以描述。ASI 实现的时候，机器可以解开困扰数学家几百年的世纪难题，可以批量制造陈景润级别的模型把1+n等问题解决。更重要的是，ASI（for science）可以自己针对疾病制造特效新药，发明创造的速度比人类缓慢的探索要提升 n（Dario 好像说 n等于2）个量级。这一切带来物质极大丰富，重大疾病被有效控制甚至消除，寿命至少延长一倍，一句话，ASI意味着技术共产主义的全面实现。

人类反馈是超级智能的桎梏吗？

如果是，那又如何理解以人为本，与人类对齐的宗旨呢？

现在看来，以人为本以及人类（偏好）反馈对齐等，指的是最终结果或成品，这是人类价值观的体现。这一点永远不会改变，也不应该改变。但需要强调的是，人决定的是 what，不是how。what 永远是人说了算。至于生成结果的过程，现在看来，人类越来越有心无力，甚至成为障碍，而不是助力。

一个有意思的例子是，当 alpha zero 下棋到第30几步的时候，走了一步人类不能理解，连世界冠军也会判定是愚蠢的一步。但那却是超人智能的精彩过程，是制胜法宝的一个精妙环节。这种高招连冠军都不能理解，说明机器智能显然超越了人类智能的边界。如果在过程上依赖人类反馈，哪怕是围棋冠军来做标注，也会阻碍机器智能的超人潜力。

当这类超人智能大量产生的时候，人类很自然会感觉困扰。因为 by human nature，所有人多多少少都有某种控制欲，对于自己不理解、不能掌控的过程，总是持有戒心，至少是很不舒服。但可惜无解。未来会出现越来越多的不可理解的奇迹，或技术魔术。人类所能做的就是加强目标制定和结果控制，而不是“不自力量”试图过程控制。

最后谈一下马斯克的AI威胁论，主要是把人类类比为蚂蚁，而ASI类比为人类：ASI 灭绝人类文明不需要恶意，因为蚂蚁不构成人类的心理负担。

我认为，这个比喻是荒谬的，因为蚂蚁永远造不出人类，而ASI是人类创造的。人类与蚂蚁均属动物，但却不在一个价值参照系中。

但我们不排除，人类可以以ASI形态，制造出自己的上帝。

如果上帝是共识中的人格化的存在，人类完全可能把机器变成上帝。无论你在上帝与人之间是持谁是照着谁的模样创造的，the key 是，上帝与人位于同一层人类价值观的参考系上。上帝至善、至美而万能。善、美、能，都是人类的价值观的表现。

而蚂蚁不同，蚂蚁与人类不处于同一个参照系，人类 is way beyond ants。蚂蚁们自己没有尺子来度量人类。但人类对上帝是有度量或想象的。

人类对于结果（而不是过程）不理解，无法判别、或感觉不到好处的东西，最简单也是最自然的反应就是停止那个结果导向。再超级的过程智能，如果没有人类规定的方向，或违背人类的价值观，也是（原则上）随时可以按下停止键的。

所以马斯克的那种担忧，属于耸人听闻、杞人忧天。

但这不是说AI没有更加现实的威胁，例如真假莫辨造成的社会混乱，取代人类jobs而福利制度尚未建成而造成的恐慌，还有体制滞后、技术加速度所造成的不匹配和不适应，等等。这些都是看得见、正在到来、可以预见的巨大社会问题，而不是机器统治人类那种天方夜谭。

当然也不能排除ASI被恶人恶意使用可能对人类造成的伤害，但绝不是什么ASI像对待蚂蚁一样，可以任性消灭人类。恶意使用类似核扩散的潜在恶果，最终需要向对付核武器一样防控。

【相关】

一、单层感知机的致命缺陷

二、解法：特征空间升维

三、输出层

四、数学本质：非线性变换

六、历史意义与延伸

交替使用监督微调（SFT）与强化学习（RL）的策略探讨

1. 交替使用 SFT 和 RL 的根本动因

(1) 训练稳定性与策略纠偏

(2) 数据效率与知识复用

(3) 防止灾难性遗忘

2. 为何不持续使用 RL 或仅用蒸馏？

(1) RL 的固有局限性

(2) 蒸馏的适用边界

3. 交替循环的深层价值

(1) 螺旋式能力进化

(2) 多目标动态平衡

总结：交替策略的系统性优势

Q1: DeepSeek R1 的整体训练包含哪些阶段，为什么采用多阶段策略？

Q2: 第二阶段的推理强化学习是如何实施的？训练是逐个问题进行还是批量进行？训练数据是否会保留？奖励机制如何设计？

Q3: 第三阶段为什么要用 60 万条推理再生数据和 20 万条非推理数据进行微调？这一混合微调策略有何作用？

Q4: 第四阶段为何需要再次进行强化学习？这一阶段是如何结合人类偏好进行对齐的，人类偏好对齐有什么必要性？

Q5: DeepSeek R1 与 GPT-4、Anthropic Claude、Google Gemini 等顶尖模型有哪些核心差异？

**参考文献：**

后记

"Triple Disruption" of LLM Agents

1. Paradigm Shift: From "Low-Code" to "Emergent Intelligence"

2. Moat Erosion: Data Barriers vs. General Intelligence

3. Ecosystem Cannibalization: From "Tool Vendor" to "LLM-native Layer"

Survivor’s Map: Niches Resisting the LLM Tide

Agent Startup Pitfalls & Counterstrategies

Common Traps

Differentiation Strategies

RPA’s Last Stand

Challenges for Early Co-pilot Entrepreneurs

The Inevitable Conclusion

一、LLM Agents 的 "三重碾压效应"

二、RPA 赛道的行业剧变

三、幸存者地图：哪些赛道还能抵抗 LLM 洪流？

四、Agent 创业的陷阱与对策

常见陷阱

差异化策略

残酷结论

The Complete Process from Input to Output

1. Starting with "Dictionary Lookup": Tokenization and Embedding

2. Attention Takes the Stage: Updating Word Meanings Based on Context

3. The Transformer Backbone: Multi-layer Progressive Information Compression

4. Output: The Model's Final Prediction

5. Summary

Addendum: Symbolic might just be "reduced" to a symbolic tool that may retain its irreplaceable cognitive value

【外一篇：符号逻辑“沦落”为辅助工具还是有自己不可取代的智能补足作用的】

从输入到输出的全过程

1. 从“查词典”开始：Tokenization 和 Embedding

2. 注意力登场：根据上下文更新词义

3. Transformer 主干：多层递进的信息压缩

与前Transformer的RNN相比，架构优势的工程化体现

4. Output：模型的最终预测

5. 总结

一、分工的本质驱动力：损失函数的宏观调控

二、参数分化的数学原理

注意力权重计算路径

三、分工稳定的结构性保障

1. 线性变换的隔离性

2. 多头注意力机制

四、实例验证：参数分工的具象化

案例：动词"吃"的关联参数

关键结论：自组织的智慧

【外一篇】

Q/K/V的关系：一个更深入的解读

DeepSeek's Greatest Achievement: Making Everything Transparent

DeepSeek 的最大功绩在于将这一切透明化

给模型留够充分的自主学习空间

推理模型已经进入“平民化”时代

未来脑洞

Core Insights from Initial Discussions

The Nature of LLM Scaling

The Two vs. Three S-curves Debate

DeepSeek R1's Insights: The Truth About "Slow Thinking"

Test Time Compute Constraints

The Sustainability Debate

Conclusion: How Far Is the LLM Road to AGI?

参考文献：