Grok:大模型为什么要超大数据?(4o配图)

各位同学,想象一下这个场景:你在兴致勃勃地榨油,刚开始,新鲜原料下去,哗啦啦全是金黄的油,心里美滋滋。可榨着榨着,出来的渣子越来越多,油却越来越少。按咱们朴素的常识,这时候是不是该见好就收,别浪费原料了?

然而,到了人工智能语言模型(LLM)这片神奇的土地,剧情发生了一百八十度大转弯!数据量从1亿token(可以理解为单词或字符块)一路狂飙到1000亿,甚至更多,按理说早该“榨不出油”了,可模型反而越来越聪明,不仅能跟你天南海北地聊天,还能即兴写诗、编故事,甚至进行复杂的逻辑推理。这……不科学啊!有了“够用”的大数据,为啥还要追求“多到离谱”的超大数据?这不是跟“榨油”的道理反着来吗?

今天,咱们就来扒一扒这背后的“暴力美学”,看看AI界的“大力出奇迹”到底藏着什么玄机。

[一张对比图。左边是一个小油榨机,旁边堆着少量原料,出油口滴着几滴油。右边是一个巨大的、充满未来感的“数据榨取工厂”,无数数据流涌入,出油口喷涌出象征“智慧”或“能力”的光芒/液体。]


从“够用”到“超多”:数据的暴力美学

咱们先凭直觉感受一下。假如你正在苦学英语,背了5000个核心单词,应付日常对话基本就八九不离十了。这时候再让你背5万个单词,可能也就是多认识几个莎士比亚戏剧里的生僻词,或者医学、法律的专业术语,对于日常交流能力的提升来说,收益明显递减,对吧?

按理说,语言模型也该是这么个道理:给它1亿token的数据(大概相当于几百万句话),足够它学会基本的语法规则、常用短语和表达方式了。再喂给它1000亿token,那不就是大量的重复和啰嗦吗?就像让你把“你好”看一万遍一样无聊。

可现实啪啪打脸:像GPT-3这样的模型,据说用了高达3000亿token的数据进行训练;马斯克旗下xAI的Grok,虽然具体数字保密,但可以想象也是个“数据饕餮”。结果呢?它们不仅能对答如流,还能在没见过具体问题的情况下(零样本学习)给出答案,甚至模仿莎士比亚的文风写十四行诗。这简直就像榨油榨到最后,眼看全是渣子了,突然“嘭”地一声,冒出了一桶闪着金光的“智慧神油”!

这个现象在AI圈内有个响当当的绰号,叫做“暴力美学”(Brute Force Aesthetics)——简单粗暴地堆砌数据、堆砌模型参数、堆砌计算能力,方法看似“笨拙”,效果却出奇地惊艳。同学们,你们是不是也跟我一样好奇:这些数据明明已经冗余得像“榨干的油渣”了,为什么还能让模型变得更强大呢?


语言的“表”与“里”:一场形式与意义的双人舞

要想解开这个谜团,咱们得先学会像庖丁解牛一样,把“语言”这头复杂的牛拆解成两个关键部分:形式(Form)和意义(Meaning)。

形式:爆炸的“花式玩法”

语言的形式,说白了就是词语是怎么拼写的、句子是怎么构造的。比如,“我喜欢吃苹果”这句话,它可以有无数种“花式变体”:“我超爱啃苹果”、“苹果是我的心头好”、“苹果这种水果深得我心”、“吾甚爱苹菓”……

虽然我们日常使用的词汇量可能是有限的(比如现代汉语常用词也就几万个),但这些有限的词汇组合起来,产生的可能性简直是天文数字! 想象一下:假设我们有10万个词汇,要组成一个包含20个词的句子,理论上有多少种组合?答案是 10万^20 ≈ 10^100!这个数字比已知宇宙中的原子总数(大约10^80)还要多得多!

在海量的数据里,确实,大部分句子都是常见的、符合常规模式的(比如各种围绕“我喜欢...”的表达)。但与此同时,还存在着无数稀奇古怪、不按常理出牌的长尾模式(Long-tail Patterns),比如“我在月球上吃着榴莲唱着歌”、“用代码写一首关于猫咪打字的诗”。

超大数据(比如1000亿token级别)就像一个强迫症晚期的“形式收藏家”,它的一个重要任务就是尽可能多地收集和学习这些五花八门的、甚至是罕见的语言表达形式。只有1亿token的数据集,模型可能只见过规规矩矩的“我喜欢吃苹果”。但有了1000亿token,它就有机会见到“我喜欢啃苹果核”、“我喜欢吃代码bug”、“哀家独爱此红果”等等千奇百怪的说法,见识大大增长。

意义:有限的“内核”

与形式的无限可能性相对,语言的基本意义(包括本体知识,ontology),也就是语言所承载的内容、知识和逻辑关系,其核心概念及其关系的数量相对是有限的。比如,“我喜欢吃苹果”这句话表达的核心意义无非是“(主体:我)+(情感:正向/喜爱)+(动作:吃)+(对象:苹果)”。

世界上的核心概念(如时间、空间、情感、物体类别等)虽然丰富,但终究是有限的。可能只需要几百万句、甚至几千万句话(对应相对“较小”的数据量,比如1亿token),模型就已经能覆盖和学习到绝大多数常见的知识点和逻辑关系了。

再往数据集中加入更多句子,比如“我饿了”、“我肚子饿得咕咕叫”、“我好饿呀,想吃东西”,它们的形式各不相同,但表达的核心意义(生理状态:饥饿)却是高度重复的。这种情况就叫做信息冗余(Information Redundancy)。从“基本意义”的角度看,这就像是反复榨已经出油不多的油渣,新榨出来的“油水”(意义)并没有本质增加,只是换了个“包装”(形式)而已。

区别在哪?

    • 形式的长尾(Form's Long Tail):超大规模的数据集主要在努力填补“组合爆炸”留下的巨大空间,学习各种各样、甚至是非常罕见的表达方式。它的增长潜力巨大。
    • 意义的饱和(Meaning Saturation):知识、常识和基本逻辑关系,在达到一定规模的数据量(比如亿级token)后,就已经被模型“榨”得差不多了,后续增加的数据在“意义”层面带来的新东西越来越少,重复性很高。
      • 冗余的错觉:基本意义是高度重复的,但超大数据的“油水”藏在动态知识里,尤其是篇章结构和远距离依赖关系。
      • 规模的意义:1亿token榨干静态知识,1000亿token覆盖很多长尾形式并挖掘动态知识,远没到没用的时候。形式长尾和动态意义的把握让模型理解复杂场景、善于篇章推理和生成长文。

超大数据的“奇迹”:形式如何撑起门面?

好了,关键问题来了:既然“意义”早就饱和了,我们为什么还要费那么大劲去搞超大数据呢?难道就是为了让模型学会说“茴香豆的‘茴’有四种写法”吗?答案就藏在形式和意义这对“舞伴”的微妙互动中。

1. 长尾模式:语言世界的“捡漏大师”

1亿token的数据可能教会了模型说标准的“我喜欢吃苹果”。但1000亿token的数据,却有机会让模型“捡漏”学到诸如“我在火星上种的苹果熟了,味道好极了”、“吾乃沉迷苹果无法自拔之果粉是也”这类极其罕见但合乎语法逻辑的表达。

这些长尾模式虽然单个出现的频率极低,但汇集起来,却是模型展现其“无所不知”、“无所不能”的关键。用户可不希望AI只会说标准普通话、回答标准答案,他们希望AI能听懂方言俚语、能接住网络热梗、能应对刁钻古怪的问题。超大数据就像一个超级“万能翻译机”和“语料库大全”,让Grok这样的模型能用一本正经的古文回复你“余甚喜食苹果也”,或者用贱兮兮的语气调侃你。

类比一下:普通大数据像是在开采容易找到的浅层油田,产量大但种类单一;而超大数据则像是在深海、在极地钻探,虽然每次“捡漏”到的油(罕见模式)不多,但这些油往往非常珍贵,能满足特殊需求。

2. 大模型的“大胃口”需要“大餐”

现在的LLM动辄拥有数百亿甚至上万亿的参数(Parameters),这些参数就像是模型大脑中的神经元连接。参数量如此巨大的模型,简直就是个“超级大胃王”。只给它1亿token的数据,就好比给一个壮汉只吃一碗米饭,根本吃不饱,模型的巨大潜力无法被充分激发,这在机器学习里叫做欠拟合(Underfitting)。

有研究揭示了著名的“缩放定律”(Scaling Laws),大致表明:

# 模型性能约等于...
Performance ≈ C * log(Dataset Size) * log(Model Size)
# (C是常数, log表示对数关系)

简单来说,模型的参数量(Model Size)越大,就需要越多的数据(Dataset Size)来“喂饱”它,才能让性能(Performance)持续提升。一个拥有1000亿参数的模型,可能真的需要千亿甚至万亿级别的token数据,才能让它的每一个参数都得到充分的训练和调整,发挥出全部实力。

类比一下:小鸟胃吃一小碗饭就饱了,相扑选手的胃口,可能得连吃十大碗饭才能满足基本需求。参数量巨大的LLM就是那个相扑选手。

3. 冗余数据:看似无用,实则“磨刀石”

你可能会觉得,那些意义重复的数据(信息冗余)不就是“渣子”吗?留着干嘛?但对于模型来说,这些看似冗余的数据,其实扮演了“磨刀石”的角色。

同一个意思,比如“我喜欢”,在数据中可能以各种细微差别的形式重复出现成千上万次:“我超爱”、“我贼喜欢”、“我对其情有独钟”、“我表示好感”……模型通过反复接触这些形式各异但意义相近的样本,能够更精确地学习到词语和概念之间的概率分布,让它的预测(比如下一个词应该是什么)变得更稳定、更丝滑、更符合自然语言的习惯

冗余并不是完全没用,它在帮助模型打磨内部表示,让生成的语言更流畅自然。

类比一下:书法家反复练习同一个字“永”,虽然字本身的意义没变,但每一次练习都在打磨笔锋、稳定结构,最终让技艺炉火纯青。冗余数据就是模型在反复“练字”。

4. 涌现能力:量变到质变的“魔法时刻”

最令人着迷的一点是,当数据量、模型参数量大到一定程度后,模型会展现出所谓的“涌现能力”(Emergent Abilities)。这意味着模型突然学会了某些在小规模模型或小数据量下完全不具备的、意想不到的新技能,比如进行数学推理、编写代码、理解比喻和幽默等。

这些能力似乎并不是因为数据中包含了所有问题的直接答案,而是因为模型在处理了天文数字般的语言模式后,从量变积累到了质变,仿佛“顿悟”了一般,开始从简单的“模式记忆”跃升到了某种程度的“抽象理解”。这就像水温从99℃升到100℃,看似只差1度,却发生了从液态到气态的相变。暴力堆数据,硬生生堆出了“智能的火花”。

类比一下:你小时候玩积木,堆几十块可能只是个歪歪扭扭的塔,但当你拥有足够多的积木(超大数据),并且耐心堆到成千上万块(大模型)时,你可能突然发现自己搭出了一座结构精巧、功能完备的城堡。


“暴力美学”的真相:形式为主,意义为辅?

那么,超大数据训练出来的LLM,到底主要学到的是什么?真的是海量的形式长尾吗?目前的看法倾向于是的,但又不完全是。

    • 形式驱动显著:千亿级别的token极大地丰富了模型的形式表达能力。这使得模型生成的文本更加流畅、自然、多样化,更像一个真正的人在说话,而不是一个只会套模板的机器人。Grok能用幽默风趣或古雅文言来回答问题,很大程度上就得益于其接触过的海量、多样的语言形式。
    • 意义增长趋缓:如前所述,关于世界的基本知识、常识和逻辑推理能力,在数据量达到亿级token时可能就已经建立了基础框架。超大数据在“基本意义”层面带来的边际效益递减,它可能并没有教给模型全新的、颠覆性的“知识”,更多的是把已有的知识用100种、1000种不同的方式重新包装和表达了一遍。

一些研究证据似乎也支持这一点:有实验发现,随着训练数据量的增加(比如从几十亿token增加到几百亿甚至上千亿),模型在需要精确知识的问答任务(如百科问答)上的性能提升速度会逐渐放缓;但与此同时,衡量长上下文指令遵循、生成文本多样性、流畅度和风格模仿能力的指标却可能持续显著增长。这似乎印证了,“油水”(基本意义)榨得差不多了,超大数据更多是在给模型缝制一件越来越华丽、越来越合身的“外衣”(形式),可以应对任意复杂场景。


那意义饱和后,形式长尾还有啥用?

听到这里,你可能会嘀咕:“既然核心的知识和逻辑都学得差不多了,那花那么大代价去追求形式上的花样翻新,性价比高吗?有必要吗?” 这个问题问得很好!确实有利有弊:

    • 用户体验至上:形式的长尾和多样性直接关系到用户体验。没有人喜欢跟一个说话颠三倒四、只会用有限几种句式表达的AI助手交流。自然、多变、甚至带点个性的语言风格,是LLM能够被广泛接受和喜爱的关键因素。
    • 提升鲁棒性和长线条:接触过足够多的罕见表达和非标准用法,能让模型在面对真实世界中充满噪声、方言、俚语、甚至是错误输入的“脏数据”或超长上下文时,表现得更加鲁棒(Robust),超强理解,不容易“翻车”。
    • 高昂的代价:硬堆1000亿token的数据,其中可能高达90%甚至更多在“意义”层面是冗余的,主要贡献在于形式的多样性。这意味着巨大的计算成本、存储成本和时间成本。为了那“10%”可能的新形式或微弱的意义提升,付出“90%”的冗余代价,性价比确实是个值得考量的问题。这其实是把特定知识蒸馏给小模型做应用的理由之一,

再打个比方:意义就像是人体的骨架,可能1亿token的数据就已经把骨架搭得七七八八了;而形式则是覆盖在骨架外的肌肉、皮肤和衣服。超大数据(1000亿token)让这身“衣服”变得极其华丽、款式万千、能驾驭各种风格,但里面的“骨架”可能并没有发生根本性的改变。


为啥不更“聪明”一点,非要这么“暴力”?

“大力出奇迹”听起来很燃,但你可能还是会忍不住吐槽:这种做法也太“壕”无人性、太浪费资源了吧?难道就没有更“聪明”、更高效的方法吗?

你说得对,这种“暴力美学”确实有其局限性和潜在的浪费。但之所以目前它还是主流,主要有几个原因:

    1. 缩放定律的魔力仍在:经验性研究(Scaling Laws)表明,至少在目前可达到的数据量级(千亿甚至万亿token)和模型规模下,继续增加数据量,模型的性能仍然在持续提升,虽然可能提升速度放缓,但饱和点似乎还未真正到来(但最近发现,高品质数据几乎耗尽已经带来了预训练性能的平台效应,由此催生了强化学习后训练推理模型的崛起)。只要“大力”还能继续“出奇迹”,在追求更高性能的竞赛中,大家就倾向于继续“大力”。
    2. 工程上的可行性与效率:相比于设计复杂的算法去精确提炼“意义”、去除“冗余形式”,直接“暴力”地把能收集到的所有数据(当然要做数据清洗和质量管控)都扔给模型去学,在现有强大的GPU算力支持下,反而是工程上更容易实现、也更快看到效果的路径。先“暴力”,再“精细”。
    3. 对“涌现”的期待与未知:我们对于LLM如何学习、以及“涌现能力”产生的具体机制还不够完全理解。暴力堆数据就像是在探索一片未知的领域,我们不知道下一个“奇迹”会在哪个数据量级、哪个模型规模下出现。在没有找到更精准的“炼金术”之前,“暴力”堆料可能是触发未知的“魔法”的最直接方式。
    4. 未来的反思与探索:当然,学术界和工业界也一直在积极探索更“聪明”的方法,比如研究如何进行高效的数据去重、如何蒸馏知识、如何设计更擅长学习核心逻辑的模型架构,例如最近兴起的推理强化学习等。未来,我们很可能会看到更加高效、更加“智慧”的数据利用方式。但在现阶段,“暴力美学”凭借其简单直接和已被验证的有效性,还是基本王道。

结语:从“榨油”到“炼金”

所以,回到我们最初的“榨油”比喻。超大数据的“暴力美学”并非盲目乱来,它其实是巧妙地(或者说,被迫地)利用了语言本身“形式无限、意义有限”的特性。当“意义之油”逐渐榨干时,超大数据继续投入,主要是在榨取那无穷无尽、花样繁多的“形式之油”。正是这些看似冗余、实则提供了多样性和稳定性的“形式”,结合巨大的模型规模,共同催生了我们今天看到的LLM“巨兽”的惊人能力,甚至触发了难以预料的“涌现奇迹”。

所以,同学们,下次当你和ChatGPT、Grok或其他大型语言模型愉快地玩耍,惊叹于它们的对答如流、妙语连珠时,不妨在心里默默感叹一句:这家伙,可是被几千亿甚至上万亿token的“数据大餐”硬生生喂出来的“语言巨兽”啊!它的“聪明”,很大一部分是用“形式”的砖瓦,在“意义”的地基上,通过“暴力”施工堆砌起来的!

【相关】

检索增强(RAG)与窗口数据的互补性

RAG(检索增强生成)与扩展的上下文窗口(context window)虽然同为短期记忆机制,但在应用场景、成本效率和数据管理方面存在显著差异,以下为详细分析:

核心区别:RAG vs. 上下文窗口

特性 RAG 上下文窗口
数据来源 动态检索外部数据库/文档 当前对话或任务中提供的文本
数据实时性 支持实时更新(如最新文档、数据库) 依赖用户输入或历史会话数据
计算复杂度 检索+生成(线性复杂度) 自注意力机制(O(n²)复杂度)
数据隐私 无需存储用户数据到模型 可能需将敏感数据传入模型
成本效率 低(仅处理检索到的相关内容) 高(长上下文消耗大量算力)
适用场景 动态知识、高频更新、精准检索 固定任务、多轮对话、小范围上下文

 

为什么百万Token的上下文窗口无法取代RAG?

1. 计算成本过高
Transformer的自注意力机制复杂度为O(n²),处理百万Token的上下文需要海量计算资源(如GPU显存),推理成本远超RAG的检索+生成流程。

2. 数据实时性不足
上下文窗口仅包含用户输入或历史对话数据,无法动态接入外部更新内容(如最新新闻、数据库变更),而RAG可通过实时检索解决这一问题。

3. 信息密度与噪声问题
长上下文可能包含大量无关信息,模型需自行筛选关键内容,而RAG通过精准检索直接提供高相关性片段,提升生成质量。

4. 隐私与合规需求
RAG无需将敏感数据传入模型,仅通过检索外部隔离的数据库获取信息,更适合医疗、金融等隐私敏感场景。

RAG的不可替代性:典型场景

1. 动态知识库(如客服系统)
需实时检索产品手册、政策更新,而上下文窗口无法覆盖频繁变化的非结构化数据。

2. 垂直领域精准问答
例如法律咨询,需从海量法条中检索相关条款,避免模型因上下文过长而“分心”。

3. 低成本长文本处理
RAG仅需检索关键段落输入模型,比直接处理百万Token的上下文更经济。

4. 多模态扩展
RAG可检索图片、表格等非文本数据,而纯文本上下文窗口难以实现。

RAG与向量数据库的关系

RAG(检索增强生成)的核心流程分为两步:检索(Retrieval)和生成(Generation)。

向量数据库是RAG检索阶段的核心基础设施,其作用如下:

1. 语义化存储:将文档、知识库内容通过Embedding模型转化为高维向量(Vector),存储语义信息而非原始文本。
2. 相似性检索:根据用户问题的语义,快速找到最相关的知识片段(Top-K相似向量),替代传统的关键词匹配。
3. 动态更新:支持增量插入新数据,无需重新训练模型即可扩展知识库。

向量数据库 = RAG的“外部记忆库”,负责语义化存储与高效检索;
RAG = 利用向量数据库的检索结果,指导大模型生成答案的框架。

数据隐私问题:RAG vs. 上下文窗口

1. 上下文窗口的数据会泄漏吗?

- 风险存在:
上下文窗口中的数据(如聊天历史)会以明文形式传入模型推理,若未加密或未清除,可能通过日志、缓存等途径泄露。
- 典型场景:
医疗问诊时,若病史记录直接传入上下文窗口且日志未脱敏,可能违反隐私法规(如HIPAA、GDPR)。

2. RAG能保证隐私吗?

RAG的隐私性取决于向量数据库的设计:

- 数据隔离:用户数据存储在独立的向量数据库中,不与模型参数混合。
- 访问控制:可通过权限管理限制敏感数据的检索范围(如仅限授权用户访问)。
- 数据脱敏:入库前对隐私字段(如身份证号)进行掩码或加密处理。
- 风险点: 若向量数据库未加密或遭入侵,仍可能导致数据泄露。

结论:
RAG的隐私性优于直接将敏感数据塞入上下文窗口,但需配合数据库安全措施。

关键词检索 vs. 向量数据库

特性 向量数据库(语义检索) 关键词检索
匹配逻辑 语义相似性(非线性关系) 字符匹配(精确/模糊)
泛化能力 强(理解同义词、抽象概念) 弱(依赖关键词命中)
数据格式 需预先向量化 原始文本+倒排索引
适用场景 开放域问答、复杂意图理解 结构化数据、精确术语查询

 

未来趋势:RAG与长上下文的协同

- 混合架构:用长上下文处理局部对话(如多轮聊天),RAG处理全局知识检索,兼顾效率与成本。
- 优化检索策略:结合语义检索与向量数据库,进一步提升RAG的精准度。
- 模型轻量化:针对RAG设计小型化生成模型(如Phi-3),降低端到端成本。

结论

即使上下文窗口扩展至百万Token,RAG在动态数据接入、计算效率、隐私保护等方面仍具不可替代性。两者并非竞争关系,而是互补工具:

- 优先用上下文窗口:处理固定任务、短文本多轮对话。
- 必选RAG:需实时数据、精准检索或控制成本的场景。

技术选型需结合业务需求、数据特性与成本预算,而非单纯追求上下文长度。

 

关于颈椎病,大模型医疗建议靠谱吗?

作为一个实例,老友患颈椎病,把家庭医生的分析以及核磁共振和CT结果提供给四大头部大模型,它们的分析结果详细丰富,感觉有参考价值。这里是前三家的分析结果。老爸行医一生,主攻外科与骨科,他看了这些报告的评价是:“很了不起!说的细致入理!不逊于专家!”

OpenAI GPT4.5 deep research 遇到一些麻烦,参照了太多的资料,似乎进入死循环,准备限定它的研究深度,重启 deep research,单独博客它的深度研究报告。

1. DeepSeek R1

全面分析与评估:颈椎病手术指征及治疗方案

一、病情总结

1. 影像学核心表现

- 后纵韧带骨化(OPLL):C1-C6广泛骨化,导致多节段椎管狭窄(轻至中度),C5-C6和C6-7尤为显著。
- 椎间孔狭窄:C4-C7节段中重度狭窄(C5-C6右侧重度、左侧中度),直接压迫左侧C5-C7神经根。
- 退行性改变:多节段椎间盘干燥、骨赘形成及关节突增生,加重神经压迫。
- 无脊髓病变:目前脊髓信号正常,无肌力下降或脊髓损伤表现。

2. 临床症状:

- 左侧颈肩臂放射性疼痛:符合C5-C7神经根受压特征,夜间剧痛(8-9/10)严重影响睡眠。
- 保守治疗无效:塞来昔布、理疗仅短暂缓解,热水澡/热敷效果不持久。
- 生活限制:左侧卧位困难,左手佩戴手表诱发疼痛,日常活动显著受限。

二、手术指征综合分析

根据国际指南及多模型分析,手术指征需结合以下因素:

1. 绝对指征(未完全符合):

- 脊髓受压伴神经功能缺损(如肌无力、步态异常、大小便障碍)尚未出现。

2. 相对指征(高度符合):

- 顽固性神经根性疼痛:症状持续2年,近期加重至无法睡眠,保守治疗(药物、理疗)失败。
- 影像学与症状高度关联:C5-C6/C6-C7中重度椎间孔狭窄与左侧放射性疼痛直接对应。
- 生活质量严重下降:夜间剧痛、睡眠障碍提示神经压迫需积极干预。
- 既往腰椎手术参考:L5-S1减压术后症状缓解,提示手术减压可能有效。

3. 需谨慎因素:

- 多节段病变复杂性:OPLL范围广(C1-C6),可能增加手术难度及邻近节段退变风险。
- 年龄与基础状态:60岁以上需评估手术耐受性,但无严重合并症时风险可控。

三、手术与非手术方案对比

方案 优势 局限性
保守治疗: 避免手术风险,短期疼痛缓解 长期效果差,可能延误神经功能恢复
神经阻滞/射频消融 靶向缓解神经根炎症 效果短暂(数周至数月)
硬膜外激素注射 减轻局部水肿和压迫 对骨性狭窄无效,需反复操作
手术治疗: 直接减压,根治性缓解疼痛 手术风险(感染、邻近节段退变等)
前路ACDF 精准减压C5-C6/C6-C7,融合稳定 多节段OPLL可能需后路补充
后路椎管扩大成形术 适合多节段OPLL,保留颈椎活动度 术后颈肌疼痛风险,需严格康复

四、决策建议

1. 优先行动:

- 脊柱外科专科评估:重点讨论手术可行性(如C5-C7后路椎间孔扩大术或多节段椎管成形术)。
- 疼痛科介入:尝试C6神经根阻滞或硬膜外注射,作为术前过渡或保守治疗验证。

2. 手术时机:

- 若神经阻滞后疼痛显著缓解,可暂缓手术,但需密切随访。
- 若症状持续或加重,建议6-8周内安排手术评估。

3. 术后管理:

- 短期:颈托保护2-4周,逐步恢复颈部活动。
- 长期:颈椎稳定性训练(如等长收缩锻炼),避免低头久坐。

五、日常管理与注意事项

- 睡眠调整:使用颈椎记忆枕,仰卧位为主,避免左侧压迫。
- 姿势管理:电脑/手机支架抬高屏幕,每30分钟活动颈部。
- 疼痛控制:加巴喷丁/普瑞巴林(神经痛)+ 短期低剂量激素(医生指导)。
- 保暖与触发因素:避免颈部受寒,暂停佩戴左侧手表。

六、总结

您目前处于手术干预的临界状态:虽无脊髓损伤急症,但顽固性神经根压迫已严重影响生活,结合影像学明确狭窄及既往腰椎手术成功经验,手术减压是合理选择。建议:

1. 1-2周内完成脊柱外科与疼痛科会诊,明确手术方案或最后阶段的保守治疗。
2. 若选择手术,优先考虑后路多节段减压(适应OPLL广泛性),术后配合系统康复。

最终决策需与专科医生详细讨论手术风险、预期疗效及个体化需求,确保治疗选择与生活质量目标一致。

2. Claude 3.7

我现在会为您提供一个更全面的分析,整合之前的信息并提供更深入的解读。

您的颈椎病情综合分析

病情核心特征

您患有一种比较特殊且复杂的颈椎病变组合,主要包括后纵韧带骨化症(OPLL)和多节段退行性变。这两种问题共同作用,导致了您目前经历的症状。从影像学上看:

    1. 后纵韧带骨化症(OPLL):这是您病情中非常重要的特征,从C1延伸到C6。后纵韧带位于脊柱管前方,一旦骨化会压迫前方的脊髓和神经根。在MRI矢状位切面上,可以清晰看到脊髓前方的低信号区域,正是骨化的后纵韧带。
    2. 多节段狭窄:您有多个节段的椎管和神经孔狭窄,程度不一:
      1. C2-C3: 中度椎管狭窄和神经孔狭窄
      2. C3-C4: 中度椎管和神经孔狭窄
      3. C4-C5: 中度椎管狭窄,重度双侧神经孔狭窄
      4. C5-C6: 中度椎管狭窄,右侧重度和左侧中度神经孔狭窄
      5. C6-C7: 中度椎管和神经孔狭窄

3. 椎间盘退变和骨赘:多节段椎间盘干燥(失水)和骨赘形成,进一步加重了神经受压。

症状与影像的关联

您的症状非常典型,与影像学表现高度吻合:

    • 左侧放射性疼痛:从颈部延伸到肩膀和手臂,这与C5-C6和C6-C7节段的神经孔狭窄直接相关,特别是这些节段的神经根(可能是C6和C7神经根)受到压迫。
    • 夜间加剧的疼痛:卧位时椎间隙压力变化,可能导致疼痛加剧。
    • 左侧不能戴手表:这是神经根受压引起的异常感觉,甚至轻微的刺激也会触发不适。
    • 热敷后暂时缓解:热敷改善局部血液循环,暂时舒缓肌肉紧张,但不能解决神经根结构性压迫问题。

手术指征评估

综合您的情况,您已经达到了认真考虑手术的阶段,理由如下:

  1. 症状严重程度
    • 持续两年,近期明显加重
    • 夜间剧烈疼痛导致无法入睡
    • 日常活动受限(无法左侧卧,不能戴手表)
    • 生活质量显著下降
  2. 保守治疗效果
    • 抗炎药物(塞来昔布)效果有限
    • 物理治疗未能长期改善症状
    • 热敷只能短暂缓解
  3. 影像学证据
    • 多节段中重度神经孔狭窄,特别是C4-C5和C5-C6
    • 广泛的后纵韧带骨化,这通常比单纯的椎间盘突出预后更差
    • 影像学发现与症状高度吻合
  4. 疾病进展风险
    • OPLL是一种进行性疾病,可能随时间加重
    • 长期神经根受压可能导致永久性神经损伤
    • 若发展为脊髓受压,将面临更严重的后果

手术干预的优势和考量

  1. 适合您的手术类型
    • 考虑到您有多节段OPLL,且主要症状是神经根性疼痛,手术方式可能包括:
      • 前路颈椎间盘切除融合术(ACDF):针对C5-C6和C6-C7,直接解除神经根前外侧受压
      • 后路椎管扩大减压术:适用于广泛的OPLL,但对神经根减压效果可能不如前路手术
  2. 手术效果预期
    • 神经根减压手术对放射性疼痛和麻木的缓解效果通常很好
    • 大多数患者术后短期内即可获得显著改善
    • 您曾在2024年11月进行过腰椎L5-S1减压手术,效果良好,这表明您对神经减压手术可能有良好反应
  3. 术后考虑因素
    • 颈椎活动度可能部分受限
    • 需要系统的术后康复锻炼
    • 邻近节段退变的长期风险

临界决策分析

您处于手术决策的临界点,这需要平衡多个因素:

  1. 进一步保守治疗的选择
    • 神经根阻滞注射:可尝试CT引导下的选择性神经根封闭
    • 镇痛药物调整:可能需要加用神经性疼痛药物(如普瑞巴林)
    • 强化康复治疗:针对性颈部肌肉训练
  2. 等待与手术时机
    • 已尝试保守治疗两年,效果有限
    • 剧烈夜间疼痛是手术的相对紧急指征
    • 越早解除神经压迫,长期神经功能预后越好
  3. 个体化决策
    • 您的年龄(64岁)、整体健康状况及对手术的预期都会影响决策
    • 您之前腰椎手术的良好体验是一个积极参考

建议行动计划

  1. 专科评估 - 非常紧急:
    • 尽快完成脊柱外科专家咨询,讨论手术选择、风险和预期效果
    • 同时咨询疼痛管理专家,寻求短期症状控制方案
  2. 短期症状管理
    • 继续热敷、正确姿势和适量活动
    • 遵医嘱服用抗炎药物,可能需要加用神经痛特效药
    • 考虑使用合适的颈椎支撑枕,尤其是夜间
  3. 决策准备
    • 准备具体问题清单,在专科医生咨询时询问
    • 讨论不同手术方式的优缺点和恢复期预期
    • 了解手术和不手术两种路径的长期风险对比

考虑到您的症状严重程度、持续时间和影像学表现,如果专科医生建议手术,这很可能是合理的建议。神经减压手术对于您这类神经根症状明显、影像学显示确切压迫的患者,通常能带来显著改善。

希望这个综合分析能帮助您更全面地理解自己的病情,并为与专科医生的讨论做好准备。祝您早日找到有效的解决方案,减轻痛苦。

3. Gemini Adavanced 2.0 Pro Experimental

我现在会为您提供一个更全面的分析,整合之前的信息并提供更深入的解读。

您的颈椎病情综合分析

病情核心特征

您患有一种比较特殊且复杂的颈椎病变组合,主要包括后纵韧带骨化症(OPLL)和多节段退行性变。这两种问题共同作用,导致了您目前经历的症状。从影像学上看:

    1. 后纵韧带骨化症(OPLL):这是您病情中非常重要的特征,从C1延伸到C6。后纵韧带位于脊柱管前方,一旦骨化会压迫前方的脊髓和神经根。在MRI矢状位切面上,可以清晰看到脊髓前方的低信号区域,正是骨化的后纵韧带。
    2. 多节段狭窄:您有多个节段的椎管和神经孔狭窄,程度不一:
      • C2-C3: 中度椎管狭窄和神经孔狭窄
      • C3-C4: 中度椎管和神经孔狭窄
      • C4-C5: 中度椎管狭窄,重度双侧神经孔狭窄
      • C5-C6: 中度椎管狭窄,右侧重度和左侧中度神经孔狭窄
      • C6-C7: 中度椎管和神经孔狭窄
    3. 椎间盘退变和骨赘:多节段椎间盘干燥(失水)和骨赘形成,进一步加重了神经受压。

症状与影像的关联

您的症状非常典型,与影像学表现高度吻合:

    • 左侧放射性疼痛:从颈部延伸到肩膀和手臂,这与C5-C6和C6-C7节段的神经孔狭窄直接相关,特别是这些节段的神经根(可能是C6和C7神经根)受到压迫。
    • 夜间加剧的疼痛:卧位时椎间隙压力变化,可能导致疼痛加剧。
    • 左侧不能戴手表:这是神经根受压引起的异常感觉,甚至轻微的刺激也会触发不适。
    • 热敷后暂时缓解:热敷改善局部血液循环,暂时舒缓肌肉紧张,但不能解决神经根结构性压迫问题。

手术指征评估

综合您的情况,您已经达到了认真考虑手术的阶段,理由如下:

  1. 症状严重程度
    • 持续两年,近期明显加重
    • 夜间剧烈疼痛导致无法入睡
    • 日常活动受限(无法左侧卧,不能戴手表)
    • 生活质量显著下降
  2. 保守治疗效果
    • 抗炎药物(塞来昔布)效果有限
    • 物理治疗未能长期改善症状
    • 热敷只能短暂缓解
  3. 影像学证据
    • 多节段中重度神经孔狭窄,特别是C4-C5和C5-C6
    • 广泛的后纵韧带骨化,这通常比单纯的椎间盘突出预后更差
    • 影像学发现与症状高度吻合
  4. 疾病进展风险
    • OPLL是一种进行性疾病,可能随时间加重
    • 长期神经根受压可能导致永久性神经损伤
    • 若发展为脊髓受压,将面临更严重的后果

手术干预的优势和考量

  1. 适合您的手术类型
    • 考虑到您有多节段OPLL,且主要症状是神经根性疼痛,手术方式可能包括:
      • 前路颈椎间盘切除融合术(ACDF):针对C5-C6和C6-C7,直接解除神经根前外侧受压
      • 后路椎管扩大减压术:适用于广泛的OPLL,但对神经根减压效果可能不如前路手术
  2. 手术效果预期
    • 神经根减压手术对放射性疼痛和麻木的缓解效果通常很好
    • 大多数患者术后短期内即可获得显著改善
    • 您曾进行过腰椎L5-S1减压手术,效果良好,这表明您对神经减压手术可能有良好反应
  3. 术后考虑因素
    • 颈椎活动度可能部分受限
    • 需要系统的术后康复锻炼
    • 邻近节段退变的长期风险

临界决策分析

您处于手术决策的临界点,这需要平衡多个因素:

  1. 进一步保守治疗的选择
    • 神经根阻滞注射:可尝试CT引导下的选择性神经根封闭
    • 镇痛药物调整:可能需要加用神经性疼痛药物(如普瑞巴林)
    • 强化康复治疗:针对性颈部肌肉训练
  2. 等待与手术时机
    • 已尝试保守治疗两年,效果有限
    • 剧烈夜间疼痛是手术的相对紧急指征
    • 越早解除神经压迫,长期神经功能预后越好
  3. 个体化决策
    • 您的年龄、整体健康状况及对手术的预期都会影响决策
    • 您之前腰椎手术的良好体验是一个积极参考

建议行动计划

  1. 专科评估 - 非常紧急:
    • 尽快完成脊柱外科专家咨询,讨论手术选择、风险和预期效果
    • 同时咨询疼痛管理专家,寻求短期症状控制方案
  2. 短期症状管理
    • 继续热敷、正确姿势和适量活动
    • 遵医嘱服用抗炎药物,可能需要加用神经痛特效药
    • 考虑使用合适的颈椎支撑枕,尤其是夜间
  3. 决策准备
    • 准备具体问题清单,在专科医生咨询时询问
    • 讨论不同手术方式的优缺点和恢复期预期
    • 了解手术和不手术两种路径的长期风险对比

考虑到您的症状严重程度、持续时间和影像学表现,如果专科医生建议手术,这很可能是合理的建议。神经减压手术对于您这类神经根症状明显、影像学显示确切压迫的患者,通常能带来显著改善。

希望这个综合分析能帮助您更全面地理解自己的病情,并为与专科医生的讨论做好准备。祝您早日找到有效的解决方案,减轻痛苦。

【相关】

Xiao Hong (Red): The Man Behind the Autonomus Genral Agent Manus

 

Manus is a new AI agent developed by the Chinese startup Monica, claiming to be the world's first fully autonomous AI agent. It's designed to handle complex tasks independently after an initial user prompt, such as sorting résumés, analyzing stock trends, and generating interactive websites. Currently, Manus is in a private testing phase, accessible by invitation only.

Xiao Hong is truly exceptional! Clear-minded, shrewd, and remarkably, he maintains both the right mindset and passion while drawing on years of accumulated experience. If I were an investor, I'd back him without hesitation.

I listened to Zhang Xiaojun's exclusive conversation with Manus CEO Xiao Hong in one sitting—fascinating throughout. Xiao Hong speaks plainly and honestly, without mystification or pretense, yet his insights are remarkably precise. He has crystal-clear awareness about positioning himself and his products, about the ecosystem, about the relationship between foundation models and applications, and about the future. As a 10-year entrepreneurial veteran, he maintains his own principles, staying true to himself ("be yourself"). While he fully understands that tech giants like ByteDance will quickly catch up in the large language model agent space, he believes there will always be room for application enthusiasts like himself. He consistently holds model companies in high regard, particularly highlighting how DeepSeek has brought unexpected positive factors to the application ecosystem.

The Man Behind Manus

Xiao Hong (Red) is the CEO of Manus, the recently viral large language model autonumous agent that has become China's new star in the AI industry following DeepSeek. Listening to Xiao Hong's interview feels like reading "startup notes" for the new era of AI applications. This founder, born in the 1990s but already with 10 years of entrepreneurial experience, not only clearly articulates the development trajectory of large language model applications but also explains in plain language to entrepreneurs: in this AI revolution, even if you're not a "model powerhouse" like DeepSeek or OpenAI, you can still find your place.

From Chatbots to Agents: The Evolution of AI Applications

The evolution of AI applications has been as dramatic as the shift from flip phones to iPhones:

Jasper Era: Homework-copying mindset, "Please fill in your target audience and theme~"
ChatGPT Era: Conversational instructions, "Hi, what can I help you with?"
Monica Era: Context awareness, "I see you're reading this article, would you like me to summarize it for you?"
Cursor Era: Vertical coding agent, "I don't just chat, I can write code to solve your problems!"
Agent Era: Asynchronous planning and execution, "Let me break down this task, complete it step by step, and report results as they come~"

Isn't this just like the evolution from "feature phones" to "smartphones"? Xiao Hong discovered a clear main thread behind this evolution: increasingly aligning with ordinary people's habits while expanding capability boundaries in values.

"The New Andy-Bill Law": How Application Companies Consume Model Capabilities

Xiao Hong proposed "the new Andy-Bill law": no matter how powerful the models created by model companies, application companies can consume these capabilities and transform them into user-perceivable value. This is the positioning that large model application pioneers should adopt.

This mirrors how Intel (Andy Grove) would provide more powerful chips, and Microsoft (Bill Gates) would consume that computing power with more powerful operating systems. Now, model companies provide stronger reasoning capabilities, and application companies are transforming them into intelligent agents capable of writing code, calling APIs, and planning execution.

Xiao Hong even half-jokingly offers a startup tip: "Predict what the next model capability will be, build your application around it, and wait for that model capability to launch. When that model capability improves, you'll win at the starting line you foresaw!" As an excellent product expert, he once used this strategy to succeed in his first startup.

The Agent Is Alive! The "Aha Moment" of Large Model Applications

One astounding scene from the interview is Xiao Hong describing their "aha moment" while testing the Agent in development:

They gave the Agent Manus a task to analyze how many animals appeared at a specific timestamp in a YouTube video. The Agent not only opened YouTube but also decided by its own choice to use fastfoward button to improve efficiency, precisely located the specified time, and then analyzed the screen content to provide an answer.

Xiao Hong's reaction: "You truly feel like you're creating a life."

Isn't this like a real-life prequel to "Westworld"? Except that today's Agents are still exploring the digital world, not the physical one.

"Think in Terms of Game Theory, Not Logical Reasoning"

Xiao Hong's summary of entrepreneurial thinking is insightful: don't use logical reasoning ("Baidu has the best algorithm engineers, so Baidu will definitely do recommendations well"), but instead use game theory thinking ("because a certain player joins, the entire game rules change"). Logical reasoning doesn't account for ByteDance's success (e.g. in TikTok), but game theory thinking can accommodate new players (like Liang Wenfeng for DeepSeek and Xiao Hong for Manus).

It's like chess—not simply deducing "if I make this move, my opponent will definitely make that move," but considering "because I made this move, my opponent might change their entire strategy."

With this mindset, even in a competitive environment dominated by giants, entrepreneurs can find their opportunities—not through linear extrapolation (which would only lead to the pessimistic conclusion that "everything is an opportunity for giants"), but by becoming variables that change the rules of the game.

In other words, Sam Altman's vision of top model companies crushing everything is at most only half truth. The space where models and applications each play to their strengths will likely co-exist for a long time.

Xiao Hong's Golden Quotes

In the interview, Xiao Hong offered several quotable lines worth hanging on entrepreneurs' walls:

    • "When you realize you're innovating and leading, you should be more aggressive, super aggressive"
    • "You should understand it using the most crazy imagination, better not to short it"
    • "Don't put application companies and model companies in opposition"
    • "Being yourself is most important, rather than being reactive"

Finally, facing the rapidly developing future of AI, Xiao Hong concludes by quoting Jensen Huang: "What would happen in the next few years that would surprise you?" Huang's answer was: "Basically nothing."

In other words, in the AI era, don't be too surprised by whatever crazy things happen—even if your Agent starts using fastforward key to watch YouTube videos and answer questions. After all, the awakening of agents is only just beginning!

 


[Related]

The Agent Era: The Contemporary Evolution from Chatbots to Digital Agents

Manus is a new AI agent developed by the Chinese startup Monica, claiming to be the world's first fully autonomous AI agent. It's designed to handle complex tasks independently after an initial user prompt, such as sorting résumés, analyzing stock trends, and generating interactive websites. Currently, Manus is in a private testing phase, accessible by invitation only.

Unveiling 2025's Hottest AI Application Form

The recent explosion of Manus claimed as the first generic agent product has brought the AI industry buzzword "agent" to the public's attention, at least effective in educating and inspiring the market. Manus's beta release demos have been impressively powerful, offering a glimpse of what agent technology can truly achieve. Whether Manus represents a genuine breakthrough or merely well-marketed hype, everyone is now curious about the emerging era of large language model agents. But what exactly is an agent?

I. From Co-pilot to Pilot: The Evolution Code of Agents

When ChatGPT exploded onto the scene, humanity realized for the first time that AI could not only answer questions but also do all kinds of knowledge tasks (translation, summarization, writing, you nam´ it) as  your "cyber assistant". Early Copilot-type assistants functioned like diligent interns—obedient and responsive, answering when asked and acting when commanded. Today's Agents have evolved into "digital employees" capable of figuring out solutions to problems independently. They are no longer passive assistants waiting for instructions, but intelligent agents that can autonomously plan, break down tasks, and utilize tools.

    • Copilot mode: You command "write an English email," it generates text and waits for you to confirm or use it
    • Agent mode: You say "resolve the customer complaint within budget x," and it automatically retrieves order data → analyzes the problem → generates a solution → orders compensation gifts within budget → synchronizes the resolution record with your CRM system

This qualitative leap stems from three major technological breakthroughs:

    1. Extended context windows: New LLMs can remember conversations of up to 1 million tokens (equivalent to an entire Harry Potter novel), building continuous working memory
    2. Reasoning engine: Evolution from simple Chain-of-Thought to Tree-of-Thought reasoning, enabling multi-path decision making
    3. Digital limb growth: API calls + RPA (simulating human software operation) + multimodal input/output allowing AI to truly "take action" without human intervention during the process

II. The Seven Weapons of Agents: Beyond Conversational AI

The combat power of today's top Agents comes from a "technical LEGO set" composed of seven core components:

① Search+RAG

    • Real-time capture of the latest information via built-in search: stock quotes, flight status, academic frontiers
    • Connection to enterprise knowledge bases: instant access to employee manuals, product specifications, customer profiles
    • Case study: A medical Agent can simultaneously retrieve the latest clinical guidelines and patient medical history during diagnosis

② Coding Capabilities

    • Automatically writing scripts to process Excel files
    • Transforming into a "digital developer" during debugging
    • Even developing complete applications
    • Impressive demonstration: During testing, a Windsurf Agent independently wrote a webpage with login/payment functionality

③ Software Operation (Computer Use)

    • No API interface? RPA still directly simulates human operations!
    • Operates browsers, Photoshop, and OA systems just like a human would
    • Game-changing scenario: An Agent autonomously completing the entire workflow from flight price comparison → booking → filling expense forms

④ Memory Vault (Vector Database)

    • Permanently remembers your work habits: "Director Wang prefers blue templates for Monday morning meeting PPTs" "Accountant Zhang's reports must retain two decimal places"
    • Localized storage ensures privacy and security

⑤ Multimodal Capabilities

    • Input and output no longer limited to text:
      • Converting voice meetings into visual minutes
      • Transforming data reports into dynamic videos
      • Generating mind maps while listening to podcasts

⑥ Multi-Agent Collaboration: Complex tasks tackled by "intelligent teams"

    • Commander Agent: Formulates battle plans
    • Scout Agent: Monitors data in real-time
    • QA Agent: Cross-validates results
    • Diplomatic Agent: Requests resources from humans

⑦ Planning and Reasoning

    • Breaking down vague instructions like "organize a product launch" into 100+ subtasks
    • Dynamically adjusting plans: When a venue is suddenly canceled, immediately activating Plan B

III. The Bipolar War in the Agent Universe

The agent landscape is currently witnessing a "generalist vs. specialist" showdown:

Generalist Camp

    • Key players: Manus, GPT-5 (? rumored to integrate all capabilities)
    • Advantages: Universal capabilities—coding, designing, project management all in one
    • Potential risks: Vulnerability to disruption by tech giants (for example, GPT-5 or DeepSeek R3 potentially crushing Manus)

Specialist Camp Lineup:

    • Medical Agents: AI doctors capable of examining CT scans, making diagnoses, and writing prescriptions
    • Legal Agents: Generating flawless contracts in three minutes
    • Financial Agents: Trading operators monitoring 37 global exchanges in real-time
    • Moat: Industry know-how + dedicated toolchains creating competitive barriers

IV. Hopes and Concerns in the Agent Era

On the Eve of Breakthrough:

    • Technical infrastructure largely in place (sufficiently long context + mature toolchain)
    • Multimodal large language models filling the final gaps
    • 2025 potentially becoming the true "Year of the Agent"

Undercurrents:

    • Privacy concerns: Agents requiring deep access to user data
    • Ethical dilemmas: Who bears responsibility when an Agent books a hotel without explicit approval?

V. The Future Has Arrived: A New Paradigm of Human-Machine Collaboration

As Agents gradually master three ultimate skills:

Predictive capability: Anticipating your needs in advance ("Rain detected tomorrow, outdoor schedule modified")

Embodiment: Robots infused with "souls" executing physical actions autonomously (Robot + Agent = Robot butler)

Humans are finally entering an era where "the noble speaks but doesn't lift a finger"—humans set goals, while Agents handle all implementation details and solution paths. This quiet efficiency revolution shall be reshaping the rules of the game across every industry.

The only question is: Are you ready to embrace your digital colleague?

 

【相关】

o3 deep research: 智能体的分野和演进

引言

人工智能 Agent(智能代理/智能体)的发展正引领着新一轮技术变革。从最初只能回答问题的聊天机器人,到如今能够自主规划和执行任务的Agent(智能代理/智能体),AI 正在从“工具”走向“生产力主体” (link1)。针对 AI Agent 的架构设计,出现了两种不同范式:白硕老师称之为编译型 Agent解释型 Agent。简言之,前者将大部分“智能”体现在开发编译阶段,由开发者预先规划好流程;而后者则把智能延伸到运行时,Agent 在执行过程中自主决策。两者理念迥异。

要理解二者区别,可以类比传统软件开发的编译型语言和解释型语言:前者在运行前将逻辑“翻译”成机器可执行的固定步骤,运行时按部就班;后者则在执行时动态遵循用户指令,能够根据实时情况改变流程。换言之,编译型 Agent就像提前写好的脚本,解释型 Agent更像实时推断的决策者。LangChain 框架在常规 Chain(链式流程)中,行动序列是硬编码写死的;而在 Agent 中,智能体使用大型语言模型(LLM)作为推理引擎动态决定采取哪些行动、以何种顺序执行 (link2)。这种自治决策能力正是解释型 Agent 的核心特征。

本文将深入解析编译型与解释型 Agent 的差异,探讨智能在编译环节(开发阶段)与运行时(执行阶段)的不同作用机制。我们将结合技术架构和实例,对两类 Agent 的应用场景、优势局限,以及未来演进趋势进行分析。

编译型 Agent:智能在编译阶段的应用

定义与特点: 所谓编译型 Agent,是指在开发阶段就将 Agent 的行为逻辑和任务流程规划完善,类似于把“智能”预先编译进固定的脚本或流程图中。Agent 运行时主要按照预定义的流程执行,很少偏离既定路径。其特点包括:流程高度可控,可预测性强,但灵活性相对不足。一旦运行环境或需求发生超出预设的变化,Agent 无法自适应,需开发者重新修改流程(相当于重新“编译”)。这种类型的 Agent 常通过可视化工作流或规则引擎实现,开发者扮演了主要的规划者角色。

主要应用场景: 编译型 Agent 广泛应用于流程固定、规则明确的任务自动化场景。例如机器人流程自动化 (RPA) 软件就是典型案例:RPA 按照预录制的脚本,在各类系统界面上执行点击、输入等操作,完成表单处理、数据迁移等重复性工作 (link3)。这类任务流程相对稳定,可用事先编排的脚本完成。又如各类无代码/低代码自动化工具,比如字节跳动推出的“扣子 (Coze)”平台 (link4)。扣子允许用户通过可视化流程编辑器拖拽节点,配置对话逻辑和插件调用,实现复杂业务的自动化处理 (link5)。用户无需编程,只需在开发阶段将业务流程和AI调用方式设计好,生成的Bot便会按照该流程运行。再如企业中的固定业务流程自动化(审批流程、报表生成等),也多采用此类预定义流程的 Agent 实现。总体而言,凡是可以提前明确步骤的任务,都适合用编译型 Agent 来部署,比如批量操作办公软件、网站信息采集、报表定制等。

优势与局限性: 编译型 Agent 的优势在于其确定性和可靠性。由于所有步骤在上线前已调试固定,这类 Agent 在已知场景下表现稳定,不易偏差。RPA 机器人可以24小时不知疲倦地执行,大幅降低人为错误和成本 (link6)。预定义流程也便于通过规则约束,确保符合业务规范。然而,这种“提前规划”的模式也带来明显局限。一方面,Agent 对环境变化十分敏感,缺乏弹性。如果外部系统界面稍作调整,输入数据格式发生变化,硬编码的流程可能立即失效。例如有RPA用户抱怨:“表单里新增一个字段,机器人就无法继续工作,必须重新配置任务” (link7)。编译型 Agent 缺少在运行时举一反三的能力,无法随机应变。另一方面,由于流程复杂度随着需求增长而上升,预先编排过于繁琐的流程图也会降低开发效率,且后期维护成本高。当任务超出原设计范围时,Agent 只能执行有限职责,无法自主学习新技能。总之,编译型 Agent 擅长按规划行事,在稳定重复场景中效率惊人,但面对未知情况时显得僵化,需要人工干预调整。

值得注意的是,随着大模型技术的发展,一些编译型框架开始在开发阶段借助 AI 提升流程设计智能化。例如企业级RPA厂商推出了“一句话生成自动化流程”的功能,由大模型根据自然语言描述自动编排RPA脚本 (link8)。这种做法本质上仍是编译型思路,只是把开发者从手工绘制流程中解放出来,让 AI 辅助完成“编译”,低代码编程了零代码。但不论流程由人编还是AI编,运行时执行的仍是固定脚本,没有运行中再规划的能力。

解释型 Agent:智能在运行时的应用

定义与特点: 解释型 Agent 指的是将 AI 的推理与决策能力融入运行环节,让 Agent 在执行过程中根据实时情况动态规划和调整行为。它不依赖预先硬编码的完整流程,而是赋予 Agent 某种程度的自主性:接受一个目标后,Agent 可以自己思考如何达成,边执行边调整策略。这类 Agent 通常由有推理能力的大型语言模型(LLM)驱动,具备自然语言理解、环境感知和自主决策能力 (link9)。它接收人类的高层指令或目标描述,能够拆解子任务调用外部工具,循环执行“思考->行动->观察”的过程,直到完成目标 (link10)。其运行流程是解释型的:每一步都通过 AI 推理决定下一步做什么。相较编译型的刚性,解释型 Agent 具有高度的灵活性和适应性,可以处理开放式问题和动态环境。

主要应用场景: 近年来兴起的自主决策 Agent交互式 AI 助手大多属于解释型 Agent。例如火爆一时的 AutoGPT 就是一种自主执行 Agent。用户给出一个目标,它会利用 GPT-4 等模型生成计划、执行任务、根据反馈调整,直至达到目标。AutoGPT 被形象地比喻为“无人驾驶的汽车”,只需给定目的地就会自动完成旅程 (link11)。它能够将指令拆解为多个子任务,逐一执行,并在过程中调用各种API接口,实现长时间不间断自主工作 (link12)。再如 Manus,一个号称“通用型”的交互式 AI 助手。比传统ChatGPT只给建议进了一大步,Manus可以自主规划并直接交付完整任务成果 (link13)。用户让它分析数据、写报告,Manus 会自己决定使用哪些工具、执行哪些步骤,最终产出结果。Manus背后采用了多智能体协同架构(规划Agent+执行Agent+验证Agent等)来实现这一过程 (link14)。此外,一些嵌入日常软件的AI 助手(如可以自主帮你查邮件、安排日程的数字秘书)也逐渐出现。这些场景的共同点是:任务目标可能多变且无法完全穷举预设,需要 Agent 具备即时决策上下文理解的能力,才能根据当前输入和环境做出合理行为。解释型 Agent 擅长充当这样的“智能决策者”和“交互伙伴”。

事件驱动 vs 流程硬编码: 解释型 Agent 的另一大特点是事件驱动、灵活反应。它不再依赖事先硬编码的流程图,而是根据环境输入(事件)实时决定响应方式。这使其能够处理非预期情况,拥有更灵活的任务处理方式。如上文所述,LLM 驱动的Agent可以调用长期和短期记忆,将新输入纳入上下文,动态调整接下来的步骤 (link15)。正因如此,这类Agent不会像传统RPA那样在遇到新数据格式时陷入“死胡同”,而是能够自适应,修改任务序列再尝试 (link16)。这种事件驱动的自主性让解释型 Agent 可应对更多长尾场景。例如,当用户临时改变目标,Agent 可以即时重新规划;当外部API返回异常数据,Agent 可根据错误信息调整调用参数甚至更换方案。相比之下,编译型 Agent 只能处理开发者预想好的分支,对于未覆盖的情况就无能为力。因此,在需要实时决策高适应性的任务中(如实时监控报警处理等),解释型 Agent 展现出巨大优势。当然,运行时动态决策也带来了新挑战,包括决策正确性、可控性以及资源开销等,我们会在后文进一步讨论。

编译型与解释型 Agent 在决策机制和能力侧重上截然不同:前者将智能更多体现在编译时,由人来“教会”Agent做什么;后者则让Agent在运行时运用智能自行“figure out”怎么做。实际应用中,需要根据任务性质选择合适的 Agent 类型。

技术架构分析:从编译到运行的智能分布

要深入理解编译型和解释型 Agent 的实现差异,必须拆解其技术架构,看看“智能”分别分布在系统的哪些部分。主要涉及以下关键技术要素:规划决策能力、上下文记忆、工具调用接口以及内化技能等。

规划能力: 在编译型 Agent 中,规划主要发生在开发阶段,由人或辅助工具完成。开发者按照业务逻辑把任务分解成步骤,写入流程图或代码中。因此规划算法本身相对简单,更多依赖于人的经验和业务规则。运行时不需要复杂的决策推理,只需遵循既定计划。而在解释型 Agent 中,规划能力被移至运行时,由Agent自己完成。这通常借助于LLM的推理能力实现复杂任务拆解和策略搜索 (link17)。例如解释型 Agent 会在接受目标后,生成一系列子目标或行动提议,评估其可行性,再选择最佳方案执行。如果遇到新信息,还会重新规划。这里涉及的技术包括基于大模型的思维链 (Chain-of-Thought)推理、树状搜索算法,甚至元推理等 (link18)。一些高级Agent架构还涉及自我反思、自适应规划模块,不断优化决策质量。因此,解释型 Agent 对赋能AI模型的规划能力要求很高,需要模型能理解任务语义,推导步骤顺序,并具备一定的推理正确性保障。最近的OpenAI头部推理模型o系列以及DeepSeek R1 应运而生。

上下文记忆: 无论哪种 Agent,记忆和状态管理都是关键。编译型 Agent 通常只需维护有限的运行状态(例如循环计数、中间变量),上下文非常有限。由于流程固定,Agent 不需要长期积累对话或环境信息,更多是一次性地按照输入->处理->输出即可。因此,传统编译型 Agent 的“记忆”多是程序局部变量或数据库读取,谈不上智能记忆。而解释型 Agent 则必须具备类似人类的记忆能力:既包含短期记忆(对当前对话或近期事件的上下文),也包含长期记忆(跨会话的知识、历史经验)。技术上,这通常通过缓存对话历史、引入向量数据库存储知识、甚至利用专门的记忆模块来实现 (link19)。例如,一个对话型解释Agent会保存与用户过往交互的关键信息,或将重要事实嵌入向量空间以便后续检索。当Agent规划后续行动时,会查询其记忆模块,以确保决策考虑到相关背景。Memory 模块也可以帮助 Agent 实现状态持久化,使其在跨多个子任务执行时保持一致的上下文。举例来说,AutoGPT 这类 Agent 在执行多个子任务时,会将每个步骤结果写入一个“内存”文件或数据库,作为下一步的输入依据,从而保持全局状态的一致 (link20)。这一点在需要跨越长时间或复杂流程的任务中尤为重要。如果没有上下文记忆,解释型 Agent 将难以进行连贯的推理和执行。可以说,记忆赋予了解释型 Agent 持续思考的能力,使其不像无头苍蝇一样乱撞。这也是为什么许多Agent框架强调结合短期会话记忆和长期知识库,以增强Agent解决实际问题的能力 (link20)。

工具调用 vs 内化能力: 工具使用是Agent实现复杂功能的关键。编译型 Agent 一般通过API调用RPA UI操作来使用外部工具,但这些调用点都是在开发时写死的。例如一个报表生成Agent,开发者会在流程中写明:“步骤3:调用数据库API获取销量数据;步骤4:调用图表库生成图表”。Agent 运行时按此执行,不会自行更改调用顺序。解释型 Agent 则倾向于在运行中灵活调用各种工具。Agent接到任务后,会根据需要评估使用何种工具来完成子任务 (link21)。例如需要计算时调用计算器插件,需要查信息时用网络搜索工具等。LangChain 等框架提供了工具插件机制,Agent可动态选择并调用。这要求Agent具备一定的“工具使用知识”,即知道何时该用何工具、如何解析工具返回结果并纳入后续决策。LLM 可以通过提示词学习这方面能力。例如AutoGPT内置了浏览网页、执行代码等多种命令,Agent会根据自己计划选择调用,再将结果写入思维链继续处理。内化能力指的是Agent依靠自身模型完成某些任务的本领。例如有些文本分析、语言翻译等工作,大模型本身就能胜任,Agent 无需借助外部API即可完成——这相当于技能内化在模型中。而对于模型不擅长的精确计算、访问实时数据等任务,则通过工具调用来补足。两类 Agent 在设计上对工具的依赖程度不同:编译型 Agent 更依赖显式 API 接口,所有外部交互都由开发者指定;解释型 Agent 更强调通过一个通用接口来使用各种工具,由AI决定调用哪些接口 (link22)。例如Manus的架构中,就内置了浏览器自动化、RPA机器人和第三方API等多种模块,形成一个通用的跨平台执行引擎 (link23)。这样的设计使Agent可以根据需要自由组合手段完成任务。然而,这也带来风险:Agent对外部接口的滥用或误用需要受到控制,否则可能造成不良后果(比如调用文件删除命令)。因此工具使用模块通常会增加权限管理和有限动作空间,以确保Agent不会越权。总的来说,解释型 Agent 通过灵活使用工具和内化技能,拥有了触及外部世界的“手脚”,这远比仅靠预设接口的编译型 Agent 更加机动。

运行环境:驻留型 Agent 的 Runtime 特性: 对于持续运行的解释型 Agent,其运行环境(Runtime)需要支持一些特殊特性。例如,一个驻留守护的Agent相当于一个异步后台进程,需要能够待机监听事件触发。运行环境可能需要实现事件循环或订阅某些消息总线,使Agent可以在无用户交互时被动接收触发信号(如定时事件、新邮件通知等)。这类似于操作系统中的daemon进程或服务,需要管理其生命周期。此外,Runtime 需提供资源管理和隔离机制。因为解释型 Agent 可能长时间运行并调用系统资源,必须防止内存泄漏、无限循环等问题,对其CPU/内存占用进行限制,必要时能平稳中止Agent执行。此外,驻留Agent往往需要持久存储来保存长久记忆和中间状态,例如将对话历史存入本地数据库,以便重启后能恢复上下文。安全性也是Runtime必须考虑的,需确保Agent的每次工具调用都在受控沙箱或虚拟机中执行,防止对主系统造成破坏。相比之下,编译型 Agent 的runtime要求就简单许多,因为流程短暂且可控,一般不需要上述复杂的调度和隔离机制。可以预见,未来若要实现真正实用的常驻型AI Agent,一个强大的智能体运行环境(类似“Agent操作系统”)将是不可或缺的,它要能调度AI的认知处理,与底层系统资源打交道,并解决长期连续运行带来的一系列工程问题。这也是当前很多Agent框架开始探索的方向,例如有人提出打造通用的“LLM-OS”来统筹Agent的运行 (link24)。

接口设计:万能接口 vs 专用接口

设计 AI Agent 时,一个重要考量是如何与外部世界交互,也就是接口的选择。这里的接口既包括Agent调用工具/数据的途径,也包括Agent与用户或环境通讯的方式。总体而言,存在万能接口专用接口两种思路,各有优劣。

浏览器 vs API:Agent 调用外部工具的方式:浏览器可以被视作一种万能接口。人类几乎可以通过浏览器访问任何在线服务或系统界面,同样地,Agent 若学会使用浏览器,就能够执行各种网页上的操作,实现跨系统的任务。例如Agent可以打开网页、填表提交、抓取信息等,相当于在UI层模拟了人的行为。很多RPA工具正是采取这种思路,通过UI自动化兼容各种老旧系统而无需官方API (link25)。近期一些Agent项目(如AutoGPT的浏览插件)也使用了浏览器自动化来完成上网搜索、网站操作等任务。然而,浏览器接口的通用性是以效率和可靠性为代价的。UI 操作脆弱且缓慢,页面结构稍变脚本就可能失效 (link26)。相反,专用API接口则提供了直接、高效的交互方式。通过API,Agent可以以结构化数据形式与服务交互,速度快且不易受界面改变影响。比如直接调用一个天气API获取温度,比操作浏览器搜索天气再解析网页要稳健得多。因此,在接口设计上,需要权衡通用性专用性。一种折中做法是:针对高频或关键服务,尽量使用官方API(专用接口)以保证可靠;而对于没有公开API的任务,Agent可以退而求其次使用浏览器自动化等通用接口完成。这也是为什么像Manus这样的先进Agent,同时内置了浏览器自动化、RPA、API 等多种模块 (link27)——它提供一个通用驱动层,可以视情况选择最佳的调用途径。未来Agent开发平台可能会汇集上百种工具插件,无论是浏览器执行JS、操作本地应用,还是HTTP API,都通过统一的接口规范供Agent调用。开发者需要为Agent设计合理的决策逻辑,让其学会在不同接口之间做选择:什么时候走万能路径,什么时候用专用捷径。

内化的威胁:为什么万能接口更易受到替代? 这里的“内化”是指平台方将某项功能直接整合为自身能力,从而替代掉外部Agent的价值。如果Agent只是通过万能接口调用别人的服务来实现功能,那一旦这个功能被平台内置,Agent 就失去了存在意义。例如,一个Agent通过网页操作实现了自动邮件分类,但邮件服务商后来自己推出了AI分类功能,用户自然更倾向使用官方内置方案。相比之下,专用接口往往建立在更紧密的合作关系或独有数据上,不容易被简单替代。如果Agent能深度集成某企业内部系统,通过私有API完成复杂业务流程,那么外部很难重现这种集成度。万能接口的通用性导致替代门槛低:任何人都可以开发类似的Agent脚本来调用同样的公共接口或页面。而垂直专用接口则受益于生态绑定,具有一定进入壁垒。这对Agent创业公司提出了警示:如果只是做一个到处点网页按钮的通用Agent,中长期看可能会被大厂直接在产品中实现类似功能而替代。要打造有护城河的Agent,需考虑如何深化垂直集成,获取独特的数据和接口授权,或者提供平台难以及时复制的个性化长尾功能。

垂直领域的集约化 vs 长尾化: 从行业生态看,Agent技术可能促使垂直领域集约长尾需求满足两种趋势并存。一方面,行业巨头和专业厂商会将Agent思想融入各自领域,提供深度优化的垂直解决方案。例如CRM厂商可能推出智能销售Agent,结合内部客户数据和业务流程,表现远优于通用Agent。这会使各垂直领域的AI自动化更加集约,由懂行的玩家提供专业Agent服务。我们已经看到RPA厂商在金融、制造等领域构建特定场景的自动化方案,未来这些方案会纳入更多AI智能,演化成领域Agent。另一方面,大量零散的、个性化的长尾需求难以被大公司全部覆盖,比如帮个人整理特定格式的资料、处理私人化任务等。这些需求非常多样且小众,正是通用Agent大显身手的空间。一个足够通用灵活的Agent可以作为平台,让用户自行定制各种脚本,满足长尾场景。这实际上类似App Store生态,只不过应用开发者变成了Agent本身。如果Agent拥有强大的学习和适应能力,那么只要用户以自然语言描述需求,它就能胜任——这样无论多小众的需求都可以被满足,而不必等待厂商开发专门功能。因此,Agent生态很可能走向“双轨”:主流高价值场景上,大厂提供安全可靠的专用Agent功能;同时,一个开放的Agent平台生态去承接海量长尾任务,为个人和小团队提供定制智能服务。这两者并不矛盾,而且可以互补:通用Agent平台培养起来的创新点子,反过来也许会被验证后垂直化商业化。对于开发者而言,需要关注哪些接口和能力值得投入构建,以及如何在专用与通用之间找到平衡。既要利用好万能接口快速覆盖功能,又要在可能的情况下打通更底层的专用接口,提高Agent服务的不可替代性。

案例分析:Manus、AutoGPT、扣子

为了更具体地理解编译型和解释型 Agent,我们对比三个具有代表性的案例:Manus(最新解释型)、AutoGPT(解释型)和扣子Coze(典型编译型平台)。

Manus:解释型智能体的前沿探索 – Manus号称“全球首款通用型 AI Agent” (link28)。Manus 使用了多Agent协同架构:由不同模型分别担任规划、执行、验证等角色 (link29)。当用户给予复杂任务时,规划Agent首先基于目标产出一系列子任务方案,执行Agent依次完成每个子任务并调用相应工具,验证Agent负责检查每步结果和整体质量。这种架构相当于把解释型 Agent 的内部决策流程模块化,由专门模型扮演不同阶段的决策者,减少单个模型出错概率。Manus 的亮点在于全流程自主执行:它不仅输出建议,还能真正完成任务交付 (link30)。例如官方演示中,Manus从15份简历中筛选出适合某职位的人选,整个过程包括解压文件、逐页浏览简历、记录要点、汇总生成报告,全程完全自主完成 (link31)。这展示了解释型 Agent 强大的任务串联能力。然而,目前的 Manus 并没有提供一个永远在线的守护模式。用户需要给定具体任务(比如让它写一份分析报告),Manus 才会实例化执行一次 (link32)。任务完成后如果要做别的,需要重新下达指令。虽然Manus支持将任务放在云端异步执行,让用户关机后仍能拿到结果 (link33)(体现了一定驻留能力),但它不会在无任务时长期占用资源自行运行。所以可以说,Manus是解释型 Agent 的即时执行代表:具备很强的自主决策和跨工具协作能力,但目前缺乏事件监听的常驻式架构。Manus的智能主要体现在运行阶段的规划和工具使用上,其开发阶段更多是对多模型架构和提示的精心设计,以及提供丰富插件供其使用 (link34)。这符合解释型 Agent 的特征:设计者赋予它灵活的大脑和手脚,具体怎么走路、做事由Agent自己决定。Manus案例也揭示了解释型 Agent 的一个现实挑战:要在实际中取得良好效果,需要精巧的工程化,实现各模块稳健协同,否则自主也可能带来自主的崩溃(据报道,一些用户在复杂任务下遇到了Manus卡死的情况 (link35))。但无论如何,Manus证明了解释型 Agent 的巨大潜力,其“想法到执行”的闭环体现出AI智能体相较传统AI助手的质变 (link36)。

AutoGPT:与Manus类似但更为单一,AutoGPT是跨多个子任务保持状态的自主Agent – AutoGPT 是开源社区的明星项目,它让GPT模型化身一个可以连续执行的Agent。AutoGPT的显著特点是实现了任务的链式推进和长期状态保持。当用户赋予一个总目标时(例如“研究某个市场趋势并给出报告”),AutoGPT会自己创建待办任务列表,然后逐一执行,每完成一项就更新列表,直到所有任务结束。为了支撑这一过程,AutoGPT 引入了持久内存来跨步骤保存状态 (link37)。典型实现中,它会把每次思考得到的结论、执行的动作及结果追加到一个日志或内存文件中,这样下一步决策时可以查阅之前的信息。比如第1步通过网络获取了几条资料摘要,第2步在做分析时就能从内存中取出这些摘要作为依据,而不必重复调用网络。AutoGPT 等自主Agent还常结合向量数据库,将较大信息存入向量索引,方便需要时语义检索。这种记忆机制保证了Agent在跨越多个工具和领域的操作中上下文连贯。举例来说,AutoGPT可以先调用浏览器搜索获取文本,然后调用本地代码执行环境来运行分析脚本,再将结果交给GPT总结——整个过程中,它用内存把搜索结果传递给后续步骤,用文件系统把分析脚本输出交给总结阶段,保持了任务链条的衔接 (link38)。相较而言,人类开发者如果用编译型思路也能写出类似流程,但AutoGPT的厉害之处在于这些步骤由AI自主衍生,而非人工写死。当遇到意外情况,比如某一步得到的结果不符合预期,AutoGPT还能让GPT调整后续计划,增删任务以纠偏 (link39)。当然,AutoGPT 并非完美,目前它有时会陷入“思维循环”或徒劳子任务,消耗大量API调用。这提醒我们,自主Agent在保持长期状态时也需要加入约束策略,避免无休止运行。但不可否认,AutoGPT作为解释型 Agent,充分展现了跨多个任务的自主性和持续性:通过内存和任务队列结构,Agent 获得了贯穿任务始终的“自我”,能够记住目标、追踪进度、灵活调整方案,真正做到了一次启动即可自动连续工作。其智能主要来源于运行时GPT模型不断interpret环境和自我提示。开发阶段则侧重于框架搭建,比如设计好任务列表的数据结构、记忆读写接口、通用的工具插件,让GPT在这个“容器”中发挥最大效用。AutoGPT 的出现对业界影响深远,它证明哪怕没有复杂多模型架构,仅用一个大模型配合精巧的循环设计,也能实现相当水平的自主Agent (link40)。这为众多轻量级应用铺平了道路,也引发了对Agent安全性、成本等问题的讨论。

扣子(Coze):典型编译型 Agent 平台 – 扣子是字节跳动推出的无代码AI智能体开发平台 (link41),很像是大模型时代的RPA低代码平台。从定位上看,它更偏向编译型 Agent,即通过图形化配置在编译阶段构建智能体逻辑,运行时按既定逻辑执行。使用扣子,用户可以拖拽流程节点来设计对话机器人的对话流、调用插件时机等 (link41)。比如为客服机器人设计一个流程:先通过NLP分类用户意图,然后根据意图选择不同回复模块,或调用插件查询数据库,最后输出答案。在这个过程中,用户明确规定了AI Bot在各种分支情况下的动作顺序。扣子也提供了知识库工作流等功能,可以看作开发阶段提供给Agent的“能力配置”。但重要的是,决策逻辑基本都在设计时敲定。运行时,扣子的Bot面临用户提问时,会先经过预设的意图识别(可能由内置模型完成,这算是AI能力的调用,但流程上是固定第一步),然后根据匹配到的意图路径执行相应操作。如果遇到未涵盖的新问题,除非有默认兜底,否则Bot可能无法妥善处理,因为它不会临场生出新流程。由此可见,扣子的智能主要体现在开发环节的人机协作:由业务专家利用平台提供的AI组件,把智能能力嵌入流程。例如可插入一个“OpenAI 问答”节点来解答未匹配问题,或者通过Prompt设计,让Bot按某种风格回答。这些都属于在编译阶段赋予Agent的“智能”。一旦Bot发布上线,它的行为边界和表现就取决于预配置。扣子的优势在于低门槛快速搭建行业定制Bot (link42)。正因为采用编译型模式,其输出可控性强,企业乐于接受。然而也正因此,扣子类平台很难覆盖长尾需求和高度开放的任务。它适合垂直领域专家系统:比如培训一个医疗问答Bot,预设所有问答流程节点和知识调用。而若让扣子的Bot去做一个需要实时网搜的复杂决策任务,就力有不逮(尽管可以接入联网插件,但仍需预设何时使用它)。扣子是典型的编译型 Agent:通过丰富的插件和模板,让开发者在编译阶段构建智能体逻辑,把各种AI能力集成进固定流程,从而快速实现业务需求。但它没有也不追求让Agent自主生成新流程或自主长期运行,这方面仍然需要解释型 Agent 来补足。扣子背后折射出当前工业界对AI Agent的务实取舍:在可控性和灵活性之间,更偏向可控性。对于很多企业应用,宁可牺牲一些灵活度,也要保证行为可靠、结果可预期。因此像扣子这样的平台将AI赋能锁定在开发阶段,由人主导智能的应用,而把运行阶段视作严格执行时期。这与AutoGPT等探索形成了鲜明对比。

小结: Manus、AutoGPT、扣子三者分别代表了多智能体协作长程解释型连续自主解释型编译型预设三种路径。Manus展示了多Agent协同在一次任务内实现高度自主的可能,AutoGPT则展现了单Agent跨任务链自主运行的潜力,而扣子强调在开发期集成智能保障可靠输出。它们各自适配场景的不同。Manus适合一次性复杂任务的全自动执行,AutoGPT适合持续自主工作代理,扣子适合清晰业务流程的快速落地。未来的Agent解决方案或许会将这些思路融合:既利用开发期的规划保障特定环节可靠,又允许运行期的Agent自主处理不可预测的子问题。例如,一个企业可能用扣子搭建主体流程,但其中某一步“智能分析”调用AutoGPT模型来灵活完成,然后再回到主流程。这种混合架构已经在一些超自动化产品中初现端倪 (link43)。可见,编译型与解释型Agent并非截然对立,而是可以互为补充,共同构建更强大的智能体系统。

未来展望:Agent 生态的演进与挑战

随着大模型的快速演进,智能体生态也在发生深刻的变化。编译型与解释型 Agent 的界限可能逐渐模糊,未来的Agent系统很可能融合两者优点,呈现新的形态。以下是对未来趋势的一些展望和需要攻克的挑战:

更强的实时推理: 未来的解释型 Agent 将具备更强大的即时推理和决策能力,能够在复杂开放环境中做出接近专家水准的判断。这依赖于基础模型能力的提升和推理机制的优化。例如,引入更先进的自我反思机制,让Agent在执行过程中不断评估自身表现,像人一样反思改进。近期研究表明,给LLM增加一个反思反馈回路可以显著提升多步骤推理的准确性 (link44)。未来Agent或许会内置一个元认知模块,监控LLM的输出,对其不合理之处进行修正,从而减少荒谬错误。同样重要的是高效长程推理:当前LLM在长任务、多层规划上易出错或效率低,未来需要在模型结构上进一步突破,或者通过级联专家模型的方式提升性能。另外,强化学习(RL)等技术可能更多应用于Agent,让Agent可以通过反复试错自行优化策略。实时推理能力的增强也意味着Agent可以承担更高价值、更复杂的决策任务,例如实时交易决策、自动驾驶决策等,这对可靠性提出了更高要求。

更深度的多模态能力: 人类智能的一大特征是可处理多模态信息并在物理世界行动。未来Agent必然朝这个方向发展,拥有更全面的感知和行动能力。这包括视觉、听觉、甚至机器人实体能力。例如,一个未来的个人数字Agent可能同时接入你的摄像头、麦克风,能“看到”周围环境、“听到”会议内容,并据此决策行动(比如控制家中IoT设备)。技术上,大模型已经迈向多模态,目前的GPT-4等模型具有一定图像理解能力。可以预见,下一代基础模型将全面融合文本、图像、音频、视频等多种模态。这将赋予Agent真正的环境感知力,使其达到 AI 技能进化的第三级:从仅有互联网文本到拥有感知空间 (link45)。同时,如果再结合实体机器人,让Agent具备操作物理世界的能力,则更上一层(具身智能)。多个Agent之间通过联网交互,形成社会协作,又是更高层次 (link46)。当然,实现这一切需要解决大量工程问题,如多模态数据的高效处理、实时性要求、机器人控制技术等。但方向已经明确:未来的AI Agent将不再局限于屏幕前打字,而是可以“看”、“听”、“动”,成为数字世界和物理世界的桥梁。这将打开无数新应用,例如智能安防监控Agent(自动分析摄像头画面并采取措施)、医疗辅助Agent(读取医学影像协助诊断)等。

解释型 Agent 的驻留管理: 一旦Agent可以长时间自主运行,我们就面临如何有效管理它们的问题。首先是可靠性:持续运行的Agent可能累积误差甚至陷入奇怪状态,需要定期校验和纠正。运维人员也许需要新的工具去监控Agent内部状态,比如查看它的思维链日志,发现异常循环时干预。其次是资源调度:多个常驻Agent运行时,系统需要分配计算资源,决定哪个Agent优先度高、何时挂起或恢复某个Agent等,这类似操作系统调度进程,但难度更高,因为Agent的工作负荷和紧急程度更难预测。再次,安全与伦理:让Agent长期自行行动,必须保证它不会“越轨”。这涉及给Agent设置“红线”规则或者监控策略。例如前述引用中提到,AI Agent 需要监控以确保做出伦理且一致的决策 (link47 )。未来可能诞生“智能体监管”领域,为自主Agent制定约束和审核机制。例如对企业内部的Agent,IT部门会规定哪些系统它可以访问、数据不能外传等等。如果Agent违背了策略,要有自动熔断或报警。类似地,个人用户也需要控制自己的数字分身Agent不要做出自己不希望的行为(比如避免社交媒体代理乱发贴)。这些都需要Runtime和治理层的支持。

编译型 Agent 的演化: 编译型 Agent 是否会被解释型完全取代?未必。相反,二者可能融合出新模式。编译型框架可以逐步引入更多运行时灵活性。例如,在预定义流程的节点上,嵌入小型解释Agent来处理不可预知的部分。这有点像在传统软件中调用AI服务一样,但更紧密集成。RPA领域已经朝这方向走:许多RPA厂商给流程添加了AI决策节点,用大模型来判断分支走向或生成填充内容 (link48)。再进一步,编译型 Agent 甚至可以在运行时调用一个“规划服务”来重新配置自身流程。这实际上是在编译型架构上叠加解释型能力,让Agent可以部分地“改写”自己的脚本。这种自我改写听起来危险,但如果限定在小范围,比如流程的某段由Agent自行优化实现,还是可行的。另外,随着Prompt工程的发展,编译型Agent的规则配置或能用自然语言直接描述,再由大模型编译成可执行流程。这会降低开发难度,让更多人参与Agent构建。总的来说,编译型 Agent 不会消失,而是会变得更加“智能”:开发阶段依然需要人为设计框架,但运行阶段能够比过去更加灵活、健壮。在强调合规性的行业(如金融、医疗),编译型Agent依然是主力,只是在内部慢慢融入AI增强,使它们不那么死板。例如未来的银行RPA可能在遇到异常票据时自动调用AI识别并尝试处理,而不是简单报错终止。从另一个角度看,解释型 Agent 要大规模落地,也需要借鉴编译型思路,把开发者的业务知识固化到一定程度。例如重要的合规检查、审批逻辑还是要固化为规则,不能完全交给AI临场发挥。可以预见未来的Agent平台也会提供图形化工具,让开发者方便地设定约束和模板,然后由Agent在这些护栏内自主行动。这样既保证了输出可靠性,又享受了AI灵活性,实现两种范式的结合。

结语: 无论编译型还是解释型,AI Agent 技术都仍在快速演变中。对于AI研究者和开发者来说,当下正是介入这一领域的大好时机。从商业视角,谁能率先打造出一个既易用又强大的Agent平台和生态,谁就有机会重构下一个时代的软件生态。想象一下,也许不远的将来,每个人都会有一个属于自己的AI数字分身或伴侣,无需编写繁琐脚本,就能替自己完成大量工作;每个企业都有无数AI Agent 在各业务环节自主协作运行。这场“从思考到行动”的智能体革命才刚刚开始,其未来充满机遇与未知。

 

【相关】

肖弘与Manus:AI Agent 的实战方法论

——一位90后创业者的技术洞见与生态博弈

立委老友群最近热议:

鲁总:现在的一些所谓的agent,都还只是停留在实现工作流。还缺乏自主解决问题的能力,离人们希望的 AGI 还有蛮长的距离。

立委:Manus的做法,所谓通用agent,就放出来的demo来看,已经具有(初步)自主解决问题的能力,不仅仅是对不同场景做不同工作流。对挑选出来的 demos,做了场景优化是可能的;但就路线而言,是具有相当自主性的,并对不同场景做了多 agent 通用流程的概括/抽象。llm应用对接方面,头部厂商一直有动作,也有亮眼的进步,但还是觉得步子太小。也许是因为做模型的人与做产品的人,基因不同。同一家要在两头都是带头人,长久看来也许不太现实。从大模型应用角度看manus,有点集大成的意味。

盘点一下manus 之前大模型应用方面的尝试与进步:1 deep research;2 computer use;3 artifacts / canvas;4 Google 的自动生成 podcast (叫什么来着?);5 GPT Builder / GPT Plug-ins;6 coding agent;7 search & rag;8 a long list of multimedia abilities;9. reasoning & cot。这些全部在为 agent 爆发铺路。尤其是最近爆发的cot推理强化模型,简直就是通用agent起飞的火箭。本来以为能集成这些能力做出超级应用agent,至少还需要半年、一年的时间消化落地。现在看来是保守了。

白老师:从语言学观点看,都是在搭建grounding。让自然语言的语义跨越LLM,平滑来到现实世界,保持LLM对内化知识和grounding有一致的理解能力。这个“一致”,RAG其实并没有达到。

我们用RAG+Function call接本地数据,agent在这一层之上。通过浏览器接口直接“看”系统界面,能不能突破垂域的知识壁垒,对我们来说是个极其重大的问题。所以Manus那边发生的事情,我们正在不断评估。这不仅是技术路线之争,也是超级入口之争。

立委:哪怕 Manus 这词的确有一些夸大其词,或营销过了头,它给agent点的这把火,以及用哪怕是特别优化过的demo给我们所展示的图景,也是功勋。

如果它自己不够硬气,后来倒下了,它所推动的这股大模型应用方向的内卷和加速,也会有厉害角色冒出来。至少人家蓝图画得好。综合各路信息,估计 Manus 有基本实力和某种独特优势,但不够成熟,涉嫌“炒作”。这次炒作其实基本是好事。

白老师:相应的思路、资源和问题,一股脑儿暴露在聚光灯下了。省得去找。

立委:唤醒了市场、用户、投资人,以及一长排对手,西方不亮东方亮。它不成熟,别家会加快成熟。 它们的功劳包括做了比可行性研究更深入广泛也更工程的展现。

吕总:我感觉开放域的agents属于大模型能力的自然延伸,能力壁垒也是由通用大模型决定,很难由第三方做出具有长期价值和壁垒的产品。同时也是现在已经有点狼狈的小虎们的救命稻草。我更看重解决封闭领域内深度问题的agent能力,包括其吸收和处理领域外信息的能力。

白老师:通过看UI,LLM就能正确操作一个垂类业务系统吗?这个我还是有不小的疑问。不暴露业务系统的API,用“看UI”的通用能力对接,就要回应这个挑战。

立委:ui 是超越平台的人机界面,rpa 早就这么做了,是rpa当年能站住的法宝。现在的 computer use 借助视觉大模型的屏幕理解优化,与rpa时代的能力,今非昔比了。

白老师:但是系统界面真的看就能理解吗?OA除外,我说的是业务系统,LLM有这个本事?这还涉及到增量,今后的新系统,难道就为了给agent看UI而设计UI?反正人是不看了,看agent就够了。我觉得到时候一定会有一个裸api标准。而且垂域也会细化自己这个部位。就像XML和各个垂域的标记语言一样。

一、Agent的“活化”时刻

在肖弘的观察中,AI Agent的进化本质是从“工具性”到“生命感”的质变。当Manus的测试案例中,Agent自主调用YouTube快捷键、精准定位视频内容时,团队感受到的不仅是技术突破,更是一种“创造生命”的震撼。这种“A-ha moment”背后,是三大技术跃迁:

1. 环境感知的突破:Agent不再局限于对话框,而是通过虚拟机界面实时操作浏览器、运行代码,形成“数字肢体”;

2. 异步规划能力:从“对话式响应”升级为“任务拆解→自主执行→动态反馈”的闭环,例如自动生成房产研究报告时,Agent能拆分出数据爬取、预算计算、可视化呈现等子任务;

3. 长程记忆机制:用户偏好(如要求表格展示简历排名)会被记忆,下次任务直接应用。

肖弘的洞察
> “Agent的核心不是功能堆砌,而是让用户觉得它就像人类实习生,既听话又带点小聪明。”

二、“新安迪比尔定律”:模型外溢与应用创新的共生法则

这一理论,揭示了LLM时代的技术扩散规律:

- 原典溯源:PC时代的“安迪-比尔定律”(Intel芯片升级→微软系统吃掉算力)在AI时代演变为“模型能力提升→应用公司将其转化为用户价值”。 实战案例:

- Cursor的启示:Claude 3.5模型能力到位后,Cursor通过代码编辑器形态释放其潜力,让非工程师用户也能用自然语言操控代码;
- Manus的“壳”哲学:即使基于DeepSeek等现成模型,通过设计虚拟机操作界面、进度可视化等功能,让模型能力“可触摸”。

创业者生存指南:
> “预判下一个模型能力是什么,提前把应用做好,在前方等着模型能力变强——等它ready,你就赢在自己预见的起跑线上了!”

三、博弈思维:在巨头丛林中开辟生态位

肖弘反对“逻辑推理式悲观”(如“大厂有数据优势必碾压创业者”),主张用动态博弈视角重构竞争:

- 变量思维:DeepSeek开源策略打破游戏规则,让应用层创业者获得“新武器”;
- 生态卡位:专注“脏活累活”——如Manus选择巨头不愿涉足的复杂任务执行场景(简历分析、房产研究),通过用户体验差异化建立壁垒;
- 错位竞争:模型公司聚焦底层能力突破,应用公司深耕垂直场景;
如Manus与DeepSeek的关系:“他们造核弹,我们设计精准投放系统”。

金句:
> “不要问‘巨头会不会做这个’,而要想‘我做了这个,巨头会怎么变’——博弈论里没有注定被吃掉的棋子。”

四、“Be Yourself”哲学:技术浪潮中的定力修炼

肖弘十年创业史的核心经验,是如何在风口与噪音中保持清醒:

1. 抵抗“应激创新”:

- 当行业追逐多模态、元宇宙时,Manus坚持打磨Agent的任务执行闭环;
- 肖弘坦言:“如果2023年跟风做AI绘画工具,可能早被卷死了。”

2. 用户价值的“第一性原理”:

- 从早期产品“壹伴助手”到Manus,始终围绕“让用户少点一次鼠标”;
- 测试阶段引入真实用户参与,甚至容忍“生成《地球伪装火星指南》”的沙雕案例,以理解需求边界。

3. 技术实用主义:

- 不迷信“全自研模型”,善用开源生态(如集成Claude、DeepSeek);
- 但关键模块(如虚拟机环境隔离)必须自主掌控。

创业者戒律:
> “AI时代最危险的幻觉是‘技术至上’——模型再强,不能转化为用户可感知的价值,就是空中楼阁。”

五、Agent未来图景:从数字世界到物理世界的“觉醒”

肖弘对Agent演进的预判,展现了技术现实主义者的大胆想象:

短期落地:

- 云端分身:通过虚拟机实现“无侵入式”操作(如自动填写报销单不触碰本地数据);
- 多Agent协作:规划Agent+执行Agent+质检Agent的分工体系。

长期展望:

- 实体化延伸:与机械臂、人形机器人结合,成为“具身智能”中枢;
- 社会性进化:Agent间形成市场(如算力租赁、任务转包),甚至出现“Agent经济”。

黄仁勋式结尾:
> 当被问及“未来什么会让你惊讶”,肖弘引用了老黄的回答:“Basically nothing(没什么可惊讶的)”——在AI革命中,疯狂本就是新常态。

结语:通透者的生存法则

肖弘的独特之处,在于兼具“技术极客的敏锐”与“街头商人的务实”。他既敢说“用最疯狂的幻想理解AI”,又坚持“用户愿意付费的功能才是真需求”。这种平衡术,或许正是AI应用创业者的终极生存技能:在仰望星空时,永远记得系好安全绳

(注:本文综合自肖弘访谈、Manus技术解析及行业观察,部分案例细节引用自产品演示报道)

【相关】

张小珺 独家对话Manus肖弘:世界不是线性外推,做博弈中的重要变量

【外一篇:推理范式演进中的概念】

1.1 训练与推断的二分

机器学习系统存在两个核心阶段:

- 训练时间(Training Time):参数空间Θ的优化过程,目标函数为误差最小化 minL(θ;D_train)
- 推断时间(Inference Time):固定参数θ时对新样本x_test的条件分布预测p(y|x_test;θ)

大语言模型(LLM)领域的最新进展显示,通过引入思维链(CoT)机制,推断阶段的计算量(test time compute)显著增长,也收获了相应的智能提升。

1.2 逻辑学视角

推理类型 逻辑结构 确定性 典型应用
演绎 前提→必然结论 确定性 数学证明
归纳 个案→普遍规律 概率性 趋势发现
溯因 现象→最佳原因 启发式 诊断推测

Inference 对应概率模型中的推断过程,本质是大数据压缩后的经验性模式归纳。在传统的符号逻辑语境下,Reasoning 特指演绎推理的符号化过程,但考察当下的LLM推理模型,reasoning 最多只是对逻辑演绎的符号化模拟,本质上仍然是概率模型的推断(inference)生成。溯因在大模型范式下,其实现启发式推测的手段也还是大数据压缩后的经验性(逆向)模式归纳。总之,生成式AI大模型以不变应万变,各种思考推理过程一律采用下一词元预测(Next Token Prediction)形式的推断实现。生成机理不变,但 input-output 函数范式变了。

1.2 认知科学视角的函数范式

Kahneman在《思考,快与慢》中提出的双系统理论为理解大脑行为提供新视角:

- System 1(快思考):直觉式响应 input→output,对应大模型 seq2seq 函数式生成模式

output=F(input)

- System 2(慢思考):反思性推理 input→[Think]→output,体现为CoT (ChainofThought) 驱动的两阶段函数生成模式

CoT=F(input) & output=F(input+CoT)

大模型 F 的底层逻辑不变,但强化学习启动了思维链外化(CoT),搭建了应对复杂问题(input)的信息桥梁CoT,增加了大模型推断output的可解释性,最重要地,大幅提升了复杂问题的解决智能。

1.3 推理模型的计算-智能增长关系

强化学习赋能的推理模型新范式呈现的新 scaling law 是怎样的呢?并无定论,一切刚刚开始,经验法则需要时间测量。

例如,当test time compute增加10倍时:
- 推理准确率提升多少?
- 响应延迟增长多少?

`

1.4 答问

「蔡江滨l六问:早安,能否理解为cot的生成是推断inference ,根据cot生成的具体内容是推理?@wei li 」

cot+answer 的区分是人为的格式区分,对于ntp GPT 模型,它们都是概率性 inference 生成的 response

如果这个response 是针对理工类推理题目,那么这种 inference 是对于人类 reasoning,尤其是演绎( deduction)的模拟,但不是严格意义上的逻辑演绎。

如果是创作类任务,inference 中的 CoT 就是 planning/pondering 等思考过程, inference 中的answer就是生成结果,都谈不上逻辑推理。

reasoning model 这个实命名也不准确,还不如叫思考模型、思维链模型、慢思维模型、意识流模型、内心独白模型、小九九模型,总之就是 system 2,与传统的 system 1 模型比,只是多了个 CoT。

 

【原文】

生成式AI学习中容易混淆的几个术语

test time / reasoning / inference 辨析

推理模型是当前生成式AI的主流热点,带火了一个热词(buzz word)叫 test time compute,咱们就先从它谈起。

从模型开发者视角,test time(测试时间) 针对的是 training time (训练时间)。众所周知,这是机器学习的两个阶段,一个是在线,一个是离线。训练模型好比是软件开发;测试模型就是软件测试。模型测试阶段就是常说的(在线)推理(inference)阶段,所以 test time 又叫 inference time。

问题是,中文翻译“推理”对应了英文两个词:inference 和 reasoning,业内人士很多人是认为后者才是推理正解,而前者是错译(misnomer),合适的翻译是“推断”。但“推理”的一词二用已经混淆并流行了很久,社区习惯上不一定能改过来。但对于后学,还是要辨析清楚才好。

test time compute 之所以流行,是因为大模型领域近几个月的“范式转变”。社区依据预训练scaling law原来一直聚焦基座大模型的做大做强,但发现大算力带来大智能的效益递减,进入了平台期。于是把重心转移到后训练的强化学习,发展了一种称为 System 2 慢思维模式的推理模型。这种推理模型最大的特点就是运行起来慢了,原来的秒回现在是慢了几拍才回复,消耗了更多在线时间和在线推断的算力(即 test/inference time compute)。

按常理,在线计算量增加,模型运行慢了,听上去是在退步,而不是进步。但好消息是,模型对于复杂问题的处理智能飙升。时间换质量,当然应该看好这个新方向。慢思维推理模型的成就,就好比人类的三思而后行。一般而言,凡事喜欢琢磨的人,比不假思索就行动的人,更有智慧,也更加靠谱。学会了慢思维(术语叫思维链)的大模型亦是如此。

模型训练阶段,无论是预训练还是后训练,都是离线进行的,迭代循环,在改变模型的参数,模型并未成形,属于开发进行时。而 test time 说的是模型已经完成阶段。严格说,test time 是内部试运行阶段,让 QA(quality assurance)部门在上线发布前做测试。但现在谈 test time compute,已经不分内部测试还是用户使用,所以更准确的术语还是 inference time compute。

白老师说:

Inference当初译成推理就错,真的推理Reasoning不来还则罢了,这回真的来了,将错就错都难。

我说:

不知是不是下意识,以前一直有意无意地把 inference 与统计预测关联得更紧,而把 reasoning 与符号演绎(deduction)划了等号。前者是统计性的,后者是确定性的。(概率性)推断就是(统计性)预测,statistical inference。不知道受到了什么的暗中影响。

老友尼克说:

本意也差不多是这个意思。reasoning是偏deduction的。

大模型的术语最早借自于统计,统计中区分建模与inference两个阶段,大模型训练相当于建模。统计中inference翻译为“推断”。简中自媒体乱翻导致业界人也瞎起哄,好像上海电影译制厂把中国两代人的英文发音都带偏了。

但是定理证明community也用inference,好多会都以此冠名,所以很难把inference与reasoning完全隔开。相对讲,reasoning更偏deduction,最好的办法是在可能引起歧义时括号英文。

abduction在逻辑里是美国实用主义的特定叫法,源自Peirce。这个意思是把建模和inference合到一起的意思,就是Solomonoff Induction的不严谨的哲学说法。inference出错时,再去修正模型,the whole nine yard。在Kolmogorov Complexity中,inference被称为logic depth 。。。

其实,inference time 口语中也叫 running time 或 runtime,其他的近义词和关联词还有很多: test-time,real time,decoding time, prediction time, service time, execution time, on-the-fly computing time ...

 

 

【外一篇:推理范式演进中的概念

1.1 训练与推断的二分

机器学习系统存在两个核心阶段:

- 训练时间(Training Time):参数空间Θ的优化过程,目标函数为误差最小化 minL(θ;D_train)
- 推断时间(Inference Time):固定参数θ时对新样本x_test的条件分布预测p(y|x_test;θ)

大语言模型(LLM)领域的最新进展显示,通过引入思维链(CoT)机制,推断阶段的计算量(test time compute)显著增长。

1.2 逻辑学视角

推理类型 逻辑结构 确定性 典型应用
演绎 前提→必然结论 确定性 数学证明
归纳 个案→普遍规律 概率性 趋势发现
溯因 现象→最佳原因 启发式 诊断推测

在传统的符号逻辑语境下,Reasoning 特指演绎推理的符号化过程,但考察当下的LLM推理模型,reasoning 最多只是对逻辑演绎的符号化模拟,本质上仍然是概率模型的推断(inference)生成;而Inference对应概率模型中的推断过程,本质是大数据压缩后的经验性模式归纳。溯因在大模型范式下,实现启发式推测的手段也还是大数据压缩后的经验性(逆势)模式归纳。总之,生成式AI大模型以不变应万变,各种思考推理过程一律采用下一词元预测(Next Token Prediction)的 inference 实现。

1.2 认知科学维度

Kahneman在《思考,快与慢》中提出的双系统理论为理解大脑行为提供新视角:

- System 1(快思考):直觉式响应 input→output,对应大模型 seq2seq 函数式生成模式

output=F(input)

- System 2(慢思考):反思性推理 input→[Think]→output,体现为CoT(ChainofThought) 驱动的两阶段函数生成模式

CoT=F(input) & output=F(CoT)

1.3 推理模型的计算-智能增长关系

强化学习赋能的推理模型新范式呈现的新 scaling law 是怎样的呢?并无定论,一切刚刚开始,经验法则需要时间测量。

例如,当test time compute增加10倍时:
- 数学推理准确率提升多少?
- 响应延迟增长多少?

`

【相关】

 

 

 

 

思维链是大模型的符号神助攻

在人工智能发展的历史长河中,符号逻辑和神经网络一直如两条平行线,难以交汇。然而,2025年春节,DeepSeek发布的推理大模型R1揭示了一个令人惊叹的技术突破——思维链(Chain of Thought, CoT)的自主生成,其展现出的语言序列仿佛在模拟人类的符号逻辑。这一强化学习的进展或将重新定义符号逻辑与神经网络的融合方向。

DeepSeek R1 CoT 示例

符号逻辑的回归

DeepSeek R1推理模型的发布引发全球关注,其强化学习框架下自主生成的思维链展现出惊人的问题解决能力与创作能力。其核心机制仍是GPT风格的自回归下一词元预测 (next token prediction, NTP),但模型生成的中间结果——思维链,却呈现出典型的符号逻辑特征,有效模拟了人类的思考过程。

这一现象值得深思:大语言模型的本质从未改变,包括最新的推理模型,其内部运作仍是基于向量计算的条件概率。然而,为了模拟人类的符号推理,这些模型竟然学会了通过自然语言构建推理链条。更令人惊讶的是,这些符号链条确实有效,它们结合内部的词元预测机制,显著提升了模型的推理和创造能力。这是一个非常值得回味反思的技术进展。这件事的奇妙,是ChatGPT核爆以来最大的震撼。

符号与神经的无缝对接

这项奇迹的发生尤其让我们符号AI的老司机感慨。多少年来,我们心心念念地追求两条路线的融合,希望符号逻辑能在关键时刻帮助神经系统和统计模型,但一直看不到入口。忽然间,在大模型预训练进入平台期的关口,神经社区内部打开了一扇通向符号逻辑的大门,找到了一种神经与符号融合的路径。这一突破来得如此突然而有力,不能不让人既惊且叹。可惊的是,符号逻辑终于开始发力了;可叹的是,闯出这条路的还是神经社区的研发者,没有符号研究者什么事儿。

大模型就是个端到端的黑盒子,符号仅在输入输出两个端口出现。具体说,我们只能在输入端将符号“嵌入”(embedding)至高维向量空间,并在输出端通过 softmax 映射回符号,而内部的多层前馈的数据流则完全由向量运算主导,与符号逻辑无关,也很难以人类可以理解的符号来解释(因此才有专门的大模型可解释性研究)。此前所有结合符号与向量的努力都失败了,感觉这两种表示就是不同物种,无法融合。符号逻辑一直坐冷板凳,而符号本身除了作为训练数据燃料,在内部智能运作上一点也帮不了忙。

符号与概率的桥梁:CoT的奥秘

现今的推理大模型将内在“思考过程”外化为自然语言CoT,很像是模型的内心独白或意识流。模型在生成最终答案前,CoT重新嵌入系统中助力最终答案的生成。看似简单的这一过程,却在推理时成功地利用了自然语言这一符号载体,使得符号逻辑与神经网络在端口上实现了全程对接,没有一点违和或不容。

这是怎么发生的呢?一切多亏了自回归的 NTP 范式。下一词元预测的GPT生成方式,使得内部的“思维过程”可以在输出端外化为符号CoT,从而利用符号推理的形态,一步步模拟人类思维过程。

从信息论的角度看,推理大模型在 question-CoT-answer 的生成过程中,实际上构建了一条低熵的信息传输路径。相比传统的 question-answer 直接映射,这种显式的中间步骤大大降低了输入question与输出answer的信息鸿沟,显著提升了正确答案的生成概率。正如我在之前的博客(推理新范式 query+cot+answer 支持新的 scaling law 吗)中所论述的,CoT 充当了信息桥梁的角色。虽然不是严谨的符号逻辑推断过程,但它无需改变概率模型的底层运作机理,实现了对于人类推理的有效模拟。

自然语言vs形式语言:CoT的载体选择

CoT 本质是思维,语言只是载体,原理上没必要用自然语言,或任何特定的语言(例如中文,或英文)。但现在的实现方式,无路DeepSeek的R系列,还是OpenAI的O系列,就是直接用了中文或英文的思维链表示。推理模型在运行时,CoT先在输出端外化为中文或英文的符号序列,然后自回归循环运作,最终生成答案。看上去简直就是在符号与神经之间变戏法。以前何曾想到符号介入神经,还带这么玩的,内部与端口天衣无缝。

我们都要感谢DeepSeek,要不是 DeepSeek 给了这样一个开源的春节大礼包,又完全公开了自己的CoT,我们会一直蒙在鼓里,绞尽脑汁也想不明白 reasoning 这种符号逻辑,怎么就悄悄进入了神经过程。符号与向量的矛盾通过端口外化,化解于无形。

从符号逻辑的角度看,自然语言虽然老妪能解,但其精确性远逊于数学符号、形式逻辑表达式或编程语言代码。对于数理问题或逻辑推演,CoT 用这些形式语言比用自然语言显然更合适。然而,目前思维链生成采用自然语言,而不是形式语言。偶然也夹杂了一些数学符号,但思维链整体还是压倒性的使用英语或中文。R1论文中还提到他们的CoT连计算机代码都排斥在外。感觉上,这其实不是推理的正道,但可以理解其成因。CoT 是模型自主生成的序列,而据以生成的基座大模型本身,其训练数据以自然语言压倒多数。原因之二是开发者刻意追求的可解释性以及对用户的透明性,觉得自然语言大家都容易看明白。R1在设计CoT强化训练的时候,强调了格式的奖励,其中语言一致性当成了一个目标。夹杂不同语言或代码的CoT,剔除出局了。从长远来看,这未必是最明智的决策。

前景与思考

推理模型的 CoT 无疑是神经网络与符号逻辑融合的一次创新,也是近年来人工智能领域的重要里程碑。未来的CoT研究方向,可以考虑越来越多地奖励 CoT 以更加严谨的形式语言表示。至少要文理分科,文科任务(如创作)继续用自然语言做CoT的载体,这类思维过程涉及的主要是篇章规划和表达斟酌,而非严格推理;理工任务(数学、编程、逻辑等)则不同,应该训练模型尽量使用形式语言来表示CoT推理过程。当然,要做到后者,可能还要进一步加强基座模型等代码和其他形式语言的数据配比,否则 LLM也许 “憋”不出来合适的表示。

最后说一下概率模型与符号逻辑的严谨性差异。推理模型的 CoT 是推理模型的现场发挥,很多时候显得不够严谨。但其实人脑也是一样。没人能出口成章地把一个复杂数学证明一气呵成,就如同没人能一口气写出一个没有bugs的程序。严谨性都是在反复校验、修正,甚至是几代科学家前赴后继不断探究中完成的。CoT 不够严谨是拿模型的“草稿”与人类专家的定稿对比产生的印象。可以预见,对于非常复杂的问题,推理模型应该像人类一样反复校正,独立复现结果比对。这可能要指望各种推理 agents 的相互协作,以及它们与人类反馈的多层交互。

【相关】

 

再谈自然模态数据是高维空间的低维流形

这个话题实际上涉及 流形假设(Manifold Hypothesis)对抗样本攻击(Adversarial Attacks)高维几何的脆弱性


1. Normal Bundle 在高维空间的“刺猬”形象

normal bundel 是微分几何里的一个重要术语,中文翻译叫“法丛”,对于欠缺数学背景的我们来说,这个术语不好理解,但理解它对于大模型的骇客攻击与AI安全性蛮重要,咱们先看看它是个什么东西。

先破题:Normal 不是‘正常’,而是‘垂直’!

咱中文里‘法’字,来自‘法线’——就是几何里那个与曲面切线呈90度的垂直方向的线。比如你拿根筷子插进苹果,筷子方向就是苹果表面的‘法线方向’。数学家说的‘法丛’(Normal Bundle),其实就是把曲面上每一点的所有垂直方向‘打包’成捆(Bundle),像给苹果表面插满箭头的刺猬!

刺猬的类比是准确的!在 低维流形嵌入高维空间 时,法丛(Normal Bundle)确实可以被想象成“刺猬”——流形的每个点都有一个法向空间,而法丛就是把这些法向空间整体“打包”在一起。见我让Sora生成的法丛示意图:

这是我给Sora发的指令:

制作3D曲面切线与法线动态演示短视频,为科普用,数学家说的‘法丛’(Normal Bundle),其实就是把曲面上每一点的所有垂直方向‘打包’成捆(Bundle),像给苹果表面插满箭头的刺猬!

有几个关键点:

    • 在 3D 空间里,我们习惯把曲面(2D 流形)的法丛(Normal Bundle)想象成一个 1D 线性空间(即法线),所以它看起来像是一个“箭头丛”。
    • 在更高维空间中,比如一个 d 维流形嵌入到 n 维空间(d < n),它的法丛一般是 (n-d) 维的,这意味着法丛方向本身就有多个自由度,而不只是单纯的“刺”。
    • 用 3D 可视化高维流形的法丛,最常见的做法确实是让法丛的“刺”呈现出扇形的扩展感,因为高维空间的法向方向比低维空间的直观“法线”更加复杂。

所以 Sora 生成的图如果能表现出“立体感更强的刺猬形态”,或者在某些点上看到“扇面展开的刺”,就能更好地表达高维法丛的概念。

想象你是只二维蚂蚁,活在皱巴巴的树叶上。对你来说,前后左右是‘切空间’——现实世界;而上下方向就是‘法丛’——完全不可见的维度。所以‘法丛’的normal,本质是几何垂直性,不是日常说的‘正常不正常’。

数学家眼里的‘法丛’一点也不玄乎,就是垂直方向的集合。既然切空间是‘tangent’(相切的),那垂直方向就叫‘normal’(正交的),跟‘正常’没半毛钱关系!要怪就怪英文一词多义。


2. 高维空间 vs. 自然数据流形

“自然数据(文本、音频、图像、视频等)通常分布在一个远低于高维空间的流形上”,这是 流形假设(Manifold Hypothesis) 的核心观点:

    • 我们的高维数据(比如图片,通常是 256×256×3 维)并没有填满整个高维空间,而是集中在 远低于该维数的流形上
    • 这类似于:宇宙是三维的,但星星和星系主要分布在局部区域,而不是均匀填充整个宇宙。
    • 这个流形的维度一般和“人类可感知的主要变化”相关,比如在 人脸识别模型 里,所有可能的“人脸”数据分布在一个 远低于像素维度的低维流形 上,而这个流形包涵了现实中所有合理的人脸样本。这实际上是人脸与表情解耦可以work的原因。

3. 对抗样本:在法丛方向制造“幽灵数据”

既然 自然数据的流形是稀疏的,那这意味着:

    • 绝大多数可能的数据点 根本不属于自然数据的流形
    • 但在数学上,它们仍然是“合法”的高维向量,可以被构造出来,并且会落在 法丛方向 之中。

这是对抗攻击(Adversarial Attacks) 可以成立的根本原因:

    1. 模型只在自然流形上训练,它对“流形外的点”几乎没有概念。
    2. 对抗攻击就是在法丛方向上微调输入数据,让它离开流形,但仍然能欺骗模型的决策边界。
    3. 由于高维空间的稀疏性,很多非自然数据的点可能距离决策边界非常近,但在我们人类看来它们几乎与原始自然数据无差别。

一个经典的例子是:给神经网络输入一张猫的图片,稍微在 法丛方向 加一点扰动(人眼几乎察觉不到)。这张图片仍然落在数据空间里,但 已经偏离自然流形,导致模型认为它是一只狗。

这种“幽灵数据”之所以有效,正是因为 法丛方向是流形的正交补空间(Orthogonal Complement),而流形本身对它并不敏感——就像蚂蚁看不到第三维的变化。


4. 为什么法丛方向上的攻击有效?

既然自然数据流形本身很低维,而法丛方向是它的补空间,远高于流形维度,那么:

    1. 法丛方向上的变动可以很大,但仍然不影响数据的可感知特性(因为它们对人类而言是“无意义的方向”)。
    2. 但是对于机器学习模型而言,这些方向仍然是合法的数据输入,它并没有经过特殊训练去防御这种扰动。
    3. 于是,在某些关键点上,对抗样本只需要在法丛方向上加一点点扰动,就能让模型的决策边界翻车

这也解释了:

    • 为什么对抗样本攻击几乎无处不在?因为高维空间里“非自然数据”占比极高,总会有某些点可以被找到来攻击模型。
    • 为什么对抗样本对人类视觉几乎无影响?因为它利用的不是“自然数据的流形变化”,而是“法丛方向的变化”,这个方向人类或自然模型根本不敏感。
    • 为什么深度学习模型的泛化能力有时不稳定?因为它们过度依赖训练数据流形,而法丛方向上的任何变化都会导致预测结果的剧烈偏移。

5. 这是否意味着“无限对抗攻击”可能存在?

在理论上,确实可以构造出 无穷多的“几何意义上非自然的数据” 来攻击模型。

    • 这些数据点不属于原始数据流形,但它们可以被设计得 足够接近决策边界,从而诱导模型出错。
    • 只要高维空间的维度够高,法丛方向的自由度就足够多,攻击者总能找到某个点来实现攻击。

这也是为什么:

    • 对抗样本防御是个无止境的军备竞赛,模型越强,对抗攻击手段就要越高级。
    • 对抗训练(Adversarial Training) 是当前最有效的解决方案之一,本质上就是“让模型在法丛方向上也学会一些关键的变化”,从而增加对抗鲁棒性。
    • 如果不考虑计算成本,我们可以尝试在更大的法丛方向空间中做鲁棒性优化,但这通常代价太高,所以现实中只能有限度地解决。万全之策是少有的,迅速堵漏或patches 对应用系统变得尤为重要。

6. AI炼丹里的‘法丛陷阱’

搞大模型训练时,参数空间就像高低起伏的山脉。梯度下降是沿着‘切空间’(Tangent Space)找路,但万一算法抽风往‘法丛’方向抖动,就可能造成极度的训练不稳定。所以高阶炼丹师都要用正则化当护栏,专门防着往法丛深渊掉。

如果没有黑客,模型使用者用到的数据输入都是自然数据,与模型训练集是同质的,模型本身的泛化能力足以轻松应对。这是因为自然数据,包括用户的自然语言指令和样例都处于模型数据的切空间里,好比在平原地带,东西南北方向坡度平缓。

但不怕贼,就怕贼惦记。黑客可以刻意制造法丛区域的数据,攻击模型。法丛处于训练数据没有覆盖的未知区域,以此为测试数据,模型性能就急剧下降,可能表现得一塌糊涂。这就像开车——切空间是方向盘能控制的方向,法丛是侧翻悬崖的方向。

这是个头痛问题,单纯依靠数据中的统计模式,难以实现真正的分布外泛化和可靠的安全性保障。常规数据增强很难解决这个法丛数据攻击的问题,因为常规自然数据如同在已知区域做探索,不会主动探索法丛方向,而且这样的数据在通常的场景极为罕见:正常用户也不会故意跳崖。开发者要探索所有法丛危险方向需要无限资源,防不胜防,通常是对骇客的数据进攻,亡羊补牢,给模型做针对性补漏。

总结:

“刺猬模型”来展示法丛(Normal Bundle)是合理的,需要强调“不同点的法向方向形成的丛”,就能很好地表达它的几何概念。
自然数据流形 vs. 法丛方向的思考是对的,它基于自然数据的流形假设。
高维空间的稀疏性 + 法丛方向上的非自然数据确是对抗攻击成立的根本原因,所以确实可以构造无数个“几何上合法但不自然”的数据点来让模型翻车。
如果要防御,关键是让模型学会法丛方向上的“稳定性”,这就是对抗训练、正则化、鲁棒性优化、构造对抗数据patching等技术的意义。

高维空间和对抗攻击的研究,可以进一步研读对抗学习(Adversarial Learning)相关论文,尤其是 Ian Goodfellow 提出的 FGSM(Fast Gradient Sign Method)经典攻击方法及其后续的种种高级变体

 

【相关】

大模型三阶段的 scaling laws 接力赛

黄教主背后是三个S曲线,illustrate 的是大模型三个阶段的“做大做强”的 scaling 接力赛,颇显挖山不止的愚公移山精神。

我们知道,大模型有三个阶段:预训练、后训练、在线推理。过去几个月最大的改变是,业内基本有了共识,就是伊利亚宣称的说法,预训练时代结束了,著名的预训练的 scaling law 的经验法则貌似走到了头,进入平台期。于是有了推理模型(OpenAI O 系列,以及deepseek R 系列等)的兴起,强调的是后训练的思维链强化学习的投入以及对于在线推理时间(所谓 test time compute)的利用。推理模型也确实在数学、代码以及文字创造性方面展现了前所未有的亮眼成绩。

推理模型的后训练 scaling 刚开始,能走多远,还不清晰。但这个轨迹可以从 o1 变成 o3,以及 r1 到据说今后一两个月随时可能发布的 r2 等能力增强上,慢慢看到。最后这个推理阶段的 test time scaling 呢?

前阵子我跟国内最早鼓吹 scaling 三条S曲线的老友 Junlin 说,根本就没看到什么 test time scaling 呀,因为用户不能掌控模型的 test time compute,它花多少时间/算力,啥时候完成我给它的指令任务,是大模型自己决定的,test time 并不是可以“伸缩”(scaling)的呢。Junlin说,目前确实如此。

这几天玩大模型 deep research,逐渐体会到一点 test time scaling 的伸缩可能性。答案的端倪出现了。说到底就是,如果一个query 或课题,有没有一条曲线表示,如果你给它更多的思考和响应时间,它就完成得更好。具体说,在 o3-mini 下,有一个 button 叫 deep research,用户可以选择用或者不用。如果不用,你的问题也会有思维链,因为你一开始就选择了已经后训练强化的 o3 推理模型。推理模型的思维过程通常是一两分钟吧。但是如果你同时又按下 deep research 按钮,它最终完成任务的推理时间是加n倍的等待时间,可以长达10来分钟。这至少让我们看到了,模型不变,但推理现场的缩放时间不同,其结果是不同的。这应该算是 test time scaling 的前序或影子了。

那它是怎么做到的呢?让用户可以根据自己的课题的难易或挑战性,以及自己可以容忍的等待时间,来对同一个课题做不同的 test time compute 的投入,以便生成不同的结果呢?原来,它用的是某种 agent 的思路。deep research 这个按钮所提供的功能,实际上就是一个研究性推理的 agent。agent 是 LLM-native 的附加 feature,但不需要改变模型,它所改变的是在推理阶段与用户的交互方式。目前这种交互非常简单,只有一轮,但所开辟的这个 test time scaling 方向,预计可以继续探索与用户更长更多的交互,来帮助最大化 test time compute 的效果。

如果 test time compute scaling 并不是很快就遇到瓶颈,可以想象将来的 deep research 可以与用户互动很长的时间去完成一个非常复杂的项目,也许不是分钟量级的推理时间投入了,完全可以想象大模型投入 hours 甚至 days 来完成一个高难课题,譬如一个人类研究员需要几个月甚至几年才能完成的项目,或者完成一个人类无法完成的研究项目。当前的 deep research 非常简单,它在得到用户的 prompt/query 后,立即分解问题,并反问用户五六个简单的问题,来确认这个研究所需要的资料来源、广度、深度以及注意事项。得到用户的反馈以后,模型开始接受用户更新的资料(if any),并用 search 来收集更多的相关资料,然后按照任务的分解以及与用户确认过的计划和信息,对于每一个来源做分析,最后综合贯通,形成研究报告。这样一来,它所需要耗费的推理时间自然就加长了,因为任务不再是单一的任务,资料也不仅仅是自己模型内部已经消化了的知识,而是现场搜索的更多来源,处理这一切都需要时间。

对于推理模型的后编辑阶段的强化学习以及它在推理阶段的 test time compute 的投入,这两阶段的 scaling 征程刚刚开始。希望这两条S曲线还可以稳定上升一段时间,让 scaling 的接力赛助力我们在通用智能(AGI)和超级智能(ASI)的道路上不断前行。

 

 

【相关】

张俊林:从Deepseek R1看Scaling Law

Fundamental Limitations of Deep Learning: Origins in Data-Driven Methodology

In today's rapidly evolving artificial intelligence landscape, deep learning models have become the core engine driving technological innovation. However, behind these impressive achievements, deep learning faces several fundamental limitations that are not merely technical implementation challenges but are determined by the essence of its methodological approach.

This article explores the core limitations of deep learning beyond the well-known "black box" problem and analyzes their mathematical origins.

I. "Original Sin" of Data-Driven Approaches: Philosophical Dilemma of Statistical Learning

Confusion Between Correlation and Causation

Deep learning models are essentially high-dimensional probability density estimators. Their core objective is to find the best approximation of the conditional probability distribution P(y|x;θ) through parameters θ. Models learn from large volumes of training data, attempting to capture statistical patterns of various sizes and generalize them to unknown data. However, this statistical learning paradigm leads to a fundamental problem: models learn P(y|x) rather than P(x→y) – statistical correlation rather than causal relationships.

Consider a simple example: an AI medical system might discover that a certain type of rash is highly correlated with malaria diagnoses (possibly because the data was collected in regions with mosquito proliferation issues). The model establishes a pseudo-causal relationship of "rash → malaria" while overlooking temperature, humidity, and other actual disease factors.

This "causal misplacement" leads to poor model performance in new environments:

    • In regions without mosquito problems, the model may over-diagnose malaria
    • When facing new infectious diseases, the model may misclassify them as known categories
    • When the diagnostic environment changes, model performance significantly deteriorates

Information Bottleneck and Compression Distortion

According to information bottleneck theory, neural networks need to perform a special kind of information compression during training: discarding information in input X that is irrelevant to prediction Y while preserving all relevant information. Mathematically, this is expressed as maximizing:

I(Z;Y) - β·I(X;Z)

Where Z is the intermediate representation, I denotes mutual information, and β is a trade-off coefficient.

When training data is insufficient or biased, this compression process encounters serious problems:

    1. Discarding genuine causal signals (e.g., cell morphological features in medical images)
    2. Preserving pseudo-correlative signals (e.g., using hospital wall colors as diagnostic criteria)

The essence of this "lossy compression" causes models to grasp incorrect features in new environments. In fact, when H(Y|X) is incorrectly minimized, the model establishes data defect-driven false information channels rather than accurate mappings of reality.

II. The Low-Dimensional Manifold Hypothesis for Natural Data

A classic assumption in machine learning theory is that natural data resides on low-dimensional manifolds within high-dimensional spaces. Taking a simple example of a two-dimensional manifold in three-dimensional space, it resembles folded paper – although data points exist in 3D space, the intrinsic structure is 2D. Manifolds possess local Euclidean properties of smoothness and continuity: the neighborhood of any point on the manifold can be mapped to a low-dimensional Euclidean space. For instance, when a face rotates, the image slides continuously on the manifold without sudden changes.

Basic Concepts

    • High-dimensional space: Refers to mathematical spaces with dimensions far greater than 3. For example, a 100x100 pixel image exists in a 10,000-dimensional space (each pixel being one dimension).
    • Low-dimensional manifold: A continuous, smooth low-dimensional structure embedded in high-dimensional space. For example, a two-dimensional surface in three-dimensional space, or a structure of dozens of dimensions in a space of millions of dimensions. For instance, all ways of writing the digit "2" in a 784-dimensional pixel space form an approximately 10-dimensional manifold (controlling stroke thickness, tilt, curvature, etc.).

Why Natural Data Forms Low-Dimensional Manifolds

Natural data (such as video, images, audio, text) may seem to contain high-dimensional features but is constrained by physical laws and semantic restrictions, exhibiting low-dimensional characteristics:

Physical constraints:

    • A facial photograph's parameters are limited by bone structure, lighting angle, facial muscle movements, etc., with actual degrees of freedom potentially fewer than 50 dimensions.

Semantic constraints:

    • In textual data, grammatical rules and semantic coherence restrict seemingly unlimited vocabulary combinations to a finite meaning space. Other modal data is similar, also constrained by relationships between atomic units (tokens).

Dimensionality Reduction and Feature Extraction

The information "compression" in models is actually a process of dimensionality reduction and feature extraction. For example, the hierarchical structure of Convolutional Neural Networks (CNNs) progressively strips away redundant dimensions, approaching the essence of data manifolds. Shallow layers extract edges (local linear structures), while deeper layers combine them into object parts (global manifold structures). Manifold Learning explicitly recovers the intrinsic low-dimensional structure of data.

Taking 3D object recognition as an example: The translation and rotation of objects in 3D space form a 6-dimensional manifold (3 translations + 3 rotations). When related video image data is embedded in high-dimensional pixel space, ideal 3D object recognition involves dimensionality reduction to its 6-dimensional manifold, identifying objects through feature extraction.

III. Adversarial Examples: Fragility at Distribution Boundaries

Adversarial examples refer to samples that, through minor perturbations to original input samples, can cause deep learning models to produce incorrect outputs. For instance, a slightly modified panda image might be identified as a turtle with 99% confidence by AI.

The existence of adversarial examples reveals structural defects in data manifolds, challenging the traditional manifold hypothesis discussed above and exposing its fragility:

    • Natural data manifolds contain numerous "holes": areas not covered by training data
    • Manifold boundaries have high-curvature regions: minor perturbations can cross category boundaries

Theoretically, the possible combinations of high-resolution color images far exceed astronomical numbers. While natural images are constrained by physical laws, substantially reducing the effective space, it remains far from being adequately covered by existing datasets. This data sparsity is one fundamental reason why deep learning models are susceptible to adversarial attacks and struggle to generalize to extreme scenarios.

This sparse coverage allows attackers to find vulnerable points near decision boundaries. For example, adding carefully designed noise to a panda image that is almost imperceptible to the human eye can cause an image recognition model to misclassify it as a turtle.

Adversarial examples are not random but systematically exploit the geometric structure of model decision boundaries. A panda image being identified as a turtle is not due to random noise but because the noise is precisely added in the direction of the shortest path to the decision boundary.

Failure of Lipschitz Continuity

Lipschitz continuity is a metric for measuring how sensitive a function is to input changes. The higher the Lipschitz constant (L-value) of a deep network, the more sensitive the model is to input perturbations. The L-value of actual deep networks in adversarial directions can reach 10^3 magnitude, meaning that even minor perturbations may cause dramatic changes in model output. For example, in an autonomous driving system, if the image recognition model is overly sensitive to input perturbations, it might misidentify a large truck crossing an intersection as sky, leading to incorrect driving decisions.

Ideal classification models should satisfy the Lipschitz continuity condition, meaning that minimal input changes should only lead to limited output changes:

‖f(x+δ)-f(x)‖ ≤ L‖δ‖

Mathematical expression meaning:

‖model(input+small change) - model(input)‖ ≤ L × ‖small change‖

L is the "sensitivity coefficient," smaller L is better.

The failure of Lipschitz continuity causes input space to exhibit strong anisotropy (i.e., sensitivity in different directions varies dramatically). Imagine standing in complex terrain:

    • Natural perturbation directions (L≈1): Like walking on a gentle slope, moving 1 meter changes elevation by 1 meter, movement is safe and controllable
    • Adversarial perturbation directions (L≫1, e.g., L=10³): Like standing at a cliff edge, moving 1 centimeter might result in a 100-meter fall

This geometric structure makes it difficult for data augmentation based on uniform sampling to cover high-risk areas, as these areas have extremely low probability in natural data distribution but are "close" in Euclidean distance. For example:

    • Natural direction: Not sensitive to perturbations like lighting changes, blur, etc. (L≈1) ➔ Model can correctly handle everyday image variations
    • Adversarial direction: Specific minor perturbations cause dramatic changes (L=10³) ➔ Like applying "magic noise" to images, causing model misclassification

The Danger:

    • Exploited by attackers: Finding high-L directions to create adversarial examples is like knowing cliff locations and specifically targeting vulnerable points
    • Difficult to defend: Regular training covering all directions is prohibitively expensive, like requiring hikers to adapt to all terrain types, which is unrealistic

IV. The Mathematical Chasm Between Interpolation and Extrapolation

Interpolation Success vs. Extrapolation Failure

Deep learning models perform excellently on interpolation tasks but often fail in extrapolation tasks. This is not coincidental but determined by the essence of statistical learning:

    • Interpolation: Predicting points within the support set of the training data distribution, equivalent to filling gaps in known regions
    • Extrapolation: Predicting points outside the support set of the training data distribution, equivalent to exploring unknown regions

The success of modern deep learning largely depends on the assumption that "training distribution ≈ testing distribution." When this assumption is broken, extrapolation problems become severe, and model performance deteriorates dramatically.

Differential Geometric Explanation of Extrapolation Failure

Imagine you are a geographer drawing terrain maps:

    • Input space (M): The entire Earth's surface, containing plains, mountains, canyons, and various terrains
    • Data distribution (P_data): Regions explored by humans (cities, roads, farmland, etc.)
    • Classification model (f): Maps drawn based on terrain features (marking where forests, deserts are)
    • Decision boundaries: Transition zones between different terrains (e.g., transition between forest and grassland)

Tangent Space: Trends in terrain changes in explored areas. For example, in plains, slopes are gentle in east, west, north, and south directions (corresponding to natural perturbation directions).

The normal bundle refers to directions perpendicular to explored areas, such as suddenly appearing cliffs.

Key problem: Maps are accurate in explored regions but fail in unknown cliff areas.

When moving in unknown cliff areas, test data falls in the normal bundle of the training data distribution, and model generalization performance drops dramatically. This situation can be characterized by the following inequality:

∇ₓ log P_data(x)·δ > κ

Mathematical expression meaning:

(Terrain steepness) × (Movement direction) > (Map boundary blurriness)

Where κ is the curvature radius of the decision boundary, comparable to the width of the transition zone between forest and grassland on the map; δ is the perturbation vector, i.e., the direction of movement.

∇ₓ log P_data(x): "Terrain steepness" of data distribution

    • In frequently visited human areas (like cities), terrain is gentle (small gradient)
    • In uninhabited areas (like deep sea), terrain is steep (large gradient)

It is impossible to effectively cover unknown cliff areas through conventional training data augmentation because they have extremely low probability in the training distribution. Conventional data augmentation is like exploring in known areas and will not actively explore cliff directions because (1) the probability is extremely low: normal people don't deliberately jump off cliffs; (2) the cost is prohibitively high: exploring all dangerous directions requires infinite resources.

Consequence: When map users accidentally walk to the edge of a cliff, the navigation provided by the map fails; the map cannot predict whether you will fall off the cliff or discover a new continent.

Beyond Data-Driven Artificial Intelligence

The limitations of deep learning are not temporary technical difficulties but methodological boundaries determined by its "data-driven" nature. Relying solely on statistical patterns in data makes it difficult to achieve true causal understanding, out-of-distribution generalization, and reliable security guarantees.

One future direction may be to combine data-driven learning with structured prior logical symbolic systems, creating hybrid systems that can both leverage massive data and possess causal reasoning capabilities.

 

 

【Related】

深度学习的局限性研究综述

在人工智能快速发展的今天,深度学习大模型已成为推动技术革新的核心引擎。然而,在这些耀眼成就的背后,深度学习也面临着一系列根本性的局限,这些局限不仅是技术实现上的挑战,更是其方法论本质所决定的。

本文探讨深度学习除了黑箱以外的核心局限性,并分析它们的数学根源。

请参照机器自动生成的研究报告 o3 deep research: 深度学习局限性研究报告,本文不是在该研究报告基础上修正的版本,而是笔者综述调研同时进行的平行版本。因此,不敢说一定更好,虽然我花了不少时间核对、查验、精简、修正,道理上应该靠谱一些,至少可读性更好(?)。

一、数据驱动的"原罪":统计学习的哲学困境

相关性与因果关系的混淆

深度学习模型本质上是高维概率密度估计器,它的核心目标是通过参数θ找到条件概率分布P(y|x;θ)的最佳近似。模型通过学习大量的训练数据,试图捕捉数据中的大大小小的统计规律,并将其泛化到未知数据。然而,这种统计学习范式决定了一个根本性问题:模型学习的是P(y|x)而非P(x→y),即统计相关性而非因果关系。

举个简单例子,假设一个医疗AI系统在训练数据中发现,某种皮疹与疟疾诊断高度相关(可能因为采集数据的地区同时存在蚊虫滋生问题)。模型会建立"皮疹→疟疾"的伪因果关系,而忽略温度、湿度等真实致病因素。再如,模型可能学习到“公鸡打鸣”与“太阳升起”之间的强相关性,但这并不意味着公鸡打鸣导致了太阳升起。

这种"因果错置"导致模型在新环境中表现不佳:

    • 在没有蚊虫问题的地区,模型可能过度诊断疟疾
    • 面对新型传染病,模型可能将其误判为已知类别
    • 当诊断环境变化时,模型性能会显著下降

信息瓶颈与压缩失真

根据信息瓶颈理论,神经网络在训练过程中需要执行一种特殊的信息压缩:丢弃输入X中与预测Y无关的信息,同时保留所有相关信息。数学上表达为最大化:

I(Z;Y) - β·I(X;Z)

其中Z是中间表示,I表示互信息,β是权衡系数。

当训练数据不足或有偏时,这种压缩过程会出现严重问题:

    1. 丢弃真实因果信号(如医学图像中丢弃细胞形态学特征)
    2. 保留伪相关信号(如将医院墙面色调作为诊断依据)

这种"有损压缩"的本质导致模型在新环境中往往抓住了错误特征。实际上,当H(Y|X)被错误最小化时,模型建立的是数据缺陷驱动的虚假信息通道,而非对现实的准确映射。

二、自然数据的低维流形假说

机器学习理论中有一个经典假设:自然数据位于高维空间的低维流形上。以简单的三维空间中的二维流形为例,它就形似卷起的纸张,虽然数据点存在于3D空间,但内在结构是2D的。流形具有光滑性与连续性的局部欧几里得性:流形任意点的邻域可映射到低维欧几里得空间。例如,人脸旋转角度变化时,图像在流形上连续滑动,不会突变。

基本概念

    • 高维空间:指维度远大于3的数学空间,例如一张100x100像素的图像存在于10,000维空间中(每个像素为一个维度)。
    • 低维流形:嵌入在高维空间中的连续、光滑的低维结构。例如,三维空间中的二维曲面,或万维空间中的几十维结构。例如,784维像素空间中,数字"2"的所有写法构成约10维流形(控制笔画粗细、倾斜、弧度等)。

自然数据为何形成低维流形

自然数据(如视频、图像、语音、文本)看似蕴含着高维特征,实则受物理规律和语义约束,呈现低维特性:

    • 物理约束
      一张人脸照片的参数受限于骨骼结构、光照角度、表情肌肉运动等,实际自由度可能不足50维。
    • 语义约束
      文本数据中,语法规则和语义相谐性将看似无限的词汇组合限制在有限的意义空间内。其他模态数据也是如此,同样存在原子性单元(tokens)之间相互关系的约束。

降维与特征提取

模型的信息“压缩”实际上是降维与特征提取的过程。例如,卷积神经网络(CNN)的层级结构逐步剥离冗余维度,逼近数据流形本质。浅层提取边缘(局部线性结构),深层组合成物体部件(全局流形结构)。流形学习(Manifold Learning)意味着显式恢复数据的内在低维结构。

以3D物体识别为例
物体在3D空间中的平移、旋转形成6维流形(3个平移+3个旋转)。当相关视频图像数据嵌入到高维像素空间中以后,理想的3D物体识别就是降维到其6维流形,通过特征抽取识别物体。

三、对抗样本:分布边界的脆弱性

对抗样本是指通过对原始输入样本进行微小的扰动,就能使深度学习模型产生错误输出的样本。例如,一张微调过的熊猫图片会被AI以99%的置信度识别为乌龜。

对抗样本的存在揭示了数据流形的结构性缺陷,挑战了上文的传统流形假说,揭示了这一假说的脆弱性:

    1. 自然数据流形存在大量"孔洞":未被训练数据覆盖的区域
    2. 流形边界存在高曲率区域:微小扰动即可跨越类别边界

ImageNet包含128万张标注图像,但这些数据仅覆盖了自然图像空间的极小部分。理论上,高分辨率彩色图像的可能性组合数远超天文数字,而虽然自然图像受物理规律约束,实际有效空间大幅减少,但仍远未被现有数据集充分覆盖。这种数据稀疏性是深度学习模型易受对抗样本攻击、难以泛化到极端场景的根本原因之一。

这种稀疏覆盖使得攻击者可以在决策边界附近找到脆弱点。例如,在一张熊猫图片中加入一些精心设计的噪声,人眼几乎无法察觉,但却可以使图像识别模型将其误判为烏龜。

对抗样本不是随机的,而是系统性地利用了模型决策边界的几何结构。一张熊猫图片被识别为烏龜,不是因为随机噪声,而是因为噪声被精确添加到决策边界的最短路径方向。

李普希茨连续性的失效

李普希茨连续性是衡量函数对输入变化敏感程度的一个指标。深度网络的李普希茨常数(L值)越大,表示模型对输入的扰动越敏感。实际深度网络的 L 值在对抗方向上可达 量级,这意味着即使微小的扰动也可能导致模型输出发生剧烈变化。例如,自动驾驶系统中的图像识别模型如果对输入的扰动过于敏感,就可能将前方横穿路口的大卡车误识别为天空,从而导致错误的驾驶决策。

理想的分类模型应满足李普希茨连续条件,即输入的微小变化只能导致输出的有限变化:

‖f(x+δ)-f(x)‖ ≤ L‖δ‖

数学表达含义:‖模型(输入+小改动) - 模型(输入)‖ ≤ L × ‖小改动‖
L是"敏感系数",L越小越好。

李普希茨连续性失效导致输入空间呈现强烈的各向异性(即,不同方向上的敏感度天差地别)。想象你站在一片复杂地形中:

    • 自然扰动方向(L≈1):如同缓坡行走,移动1米海拔变化1米,行动安全可控
    • 对抗扰动方向(L≫1, 例如 L=10³):如同悬崖边缘,移动1厘米可能坠落100米

这种几何结构使得基于均匀采样的数据增强很难覆盖高风险区域,因为这些区域在自然数据分布中概率极小,但在欧氏距离上却很"近"。例如,

    • 自然方向:对光照变化、模糊等扰动不敏感(L≈1)
      ➔ 模型能正确处理日常图片变化
    • 对抗方向:存在特定微小扰动导致剧烈变化(L=10³)
      ➔ 如同在图片上施加"魔法噪声",使模型错判

各向异性的危险

    • 攻击者利用:找到高L方向制作对抗样本
      就像知道悬崖位置后,专攻脆弱点
    • 防御困难:常规训练覆盖所有方向成本过高
      如同要求登山者适应所有地形,不现实

四、内插与外推的数学鸿沟

内插成功vs外推失败

深度学习模型在内插(interpolation)任务上表现出色,但在外推(extrapolation)任务上往往失效。这不是偶然现象,而是其统计学习本质决定的:

    • 内插:预测落在训练数据分布支撑集内的点,相当于填补已知区域的空白
    • 外推:预测落在训练数据分布支撑集外的点,相当于探索未知区域

现代深度学习的成功很大程度上依赖于"训练分布≈测试分布"这一假设。当这一假设被打破时,外推问题严重,模型的表现会急剧恶化。

外推失效的微分几何解释

从微分几何视角看,可以将输入空间视为黎曼流形(M,g),分类模型f:M→ℝ^C将数据点映射到类别空间。

想象你是一位绘制地形图的地理学家:

输入空间(M):整个地球表面,包含平原、高山、峡谷等各种地形

数据分布(P_data):人类已探索区域(城市、公路、农田等)

分类模型(f):根据地形特征绘制的地图(标注哪里是森林、哪里是沙漠)

决策边界:不同地形的分界线(如森林与草原的过渡带)

切空间(Tangent Space):已探索区域的地形变化趋势,例如在平原地带,东西南北方向坡度平缓(对应自然扰动方向)。所谓法丛区域(normal bundle)指的是垂直于已探索区域的方向,例如突然出现的悬崖。

关键问题:地图在已探索区域表现精准,但对未知悬崖区域失效。

在未知悬崖区域移动时,测试数据落在训练数据分布的法丛区域,模型泛化性能急剧下降。这种情况可以用以下不等式表征:

 ∇ₓ log P_data(x)·δ > κ

数学表达含义:(地形陡峭度) × (移动方向) > (地图边界模糊度)

其中κ为决策边界的曲率半径,好比地图标注的森林与草原的过渡带宽;δ为扰动向量,即移动的方向。

∇ₓ log P_data(x):数据分布的"地形陡峭度"

      • 在人类常去区域(如城市),地形平缓(梯度小)
      • 在无人区(如深海),地形陡峭(梯度大)

无法通过常规训练数据增强来有效覆盖未知悬崖区域,因为它们在训练分布中概率极低。常规数据增强如同在已知区域做探索,不会主动探索悬崖方向,因为(1)概率极低:正常人不会故意跳崖;(2)成本过高:探索所有危险方向需要无限资源。

后果:当地图使用者意外走到悬崖边时,地图提供的导航失效,地图无法预测你会掉下悬崖还是发现新大陆。

结论:超越数据驱动的人工智能

深度学习的局限不是技术实现上的暂时困难,而是其"数据驱动"本质所决定的方法论边界。单纯依靠数据中的统计模式,难以实现真正的因果理解、分布外泛化和可靠的安全性保障。

一个未来方向可能是将数据驱动学习与结构化先验逻辑符号系统相结合,打造既能利用海量数据又具备因果推理能力的混合系统。

 

【相关】

 

深度学习的基石:多层感知机

一、单层感知机的致命缺陷

要理解多层感知机的突破性,我们先看一个经典案例:假设我们需要实现“异或”逻辑(XOR),定义如下:

| 输入A:0 | 输入B:0 | 输出:0 |
| 输入A:0 | 输入B:1 | 输出:1 |
| 输入A:1 | 输入B:0 | 输出:1 |
| 输入A:1 | 输入B:1 | 输出:0 |

当我们在坐标系中绘制这些点时,会发现正例(0,1)和(1,0)位于对角线两侧,形成一个典型的非线性可分问题。单层感知机的决策边界只能是直线,就像试图用一把直尺分开两堆交叉摆放的硬币——这是数学上证明不可能完成的任务(闵斯基1969年证明)。

二、解法:特征空间升维

关键思路:通过增加隐藏层,将原始二维输入投影到三维特征空间。

我们设计两个特殊神经元:

神经元H₁:激活条件 = 1*A + (-1)*B > 0.5
神经元H₂:激活条件 = (-1)*A + 1*B > 0.5

输入变换过程

当输入(1,0)时:

H₁:1*1 + (-1)*0 = 1 → 激活
H₂:(-1)*1 + 1*0 = -1 → 不激活

当输入(0,1)时:

H₁:1*0 + (-1)*1 = -1 → 不激活
H₂:(-1)*0 + 1*1 = 1 → 激活

当输入(1,1)时:

H₁:1*1 + (-1)*1 = 0 → 不激活
H₂:(-1)*1 + 1*1 = 0 → 不激活

此时特征空间变为:

| 原始输入:(0,0) | 隐藏层特征 :(0,0) |
| 原始输入:(0,1) | 隐藏层特征 :(0,1) |
| 原始输入:(1,0) | 隐藏层特征 :(1,0) |
| 原始输入:(1,1) | 隐藏层特征 :(0,0) |

神奇的事情发生了——原本线性不可分的问题,在特征空间中被转换为线性可分问题!

三、输出层

在输出层,我们只需一个简单的OR逻辑:

输出 = H₁ OR    H₂

此时只需一条直线就能完美分割正负例。这种「空间变换+线性分割」的思想,正是深度学习突破维度诅咒的武器。现代神经网络中,每个隐藏层都在进行更复杂的空间扭曲,最终在高维特征空间中用超平面分割数据。

四、数学本质:非线性变换

整个过程可视为:

f(X) = σ(W₂·σ(W₁X + b₁) + b₂

其中:

W₁是第一层权重矩阵 [[1,-1], [-1,1]]
b₁是偏置向量 [-0.5, -0.5]
σ是阶跃激活函数
W₂是输出层权重 [1,1]
b₂是输出偏置 -0.5

这个结构实现了:
1. 第一层:将输入空间扭曲为线性可分的新空间
2. 第二层:在新空间中进行线性分类

传统OR门的感知机表达式:

输出 = 1*H₁ + 1*H₂ - 0.5 > 0

每个输入的权重都是1,表示H₁和H₂同等重要
当任意一个输入为1时,加权和为1;两个都为1时和为2
设置阈值0.5(移项后表达式为加权和 > 0.5)
只要有一个输入为1(1 > 0.5),即触发激活
两个都为0时(0 > 0.5)不激活


用真值表验证公式的正确性:

H₁ H₂ 计算过程 结果
0 0 0+0-0.5 = -0.5 0
0 1 0+1-0.5 = 0.5 1
1 0 1+0-0.5 = 0.5 1
1 1 1+1-0.5 = 1.5 1

(注:在XOR场景中,H₁和H₂不会同时为1,最后一行为理论验证)

用Python实现这个经典XOR网络:

import numpy as np

# 定义网络结构
W1 = np.array([[1, -1], [-1, 1]])  # 第一层权重
b1 = np.array([-0.5, -0.5])        # 第一层偏置
W2 = np.array([1, 1])              # 输出层权重
b2 = -0.5                          # 输出层偏置

def perceptron(X):
    # 第一层计算
    h = np.heaviside(np.dot(X, W1) + b1, 0)
    # 输出层计算
    return np.heaviside(np.dot(h, W2) + b2, 0)

# 测试所有输入
inputs = [[0,0], [0,1], [1,0], [1,1]]
for x in inputs:
    print(f"输入{x} → 输出{perceptron(x)}")

输出结果:

输入[0, 0] → 输出0
输入[0, 1] → 输出1
输入[1, 0] → 输出1
输入[1, 1] → 输出0

六、历史意义与延伸

这个简单案例揭示了深度学习的核心思想:

- 层次化处理:如同视觉皮层V1→V2→V4的信息处理流程
- 分布式表示:单个概念由多个神经元协同表示
- 端到端学习:现代网络通过反向传播自动学习W,b参数

今天的Transformer架构依然延续这个基本原理,只是:

- 隐藏层数从2层增加到数百层
- 手动设计参数变为自动优化
- 阶跃激活变为ReLU等平滑函数
- 增加了注意力机制等复杂交互

理解这个案例,就掌握了理解深度学习的钥匙——任何复杂网络,本质上都是这个「俄罗斯套娃」结构的扩展与优化。

 

o3 Deep Research: DeepSeek R1 多阶段训练流程问答解析

好的,我将对 DeepSeek R1 的整体训练过程进行研究,并重点分析强化阶段(第二、三、四阶段)的具体细节,包括 RL、SFT、CoT 的方法和策略。

报告将采用问答解析的形式,同时保持一定的技术深度,并确保可读性。还会在必要时提供与 GPT-4、Claude、Gemini 等模型的简单核心对比,以展示其独特性。

我会尽量参考 DeepSeek 官方网站的论文和技术资料,同时结合网络上的研究和学习经验进行综合分析。

DeepSeek R1 多阶段训练流程问答解析

Q1: DeepSeek R1 的整体训练包含哪些阶段,为什么采用多阶段策略?

A: DeepSeek R1 采用了四阶段的训练流程,包括两个强化学习(RL)阶段和两个监督微调(SFT)阶段 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。这一多阶段策略旨在先通过少量数据稳定模型输出格式,然后大规模提升推理能力,接着扩展通用能力,最后对齐人类偏好,逐步打造出既擅长推理又安全实用的模型 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园) (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。具体来说:

通过上述多阶段渐进式训练,DeepSeek R1 最终在推理能力上达到接近OpenAI o1系列闭源模型的水平,同时保持了输出的规范性和对用户需求的良好适应 ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning) (阅读 | DeepSeek-R1:怎么用强化学习把一个普通语言模型调教成推理高手?_腾讯新闻)。

Q2: 第二阶段的推理强化学习是如何实施的?训练是逐个问题进行还是批量进行?训练数据是否会保留?奖励机制如何设计?

A: 阶段2是针对推理能力的强化学习训练。在这一阶段,DeepSeek R1 采用了GRPO算法(分组相对策略优化)来高效地进行RL训练 ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning)。具体来说,对每个问题,不会只生成单一回答,而是从当前策略采样一组不同的回答,然后根据这组回答的得分情况来优化模型 ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning) ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning)。这种“按问题分组”的方式相当于批量生成和评价:每道训练题目产生多个解答,计算每个解答的奖励,并用组内奖励的相对差异(优势值)指导模型参数更新 ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning)。由于采用组内平均得分作为基准(baseline),GRPO 不需要额外的价值网络(critic),降低了大模型RL的开销 ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning)。因此,就单个问题而言是通过多次尝试来强化,但在实现中会并行处理多个问题批次,以加速训练。每轮生成的回答在用于计算梯度更新后并不长期保留,属于在线的RL采样;只有当阶段2训练收敛后,会使用最终的模型来批量生成数据供下阶段使用 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。

奖励机制方面,DeepSeek R1 在此阶段设计了规则奖励,主要包括:

最终,以上各项奖励会加权求和形成总奖励 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。模型通过最大化期望奖励来更新参数:正确且格式规范、语言一致的回答获得最高回报,从而模型逐步学会既推理正确表述规范地回答问题 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。

值得一提的是,DeepSeek R1-Zero(没有冷启动微调的版本)在纯RL下已经自行涌现出了反思验证、超长链式推理等强大能力,但同时输出可读性差、语言混杂 (阅读 | DeepSeek-R1:怎么用强化学习把一个普通语言模型调教成推理高手?_腾讯新闻) (阅读 | DeepSeek-R1:怎么用强化学习把一个普通语言模型调教成推理高手?_腾讯新闻)。因此在 R1 中通过以上冷启动和奖励改进,显著改善了这些问题 (阅读 | DeepSeek-R1:怎么用强化学习把一个普通语言模型调教成推理高手?_腾讯新闻)。阶段2 的RL训练让模型的推理准确率在数学等任务上大幅提升(例如 AIME 数学竞赛准确率从15.6%提高到71% (阅读 | DeepSeek-R1:怎么用强化学习把一个普通语言模型调教成推理高手?_腾讯新闻)),为后续步骤打下强大的推理基础。训练过程中产生的大量问答尝试并未直接用于模型参数更新之外的用途,而是在训练完成后经过筛选用于下一阶段的数据集构建 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。

Q3: 第三阶段为什么要用 60 万条推理再生数据和 20 万条非推理数据进行微调?这一混合微调策略有何作用?

A: 阶段3是承上启下的监督微调,目的是在保持高推理能力的同时扩展模型的通用能力 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。之所以采用“60万推理 + 20万非推理”的数据组合,主要有以下考虑:

Q4: 第四阶段为何需要再次进行强化学习?这一阶段是如何结合人类偏好进行对齐的,人类偏好对齐有什么必要性?

A: 阶段4是DeepSeek R1的二次强化学习阶段,核心目的在于对齐人类偏好,进一步提升模型在实际应用中的安全性和实用性 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。即使经过阶段2和阶段3,模型在推理和一般任务上已表现出色,但仍可能存在不符合用户期望的行为,例如回答不够礼貌有用,或在敏感问题上产生不安全内容。这正是当前尖端LLM都会面临的问题,需要引入人类偏好对齐(例如 OpenAI 的GPT-4通过RLHF过程进行对齐)。对于DeepSeek R1,这一阶段通过融合人类偏好相关的奖励信号来微调模型,使其行为更符合人类期望 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)。具体而言:

通过上述措施,阶段4 实现了模型行为和人类期望的深度对齐,相当于执行了一次RLHF式(基于人类反馈的强化学习)训练。 (阅读 | DeepSeek-R1:怎么用强化学习把一个普通语言模型调教成推理高手?_腾讯新闻)在对齐后,DeepSeek R1 兼具强大的推理能力和良好的安全性/有用性:既能给出复杂问题的正确解答,又能确保回答让用户容易理解、愿意接受,在敏感问题上遵循安全准则 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园) ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning)。这种人类偏好对齐对于模型大规模应用是必要的 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园)——它保证模型不会因为追求高推理准确率而忽略了与用户互动的品质,确保模型的高智商能够以让人类舒适的方式呈现出来。换句话说,经过阶段4,DeepSeek R1 不仅是“推理高手”,也是“贴心助手”。

Q5: DeepSeek R1 与 GPT-4、Anthropic Claude、Google Gemini 等顶尖模型有哪些核心差异?

A: DeepSeek R1 的训练路线与这些闭源大模型有一些独特之处,但总体目标相似(都追求强性能和对齐人类意图)。主要区别在于:

综上,DeepSeek R1 展现了一条与主流商用大模型不同但并行的研发路线:用较少人工介入、充分利用RL自我提升推理,再辅以必要的监督和对齐,来培养强大的通用AI。这为业界提供了新思路,也使研究者能够探究在不开天量数据集的情况下提升模型智力的可能 (The Illustrated DeepSeek-R1 - by Jay Alammar) (The Illustrated DeepSeek-R1 - by Jay Alammar)。各模型各有侧重,但目标殊途同归:让AI既“智商高”又“情商高”,能真正解决问题并让用户放心使用。

**参考文献:**

DeepSeek R1 官方论文 ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning) ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning)及技术报告、DeepSeek 官方博客和数据集说明 (DeepSeek-R1的四个训练阶段 - 蝈蝈俊 - 博客园) ([2501.12948] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning),以及相关分析文章和讨论 (阅读 | DeepSeek-R1:怎么用强化学习把一个普通语言模型调教成推理高手?_腾讯新闻) (阅读 | DeepSeek-R1:怎么用强化学习把一个普通语言模型调教成推理高手?_腾讯新闻)。

上述内容综合了官方资料与社区解读,以问答形式呈现,希望对DeepSeek R1的训练策略有全面而清晰的说明。

 

后记

才试了一下 openai o3 最新推出的 deep research 功能,不能不说,作为研究工具,非常给力,完成后的成品具有直接发表的水平,想做一点后编辑都难。有点叹为观止的感觉,堪称完美。当然,现在各家头部模型都开始 offer 这种“深度研究”的功能了,这等于是平替了大多数咨询公司的研究人员的工作。即便是训练有素的数据分析和调研人员,多数人实际上做不了它这么全面和细致。最见功力的是它的引用的全面和细密。说话有根据,幻觉自然减少。

【相关】

Dilemma of RPA and Early-Stage LLM Co-pilot Entrepreneurs in the Age of Agent Tsunami

As large language models (LLMs) surge forward, LLM Agents are reconstructing the automation landscape at unprecedented speed. This revolution not only threatens traditional RPA (Robotic Process Automation, reliant on rule engines or small models) but also pushes early-stage co-pilot application builders to the edge of a cliff. At its core, this technological shift represents two fundamental disruptions: 
1. Natural language interaction overpowering low-code programming in complex, dynamic, unstructured data scenarios. 
2. General intelligence violently overshadowing shallow vertical solutions.

"Triple Disruption" of LLM Agents

1. Paradigm Shift: From "Low-Code" to "Emergent Intelligence"

- Traditional RPA: Engineers script step-by-step logic (e.g., UiPath’s drag-and-drop designer), akin to teaching robots to hop grids – brittle and error-prone.
- LLM Agent: Directly interprets human intent (e.g., "Extract invoice data from emails into the system"), autonomously decomposes tasks, and dynamically adjusts execution paths.
- Case Study: ChatGPT plugins already book flights or fetch data via API calls, while traditional RPA requires low-code scripting for equivalent functions.

2. Moat Erosion: Data Barriers vs. General Intelligence

Pre-LLM RPA Moats:
Industry know-how (e.g., nuances of financial reimbursement workflows) + custom deployment capabilities + template libraries.
Reality: Most RPA firms accumulated shallow industry exposure rather than deep vertical data expertise.

LLM’s Breaching Tactics:
- Digests unstructured documents (e.g., diverse invoice formats) via multimodal vision and computer use capabilities.
- Adapts to new workflows via zero-shot Chain-of-Thought (CoT) reasoning (e.g., interpreting vague commands like "Sync key contract terms to CRM").

Final Blow: As standardized scenarios get natively covered by leading LLMs (including reasoning models), RPA’s last defense – proprietary industry APIs – is being devoured by LLM vendors’ customization and privacy solutions.

3. Ecosystem Cannibalization: From "Tool Vendor" to "LLM-native Layer"

Early Co-pilot Traps:
Products like Character.ai (personalized chatbots) and Jasper (writing/marketing assistants) – essentially thin wrappers over base models – crumble when ChatGPT launches role presets or DALL·E 3 plugins.

Survivor Playbooks:
- Perplexity.ai: Carves a niche with real-time search + academic citations (fixing LLM hallucination).
- Cursor: Builds vertical moats via developer workflow integration (codebase semantics, AI pair programming).

Industry Upheaval in RPA

- UiPath’s stock plummets from 2021 highs; its "Autopilot" feature (English-to-automation) criticized as a "GPT-4 wrapper."
- Microsoft Power Automate integrates Copilot, generating cloud workflows from natural language prompts.
- Adept (AI-for-computer-actions startup) hits $1B+ valuation, directly threatening RPA’s existence.

Survivor’s Map: Niches Resisting the LLM Tide

1. Deep Verticalization
- Cursor: Dominates IDE ecosystems via VSCode extensions and developer workflow data.
- Harvey (legal AI): Trains on LexisNexis corpus + private deployment for compliance.

2. Real-Time Data Masters
- Perplexity.ai: Search engine-grade indexing + academic database partnerships.
- Hedgeye (finance): Aggregates Bloomberg/Reuters feeds + proprietary prediction models.

3. Hardware Fusion
- Covariant: Embeds LLMs into warehouse robotics, leveraging mechanical barriers.
- Tesla Optimus: Physical-world operation via embodied AI, evading pure-digital competition.

Agent Startup Pitfalls & Counterstrategies

Common Traps

- Thin Model Wrapping
Issue: Repackaging ChatGPT prompts as "AI customer service" adds no real value.
Fix: Develop domain-specific features (e.g., clinical decision support requiring privacy-sensitive data pipelines).

- Over-Reliance on Fine-Tuning
Issue: Claiming "medical LLM" after basic terminology tuning ignores the need for closed-loop clinical workflows.
Fix: Build proprietary data flywheels and scenario-optimized architectures.

- Ignoring Enterprise Needs
Issue: Overlooking security, SLA guarantees, and system integration.
Fix: Architect enterprise-grade frameworks for organizational deployment.

Differentiation Strategies

- Workflow Integration Specialists: Develop deep connectors for niche scenarios (e.g., legal document parsing).
- Human-AI Orchestrators: Design quality control layers and manual override mechanisms.
- Vertical Knowledge Engineers: Curate domain-specific benchmarks and evaluation protocols.

RPA’s Last Stand

While battered, RPA retains residual value in:

- High-compliance scenarios: Auditable/traceable workflows (e.g., financial regulations).
- Legacy system integration: Stability in outdated IT environments.
- Ultra-high precision demands: Deterministic execution for core systems (e.g., stock trading).


Challenges for Early Co-pilot Entrepreneurs

Two fatal flaws plague AI application startups: 
1. No proven scaled success cases – LLMs are barely 2-3 years old, leaving co-pilots (beyond chatbots) unvalidated for commercial viability. 
2. Vulnerability to LLM upgrades – Without exclusive industry data or customer channels, co-pilot startups risk being crushed by foundational model advancements.

The Inevitable Conclusion

LLM Agents are replaying cloud computing’s annihilation of on-prem servers: foundational capabilities get standardized (like AWS replacing data centers), while vertical opportunities spawn new giants (like Snowflake). RPA and generic Agent startups must either:
1. Become vertical domain experts, or
2. Master human-AI collaboration architectures

... or face obsolescence as LLM agents absorb 90% of automation value. The silver lining? This disruption will unlock an automation market 100x larger than the RPA era – but tickets are reserved for those who architect vertically fused, LLM-empowered solutions.

As Sam Altman warned: Avoid building what foundational models will inevitably swallow.

 

 

【相关】

Understanding the Division of Labor Among Q, K, V in Self-Attention Mechanism

For those diving into self-attention mechanisms, the roles of Query (Q), Key (K), and Value (V) often spark confusion: Why must every token in a sequence generate these three distinct roles to capture contextual dependencies?To grasp how Q, K, and V matrices autonomously specialize through backpropagation, we must delve into the foundational logic of model training. This process mirrors the natural phenomenon of "ant colony division of labor": though initially identical, ants evolve into workers, soldiers, or reproducers through environmental feedback. Similarly, Transformer parameters self-organize via error-driven optimization.

I. The Driving Force: Loss Function as a Macro-Regulator

Suppose we are training a translation model, with the input sentence "猫追逐激光点", and the target output "The cat chases the laser dot." The following are the key steps in parameter differentiation:

1. Initial Chaotic State
- W_Q, W_K, W_V matrices are all randomly initialized
- At this point, the Q vector of "追逐" (chase) may have no correlation with the K vector of "激光点" (laser dot)

2. First Forward Propagation
- When calculating attention weights, "追逐" (chase) fails to associate with "激光点" (laser dot)
- This leads to an incorrect translation (such as outputting "The cat eats the laser")

3. Error Signal Feedback
The loss function calculates two key gradients:
- Content missing gradient: Need to strengthen the action association "追逐→chases"
- Object mismatch gradient: Need to establish the verb-object relationship between "追逐" (chase) and "激光点" (laser dot)

4. Parameter Differentiation Begins
- W_Q matrix receives the signal: Make the Q vector of verbs more attentive to action target features
- W_K matrix receives the signal: Strengthen the acted-upon object attributes in noun K vectors
- W_V matrix receives the signal: Preserve details such as mobility in nouns

🔥 Key Mechanism: The same error signal propagates through different computational paths, causing the update directions of the three matrices to differentiate.

II. Mathematical Principles of Parameter Differentiation

By breaking down the attention calculation process, we can see how gradients guide division of labor:

Attention Weight Calculation Paths

- Gradients for W_Q:
Mainly come from the similarity calculation between the Q of the current token and the K of contextual tokens, forcing W_Q to learn how to generate effective query features
(Example: Making the Q vector of a verb contain potential features like "needs to be paired with an object (transitive verb)"; Q resembles the encoding signal for potential sentence patterns in traditional linguistics, similar to Subcat)

- Gradients for W_K:
Also come from Q-K similarity calculation, but the direction is to optimize K features to be recognizable by Q
(Example: Making the K vector of nouns contain attributes like "can serve as an object of action (object noun)")

- Gradients for W_V:
Come from the final weighted sum, requiring V to retain sufficient information
(Example: The V vector of "激光点" (laser dot) needs to include details like "small, bright, movable")

Four Steps of Weight Calculation:

1. Q-K Dot Product: Measure relevance.
2. Scaling: Prevent gradient explosion.
3. Softmax: Normalize into probability weights.
4. Weighted Sum: Generate contextualized representations.

III. Structural Guarantees for Stable Division of Labor

Beyond gradient driving, model structure design also ensures that the division of labor remains consistent:

1. Isolation of Linear Transformations
- Q/K/V come from three completely independent matrix multiplications
(Unlike LSTM gating mechanisms that share parameters)
- Gradient updates for each matrix do not interfere with each other

2. Multi-Head Attention Mechanism
Using 8-64 independent attention mechanisms (multi-head attention) is like having a team of detectives investigating different directions: some focus on the timeline, others analyze character relationships, and finally, all relationship matching results are synthesized.

Different attention heads form a "division of labor":
- Head 1: W_Q¹ learns grammatical role matching
(Example: Matching the Q of a subject with the K of a predicate)
- Head 2: W_Q² learns semantic associations
(Example: Matching the Q of "bank" with the K of "interest rate")
- This multi-objective optimization forces parameters to specialize

IV. Empirical Validation: Concretization of Parameter Division of Labor

By visualizing the parameters after training, clear patterns of division of labor can be observed:

Case Study: Related Parameters for the Verb "吃" (eat)
- W_Q Matrix:
In the Q vector of "吃" (eat), high-weight dimensions correspond to features like "edible," "concrete object," etc.
- W_K Matrix:
In the K vector of "苹果" (apple), high-weight dimensions correspond to attributes like "food category," "solid," etc.
- W_V Matrix:
In the V vector of "苹果" (apple), high-weight dimensions include details like "color," "texture," "nutritional components," etc.

When calculating `Q(吃)·K(苹果)` (`Q(eat)·K(apple)`), strong attention weights are generated due to high activation values on the "edibility" dimension from both parties. Meanwhile, V(apple) carries the specific information needed for output production (such as knowing it's a fruit rather than a technology company when translating to "apple").

Key Conclusion: The Wisdom of Self-Organization
The essence of parameter division of labor in Transformers is the functional specialization that naturally evolves under the constraints of a unified objective function. The system does not need to preset division of labor details but spontaneously forms an efficient information processing system through repeated "trial-error-feedback" cycles with massive data. This self-organizing process driven by error is the source of the powerful representation capabilities of deep learning models.

[Addendum] A Deeper Interpretation of Q/K/V Relationships

Relationship Between Q and K
- Q is a specific perspective or projection of the K space
- Just like a book can be retrieved from different angles:
- Q1: Subject classification (K1: Literature/Technology/History)
- Q2: Difficulty level (K2: Beginner/Advanced/Professional)
- Q3: Writing style (K3: Theoretical/Practical/Case-based)

This is because Q "actively" seeks certain features associated with other tokens, while K is "passively" prepared to be matched by other tokens. K is like an index that needs to summarize all the main features of a token, but Q focuses on querying a specific feature.

This makes understanding multi-head attention more intuitive:

```
Each head learns a different projection perspective
Q1 = token * W_q1 # May focus on thematic relevance
Q2 = token * W_q2 # May focus on grammatical relationships
Q3 = token * W_q3 # May focus on semantic roles
```

It's like different facets of a high-dimensional space:
- Each attention head learns a specific "query perspective"
- These perspectives collectively build a complete picture of inter-token relationships

Division of Labor Between K and V
- K: Information's "retrieval representation"
- Contains various ontological features that might be queried
- Similar to a multidimensional tagging system for books
- V: Information's "content representation"
- Contains information that actually needs to be utilized
- Like the specific content of a book's text

A Concrete Example
Using the word "驾驶" (driving) as an example:

Different perspectives that multi-head attention might learn:
- Q1: Seeking action tools (highly relevant to "汽车" (car))
- Q2: Seeking action subjects (highly relevant to "司机" (driver))
- Q3: Seeking action modifiers (relevant to "快" (fast), "稳" (stable), etc.)

This understanding effectively explains:
1. Why Q/K separation is necessary
2. Why multi-head QKV mechanisms are needed
3. How the model automatically learns different types of contextual relationships

Continuity Between V and Token Representation
A token's V (Value) is most related to the token's initial embedding, as both represent the content and meaning of this token.
- Initial embedding: Represents the general meaning of the token learned in large-scale embedding training in advance, similar to looking up a dictionary
- Value vector: Can be seen as a continuation and update of this initial representation in a specific context

In other words:
1. Embedding is the "basic dictionary definition" of a token
2. Value is the "specific expression" of this definition in a particular context

Evolution of Token Represenation in the Model

As information flows through multiple network layers:

Initial embedding → Layer 1 Token → Layer 2 Token → ... → Final representation

During this process:
- Each layer's token representation carries increasingly rich contextual information
- While maintaining continuity with the original token meaning (residual connections can compensate if continuity degradation is a concern)
- This evolution is gradual, not disruptive

Essential Differences Between Q/K and V
- Q and K primarily serve the goal of "establishing relationships"
- Q and K extract query features and index features for matching
- Q and K are naturally more abstract and general than V
- V directly carries "concrete content"
- Contains actual information that the token needs to convey
- More specific, more detailed

Figuratively speaking:
- Q/K is like the retrieval system in a library
- V is like the actual content of books on the shelves

Conclusion: The Deep Wisdom of the QKV Mechanism

From the perspective of the entire model:
1. Initial embeddings enter the first layer
2. Each layer updates the next layer's token representation through attention mechanisms and feed-forward networks
3. The final layer's representation encompasses all contextual relationships and meanings, directly empowering the output

The QKV division of labor in self-attention mechanisms, seemingly simple yet embedding profound information processing philosophy: through carefully designed computational flows and gradient paths, the model naturally develops functional differentiation during the optimization process. This design philosophy of "emergent intelligence" has become a core paradigm in modern artificial intelligence.

It is precisely this capability for self-organization and self-evolution that enables Transformer models to capture complex and variable relationship patterns in language, laying the foundation for the powerful capabilities of large language models.

 

【相关】

Linguists Should Find Self-Attention Intuitively Familiar

Written for my linguistics and symbolic NLP peers — a reflection on my journey to leverage computational linguistics in undersranding modern AI LLM.

Breaking Through the Jargon Barrier

For linguists bewildered by large language models (LLMs), the confusion often stems from terminology and implementation details obscuring shared foundational principles. Let’s cut through the noise and focus on self-attention — the beating heart of the Transformer architecture.

As a computational linguist and lifelong NLP practitioner, I’ve spent years dissecting symbolic grammars and, more recently, tracking the rise of LLMs. Here’s my attempt to "translate" the core design of multi-head Query-Key-Value (QKV) mechanisms into a framework linguists already know.

QKV: A Linguistic Reinterpretation

Query as Subcategorization (Subcat)
First, I would like to point out, Query mirrors Subcat in symbolic grammar: the slots a head word "digs" for its dependents. Take a transitive verb (vt) as an example: it creates two syntactic "slots"—a noun subject (pre-verbal) and a noun object (post-verbal). Similarly, the predicate eat defines two semantic slots: an animate agent (e.g., animal) and an edible patient (e.g., food). These constraints — syntactic roles and semantic selection restrictions — are bread-and-butter concepts for linguists.

Key as Ontological Features
Key represents ontological attributes: nounhood, animacy, action, state, time, descriptive, etc.  Value is the filler—the "carrot" that occupies a slot. When I first read Attention is all you need, the QKV triad felt alien. No one explained that this was just dynamic slot-filling.

Why LLMs "Get" Language

LLMs thrive because their "slots" and "fillers" align perfectly across linguistic hierarchies. Every token carries QKV information because every word can both be a seeker (Query) and a target (Key/Value). When a Query (e.g., eat) finds a compatible Key (e.g., apple), their dot product sparks a high attention weight. The Value (the token’s semantic essence) is then passed forward, blending into the next layer’s representation of the token.

Contextual "Polygamy"
Tokens in the context window engage in group marriage, not monogamy. Each token 'flirts' with all others via Query-Key dot products. Relationships vary in intensity (weights), and the resulting "offspring"—the next layer’s tokens—inherit traits from multiple "parents" through weighted summation. Stronger relationships dominate; weaker ones fade. This crazy yet efficient "breeding" compresses linguistic structure into dense vector spaces, a process conceptually equivalent to parsing, understanding, and generation in one unified mechanism.

The Database Analogy (and Why It 'Misled' Us)

QKV borrows terms from database systems (Query for search, Key-Value for retrieval), but early attempts to map this to linguistics fell flat. We thought: "Databases? That’s just dictionary lookups — isn't it already handled by embeddings?!" The breakthrough came when we realized: Self-attention isn’t static retrieval—it’s dynamic, context-aware slot-filling.

For decades, we built bottom-up parsers using Subcat frames. Transformer layers do the same, but with vectors instead of symbolic representaions. See the 2 slides I made 3+ years ago when GPT3 playground was launched when I compared the parallel archtectures and approaches from two schools of AI, grammar school and multi-neural network school.  Symbolic grammars, though, despite their transparency, pale in scalability:
- Granularity: LLMs leverage hundred or thousand dimensional vectors; we relied on only hundreds of one-hot features.
- Generalization: Transformers parse text, audio, video—any modality. Symbolic grammars, at best, aspire to universal grammar across languages.

A Convergence of Paths

My colleague Lü Zhengdong once mapped the evolution of attention: 
Seq2Seq (Google Brain) → Auto-alignment (Mila) → Transformer (Google) → Pre-trained LMs → LLMs (OpenAI)...

To this, I chuckled: "You pioneers see the trajectory clearly. But for us symbolic refugees, diving into Attention is all you need felt like drinking from a firehose." Without fully understanding the historical context, the concepts overwhelmed us—until one day, it clicked: Subcat-driven parsing and self-attention are two sides of the same coin.

Symbolic methods are obsolete, yes—clunky, rigid, and modality-bound, with the only merit of full transparency of symbolic logic. Yet understanding their parallels to Transformers suddenly made LLMs feel familiar. The difference? Scale and ambition. Linguists seek cross-linguistic universals; AI aims for cross-modal universals.

Postscript: Simplifying the Transformer

The original Transformer paper (Attention is all you need) is not an easy-read at all, bogged down by encoder-decoder specifics for machine translation. Strip away the noise, and the core is simple:
1. Self-attention layers (dynamic slot-filling).
2. Feedforward networks (nonlinear transformations).

GPT’s decoder-only architecture reveals the essence: next-token prediction (NTP) is the key to general intelligence. The so-called "decoder" isn’t just about decoding or generation—it’s also analysis and understanding fused into one stream.

Closing Thoughts

Dr. Bai, Shuo once remarked:

Language processing demands a unified ‘currency’—a mechanism to reconcile syntax, semantics, pragmatics, and world knowledge. Only neural networks (imperfect as they are) managed to have achieved this, probabilistically. Attention is that currency.

He’s right.  Attention isn’t just a tool—it’s the universal metric we’ve sought all along.

 

【相关】

语言学家应该很容易理解自注意力机制

作为计算语言学家和NLP老司机,本篇是写给我的语言学老师和同学,以及符号NLP同行的,分享自己的心路历程,希望对大家有所启发。

如果语言学家以前觉得大模型烧脑和不解,那是因为窗户纸没有捅破,存在不同体系下的术语与机制细节的理解障碍,但底层逻辑并不乏不少共同的东西。本篇聚焦在大模型Transformer架构中的核心注意力机制的理解。

我出身计算语言学,这几年又一直在追踪大模型,可以“翻译”一下自注意力机制的核心设计多头的 QKV。

我们做符号文法的 早就该知道 Query 就是 Subcat,主导词为潜在结构“挖坑”用的,例如及物动词 (vt)就挖了两个坑:一个「名词」主语,一个「名词」宾语。主语、宾语是句法结构的角色要求,「名词」是对所要求对象的限制条件(更细致地,Subcat 还有其他限制条件,例如主语在vt前,宾语在 vt后,等)。具体到概念谓词“eat”,逻辑语义上也相应地挖了两个坑:一个是「动物」施事,一个是「食物」受事。「动物」(包括「人」)是逻辑语义结构中对施事角色的语义限制条件,「食物」是逻辑语义结构中对受事角色的语义限制条件。这些都是我们语言学家践行多年、耳熟能详的概念体系。

Key 就是本体特征, 例如,名词、物体、食物、动作、状态、修饰、时间等,Value 就是填坑的“萝卜”。可惜,初读论文「Attention is all you need」 ,被 QKV弄得晕头转向的时候,没有人指点迷津。

为什么LLM大模型吃透了语言,说话那么顺溜,原来各个层级的坑与萝卜,都是那么相配,天赐良缘。为什么每一个单词都有QKV信息,道理也简单,每一个词都可能在“求偶”,追求别人,也都可能被追求。追与被追发现非常谈得来的时候, QK相配,注意力权重大,于是结合,就是把身子(Value)献上;然后生子 ,就是创造下一层的 Token 表示。

有意思的是,上下文窗口里的 Tokens 是群婚制,不是一夫一妻制生孩子。一个 Token 与周围所有 tokens 谈恋爱 q k(i)点积,其他tokens(包括该 Token自己)都与该Token 有一腿,但关系强度不同(谈吹的tokens,权重为0)。该 Token与这种多边关系“杂交”生出的孩子是怎样的呢?加权求和。就是说孩子继承了母亲的很多特征,同时也继承了周围众父亲的特征。谁的特征在子代最彰显或较弱,完全决定于交情的深浅、关系的强度。每个token都是这样推陈出新,一代一代传下去。最后发现,这种群婚制对于信息压缩(理解、解析、生成)特别有效。真有意思。

QKV这些概念显然是从数据库技术拿来的,对于软件界一点都不陌生。但当年理解注意力机制的时候,还是遭遇很大的困惑,想不明白 语言解析与数据库啥关系。当时觉得扯上数据库,最多就是查词典相关,而查词典早就在词嵌入(embedding)编码阶段就完成了。自注意力的核心不是静态的查词典,而是动态的上下文理解。 当年因为想不明白为什么要套用数据库的信息查询和匹配的说法,只能把注意力机制泛泛理解为关注上下文。这也没错,但看不到 insights,也理解不了为什么这东西这么厉害,成就了大模型的超人语言能力。

经过很久终于豁然开朗:原来我们做了一辈子的 subcat-based bottom up parsing,跟 attention+feedforward 做的完全是一回事,一个原理,殊途同归(见下图:这是我在大模型GPT3 playground刚冒头时候做的一个对比,illustrate 当年意识到两条路线殊途同归时候的震撼)。只不过我们符号文法处理完全打不过它。一个是颗粒度没法比,人家的家底是几百上千维度的实数向量,加上各种投射以及非线性转换,我们才有最多几千个强行结构化的符号特征(one hot features)。另外,人家的泛化可以从文字解析,推广到任何模态信号的压缩和理解,我们的文法不具有任何的模态可推广性、可移植性,最多只是跨过了语言壁垒,文法追求可以适用于任何语言的 universal grammar。

我的主流弄潮儿的一位老友吕正东说:

在attention 的发展史上(so far),我看到多次颠覆式的创新, 从最早的 seq2seq (Google Brain) → auto alignment (Mila) → Tranformer (Google again) → pre-trained language model (?) → LLM (openAI )→ ...

我苦笑道:你是真正业内前沿,一路发展轨迹自然一目了然。你可能想象不出我们这些符号出身的人,突然被逼去研读这种经典论文(Attention is all you need)时候的困境。因为缺乏历史演进的知识垫底,一下子就被这些概念砸晕了。不知道经过多少次嘀咕、查阅,才慢慢明白:天下大势,冥冥之中,有万变不离其宗。原来,我们在符号文法摸索了一辈子的被证明最有效的谓词Subcat框架和自底而上的结构解析算法,底层逻辑与 transformer 及其自注意力机制不谋而合。 虽然符号技术过时了,也确实笨拙,除了符号逻辑的透明性外,没有多少其他长处,但现在理解深度学习大模型的原理和框架,由此骤然变得亲切了很多。只不过现在眼界开阔了,与信息论和计算理论更加靠近。(计算)语言学家一辈子的追求就是跨语言,而AI的追求更高一筹,是跨模态,无论音频、视频还是文字。

【后记】

大模型经典论文 Attention is all you need 不好读也有时代的原因,它提出的框架是直接应对RNN的短板和机器翻译的需求和验证。这就让它的核心部分被这些因素干扰而模糊了。框架看上去太复杂,encoder decoder 两大部分,还必须在encoder 到 decoder 之间做一种交叉对齐 ,但其实内核组块没有任何区别。这些对于理解 transformer 的通用性和原理,都是“噪音”。

transformer 主体简单极了,不过就是多层感知器,在每一层加了个自注意力而已。 到了GPT 发现 ntp(下一词预测)是打开通用智能的钥匙的时候,从框架角度就更简单了,decoder-only 足矣(说明:decoder 并不是名字所说的那样只做解码,它一样做分析和理解,这是由其核心组块决定的)。

老友看了我的transformer博客解说(Transformer 和注意力机制简介),说你这个太简陋了,连篇幅都比原论文短。

原文中有一些细节舍去了。
作者: 立委 (*)
日期: 2025/02/21 12:23:37

包括:

原架构是两大块:encoder + decoder

但实际上这两大块里面的组快都是完全一样的。而且,主流已经抛弃了 encoder,GPT 采用的就是 decoder-only 架构。

另外,位置编码是序列的一个因素,与处理过程解耦以后,位置编码有一套说法,怕干扰了主旨的理解,就点到为止了。

再有就是一些数学公式和实现细节,包括归一化、残差等。舍弃不影响对于 “注意力+神经网络” 这个核心主旨的理解。

所以通篇实际上就是一个理解重点:自注意力机制怎么work的,因为多层感知器是个 given,不是 transformer 的创新。

顺便一提,所谓 自注意力,国人喜欢顾名思义,以为是自己注意自己,感觉很蹊跷。其实自注意力是针对跨序列的交叉注意力而言的,自注意力是在同一层序列的上下文中注意所有的相关tokens(确实也包括它自己),是单层序列之内的事儿,这个“自”回指的不是token自己,而是token自己所在的那个窗口序列。交叉注意力说的是跨序列层的注意力,例如传统的神经机器翻译中,目标语序列针对源语序列的注意力。到了 GPT 的通用生成式AI(gen-AI)主流,没有跨序列的必要了,因为所有的 input 序列 和 output 序列,都被自回归“挤压”到同一层的序列去了。仅有的 output 就是 next token,其余一切tokens都连成一串了作为input的条件:everything is ntp。

以“中译英:我爱你” 的机器翻译为例,GPT自回归生成的序列是这样的:

Input                  Output
中译英:我爱你         I
中译英:我爱你 I       love
中译英:我爱你 I love  you

屠可伟老师的研究进一步验证了parsing与transfromer的可对齐性:

李老师,关于transformer自注意力机制和语言学的关系,我们前年有个工作,之前也和您提过:我们发现transformer自注意力机制与概率依存句法模型的近似推理计算图极为相似,局部几乎一模一样: Probabilistic Transformer: A Probabilistic Dependency Model for Contextual Word Representation

白硕老师说:

我对这个问题的观点:

1、语言处理的符号主义本身并没有一个基于第一性原理的强有力理论体系,最牛的乔姆斯基也没做到。

2、语言处理的完整方案必须包含一个词法、句法、语义、语用、常识、事理、逻辑各方面“角力”因素能够以可以“统一度量衡”的表达-竞争-筛选机制,这点,目前只有多层神经网络可以做到,虽然只是“概率性的”。

3、语言处理和知识处理的共性是滑动上下文内的key-value填充,也就是我们俗称的“哪些萝卜填哪些坑”,这个共性的需求,被一个共性的机制——注意力机制在很大程度上解决了。再单独、分别做语言层面的填充机制(什么成分做什么角色)或是知识层面的填充机制(什么槽位取什么值)已经失去意义。要么不做,要么统一做而且比注意力机制做得更好。没有其他的出路。

白老师所言极是。白老师说的“统一的度量衡”就是自注意力。

 

【相关】

Introduction to Transformer and Its Attention Mechanism

The Transformer architecture and its attention mechanism form the foundation of mainstream GPT large language models, making them extraordinarily important. Despite the abundance of explanations and popular science articles on this topic, many friends tell me they still find it bewildering or only partially understand it. Therefore, I've decided to write a couple of blogs to contribute my understanding.

As someone curious about mainstream AI, you've likely heard of the renowned Transformer framework and its "attention mechanism" that powers large language models, perhaps considering them mysterious concepts. You may have read the classic paper "Attention is All You Need," but still found it confusing or difficult to decode. Don't worry—this is completely normal, and most of us have gone through this stage! While the paper may be a bit mind-bending, its core logic isn't actually that complex.

To understand the Transformer architecture in AI large language models (LLMs), we need to break down its workflow. First, we should understand how large language models work and how they're trained. Base large language models gain knowledge from data through "self-supervised learning" using multi-layer neural networks. Self-supervised learning is a special type of machine learning that uses "masking" to generate supervision signals. While supervised learning typically uses human-annotated data with output targets, self-supervised learning requires no human annotation. Instead, it masks certain data points and trains the system to predict them (like "filling blanks" or "continuing sequences"), using the masked data as the correct answer and supervision signal. Mainstream GPT models mask the next word, training the system to predict it based solely on previous context (called "next token prediction")—this is the current paradigm for generative AI.

The Complete Process from Input to Output

1. Starting with "Dictionary Lookup": Tokenization and Embedding

To understand an entire input text for next token prediction, we first need to break it down into basic units, called tokenization, which converts text into a sequence of tokens (the smallest units of text). These tokens might be complete words (like "work") or subwords (like "un+believ+able").

Tokens are symbols, and computers struggle with direct symbol manipulation—they only work well with numbers. So we need to convert tokens into numbers.

Each token is converted into a numerical representation—a multi-dimensional vector—by looking up an embedding dictionary. Each token is transformed into a 300-1024 dimensional vector (imagine establishing feature representations for each word across many conceptual dimensions, such as: noun, singular, organization, finance, etc.). Embedding allows words to have computable semantic spatial relationships.

This multi-dimensional vector space acts like a "meaning space" where each token's vector defines its position. The distance between tokens across different dimensions represents their semantic distinctions. This aligns with our intuition: a word's meaning becomes apparent through comparison with other words.

These vectors aren't randomly generated but are numerically encoded representations trained on massive natural text corpora, providing the basic semantic information of tokens—their position in meaning space. For example, the vector for "bank" naturally sits closer to "money" and far from "trees." Similarly, the vector for "apple" might contain information about both "fruit" and "technology company."

Imagine trying to help a computer understand the sentence: "The cat sat on the mat."

Step one: Tokenization breaks this sentence into individual tokens: The+cat+sat+on+the+mat.

Step two: Dictionary lookup (Embedding) finds a numerical representation—a multi-dimensional vector—for each token.
"cat" -> [0.1, 0.5, -0.2, ...]
"sat" -> [-0.3, 0.8, 0.1, ...]
...

Simply put:
Tokenization breaks text into the smallest units (tokens) that computers can easily process and analyze.
Embedding converts these tokens into vectors that computers can easily calculate and combine.

Key point: The vectors obtained from the embedding dictionary are only the "initial meaning representations" of tokens, without considering their specific context. Decoding contextual meaning from vector representations is the task of the next steps, using the multi-layer neural networks + attention mechanism in the Transformer architecture.

The core modules of Transformer can be broken down into two parts:
1. Attention mechanism: Used to calculate the relevance between tokens and dynamically update token representations.
2. Neural network: Used to process information transformation between tokens.

The entire Transformer is stacked with multiple such blocks for transformation, and with each attention layer recalculating token representations, deepening understanding progressively.

2. Attention Takes the Stage: Updating Word Meanings Based on Context

Now we have a sequence of vectors, each representing the "initial meaning" of a token. But here's the problem: the same word can have different meanings in different contexts! For instance, "bank" can mean a financial institution or a riverbank.

The core of the Transformer architecture is the attention mechanism (self-attention), which serves to dynamically adjust the representation of each token based on context, reflecting its relationships with other tokens.

For example: In the sentence "I like to eat apples," "apple" and "eat" are highly correlated, so the model will rely more on the word "eat" to update the meaning of "apple," determining that "apple" here refers to fruit rather than a company.

How is this done?

The model calculates attention weights between each token and other tokens through QKV attention:
- Query: querying vector of the current token (e.g., "he")
- Key: key vectors of contextual tokens (e.g., "police," "witness")
- Value: The actual meaning after association

For example, through matrix operations, the model discovers that "he" is most strongly associated with "witness," so it updates the vector for "he" to carry information from "witness."

Calculating "relevance": For each token, we calculate its "relevance" with all other tokens in the sentence, assigning different "attention weights" (attention scores) to different tokens. This "relevance" can be understood as: how important are other tokens when understanding the meaning of the current token.
* For example, when understanding the word "sat," "cat" and "mat" are obviously more important than "the."

Weighted average: Based on the calculated "relevance" (i.e., token weights), we take a weighted average of the V vectors from all tokens in the context to obtain a new vector representation for this token. This new vector is the meaning representation of the current token in this specific sentence.
For instance, the new vector for "sat" will be more influenced by the vectors of "cat" and "mat," and less by the vector of "the."

Key point: The attention mechanism dynamically updates the meaning of each token by calculating the relevance between tokens. This update is context-based—the same token will have different representations in different sentences.

This way, each token's meaning is no longer fixed but changes dynamically based on the entire sentence's context. For example, in "I saw a bat," "bat" could refer to either a flying mammal or a sports implement, but the attention mechanism will combine the bigger context to infer its more appropriate meaning.

For details on how QKV works in the attention mechanism, please refer to the companion article "How to Understand QKV Division of Labor in Self-Attention Mechanism?"

3. The Transformer Backbone: Multi-layer Progressive Information Compression

The core building blocks of Transformer can be broken down into two parts:
Multi-head attention layer: Used to calculate relevance between tokens and dynamically update token representations.
Feed-forward neural network layer: Further process and transform information (compression, abstraction)

The entire Transformer consists of multiple such modules stacked together, with each layer recalculating token representations for deeper understanding. Depending on the number of blocks, the Transformer repeatedly performs this update process. Like humans pondering a text multiple times, each layer deepens the understanding of the text. Deeper layers may capture more complex semantic relationships.

Each Transformer block iteratively upgrades understanding, for example:
- Bottom layers: Capture local grammar (such as the contrasting relationship in "not...but...")
- Middle layers: Understand "who 'he' actually refers to"
- Top layers: Grasp the main theme of the entire text

The main features of Transformer
1. Parallel computation: Word order is decoupled from token processing, allowing parallel processing of all tokens (in contrast to the linear inefficiency of previous RNNs)
2. Hierarchical understanding: Progressive interpretation from literal meaning to deep intention, capturing patterns both large and small.

4. Output: The Model's Final Prediction

Transformer models can be used for various tasks. Different tasks have different forms of output.

GPT: Next Token Prediction
For mainstream GPT models, their ultimate task is to predict what comes next through "autoregressive" next token prediction (autoregression is the dynamic extension of previous context, recursively implementing word-by-word continuation). The model decides what content should logically follow based on the deeply understood context. This opened the path to general AI, as sequence learning has mastered the "code" for converting inputs to outputs for general tasks, but that's a topic for another article.

5. Summary

Tokenization and Embedding lay the foundation for computers to understand text, similar to looking up a dictionary.
Attention mechanism calculates relevance between tokens and dynamically updates token representations.
Transformer consists of neural network layers + attention layers, optimizing token representations layer by layer, covering various relationships at different levels.
The final output depends on the task. Translation models generate target language text. GPT is responsible for predicting the next token, ultimately evolving this simple prediction mechanism into a general-purpose large model capable of unlocking various tasks.

 

【相关】

Has Symbolism Been Sidelined for Too Long? Could Neural LLM terminate AI?

Though no one can predict the future, and though abandoning one of the two paths feels politically incorrect, we cannot rule out the possibility of such unipolar dominance.

As is widely known, AI has always been marked by the competition between two schools: symbolic rationalism and data-driven empiricism. Their fortunes have waxed and waned throughout history, but over the past 30+ years, the pendulum has shown no sign of swinging back toward symbolism.

Why?

The ongoing contemporary history of large language models is fascinating. Each time challenges and obstacles arise, the mainstream paradigm overcomes them from within. Whether this will continue remains to be seen, but the trend seems likely to persist.

When large language models (LLM) first emerged, people marveled at their capabilities. But soon, critiques arose: their simple "next token prediction" (NTP) objective and the statistical nature of their probabilistic models led many to conclude they were merely advanced statistical tools, like large parrots—lacking true "understanding."

Ilya Sutskever and Geoffrey Hinton had to step in repeatedly to explain: "Do not underestimate next token prediction. This is no mere statistical n-gram model from the past. It abstracts a system of understanding that integrates human knowledge. When next-token prediction grows increasingly accurate, deep comprehension of context becomes indispensable." Such explanations struggled to convince skeptics. Later, Ilya invoked Kolmogorov complexity as a theoretical foundation, but this framework remains esoteric and inaccessible to most audiences—even many PhDs and professors view it with bemused skepticism. Yet, no better explanation exists.

What ultimately dissolved the "statistical parlor trick" critique was firsthand experience. Users interacting with LLMs realized: these systems seem to genuinely understand. No matter how you phrase your queries, in any language, with nuance or subtext, large models grasp meaning more reliably than most humans.

With the "understanding" debate fading, critics shifted focus: "LLMs cannot reason."

As recently as last year, Yann LeCun cited this as one of his core arguments against the mainstream GPT-style LLM paradigm (advocating instead for vision-based world models as the true path). Many relished pointing out flaws—like LLMs failing at elementary arithmetic such as multi-digit multiplication.

But this critique no longer holds. With the advent of reasoning models like OpenAI’s "o-series" and DeepSeek’s "r-series," accusations of "no reasoning ability" have collapsed. Hardliners may still dismiss probabilistic reasoning as unstable, lacking the rigor of symbolic logic. Yet users deploying these models for mathematics, coding, or project planning overwhelmingly report breakthroughs. Large-model reasoning now rivals or surpasses human experts, approaching master’s or doctoral proficiency. Coding capabilities already exceed those of average engineers.  This is just the beginning.  It is well plausible that within a year or two, reasoning models could dominate Olympiad-level math or competitive programming.

Once again, barriers were breached through internal innovation—this time after large-model pretraining neared its limits. The core framework remains unchanged, though: reinforcement learning still relies on NTP for chain-of-thought (CoT) generation; reasoning models remain probabilistic. Symbolic AI contributed nothing. Symbols remain confined to input/output interfaces—even the "inner monologue" of CoT manifests as output tokens.

The sheer creative potential within this paradigm is staggering. Those of us from symbolic AI backgrounds once naively imagined that when neural approaches hit walls, our logic-and-grammar toolkit would ride to the rescue. Hybrid neuro-symbolic fantasies danced in our minds.

Zooming out, modern large models evolved from earlier statistical frameworks, with neural networks as a tributary. When those statistical models hit ceilings, breakthroughs came from within—via deep learning. Symbolism played no role.

A profound question arises: Why has the theoretically appealing vision of hybrid neuro-symbolic synergy remained an impractical or unnecessary dream?

Two possibilities stand out.

First, the data-driven empiricist approach possesses far greater resilience and potential than we imagined.

This hints at deeper truths. Artificial neural networks, inspired by biological brains, had languished for decades until the deep learning revolution. Over the past decade, their human-like (or superhuman) performances have forced us to confront a possibility: perhaps this is indeep how intelligence works. If artificial systems achieve human-level cognition through mechanisms mirroring our own biology—despite neuroscientists’ caveats about our limited brain knowledge—this alignment would powerfully validate the neural paradigm. Symbolic logic and statistical feature engineering, by contrast, are alien to biological cognition. Their limitations may stem from this fundamental mismatch. One might even argue that high-dimensional vector spaces in LLMs—where multimodal signals are embedded within neural frameworks—encode a "language of God," or the essence of universal information. Symbols, then, are mere human-imposed constructs, sensory accommodations divorced from reality’s substrate.

Second, when a paradigm harbors untapped potential, progress demands sufficient talent density to exploit it.

AI uniquely concentrates genius. Countless brilliant minds flock to this field, creating an intellectual critical mass unmatched in most domains.

With these conditions in play, we must never underestimate the internal momentum to break through barriers. AGI (Artificial General Intelligence) believers, via their "insane" grind, keep delivering results. Could they indeed be AI’s ultimate Terminators?

 

Addendum: Symbolic might just be "reduced" to a symbolic tool that may retain its irreplaceable cognitive value

yanyongxin:

What distinguishes humans from other animals is our evolved reasoning capacity. Though rooted in neurons, this ability represents a qualitative leap beyond mere "instinctive reactions." It abstracts object relationships, enabling multi-step reasoning that can be transmitted and memorized through linguistic symbol chains. Reasoning is inherently discrete—thus symbolizable—as a simulated system built atop neural architecture. This simulation likely requires structural differences in human neural systems compared to other animals.

The most striking contrast between reasoning systems and primal neural cognition lies in sustained deliberation. Unlike "muscle memory" or intuition, human reasoning varies dramatically. During my university years, I observed students who excelled at quick problem-solving yet froze when faced with complexity. Today's LLMs approximate the reasoning level of humanities undergraduates, but still lag behind trained STEM specialists—particularly in mathematics and physics. The essence of STEM disciplines lies in rigorously symbolizing real-world problems. Simulating such precision within biological neural systems demands rare opportunities (elite education), prolonged training, and specific neurostructural advantages ("talent"), all channeled through disciplined formalization. Yet achieving this capability bridges biology with mechanical rigor—enabling interfaces with tools like Mathematica.

This charts AI's next frontier: building superior logical simulation systems atop neural frameworks until seamless integration with pure symbolic tools is achieved. The brain's logical simulation system remains energy-intensive, error-prone, and costly to develop. Its key advantage? Seamless integration with underlying neural processes.

Li Wei: Well said.

Interfacing with symbolic systems manifests as tool use. For instance, when confronting complex math problems, instead of forcing probabilistic reasoning through natural-language chain-of-thought (CoT), LLMs should just generate code properly to invoke Mathematica. This tool-use capability is now defined as a fundamental trait of LLM-native agents—yet another innovation emerging from within the paradigm.

Thus, we see a clear evolutionary trajectory:

1. Traditional Statistical Models ("Artificial Idiocy"):
Failure: Little natural language understanding 
Solution: LLMs (e.g., ChatGPT)

2. Pretrained LLM: 
Failure: Lacking reasoning ability 
Solution: Reasoning-reinforced LLMs (e.g., OpenAI’s o1, DeepSeek’s r1)

3. Reasoning LLM:
Failure: Insufficient symbolic rigor 
Solution: LLM Agents (symbolic tool integration)

yanyongxin:

Traditional statistical models earned their "artificial idiocy" label because their parameter spaces and data structures proved inadequate to host the world models required for true language understanding.

 

【相关】

符号主义被打入冷宫太久了,难道神经是AI的终结者吗?

虽然没人可以预测未来,虽然抛弃两条路线的一条感觉政治不正确,但的确不能排除这种单极主义的可能性。

众所周知,AI自从诞生,就有符号理性主义和数据经验主义的两条路线竞争,以及此伏彼起来回震荡的历史轨迹,直到30年前,钟摆就再也没有回落到符号主义一边的迹象。

这是为什么?

看大模型的当代史(still on-going)很有意思。每一次遇到挑战和障碍,都是主流自己跨过去的。将来是不是还是如此,可以观望,但感觉大概率还会持续。

大模型刚问世,大家惊异于其表现,但很快就从它的简单的“下一词预测”(ntp,next token prediction)的目标函数以及它概率模型的统计特性,得出一个结论:这不过是高级的统计模型,大号鹦鹉而已,它并没有理解。

伊利亚和辛顿不得不站出来一再解释:不要小看 next token prediction,它绝不是以前统计时代的 ngrams,而是抽象了人类各种知识的理解系统。当下一词预测越来越准的时候,没有对上文的深入理解是不可想象的。这种解释并不容易说服人。后来伊利亚追溯到了 K氏复杂性理论,似乎有了拿得上台面的理论支撑。但这一套说法对于绝大多数受众,太过玄妙和高冷,甚至很多博士和教授,也一样是雾里看花,不得不半信半疑。但除此之外,其实也没有更好的理论解释。

最后解开这个质问心结(“大模型根本就没有真正理解,一切不过是统计”)的是无数人与大模型交互的切身体验。 自己的体验不会骗自己: 每一个玩过大模型的人现在都意识到,大模型真能听懂自己的话,无论你怎么说,无论你用什么语言,甚至你话语背后的机锋和细微之处,它也比多数人理解更到位。

好,理解的问题大家搁置一边了,现在很少人还质疑大模型理解能力了。但接着就是另一个问题: 大模型不会推理。

杨立昆去年还把这一条作为自己反对主流大模型路线的根本理由(并鼓吹他的以视觉为基础的世界模型才是真正的出路)。很多人曾津津乐道的是大模型不会多位数乘法这样的小学算术。

但这条理由现在不成立了。在主流业界推出了 OpenAI的o系列和 DeepSeek的r系列等推理模型的今天,这种大模型不懂推理的指责不攻自破。

极端批判派当然可以从概率模型的不稳定的角度,继续质疑其推理不是真正的推理,不具有符号推理的严谨性和确定性。但用过推理模型做数学、代码以及做项目计划的人,绝大多数不再被此困扰了,因为很明显,大模型推理正在超越人类,现在已经修炼到硕士或博士的水平,代码能力也已经超过了大多数码农。而这只是推理模型刚刚起步上线的阶段,再给一两年,推理模型成为奥数冠军或代码冠军,都不是不可想象的事儿。

Again,跳过这个障碍,仍然来自内部,是大模型预训练几乎撞墙后的又一次内部创新。基本性质和基础未变,强化学习还是靠 ntp 去生成思维链(cot,chain-of-thought),推理模型仍然是概率模型。符号AI并没有帮任何忙。符号的作用仍然局限在 input 和 output 两端,连所谓内心独白的思维过程 cot,也是以 output 形式表现的。

这种领域流派内部的创造性和潜力,让人瞠目结舌。以前我们出身符号流派的人老一厢情愿地想着,AI主流遭遇困难了,我们逻辑派、文法派可以出手帮忙了。各种结合符号AI与神经AI的幻想,在我们的心中激荡。

如果把回顾AI历史的眼光拉得更远,我们可以看到现代大模型的前身是当年的各种统计模型,而神经基本是其中的一个支流。当年的那些统计模型遭遇种种瓶颈或天花板,最后还是在内部被深度学习突破了,从来没有符号主义任何事儿。

一个值得深思的问题是:为什么两条路线融合互补的理论美好,一直都是不可实现或不必实现的梦想呢。

可能说明了两件事。

第一是这种数据经验主义的路线,的确具有极大的韧性和潜力,远超我们想象。

这种潜力应该有更深层的原因。众所周知,人工神经网络是受到人脑神经启发而来的一种模拟,停滞多年后爆发了深度学习革命。十几年前的这场革命给我们带来了一次又一次的震撼,这些亮丽的拟人或超人的智能表现,使我们不禁在想,人脑神经也许就是这么工作的,具有同样的原理。人工智能达到或超越人类智能,机理相同应该是最有力的一个支持(虽然脑科学总有人说,我们对大脑知之甚少,貌似有一种把大脑神经的工作机理神秘化的倾向)。其他的一切手段,例如符号逻辑的公式和算法、统计模型的 feature engineering,都不是从真正意义上模拟人脑神经的工作方式,也许这就是它们注定走不远的症结所在。甚至可以说,神经框架内的各模态信号编码嵌入后的高维向量空间的确就是上帝的语言,或宇宙信息的本质,而符号不过是迁就人类感官的表象而已,或者说是人类强加给世界的一种体系。

第二,当一种路线具有超出我们想象的潜力的时候,科技进步所需要的就是足够人才密度去挖掘这种潜力。

我们知道,AI 领域具有这种聚集天才的优势。无数年轻才俊、聪明的大脑被吸引到这个领域,其天才密度超过了大多数领域。

在这样两个条件都具备的时候,我们永远不要低估它突破各种瓶颈的内部力量。那些AGI(Artificial General Intelligence,通用人工智能)的信仰者们一直在疯狂内卷,同时也的确不断在交出自己的成绩单。他们可能是AI的终结者吗?

 

【外一篇:符号逻辑“沦落”为辅助工具还是有自己不可取代的智能补足作用的】

yanyongxin:人之所以区别于其他动物,是因为进化出了推理能力。这种推理能力虽然是建立在神经元上的,但跟单纯的“本能反应”有了质的飞跃。它有了对象关系的抽象,从而可以进行多步骤推理,从而可以以对象关系链——语言的形式传播和记忆。推理本质上是一个离散过程,因此是可以符号化的,是一个在神经元体系上建立出来的模拟系统。但这种模拟的可能,很可能要求神经元系统与其他动物有结构上的差异。

推理系统与原始神经元思维的最显著差异,就是可以长时间思考。而不是“肌肉记忆”、“直觉”。不同人之间的差异很大。大学期间,注意到有些学生遇到简单问题时反应很快,面对复杂问题时,给他多少时间也没用。现在的LLM大致达到了文科生的推理水平。但还与训练有素的理科生,尤其是数学、物理的理科生有差距。数理专业的特征,就是将现实世界的问题严格的符号逻辑化。要在人的神经元系统上模拟出如此严谨的符号逻辑系统是需要特殊的机会(好大学,好老师)、长期的训练、和特殊的脑结构(“天赋”)的,要遵循良好的规则化习惯。但达到这个能力,就可以与机械化的严谨逻辑系统接口了,可以使用 Mathematica 之类了。

这就是人工智能下一步的走向:在神经元上建立更好的逻辑模拟系统,直到能与纯符号逻辑工具无缝链接。人脑中逻辑模拟系统是一个高能耗、高错误率、高构建成本的系统。它的一大好处,是能够跟底层神经元系统无缝联结。

立委:说得很好。

与符号逻辑系统接口就是 tool use,例如,遇到复杂数学问题,不要用自然语言的思维链(cot)去强行推理来影响概率性推理结论,而是LLM自己写代码去 call Mathematica。现在这种 tool use 的能力被定义为 LLM-native agent 的根本特性(之一)。所以 这种能力也还是 from within。

所以 我们看到了清晰的演进路线:

1 传统统计模型人工智障 不理解自然语言
solution: LLM e.g. ChatGPT


2. pretrained LLM 缺乏推理能力
solution: reasoning LLM e.g. o1 / r1


3. reasoning LLM 缺乏严谨的符号逻辑能力
solution: LLM agent (interfacing symbolic tools)

yanyongxin: 传统统计模型之所以人工智障 不理解自然语言,是因为没有足够大的参量空间和足够复杂的数据结构,来容纳理解语言所必须的世界模型。

 

 

 

【相关】

 

 

Transformer 和注意力机制简介

transformer架构及其注意力机制是主流GPT大模型的基础,特别特别重要。虽然这方面的科普和讲解已经汗牛充栋,但还是有很多朋友跟我说一头雾水或雾里看花。所以下决心写了三篇科普系列,试图做出我的理解贡献。

作为对主流AI有好奇心的同学,你可能早就听说过大名鼎鼎的大模型的基本框架 transformer及其“注意力机制”,觉得它们是高深莫测的概念。 你也许也读过那篇经典论文 Attention is all you need,但还是感觉云里雾里。别担心,这很正常,我们多数人都经过这个阶段!这篇论文确实有点“烧脑”,但它的核心逻辑其实并不复杂。

要理解AI大模型的Transformer架构,就需要拆解其工作流程。

首先应该了解一下大模型的工作原理和训练方式。

基础大模型通过原始大数据的“自监督学习”(self-srupervised learning),利用多层神经网络,获得数据相关的知识。自监督学习是一种特别的监督学习,它利用“掩码”获得监督信号。我们知道监督学习的训练数据是标注了输出目标作为监督信号的学习,但自监督无需人类标注,而是在数据中遮盖了部分数据点,让系统学习预测它(“填空”或“接龙”),以被遮盖的数据点作为标准答案和监督信号。主流的GPT大模型的掩码就是遮盖住下一个词,让系统仅仅根据上文来预测它(叫 next token prediction),这是当前生成式AI的主流模型。

当我们输入一段文字时,模型首先要做的就是把它切分成一个个基本单位(词元),然后给每个词元找到它的"字典释义"(向量表示)。

从输入到输出的全过程

1. 从“查词典”开始:Tokenization 和 Embedding

要理解整个输入文本,首先需要拆分基本单元,叫做 tokenization(分词),即,将文本拆解成序列 tokens(词元,文本的最小单位),这些 tokens 可能是完整词(如"work")或子词(subword,如"un+believ+able")。

词元是符号,而计算机对符号难以计算,它只玩得转数字,所以需要把词元转成数字。

每个 token 都会通过查一种嵌入(embedding)词典,把词元符号转化成一个数字化表示:多维向量。每个Token被转换为300-1024维的向量(想象给每个词建立很多概念维度的特征表示,例如:名词,单数,机构,金融,......)。Embedding 让词语有了可计算的语义空间关系。

多维向量好比一个“意义”空间,每个token的多维向量定义了这个token在意义空间的位置;token与其他tokens在不同维度的距离,就是它们在意义上的区分。这符合我们的常识:一个词的意义可以在与其他词的比较中显现。

这些向量不是随机生成的,而是通过海量语料训练出来的数字化表示,提供了词元的基本语义信息,即词元在意义空间的位置——例如"银行"的向量天然接近"金钱",而与"树木"相距甚远。再如"苹果"这个词的向量,可能会包含"水果"、"科技公司"等多个方面的信息。

想象一下,你要让计算机理解一句话:“The cat sat on the mat”。

第一步:分词(Tokenization),先把这句话拆成一个个的 tokens:The+cat+sat+on+the+mat 。

第二步:查字典(Embedding), 给每个 token 找一个数字化表示,也就是一个多维向量。

“cat” -> [0.1, 0.5, -0.2, ...]
“sat” -> [-0.3, 0.8, 0.1, ...]
...
(注:实际向量维度更高且值为连续分布)

简单来说

Tokenization 将文本拆解成计算机容易处理分析的最小单位 token。
Embedding 把这些 token 转换成计算机容易运算组合的向量。

关键点: 嵌入词典得到的向量只是 token 的“初始意义表示”,它还没考虑这个token的具体语境。在向量表示中解码上下文意义是下面步骤的任务,用的就是transformer架构中的多层神经网络+注意力机制。

Transformer 的核心模块可以拆解为两部分:

    1. 注意力机制:用于计算 token 之间的相关性,并动态更新 token 的表示。
    2. 神经网络:用于处理 token 之间的信息转换。

整个 Transformer 由多个这样的模块堆叠而成,每一层都会重新计算 token 的表示,使得理解越来越深。

2. 注意力登场:根据上下文更新词义

现在,我们有了一串向量,每个向量代表一个 token 的“初始含义”。但问题来了,同一个词在不同语境下可能有不同的意思啊!比如,“bank” 可以是“银行”,也可以是“河岸”。

Transformer 架构的核心是 注意力机制(self-attention),其作用就是:根据上下文,动态地调整每个 token 的含义表示,反映与其他token的关系。

打个比方:在"我喜欢吃苹果"这句话里,"苹果"和"吃"的相关度很高,所以模型会更多地参考"吃"这个词来更新"苹果"的含义,从而确定这里的"苹果"指的是水果而不是公司。

怎么做呢?

模型通过QKV注意力计算每个词元与其他词元的注意力权重:
- Query:当前词元的特征向量(如"他")
- Key:上下文词元的特征向量(如"警察","目击者")
- Value:关联后的实际含义

    • 例如,通过矩阵运算,发现"他"与"目击者"关联度最高,于是更新"他"的向量,使其携带"目击者"的信息。

计算“相关度”: 对于每个 token,我们都要计算它和句子中 所有 其他 token 的“相关度”,给不同的词元分配不同的"注意力权重"(attention scores)。这个“相关度”可以理解为:在理解当前这个 token 的含义时,其他 token 有多重要。

    • 例如,在理解 "sat" 这个词时,"cat" 和 "mat" 显然比 "the" 更重要。

加权平均: 根据计算出的“相关度”(也就是词元的权重),把上下文中所有 token 的V向量 加权平均 起来,得到本token的一个新的向量表示。这个新的向量就是当前 token 在 这个特定句子 中的含义表示。

    • 比如,"sat" 的新向量会更多地受到 "cat" 和 "mat" 向量的影响,而较少受到 "the" 向量的影响。

关键点: 注意力机制通过计算 token 之间的相关度,实现了对每个 token 含义的 动态更新。这种更新是 基于上下文 的,同一个 token 在不同的句子中会有不同的表示。

这样,每个 token 的意义不再是固定的,而是会根据整个句子的上下文动态变化。例如,在 "I saw a bat" 这句话中,"bat" 可能是 "蝙蝠",也可能是 "球棒",但注意力机制会结合上下文来推测其在上下文中更合适的含义。

关于注意力机制中QKV如何分工和工作的细节,可参照姊妹篇《立委科普:如何理解自注意力机制中的QKV分工?》。

3. Transformer 主干:多层递进的信息压缩

Transformer 的核心组块可以拆解为两大部分:

    • 多头注意力层:用于计算 token 之间的相关性,并动态更新 token 的表示。
    • 前馈神经网络层:非线性特征转换,进一步压缩信息(抽象、泛化)

整个 Transformer 由多个这样的模块堆叠而成,每一层都会重新计算 token 的表示,使得理解越来越深。根据组块的多寡,Transformer 会反复进行这个更新过程。就像人类理解文章时会反复琢磨一样,每一层都在加深对文本的理解。越深的层次可能捕获到越复杂的语义关系。

每个Transformer组块都在迭代升级认知,例如:
- 底层组块:捕捉局部语法(如"not...but..."的转折关系)
- 中层:理解"他指代的真实人物"
- 高层:把握全文主旨

Transformer的最大特点
1. 并行计算:词序与token处理解耦,并行处理所有token(对比此前RNN的线性低效)
2. 层次化理解:从字面含义到深层意图的渐进式解读,捕捉大大小小的规律性。

与前Transformer的RNN相比,架构优势的工程化体现

特性 RNN Transformer
并行计算 序列依赖无法并行 全token并行处理
长程依赖处理 存在梯度衰减 直接全局注意力
训练效率 O(n)时间复杂度 O(1)层内时间复杂度
内存消耗 相对较低 随序列长度平方增长

4. Output:模型的最终预测

Transformer 模型可以用于各种各样的任务。不同的任务,输出(output)的形式也不同。

    • GPT:预测下一个词(Next Token Prediction) 对于主流 GPT ,其最终的任务是预测下文,通过所谓“自回归”下一词元预测实现(自回归就是动态扩展上文,递归实现一个词一个词的接龙)。模型会根据已经深入理解的上下文,来决定接下来最合理的内容应该是什么。这一路打开了通用AI的路,原理是序列学习学到了一种通用任务的输入转输出的“密码”,但这是另一篇科普的内容了。

5. 总结

    • Tokenization 和 Embedding 给计算机理解文本打下基础,好比查了词典。
    • 注意力机制 计算 token 之间的相关性,并动态更新 token 表示。
    • Transformer 由神经网络层 + 注意力层组成,层层优化 token 表示,涵盖不同层次的各种关系。
    • 最终的 output 取决于任务,翻译模型是生成目标语言文本。GPT 负责预测下一个 token,最终发现这个简单的预测机制自然进化成解锁了各种任务的通用大模型。

 

 

【相关】

True Story Behind DeepSeek's Success: AI Learning to Think Slowly Without Human Supervision

*Edited transcript from InfoQ's second DeepSeek series livestream featuring Dr. Wei Li, former VP of Engineering at Mobvoi's Large Language Model team, discussing R1 Zero's innovative contribution to democratizing reasoning models.*

DeepSeek's Greatest Achievement: Making Everything Transparent

InfoQ: "DeepSeek adheres to a pure reinforcement learning approach, but the industry often refers to RL as 'alchemy' - how did they make this process controllable and accessible? What's innovative about their reasoning paradigm?"

Dr. Li:** The reinforcement learning for reasoning models has long been an industry challenge. About six months ago, when Ilya and others declared the end of the pre-training era, it signaled that simply scaling up pre-trained models was no longer sufficient for performance improvements. The delayed release of GPT-5 is another indicator of pre-training's decline. As a result, the industry began seeking new growth paths, with on-the-fly reasoning models gaining momentum among leading teams until OpenAI released O1, the world's first reasoning large language model. DeepSeek's R1 then followed with its breakthrough success.

From the mysterious Q-Star project (reportedly causing dramatic internal conflicts at OpenAI) to the release of O1, reasoning models have been widely recognized as a new paradigm in AI. The core of this paradigm is enabling models' "slow thinking" capability, or System 2 as it is called, using reinforcement learning to enhance model intelligence in complex tasks. However, all of this was closed-source. OpenAI even deliberately created mystique around their chain-of-thought content. Apart from a few top players like Google and Anthropic quietly exploring and tracking this field, other teams knew very little about it.

DeepSeek's greatest achievement lies in making everything about LLMs transparent. They open-sourced their models and detailed technical papers, and weren't afraid to expose their thought of chains (CoTs) in the system. Through pure reinforcement learning, they proved that even without process control data, result-based control alone could achieve top-tier reasoning model performance. This breakthrough was like piercing through a paper window, showing the industry a feasible path to democratizing reinforcement learning.

InfoQ: The innovation in reasoning paradigm sounds abstract. Could you provide an example?

Dr. Li:** R1's paper is outstanding, arguably one of the finest in the large model field. It consists of two parts: one focusing on Zero research, which presents remarkable achievements in pure reinforcement learning for reasoning; the other detailing the practical R1 system, a top-tier production reasoning model. For R1's development, they considered practicality, balancing comprehensive performance, safety, and various practical considerations, detailing a four-stage training pipeline as best practice to help other teams understand and replicate their success.

The most brilliant part is the Zero research. Zero proved a revolutionary point: contrary to traditional beliefs (or OpenAI's implied stance that reasoning requires step-by-step supervision), process supervision isn't actually necessary. Using only the final result against the "gold standard" as a supervision signal is sufficient to train the "slow thinking" process required for reasoning models.

This is Zero's greatest highlight and the origin of its name - it draws inspiration from AlphaZero's spirit. AlphaZero historically pioneered complete independence from human game records or experience, achieving zero human supervision reinforcement learning through self-play generated process data (state+move+score triplets). Similarly, DeepSeek's Zero research demonstrates that in reasoning tasks, models can autonomously generate internal process data - Chain of Thought (CoT) sequences - without human annotation.

Specifically, reasoning models initially focused on mathematics and coding because these domains have standard answers. Macroscopically, this is typical end-to-end supervised learning, as both input (math/coding problems) and output (answers/execution results) are fixed and known. However, the process from input to output is highly complex with significant information gaps, requiring a CoT bridge. Just as humans need to break down problems and think step by step when facing difficulties, models need this process too. DeepSeek's research found that models possess the ability to learn this deep thinking process autonomously if given sufficient time and space.

InfoQ: Dynamic reasoning paths sound like AI "drawing mind maps" - but how do you prevent it from going off track? Like suddenly writing poetry while coding?

Dr. Li:** Based on current evidence, this possibility is virtually non-existent or negligibly low. Before DeepSeek published their results and research details, many were puzzled about this point: wouldn't deep thinking go haywire with only result supervision and no process supervision? Without large-scale reinforcement learning experiments, this was indeed a significant concern. It's like flying a kite - you're holding just one string while letting it soar freely, worried it might nosedive.

These concerns proved unnecessary. The reason it doesn't go off track is that all this reasoning reinforcement learning, including self-generated reasoning CoTs, is built upon existing top-tier models (like V3). These models have already mastered coherent expression through massive data learning. This coherence implies orderliness, which, while not equivalent to pure logic, prevents completely unreasonable deviations. It is observed that fluent human speech typically reflects organized thinking.

InfoQ: On another note, compared to OpenAI's O1, DeepSeek R1 has another notable highlight in applying reasoning CoTs to language generation and style imitation. Could you elaborate on this?

Dr. Li:** When O1 was released, everyone knew it demonstrated significant improvements in mathematics and coding abilities, as standard tests revealed higher performance levels. What people didn't realize was that this reasoning ability, or "slow thinking" capability, excels not only in domains requiring strict logical reasoning but can also shine in traditional language tasks.

By nature, language ability has been a strength of large models - everyone knows they generate very fluent text, more native than natives. By the time we reached models like 4o or V3, their writing was already quite smooth, seemingly leaving little room for improvement. However, when asked to write classical poetry or imitate Lu Xun's writing style, previous models fell short. R1 solved these challenges. From a social impact perspective, this is actually quite remarkable and particularly noticeable.

Honestly, not many people are deeply concerned about mathematics or coding, although we know coding is a major direction for the coming years and automated programming can change the world. Everything in IT ultimately comes down to software; the digital world is built on software. If software development can transition from manual coding to model-assisted or even model-autonomous programming, this will greatly increase productivity. While this is visible to all, it's not as intuitive for ordinary people who more often face tasks like writing compelling articles.

When R1's humanities capabilities were discovered, not just geeks or software developers saw the benefits of reasoning models - ordinary people were excited too. Suddenly, anyone could claim to be a poet, writer, advisor or philosopher - the impact was tremendous. This wasn't felt with o1, perhaps because OpenAI didn't realize or at least didn't focus on this aspect of reasoning models. But while working on code and mathematical reasoning, DeepSeek must have internally realized that this "slow thinking" mechanism could also significantly improve writing abilities, especially in classical Chinese.

Everyone knows Chinese data isn't as rich as English data, so while previous models could write beautiful English poetry, they struggled with Tang poetry. This might be because Chinese data was insufficient in quantity or quality, preventing models from learning adequately. We always felt this was unfortunate - models would sometimes rhyme correctly, sometimes not, sometimes add or miss characters, not to mention tonal patterns to follow. DeepSeek clearly put effort into this area; their data quality must be significantly higher than industry standards.  More significantly, they know how to transfer the CoT ability from science and technology to language and literature.

InfoQ: If you were to recommend a DeepSeek module most worth replicating for programmers, which would it be? Like those "Aha moments" claiming to replicate R1 for tens of dollars?

Dr. Li:** If I were to recommend a DeepSeek module most worth replicating for the programming community, it would be the Zero research-related components. This replication isn't about achieving comprehensive capabilities but rather verifying Zero research's key revelation - that machines can indeed autonomously learn. This is what OpenAI kept under wraps; perhaps they had figured it out earlier but chose not to disclose it.

Now, we've seen quite a number of different teams claimed to have reproduced R1's reflective capabilities with minimal resources. This isn't just an interesting experiment; more crucially, it marks the democratization of reasoning models. Previously, people didn't understand how reasoning models worked, only knowing that vast amounts of process data were needed for models to learn slow thinking. This was considered an almost insurmountable barrier because process data is hard to obtain, and reinforcement learning's instability and high data requirements confused and challenged many programmers.

But now, we know we can bypass this most difficult process data requirement and reproduce this "Aha moment" with limited resources, proving that slow-thinking capabilities can be learned autonomously by models. Based on this premise, if you're a domain expert, you might wonder: could these techniques achieve significant improvements in your field? This is entirely possible. Even the most powerful models (like V3 or 4o) only achieve 60-70% accuracy in specific scenarios without optimization, and experience tells us that without at least 80-85% accuracy, you can't launch a truly valuable system in real-life applications.

That is to say, between a large model's out-of-box results and actual valuable application deployment, there's a gap. Previously, our only method was collecting domain data for fine-tuning (SFT). Now, we have another path RL: following the reasoning model approach, letting systems fully utilize slow thinking capabilities during the reasoning phase to improve data quality to acceptable or even exceptional levels. This path seems to have been opened.

However, my programmer friends tell me that in their comparison experiments between fine-tuning (SFT) and DeepSeek-style reinforcement learning (RL), while RL indeed outperforms SFT, the computational cost for RL training is still far higher than SFT. The superior performance makes sense because SFT data is always very limited, while successfully reinforced RL self-generated data can far exceed SFT data volume.

InfoQ: Some say large models represent "brute force aesthetics," but OpenAI's former Chief Scientist and co-founder Ilya says pre-training has reached its limit. How do you view this? Is the emergence of reasoning models just adding another scaling law to brute force aesthetics?

Dr. Li:** This is more about a shift in technical focus and a paradigm change in technical innovation. Large models involve three major components: first, pre-training, which builds foundational capabilities by learning basic patterns from massive data; second, post-training, initially mainly fine-tuning - OpenAI early on used some reinforcement learning (like RLHF) for human preference alignment, but by Meta's time, they even abandoned typical PPO style RLHF for simpler DPO, as they, like many others, struggled with it. Finally, there's the reasoning phase, where models interact with users real-time after deployment.

The current situation with high-quality natural data is that pre-training has nearly exhausted all available quality resources. The industry began to notice data growth challenges, making performance improvements increasingly difficult. GPT-5's delayed release, reportedly yielding limited returns despite massive computational investment, suggests pre-training may have indeed hit a wall.

This led the industry to explore alternative AI growth curves. Reinforcement learning-based reasoning models emerged at center stage in this context: pure reinforcement learning should be added to post-training. Previous reinforcement learning relied on human preferences, but this time it's about giving models more thinking time before reaching answers, learning underlying chain of thought (CoT). While V3 was already doing well, it didn't cause as much social sensation until R1 appeared. DeepSeek truly broke through after the Chinese New Year, becoming the most discussed public topic and causing excitement and shock overseas. R1 and O1 represent a new paradigm. Before R1, only OpenAI's O1 existed as a reasoning model, seemingly unreachably advanced, with would-be-followers unsure how to follow. However, R1 not only reproduced O1's capabilities but did so with greater transparency and clarity. This contrast further highlighted R1's importance as an open-source model leader.

InfoQ: At first glance, DeepSeek seems like an engineering masterpiece. Why did it cause such global sensation? Its user acquisition speed (100 million in a week) surpassed ChatGPT's nuclear moment? What's its historical significance?

Dr. Li:** From my personal experience and observation, ChatGPT's explosion was a landmark event in large model development. Research insiders were following large models before ChatGPT, at least since GPT-3. When GPT-3's Playground appeared, we were already immersed in it, sensing an approaching storm. But from society's perspective, ChatGPT truly shocked everyone, exceeding all expectations, like an AI nuclear explosion.

I believe R1's emergence is the second major shock after ChatGPT. Of course, between ChatGPT and R1, other influential models appeared, like 4o - another remarkable milestone. While ChatGPT 3.5 was already so impressive, 4o proved it could be even better. Then came Sora, bringing shock with video capabilities in multi-modal LLMs. I personally also greatly appreciate Suno, the music model, making me feel like I could become a musician overnight.

If I were to rank them, R1's impact is second only to ChatGPT, perhaps even exceeding 4o and Sora's sensational effects. R1's impact feels similar to ChatGPT's initial appearance, creating the same addiction. While ChatGPT was groundbreaking and R1 a follower, albeit with innovative highlights sometimes surpassing previous models (like in classical poetry and style imitation), achieving such global impact as a follower is truly miraculous.

In terms of practical effects, R1's productization was amazingly successful. Gaining hundreds of millions of users in a week, it far broke ChatGPT's record and elevated society's AI awareness. Furthermore, regarding geopolitical influences on technology access, many domestic users had long desired access to the world's most advanced models like GPT series, Claude, or Gemini but often couldn't reach them. R1's appearance eliminated these concerns about domestic and international restrictions, contributing to its rapid global popularization.

InfoQ: What's your vision of AI programming's ultimate form? Is it programmers telling AI "make me a TikTok," and it outputs deployable code and operations plans?

Dr. Li:** There are always two types of people: skeptics and optimists. People like Ilya believe Artificial General Intelligence (AGI) is imminent and Artificial Super Intelligence (ASI) isn't far away, so the biggest concern now, according to him, is ensuring superintelligence safety.

Anthropic's CEO Dario predicts that within 3-5 years, large models will achieve real breakthroughs - not just the current impressive demonstrations, but revolutionary changes in societal productivity. Fundamentally, they're talking about AI's ability to scale replacement of both physical and intellectual human labor.

However, while large models are buzzing now, their practical applications haven't reached the level of the previous generation's mobile platforms. Previous super apps like Meituan, Didi, Xiaohongshu, and TikTok transformed major aspects of our daily lives, from basic necessities to communication and entertainment, maximally shortening the distance between suppliers and customers - value everyone of us feels daily. While playing with large models is interesting, their practical value at the lifestyle level isn't yet obvious;  at best we're still on the verge of the coming AI application explosion.

Notably, DeepSeek's emergence has lowered large model application barriers, paving the way for scaled applications, though we haven't yet entered the era of true application explosion.

What will it look like when AI applications truly explode? I believe the ultimate goal, by nature of AI, is for LLMs to comprehensively replace humans in both intellectual and physical labor. Signs of large models impacting white-collar workers are already undoubtedly evident, with even programmers not an exempt. In physical labor, embodied intelligence is developing rapidly, with both humanoid robots and mechanical hands gradually replacing human physical work.

Of course, this brings side effects, like massive job displacement. How society adapts to this state of greatly developed productivity, but this is another discussion topic. But looking at AI's nature and ultimate goals, AI development could have two milestones: first, when it can replace 50% of human work, allowing half of society to maintain a decent, free life through social programs perhaps like Universal Basic Income (UBI) - this might mark the arrival of AGI (Artificial General Intelligence); second, when it replaces 90% of human work, possibly signifying the emergence of ASI (Artificial Super Intelligence) - a kind of technological utopia (or 'communism') in some sense.

These two milestones are my own verifiable definitions of AGI and ASI.  I do not agree with the idea that while old jobs are replaced, more new jobs will be created by AI.  It just does not make sense as any new jobs are also a mixture of human labor, destined to be replaced soon by super intelligence if they do emerge for time being.

This vision of AI's future development shows how DeepSeek's innovations in reasoning models might be just the beginning of a much larger transformation in how we think about work, society, and human potential in an AI-driven world.

 

 

【相关】

Does the New Reasoning Paradigm (Query+CoT+Answer) Support a New Scaling Law?

— Reflections on LLM Scaling Laws and DeepSeek's R1

My friend Zhang Junlin's article "Looking at the Future of Scaling Laws through DeepSeek R1" has sparked interesting discussions among peers.

Core Insights from Initial Discussions

Professor Bai summarised the key highlights as follows:

Infinite stacking won't lead to infinite growth (physical laws don't support this)

Only S-shaped growth is possible, with diminishing returns inevitably appearing

The initial emergence of language capabilities relates to the density of linguistic knowledge in training data

The next growth phase represents a second S-curve, driven by common sense knowledge, which requires more computing power due to lower knowledge density

The third phase involves learning logical reasoning (Chain of Thought), where natural data has even lower density of such knowledge. Brute-force mining with computing power becomes inefficient, making reinforcement learning with synthetic data a more rational approach

As Dr. Lu points out: The term "Scaling Law" is becoming overloaded. While S-curves (nonlinear curves characterized by sigmoid functions) can describe technology adoption lifecycles, they typically occur in succession (one technology hits its ceiling, making way for another). Large language models' multiple "Scaling Laws" confirm this pattern, with some overlap between Test-Time and Post-Training "Scaling Laws".

The Nature of LLM Scaling

Let's examine the fundamental logic behind LLM scaling. First, it's crucial to understand that LLMs are not databases - they don't aim to memorize long-tail data details. Large model training essentially compresses big data, or more precisely, compresses the knowledge systems behind the data (including common sense and encyclopedic knowledge), focusing on capturing patterns and regularities of various patterns (what we call generalizations).

Conventional intuition suggests that as data scale increases, redundancy increases too. Regardless of filtering, cleaning, and deduplication, growing redundancy seems to imply diminishing returns. So why do large models still appear "hungry" even at the unprecedented scale of hundreds of billions of tokens? Why does the scaling law remain effective from hundreds of billions to trillions of tokens?

The key lies in LLMs being sequence learning and sequence decoding systems. While sequences are one-dimensional, the patterns and regularities behind are high-dimensional. For instance, even a simple sequence like "cat chases mouse" potentially involves multiple knowledge dimensions: species relationships, predatory behavior, spatial movement, actor-patient roles, etc. This multi-dimensional knowledge naturally leads to combinatorial explosion at the sequence level as information is flattened in language. The "appetite" for insatiable big data effectively addresses this combinatorial explosion. As long as there isn't complete information redundancy, additional diverse sequences will help models abstract data patterns more precisely.

The Two vs. Three S-curves Debate

Zhang Junlin observes that since OpenAI's O1, two other phases have gained recognition with their own Scaling Laws: the reinforcement learning Scaling Law (RL Scaling Law) for post-training, and the Inference Scaling Law (also called Test Time Scaling Law).

This raises a crucial question: Are there really three S-curves, or just two? How comparable is the reasoning model's S-curve to the pre-training S-curve?

While theoretically we can identify three phases:

Pre-training
Post-training (especially reasoning-focused reinforcement learning)
Inference phase

In practice, post-training and inference phases likely share a single S-curve; there aren't two independent growth curves.

DeepSeek R1's Insights: The Truth About "Slow Thinking"

Consider DeepSeek R1: users can activate "deepthink" mode to enable Chain-of-Thought (CoT) reasoning, but they can't actually control reasoning quality by increasing computation time. Why is this?

Let's examine a concrete example. When R1 solves a complex mathematical problem:

Traditional models might directly answer: "The result is 42"

R1 shows detailed reasoning: "Let's think step by step: 1) First consider... 2) Then we can... 3) Finally, we get 42"

While R1's response appears to demonstrate "slow thinking" (CoT), this reasoning process reflects actually a generation pattern fixed during training, not dynamic exploration of multiple potential reasoning paths during response time. In other words, CoT+answer might look like "slow thinking," but it doesn't fundamentally change the unidirectional next-token prediction paradigm. R1's CoT+answer creates an illusion of slow thinking, but the generative nature remains fundamentally the GPT "fast thinking" paradigm. At test time, unlike AlphaGo, the depth and scale of thinking isn't dynamically explored, though beam search, if applied, can provide implicit multi-path optimization internally.

Test Time Compute Constraints

The industry's buzz word "test time compute" refers to reasoning models requiring more online computational resources compared to traditional non-reasoning models. For example, R1 with CoT enabled might need several times more computation time than its base model V3 for the same problem. However, this increased computation results from behavior patterns acquired during training, not dynamically adjustable compute investment. Without controllable scalability in test time compute, we can't really talk about a test time scaling law.

A major difference between pre-training and CoT reinforcement learning lies here: pre-training scaling laws can remain stable long-term because once training completes, it doesn't significantly impact online response time - the generation mode remains a simple query+answer. Therefore, offline training for months is acceptable if the resulting model shows significant capability improvements. However, reasoning models' post-training CoT reinforcement learning differs - it cultivates models' habits of responding with slow thinking, changing the generation mode to query+CoT+answer. Extending the CoT isn't just about the cost of training resources and time; more critically, it reflects in extended test time compute for each query during deployment, severely delaying system response time. Users generally have limited tolerance for slow thinking computation time and delays during online system use.

The Sustainability Debate

OpenAI's Sam Altman and Anthropic's Dario might argue that for extremely complex problems (like proving the Riemann hypothesis or designing next-generation aerospace vehicles), even if a model needs a week of computation time, it's still a massive improvement over human teams requiring decades. However, this argument has two issues:

LLM feasibility for such super-complex problems remains far from validated

Extreme scenarios lack universality and can't serve as data points for sustainable scaling laws

This isn't to deny S-curves as effective models for describing scaling laws, nor to reject the rationality of S-curve stacking. The combination of pre-training and post-training growth curves (s1 and s2) might indeed reflect the overall relationship between resource investment and performance improvement. However, we should carefully examine whether CoT reasoning truly opens a sustainable scaling curve.

Conclusion: How Far Is the LLM Road to AGI?

If reasoning models' scaling laws lack sustainability, this raises a deeper question: Can we reach the promised land of Artificial General Intelligence (AGI) through these two scaling laws alone? Furthermore, is the technical ideal of Artificial Super Intelligence (ASI) - AI replacing human labor and dramatically improving productivity - truly feasible?

Current evidence suggests that while pre-training scaling laws have shown considerable sustainability, reasoning models' scaling laws may quickly hit practical constraints. This reminds us that the path to AGI/ASI likely requires more innovative breakthroughs, not just simple extrapolation of existing methods. In the next phase of artificial intelligence development, we might need to discover entirely new growth curves.

[#LLMs #ArtificialIntelligence #DeepLearning #AGI #ScalingLaws #MachineLearning]

 

【相关】

张俊林:从Deepseek R1看Scaling Law

DeepSeek 笔记:推理新范式 query+cot+answer 支持新的 scaling law 吗?

LLM的"大就是好"还能走多远?

——关于Scaling Law的一些思考

 

老友张俊林《从Deepseek R1看Scaling Law的未来》一文,引起老友热议。

白老师的推荐抽提是:

核心观点:

——无限堆叠不会无限增长(物理世界规律也不支持),只有S型增长,一个S曲线一定会出现边际效益递减。

初期语言能力的涌现,与语料数据中包含的语言知识密度有关。

接下来的增长实际上是第二根S型曲线,更多语料贡献的是常识性知识,常识知识密度不及语言知识密度,所以要更大算力才能涌现。

再接下来是逻辑知识(思维链)的学习。自然语料中逻辑知识密度更低,用算力野蛮淘金,吃力不讨好。所以,用逻辑知识密度更高的合成数据做强化学习,才能让第三个S曲线爬坡。这就顺理成章了。

鲁总评论说:Scaling Law 这个词现在有点滥用。S 曲线(Sigmoid函数刻画的非线性曲线)倒是可以描述技术的生命周期,但它往往是一个接下一个(一个技术遇到瓶颈,往往才有另一个技术的开始)。。。这个在ChatGPT刚出来时我们回顾过。大模型的这几个 "Scaling Laws" 也印证这一点 (Test-Time 和 Post-Training “Scaling Laws" 有点重叠部分):

创新就是从一个S曲线到另一个S曲线,well known results。这也是斯坦福大学那位鼓吹新能源、自动驾驶以及再生食品革命等科技乐观主义的教授(叫?)每次演讲必谈的技术革命的adoption曲线。他自称根据这个曲线,他在过去30年对于技术影响社会的许多预见都证明是对的,虽然每一次预见社会都会取笑他。 

回到LLM领域的 scaling law 话题。Scaling law本质上是一种经验法则,而经验告诉我们,大多数经验法则都符合S形曲线(或增量的正态分布)。具体到LLM,"大就是好"正是这种法则在遇到平台期或天花板之前的体现。这里的"大"指的是数据规模大、模型参数量大,缺一不可。模型规模不够大,数据再多也无法有效消化——这早已是业界共识。不过OpenAI早期的设计中过分强调模型规模的做法现在看来是一种误导,直到Chinchilla Scaling Law的提出,业界才形成了更合理的共识:数据规模和模型参数量需要保持适当的比例关系。

LLM Scaling的底层逻辑是什么?

首先要明确:LLM不是数据库,其目标不是记忆长尾数据的细节。大模型训练本质上是对大数据内容的压缩,换句话说,压缩的是数据背后的知识体系(包括常识、百科知识等),重点在于揭示大大小小的各种规律性(也就是所谓的泛化能力,generalizations)。

一般直觉会认为,数据规模越大,冗余也越多。无论如何过滤清洗和去重,冗余度随规模增长,似乎意味着可榨取的"油水"会越来越少。那么为什么到了千亿tokens这种以前难以想象的数据规模,大模型依然显得"吃不饱"?为什么从千亿扩展到万亿tokens,scaling law依然有效?

这个现象的关键在于LLM是序列学习(编码)序列推理(解码)的系统。序列本身是一维的,但序列中蕴含的patterns和规律性却是高维的。举个例子:即使是简单的"猫追老鼠"这样的序列,背后可能涉及物种关系、捕食行为、空间运动等多个维度的知识。这种多维知识表现在序列层面,就会发生天然的组合爆炸。对大数据的"大胃口"正是应对这种组合爆炸的有效策略。只要不是完全的信息冗余,增加的不同序列对模型抽象数据patterns通常都是有帮助的。

然而,人类自然产生的高质量数据是有限的。预训练已经几乎吃尽了现有的高质量自然数据。于是,业界开始探索另外的AI智能增长曲线。

从预训练到推理:两个还是三个S曲线?

张俊林指出:

OpenAI o1推出后,另外两个阶段不再孤单,也各自拥有了姓名,产生了各自的Scaling Law,对应后训练阶段的强化学习Scaling Law(RL Scaling Law)和在线推理阶段的Inference Scaling Law(也叫Test Time Scaling Law)。

这里值得探讨的问题是:到底是三个S曲线,还是两个?推理模型的S曲线与此前的预训练S曲线有多大可比性?

理论上确实可以分为三个阶段:

1. 预训练
2. 后训练(尤其是推理强化学习)
3. 推理阶段

这三个阶段理论上都可能找到资源投入与性能提升之间的正相关S曲线,即scaling laws的某种表现函数。但实际上,在当前部署的应用中,后训练和推理这两个阶段应该共享同一个S曲线,原则上不存在两条独立的增长曲线。

当然,如果用户利用提示词技巧来影响模型的test time,让它更深入的思考,这可能间接影响 CoT (ChainOfThought)的长度或深度。但那是 query 的改变,是 input context 的变化,感觉也不应该算作 test time compute 的独立的 s曲线。

另外,说推理模型这一波潮流是范式转变,开启了新的 RL/Test-time scaling law,总觉得有一点太言之凿凿了。直觉上,推理模型的增长曲线与此前的预训练 scaling law 的增长曲线,大概率没有直接的可比性。

Scaling law 说的 law,实际上我们都知道是所谓经验“法则”。经验需要足够的实践数据积累,才能总结出来。强化学习赋能的推理模型才刚开始,没有足够的经验数据刻画这是怎样的一种增长关系,能持续多久,是不是昙花一现,还是可以持续相当长的时候,等等。

持续时间不够长的 scaling,其实没有多少经验法则的意义。Anthropic CEO Dario 提到 deepseek 的时候说(大意), deepseek 显得这么亮眼其实是赶上了好时机,言下之意是运气的成分大于技术硬核实力和创新(滑稽的是,Anthropic 迄今没有能力推出任何推理模型,虽然R1以来,谷歌和国内都有推理模型的上线)。他说,推理刚刚开始,所以任何人走通了这条路,在这个初期阶段都会有一个大增长。譬如PhD段位的考试题,在没有推理模型的LLM中,可能分数很低,但一旦有了推理模型,有了所谓 test time compute 的 CoT,成绩就会直线上升,给人创造了奇迹的感觉。

现在是推理模型的早期,后去会如何呢?靠增加 test time compute,或不断延长 CoT,还会有多少增长空间?这个问题是现在进行时,貌似没有明确答案。但隐隐觉得,这个持续增长的时间或曲线,远不如预训练那样稳定和持续,进而其作为 scaling law 的说法不一定站得住。

这第二条反映 RL scaling law 的后训练智能增长曲线,不大好与 pretrain scaling law 相提并论,很可能并不是可持续的,也可能很快就遭遇制约因素的强烈反弹(见后“Test Time Compute 的制约”)。

DeepSeek R1的启示:慢思考的真相

以DeepSeek R1为例,用户可以选择"deepthink"模式来启动慢思考的chain-of-thought(CoT)推理,但实际上用户难以通过增加计算时间来提升推理质量。这是为什么呢?

让我们看一个具体例子。假设我们让R1解决一个复杂的数学问题:

- 传统模型可能直接给出答案:"结果是42"
- R1会展示详细的推理过程:"让我们一步步思考:1) 首先考虑...... 2) 然后我们可以...... 3) 最后得出结果42"

表面上看,R1的回答展现了"慢思考"(CoT)的特征,但实际上这个推理过程是模型在训练阶段就已经固化的生成模式,而不是在回答问题时动态探索多个可能的推理路径。换句话说,CoT+answer 看似是"慢思考"后的回答,但其实并不改变自回归 ntp(next token prediction)的单向序列生成定式。说白了就是,R1 的 cot+answer 给人慢思考的样子,但生成的本性还是GPT“快思考”范式。在 test time,思考的深度和规模不是动态探索,虽然可以用 beam search 进行内部的隐式多路径选优。

Test Time Compute 的制约

目前业界热议的"test time compute",指的是含有CoT机制的推理模型相比传统的非推理模型需要更多的在线计算资源。以V3/R1为例,处理同样的问题,启用CoT 的R1可能需要V3 n多倍的计算时间。但这种计算量的增加是模型训练后固化的行为模式导致的,而不是可以动态调节的算力投入。test time compute 没有可控的伸缩可能性,也就谈不上 test time scaling law。

预训练与后训练的CoT强化学习的一个很大的不同是:预训练 scaling law 可以长期稳定乃是因为一旦训练完成,不大影响在线响应的时间,生成模式就是简单 query+answer。因此预训练阶段离线训练几个月都是可以忍受的,只要训练出来的大模型能力有大的提升。但推理模型后训练阶段的CoT强化学习不同,它在培养模型在线回应慢思考的习惯,生成模式是 query+cot+answer。推理模型的 cot 拉长,不仅仅是训练的资源和时间的耗费问题,更主要的是它反映在部署推理阶段的 test time compute 的延长,严重拖延了系统的响应时间。而用户在线使用系统的时候,一般来说对于慢思考的计算量和耗费时间是有能够忍耐的上限的。

这就带来了一个关键问题:即使研究表明indefinitely 增加CoT的长度(相应增加在线计算时间)能带来持续的性能提升,符合某种 scaling law 的经验法则,这种增长也会受到推理阶段现实因素的制约。一般用户可能愿意等待5-10秒获得更好的答案,但如果需要等待几分钟乃至几小时,使用体验就会大打折扣,乃至不可接受。

Scaling Law的可持续性之辩

Open AI CEO Sam Altman 和 Anthropic CEO Dario 这些大佬可能会争辩说,对于极其复杂的问题(如证明黎曼猜想、设计下一代航天战机等),即使模型需要一周的计算时间,相比人类团队需要数十年的工作量仍是极大的进步。但这种论述有两个问题:

1. 这类超复杂问题的LLM可行性远未得到验证
2. 极端场景不具有普适性,难以作为可持续的scaling law 的数据点

当然,这并不是否认S曲线作为描述scaling law的有效模型,也不是否定S曲线叠加的合理性。预训练和后训练两个阶段的增长曲线(s1和s2)叠加确实可能反映了资源投入与性能提升的整体关系。但我们需要谨慎看待CoT推理是否开启了一个真正可持续的scaling曲线。

结语:通向AGI的道路还有多远?

如果推理模型的scaling law缺乏可持续性,这就带来了一个更深层的问题:仅依靠这两个scaling laws,我们能否达到通用人工智能(AGI)的理想彼岸?更进一步,让AI平替人类劳动、极大提升生产力的超级人工智能(ASI)的技术理想是否真的可行?

目前的证据表明,预训练scaling law确实展现了相当的持续性,但推理模型的scaling law可能会较快遇到现实约束。这提醒我们,通往AGI/ASI的道路可能需要更多的创新突破,而不仅仅是现有方法的简单外推。在人工智能发展的下一个阶段,我们或许需要寻找全新的增长曲线。

 

 

【相关】

张俊林:从Deepseek R1看Scaling Law

DeepSeek 笔记:R1 部署阶段的推理机制

1. 训练阶段的强化学习机制:GRPO是否通过生成多条候选答案(multiple candidate cot+answer sequences)进行策略优化(修改模型),使得模型养成慢思考的习惯?

这个答案是毫无疑问的 YES。

2. 部署阶段的推理机制:R1是否在生成时隐式生成多条路径,但仅展示一条?如果是,这种机制与集成(ensemble)方法有何异同?

3. 与AlphaGo的MCTS的区别:MCTS树搜索是否在推理时动态构建搜索树,而集成方法只是静态组合多个模型的输出?

1. 部署阶段的隐式多路径推理机制

DeepSeek R1 的部署阶段,其推理机制可以概括为以下两种模式:

(1) 隐式多路径生成与筛选

- 生成多条路径:模型在单次推理时,可能隐式生成多条潜在的推理路径(CoT+Answers),但仅选择其中一条输出。
- 技术实现:通过调整解码策略(如束搜索宽度 `beam_width`),模型在生成过程中维护多个候选序列(即多条路径),最终选择综合评分最高的路径。
- 用户感知:用户仅看到最终输出,但模型内部进行了多路径探索与筛选。
- 效率权衡:若设置 `beam_width=1`(贪心搜索),则退化为单路径生成,响应速度最快;增大 `beam_width` 可提升输出质量,但增加计算延迟。

(2) 显式多候选生成(需主动配置)

- API级控制:通过设置 `num_return_sequences` 参数,模型可显式生成多个候选答案(如5个),用户或下游系统可进一步筛选。
- 实际应用:DeepSeek App默认未开放此功能,但在企业API或开源代码中可能支持。

关键点
- 训练阶段的强化学习优化了模型的“单路径CoT生成能力”:通过GRPO训练,模型在单次生成时即可输出高质量的详细推理步骤(长CoT),无需依赖显式多候选生成。
- 部署时的多路径探索只是“锦上添花”:隐式多路径(如束搜索)或显式多候选生成可进一步提升输出质量,但非必需功能。

 

2. 训练阶段的“慢思考习惯”培养

(1) 强化学习的作用

- 目标:通过GRPO算法,模型学习生成更详细、更合理的推理步骤(长CoT)以提高奖励(如答案正确性)。
- 机制:训练时生成多个候选答案,奖励信号不仅评估最终答案正误,还隐式鼓励逻辑连贯的推理路径(如通过格式奖励)。

(2) CoT增长的驱动力

- 奖励设计:若长CoT更易得出正确答案(如分步解题减少错误),模型在策略优化中自然倾向于生成更长、更详细的步骤。Given room for [think], a reasoning model just wants/tends to think deep!
- 数据反馈:训练后期通过拒绝采样生成的高质量SFT数据,进一步强化这一模式。

结果:训练后的模型在单次生成时即可输出高质量的详细推理(即“慢思考习惯”内化)。

3. 与集成方法(Ensemble)的异同

(1) 相似性

- 多路径生成:隐式多路径探索(如束搜索)可视为同一模型生成多个潜在输出,类似集成方法中的多模型预测。
- 结果筛选:通过置信度选择最优解,类似于集成中的投票或加权平均。

(2) 核心差异

R1的隐式多路径生成本质是单模型内的动态解码策略,而传统集成依赖多模型的静态组合,二者在实现成本与多样性来源上存在根本差异。

4. 与AlphaGo蒙特卡洛树搜索(MCTS)的本质区别

(1) AlphaGo的MCTS机制

- 动态规划:通过模拟(Simulation)构建搜索树,评估每一步的长期收益(如胜率),动态选择最优路径。
- 在线学习:在推理时根据实时反馈(如对手落子)调整搜索策略,部分版本(如AlphaZero)甚至更新模型参数。

(2) R1的隐式多路径生成

- 静态模型:部署时模型参数固定,多路径生成依赖预训练的策略与解码规则。
- 无长期收益建模:路径选择基于模型自身的置信度概率,而非多步决策的累积收益。

(3) 差异

- R1的多路径生成是静态策略的有限探索,依赖训练阶段内化的CoT+answer的生成能力。
- MCTS是动态规划过程,通过实时模拟与评估实现长期收益最大化,属于在线决策优化。

 

5. 总结

- 训练阶段的目标:GRPO通过强化学习培养模型生成详细CoT的习惯,使得部署时单次生成即可输出合理答案。
- 部署阶段的灵活性:系统可选择单路径生成(快速响应)或多路径筛选(质量优先),后者类似轻量级集成。

- 训练完成后模型参数确实固定,但隐式多路径生成依赖解码策略(如束搜索宽度),用户可通过API参数调整,非完全静态。
- 与集成的实质差异:R1的多路径生成是同一模型的不同解码路径,而传统集成依赖多个独立模型,后者多样性更高但成本激增。

- MCTS的核心是动态搜索与长期收益建模,而非多模型预测的平均化。R1的隐式多路径更接近贪心策略的扩展,而非规划过程。

DeepSeek R1的部署机制通过训练阶段的强化学习内化“慢思考”能力,使其在单次生成时即可输出详细推理。隐式多路径生成(如束搜索)可进一步提升质量,但本质是同一模型的解码策略优化,与传统集成或AlphaGo的MCTS均有显著差异。这种设计在效率与效果间取得平衡,适配大规模工业应用需求,但牺牲了动态规划与全局最优的能力。

 

【相关】

Hallucinations in AI: Bug or Feature? A Deep Dive into DeepSeek-R1

Host: Hello everyone! Welcome to today's interview. Recently, there's been quite a buzz about AI "hallucinations," especially with DeepSeek-R1, which seems to have a higher hallucination rate than its predecessor, DeepSeek-V3. Today, we're joined by Dr. Li, a senior AI researcher. Welcome, Dr. Li!

Dr. Li: Hello, host! Hello, everyone!

Host: Let's start with the million-dollar question: Why do large language models "hallucinate"? Can you break it down for us in plain English?

Dr. Li: You see, large language models are like super-powered conversation completers. Give them the first half of a sequence, say, a question, and they'll predict the second half (say, an answer) based on their massive knowledge network. They learn like our brains do – they can't remember everything word-for-word, so they compress and generalize, grabbing the gist and finding patterns.

Here's a fun contrast: Ask them "How tall is Yao Ming?" and they'll nail it because that's such famous knowledge, this data point is practically carved in stone in their memory (represented in the model's parameter weights). But ask them "How tall is Old Wang from next door?" and they're stumped because they've never met Old Wang! But here's the kicker – they won't just say "I don't know." So what do they do? They "make up" a reasonable height based on what they know about the range of human heights. That's a hallucination for you!

Host: Wow, that's some impressive guesswork! But isn't this kind of making things up pretty problematic?

Dr. Li: Not necessarily! In a way, hallucination is imagination (for better or worse) – it's where creativity lies! Think about it: all those great literary works, artistic masterpieces – aren't they all flights of fancy, products of imagination? If everything had to match reality closely, art would just be photography, and where's the fun in that?

You know, Yuval Harari makes a fascinating point in "Sapiens" – humans became Earth's dominant species precisely because we could "tell stories," creating myths, religions, nations, and money – things that don't physically exist. These are all "hallucinations," but they're the driving force behind civilization!

Host: When you put it that way, hallucinations sound pretty important! But let's talk about DeepSeek-R1. Its hallucination issue seems quite serious.

Dr. Li: Indeed, it is! The academic consensus used to follow OpenAI's view that reinforced reasoning would significantly reduce hallucinations. I remember discussing this with a head honcho at an LLM unicorn who was particularly excited about reasoning's potential to curb hallucinations. But R1's performance threw us a curveball!

According to Vectara's tests, R1's hallucination rate is more than 3 times higher than its foundation model V3's – 14.3% compared to 3.9%. This definitely correlates with its prolonged "Chain of Thought" (CoT) enabled by reinforcemnnt learning for reasoning. R1 is absolutely brilliant at reasoning, math and coding, as well as poetry and storytelling, but this currently comes with the "side effect" of increased hallucinations in things like translation and summarization.

More specifically, there are several reasons for R1's increased hallucinations.

First, the standard hallucination tests use summarization tasks, something base models are already pretty good at. In this case, reinforcement learning can backfire – it's like using a cannon to swat a fly!

Second, R1's reinforced reasoning chains weren't specifically optimized for straightforward tasks like summarization, translation, or news writing that demand strict factual accuracy. Instead, it tries to add various layers of thinking to every task. Looking at its transparent CoT (ChainOfThought) printout, we see it tirelessly analyzing even simple instructions from multiple angles. This overcomplication of simple tasks can lead to deviations and hallucinations.

During R1's reinforcement learning for NLP-related tasks, it seems the model was rewarded more heavily for creativity, leading it to be more imaginative – and consequently more prone to straying from facts. For mathematical and coding tasks, R1's supervision came from gold standards (test answers or code test cases). But for humanities tasks, they used V3 or V3's reward model to judge quality, and the current system seems to clearly favor creativity.

Moreover, user feedback typically tends to focus and encourage creativity. Most people aren't sensitive to hallucinations, especially when they're wrapped in the model's smooth, fluent language. For most frontline developers, this kind of user feedback naturally pushes them to enhance creativity rather than tackle the thorny problem of hallucinations.

Host: So, you are saying that R1's hallucination problem rooted in its over-enthusiastic reasoning? What's the real relationship between reinforced reasoning ability and hallucinations?

Dr. Li: It's still a puzzle – there's not seem to be simple correlation. Look at R1, a leading reasoning model, versus Claude 3.5 Sonnet, a top non-reasoning model. Surprisingly, Sonnet still has a higher hallucination rate than R1! But when we compare R1 to its base model V3, we see clearly that adding reasoning significantly increased hallucinations.

It may well be about the model's "personality." R1, with its powerful reinforcement learning, loves "divergent thinking." Give it a simple prompt, and it'll spin out ideas like there's no tomorrow – its CoTs could run on like crazy! This suggests that while R1 was powering up its creativity, it inevitably amplified creativity's twin: hallucination.

As a model that excels in both STEM and humanities, R1 performs differently across tasks. In mathematics and coding, where more rigorous reasoning is required, there's little room for hallucination. But in language and creative tasks, especially in the summarization tests, hallucinations become more prominent. It's largely a side effect of R1's supercharged linguistic creativity.

Technically speaking, R1 automatically adds lengthy CoTs to simple user instructions, essentially complicating straightforward tasks. Its CoTs (like  internal monologue of an entity following instructions) change the conditional part of the autoregressive probability model before generating answers, naturally affecting the final output. Compare:

V3: query → answer
R1: query+CoT → answer

For tasks that V3 already handles well, like summarization or translation, any lengthy CoT guidance might lead to deviation or embellishment, creating fertile ground for hallucinations.

Host: So where do R1's hallucinations mainly occur?

Dr. Li: Think of R1's abilities as split between "arts" and "sciences." In "science" areas like math and coding, its logic is fairly strong and hallucinations are relatively rare. But in "arts" areas like language, hallucinations become more noticeable.

R1's most impressive achievement compared to the first LLM reasoning model O1 is successfully extending mathematical and coding reasoning capabilities into creative writing, especially in Chinese. The internet is full of R1's brilliant literary works. In terms of wordplay and literary prowess, it clearly surpasses 99% of humans – even graduate students in literature and classical Chinese professors sing its praises.

But watch what happens when you ask it to do a simple summary – it can't help but "get creative," often "inventing" details not present in the original text. It's like its "arts" abilities are too powerful, a case of "too much of a good thing."

Host: That's an interesting perspective. Do all language tasks require creativity?

Dr. Li: Language tasks actually fall into two categories: ones that need high creativity, like poetry and fiction writing, and ones that demand high factual accuracy, like news reporting, translation, or summarization. R1 excels at the former, which was likely the development team's focus, but this creates side effects in the latter as it is today.

It reminds me of the old Chinese saying about translation needing to be "faithful, expressive, and elegant" – achieving all three has always been challenging. We see many examples where elegance is prioritized over faithfulness, like the use of hyperbole in literary works. We also see the opposite, like Lu Xun's advocacy for so-called "rigid translation."

Interestingly, humans have always had double standards here, but we have a mental switch we can flip at will. When watching movies or reading novels, we flip towards creativity and don't fuss about factual accuracy. But switch to news channels, and we have zero tolerance for falsehoods.

Host: People tend to believe content that appears logically coherent and detailed, so the potential harm from AI hallucinations could be significant. What should we ordinary folks do about AI hallucinations?

Dr. Li: While many people are starting to notice and become wary of these hallucinations amid their amazement at LLM's creativity, most are still mesmerized by its creative brilliance. We need to increase public awareness of AI hallucinations. I suggest a two-pronged approach:

Stay Alert: Don't take everything the model says as granted, especially factual claims. Hallucinations most commonly occur with names, places, times, locations, and other entities or numerical data.

Cross-Verify: For important details, check original sources online or consult experts to see if the claims align.

Guide the Model: When asking questions, add constraints like "please stay faithful to the original text" or "please verify facts." This can at times help reduce hallucinations.

Embrace Creativity: If you're looking for inspiration or creative ideas, model hallucinations can be a delightful surprise!

Think of AI hallucinations as "possibilities in parallel universes." What it makes up might not be true in our world, but could be true in another! It's like how novelists write fictions – while it cannot stand fact checking, it's a kind of "artistic truth." Just like novels arise from life but transcend it, AI arises from data but transcends it. AI compresses data into knowledge and common-sense network, not necesarily true to individual facts – that's what databases are for.

Host: This reminds me of what people often say: AI models aren't just "talking nonsense" – they're "talking nonsense seriously"!

Dr. Li: Haha, that's exactly it! AI hallucinations are its "educated guesses," based on the massive knowledge and patterns it's learned. The hallucinations are by noway completely random – they have internal constraints that make them seamless and convincing, but also more deceptive. Newcomers to AI need to be especially careful not to trust everything at their face value.

For regular users, understanding the nature of hallucinations is needed. For example, when asking about well-documented facts like "How long is the Yangtze River?" models won't make mistakes because these facts are firmly encoded in their parameters. But ask about an obscure creek or fictional river, and the model will activate its "reasonable completion" mechanism and make something up.

Host: Following your logic, human language itself prepares for a breeding ground for hallucinations.

Dr. Li: You could say that. Language enabled humans to create things which do not exist in the physical world, such as myths, religions, states, corporations, currency, and abstract concepts like ideals and beliefs. Harari emphasizes in "Sapiens" that story-telling (i.e. typical hallucinations) were fundamental to civilization: language enabled human story-telling abilities. Hallucinations catalyzed civilization. Humans are the only entities capable of 'lying' (besides LLMs).

Host: What about the future? Is there a way to maintain creativity while reducing hallucinations?

Dr. Li: This is definitely one of the "ultimate challenges" in AI! People are working on various solutions, including:

More Refined Training: During training, treat different types of tasks differently, teaching the model when to be strict and when to be creative.

Task-Specific Fine-tuning/Reinforcement Learning can help balance this contradiction. Tasks like summarization, paraphrasing, translation, and reporting need special care because they require both some creativity (like style) and strict factual accuracy.

Specifically, R1's training pipeline has four stages: fine-tuning 1, reinforcement 1, fine-tuning 2, and reinforcement 2. Reinforcement 2 mainly focuses on human preference alignment. Currently, this process seems to favor creativity over faithfulness, which could be rebalanced later. Perhaps more importantly, in stage three (i.e. fine-tuning 2), we could strengthen constraints for different tasks – for example, increasing supervised data for summarization to encourage faithful, straightforward results.

Routing: In the future, there will be a "model dispatcher" that assigns different models based on task type. Simple tasks could go to V3 or use tools, while complex tasks requiring deeper thinking go to R1.

For instance, arithmetic tasks should just use simple code calculations, equivalent to using a calculator. That's not how it works now – yesterday I tested a nine-digit multiplication, and R1 spent over three minutes thinking, producing CoT that could stretch down the street, breaking down the reasoning step by step. While the answer was correct, using such computationally expensive CoT for arithmetic instead of a simple function call is unreasonable. A one-line calculation code would do the job – no need to waste so much computing resource and tokens on explicit reasoning. These are foreseeable routing improvements, especially in the age of AI agents which can use all kinds of tools or applications. R1's CoT does not need to handle everything – besides hallucinations, compute-burning CoT is also not environmentally friendly.

Host: Thank you, Dr. Li, for this fascinating discussion! Today's interview has given us a much deeper understanding of AI hallucinations.

Dr. Li: My pleasure! It's been great chatting with you!

 

【相关】

从R1幻觉谈起,大模型幻觉是缺陷还是创意火花?

主持人: 大家好,欢迎来到今天的访谈。最近,大模型的“幻觉”问题引发了热议,尤其是DeepSeek-R1,它的幻觉率似乎比自己的基座大模型DeepSeek-V3高不少。今天我们请到了资深AI研究员立委博士,来跟大家聊聊这个话题。立委,您好!

立委: 主持人好,大家好!

主持人: 老李,咱们先来个灵魂拷问:为啥大模型会“产生幻觉”?能不能用大白话给大家解释一下?

立委: 这可算是大模型的经典问题。其实啊,大模型就像一个“超级接话茬儿高手”,你给它上半句,它就根据自己学过的海量知识,预测下半句该说啥。它学东西呢,就像咱们人脑记东西一样,不可能每个字都记得清清楚楚,它会进行压缩和泛化,抓大意、找规律。

打个比方,你问它“姚明有多高”,它大概率不会错,因为这知识点太出名了,它记得牢。但你要是问“隔壁老王有多高”,它可能就懵了,因为它没见过老王啊!但它又不能不回答,咋办?它就得“脑补”,根据“一般人有多高”这个学到的概念,给你编一个数出来,这就是“幻觉”。

主持人: 哎呦,这“脑补”能力也太强了!胡编乱造,这幻觉也太糟糕了。

立委: 那可不一定!你看啊,某种程度上,幻觉就是想象力(褒贬不论),就是创意!你想想,那些伟大的文学作品、艺术作品,哪个不是天马行空、充满想象?要是啥都得跟现实一模一样,艺术就成了照相机了,那还有啥意思?

就像赫拉利在《人类简史》里说的,人类之所以能成为地球霸主,就是因为我们会“讲故事”,会创造出神话、宗教、国家、货币这些现实中不存在的东西。这些都是“幻觉”,但它们却是文明诞生和发展的原动力。

主持人: 听您这么一说,感觉幻觉还挺重要的。那咱们回到DeepSeek-R1,它的幻觉问题真的很严重。

立委: 是很严重。此前学界普遍认同OpenAI的说法,推理增强会明显减少幻觉。我曾与大模型公司的一位负责人讨论,他就特别强调推理对减少幻觉的积极作用。但 R1 的表现却给出了一个出人意料的答案。

根据Vectara的测试,R1的幻觉率确实比V3高不少,R1的幻觉率14.3%,显著高于其前身V3的3.9%。这跟它加强了的“思维链”(CoT)和创造力直接相关。R1在推理、写诗、写小说方面,那叫一个厉害,但随之而来的“副作用”就是幻觉也多了。

具体到R1,幻觉增加主要有以下几个原因:首先,幻觉标准测试用的是摘要任务,我们知道摘要能力在基座大模型阶段就已经相当成熟了。在这种情况下,强化反而可能产生反效果,就像用大炮打蚊子,用力过猛反而增加了幻觉和编造的可能。

其次,R1 的长思维链强化学习并未针对摘要、翻译、新闻写作这类相对简单而对于事实要求很严格的任务做特别优化,而是试图对所有任务增加各种层面的思考。从它透明的思维链输出可以看到,即便面对一个简单的指令,它也会不厌其烦地从不同角度理解和延伸。过犹不及,这些简单任务的复杂化会引导结果偏离发挥,增加幻觉。

DeepSeek-R1在文科类任务的强化学习训练过程中,可能对模型的创造性给予了更多的奖励,导致模型在生成内容时更具创造性,也更容易偏离事实。我们知道,对于数学和代码,R1的监督信号来自于这些题目的黄金标准(习题集中的标准答案或代码的测试案例)。他们对于文科类任务,利用的是V3或V3的奖励模型来判定好坏,显然目前的系统偏好是鼓励创造性。

另外,用户更多的反馈还是鼓励和欣赏见到的创造力,一般人对于幻觉的觉察并不敏感,尤其是大模型丝滑顺畅,识别幻觉就更加困难。对于多数一线开发者,用户的这类反馈容易促使他们更加向加强创造力方向努力,而不是对付大模型领域最头痛的问题之一“幻觉”。

主持人: 这么说来,R1 的幻觉问题是不是源于它过于"积极"的思维推理?但推理能力增强和幻觉之间到底是什么关系?

立委:这个关系很微妙,并不是简单的正相关或负相关。你看 R1 是头部推理模型,而 Claude 3.5 Sonnet 是头部的非推理大模型,但后者的幻觉率反而高于前者。可是当我们对比 R1 和它的基座模型 V3 时,又发现增加推理强化后幻觉确实显著增加了。

这事儿跟大模型的“性格”有关。R1这家伙,强化做得给力,特别喜欢“发散思维”,你给它一个简单的指令,它能给你想出一大堆东西来,思维链能绕地球三圈!这似乎说明 R1 在强化创造力的过程中,不可避免地增加了创造力的伴生品:幻觉。作为一个文理通吃的推理大模型,R1 在不同领域的表现并不一样。在数学、代码等需要严谨推理的领域,幻觉的空间很小。但在语言创作领域,尤其是现在被测试的摘要任务上,幻觉问题就明显得多。这更多是 R1 语言创造力爆棚带来的副作用。

具体从技术角度来说,R1 会为用户的简单指令自动增加很长的思维链,等于是把一个简单明确的任务复杂化了。你一个简单的指令,它也反复从不同角度理解和衍伸(CoT思维链好比“小九九”,就是一个实体遵从指令时的内心独白)。思维链改变了自回归概率模型生成answer前的条件部分,自然会影响最终输出。

V3: query --〉answer
R1: query+CoT --〉answer

对于 V3 已经能很好完成的任务,比如摘要或翻译,任何思维链的长篇引导都可能带来偏离或发挥的倾向,这就为幻觉提供了温床。

主持人: 那对于R1来说,幻觉主要是出在哪方面呢?

立委: 我觉得可以把R1的能力分成“文科”和“理科”来看。它在数学、代码这些“理科”方面,逻辑性很强,幻觉相对少。但在语言文字这些“文科”方面,幻觉就比较明显了。

比起O1,R1 最令人惊艳的成就,是成功将数学和代码的推理能力充分延伸到了语言创作领域,尤其在中文能力方面表现出色。网上流传着无数的R1精彩华章。舞文弄墨方面,它显然超过了99%的人类,文学系研究生、甚至国学教授也赞不绝口。

但你看,让它做个摘要,本来是很简单的任务,但它非得给你“发挥”一下,结果就容易“编”出一些原文里没有的东西。这其实是它“文科”太强了,有点“用力过猛”。

主持人:这个说法有意思。那么在具体应用中,语言任务是不是都需要创造力呢?

立委:语言能力其实可以细分为两类:一类需要高创造力,比如写诗歌、小说;另一类需要高度真实性,比如新闻报道、翻译或摘要。R1 最受称赞的是前者,这也可能是研发团队的重点方向,但在后者中就出现了副作用。

这让我想到中国古人说的"信达雅",自古难全。为"雅"牺牲"信"的例子我们见得很多,文学创作中夸张的修辞手法就是重要手段和例证。为"信"牺牲"雅"也有先例,比如鲁迅先生推崇的"硬译"。

有趣的是,我们人类在这方面其实一直是双标的,但我们心里有个可以随时切换的开关。看小说和电影时,我们把开关偏向创造性一侧,完全不会去纠结细节是否真实;但一旦切换到新闻频道,我们就对虚假内容零容忍。

主持人: 人对于逻辑看起来清晰自洽、且详细的内容,就会倾向于相信,所以大模型幻觉潜在的危害真地很大。那咱们普通人,面对大模型的幻觉,该咋办呢?

立委: 很多人在惊艳R1创造力的同时,现在开始慢慢注意到这个幻觉现象并开始警惕了。但更多人还是沉浸在它给我们带来的创造性的惊艳中,需要增强大众对模型幻觉的 awareness。我觉得吧,咱们可以“两手抓”:

保持警惕: 大模型说的话,特别是涉及到事实的,别全信,多留个心眼。最容易产生幻觉的地方是人名、地名、时间、地点等实体或数据。

交叉验证: 重要的细节,可上网查查原始资料或询问身边专家,看看说法是不是一致。

引导模型: 你可以在提问的时候,加一些限定条件,比如“请务必忠于原文”、“请核对事实”等等,这样可以引导模型减少幻觉。

享受创意: 如果你需要的是灵感、创意,那大模型的幻觉,会给你带来惊喜!

不妨把大模型的幻觉,看成是“平行世界的可能性”。它编出来的东西,也许在这个世界不是真的,但在另一个世界里,说不定就是真的呢!就像小说家写小说,虽然是虚构的,也是一种“艺术真实”。源于生活,高于生活。大模型是源于数据,高于数据。大模型压缩的是知识体系和常识,不是一个个事实,后者是数据库的对象。

主持人: 妙啊!您这说法,让我想起了大家常说的一句话:大模型不是在“胡说八道”,而是在“一本正经地胡说八道”。

立委: 哈哈,差不多就是这个意思!大模型的幻觉,其实是它“脑补”出来的,但它“脑补”的依据,是它学到的海量知识和规律。所以,它的幻觉,往往不是乱来的,有“内在的合理性”,这才丝滑无缝,假话说的跟真的似的,但同时也更具有迷惑性。初玩大模型的朋友,需要特别小心,不能轻信。

对于普通用户来说,理解幻觉的特点很重要。比如问"长江多长"这类有足够信息冗余的百科知识问题,大模型不会出错,这些事实是刻在模型参数里的。但如果问一个不知名的小河或虚构河流的长度,模型就会启动"合理补白"机制编造。

主持人: 按照您的说法,人类的语言本身就是幻觉的温床。

立委: 可以这么说。语言使得人类创造了神话、宗教、国家、公司、货币等非真实实体的概念,以及理想、信念等形而上的意识形态。赫拉利在《人类简史》中强调了幻觉对于文明的根本作用:语言的产生赋能了人类幻觉(“讲故事”)的能力。幻觉是文明的催化剂。人类是唯一的会“说谎”的实体 -- 除了LLM外。

主持人: 那么在幻觉的背后,大模型是怎么运行的呢?

立委: 幻觉的本质是补白,是脑补。

“白”就是某个具体事实,如果这个事实在训练数据中没有足够的信息冗余度,模型就记不住(零散事实等价于噪音)。记不住就用幻觉去补白,编造细节。

幻觉绝不是没有束缚的任意编造,大模型是概率模型,束缚就是条件概率中的前文条件。幻觉选择的虚假事实需要与补白所要求的value类型匹配,即符合ontology/taxonomy 的相应的上位节点概念。“张三”可以幻觉为“李四”,但不可以幻觉成“石头”。

所谓艺术真实是说,小说创作虽然可能背离了这个世界的事实,但却是可能的数字世界的合理想象。大模型的幻觉属于此类。

大模型的知识学习过程(训练阶段),是一种信息压缩过程;大模型回答问题,就是一个信息解码过程(推理阶段)。好比升维了又降维。一个事实冗余度不够就被泛化为一个上位概念的slot,到了生成阶段这个slot必须具像化补白。“张三”这个事实忘了,但【human】这个slot 的约束还在。补白就找最合理、与 slot 概念最一致的一个实体,于是“李四”或“王五”的幻觉就可以平替“张三”。小说家就是这么工作的,人物和故事都是编造的。无论作家自己还是读者,都不觉得这是在说谎,不过所追求的真善美是在另一个层面。大模型也是如此,大模型是天生的艺术家,不是死记硬背的数据库。“张冠李戴”、“指鹿为马”等在大模型的幻觉里非常自然,因为张和李是相似的,马和鹿也在同一条延长线上。在泛化和压缩的意义上二者是等价的,因此是合理的想象。

主持人: 未来有没有什么办法,能让大模型既有创造力,又少出幻觉呢?

立委: 这绝对是AI大模型领域的“终极难题”之一!现在大家都在想办法,比如:

更精细地训练: 在训练的时候,就对不同类型的任务区别对待,让模型知道什么时候该“老实”,什么时候可以“放飞”。

针对任务做偏好微调(finetune) and/or 强化(rl)可以减缓这个矛盾。 摘要、改写、翻译、报道这种任务需要特别小心和平衡,因为它既有一点再创造的需求(例如文风),又是本性需要内容忠实的。

具体说,R1训练pipeline是四个过程,微调1,强化1,微调2,强化2。强化2主要是与人类偏好对齐的强化。这个过程在创造力与忠实方面,目前看来倾斜于前者,后去可以再做平衡。也许更重要的是在阶段三的微调2中,针对不同任务加强约束,例如,增加摘要的监督数据,引导忠实平实的结果。

Routing: 以后可能会有一个“调度员”,根据任务的类型,安排不同的模型来处理。比如,简单任务交给V3或调用工具,慢思考的复杂任务交给R1。

例如,识别出算术任务,就去写个简单代码运算,等价于调用计算器。目前不是这样,我昨天测试一个九位数的乘法,R1 思考了三分多钟,思维链打印出来可以铺开来一条街,步步分解推理。虽然最后答案是对了,但算术问题用耗费太大的所谓 test time compute 的思维链(CoT),而不用 function call,完全不合理。一行计算代码就搞定的事,没必要消耗如此多的计算资源和tokens去做显式推理。

这些都是可以预见的 routing,尤其是在agent时代。 R1 CoT不必包打一切,而且除了幻觉,也不环保。

主持人: 感谢老李的精彩分享!今天的访谈让我们对大模型的幻觉有了更深入的认识。

立委: 不客气,很高兴和大家交流!

 

【相关】

Deepseek-R1 的幻觉率是 14.3% - 比其非推理前身 Deepseek-V3 高得多
榜单排名:https://github.com/vectara/hallucination-leaderboard