立委科普:如何理解自注意力机制中的QKV分工?

这可能是开始学习自注意力机制的同学遇到的一个不容易理解的烧脑问题。

为了学习序列上下文的依赖关系,为什么序列中的每一个 token 都要派生出三个分工角色:Q(Query),K(Key),V(Value)?

要理解为什么每一个token派生出来的Q、K、V矩阵能通过反向传播自动分工,我们需要深入模型训练的底层逻辑。这个过程可以用「蚁群分工」的生态现象来类比:虽然所有蚂蚁最初都是相似的,但通过环境反馈和任务训练,它们会自发分化为工蚁、兵蚁、繁殖蚁等不同角色。Transformer的参数分化也遵循类似的自然演化规律。

一、分工的本质驱动力:损失函数的宏观调控

假设我们要训练一个翻译模型,输入句子为 "猫追逐激光点",目标输出 "The cat chases the laser dot"。以下是参数分化的关键步骤:

1. 初始混沌状态
- W_Q、W_K、W_V矩阵均为随机初始化
- 此时"追逐"的Q向量可能与"激光点"的K向量毫无关联

2. 第一次前向传播
- 计算注意力权重时,"追逐"未能关联到"激光点"
- 导致翻译结果错误(如输出 "The cat eats the laser")

3. 误差信号反馈
损失函数计算出两个关键梯度:
- 内容缺失梯度:需要加强"追逐→chases"的动作关联
- 对象错配梯度:需要建立"追逐"与"激光点"的动宾关系

4. 参数分化开始
- W_Q矩阵收到信号:要让动词的Q向量更关注动作目标特征
- W_K矩阵收到信号:要让名词的K向量强化被作用对象属性
- W_V矩阵收到信号:需要保留名词的可移动性等细节特征

🔥 关键机制:同一误差信号通过不同的计算路径反传,导致三个矩阵的更新方向产生分化。

二、参数分化的数学原理

通过拆解注意力计算流程,可以看到梯度如何引导分工:

注意力权重计算路径

- 对W_Q的梯度:
主要来自本token的Q与上下文中的K的相似度计算,迫使W_Q学习如何生成有效的查询特征
(例:让动词的Q向量包含"需要搭配宾语(及物动词)"的潜在特征;Q很像是传统语言学中潜在句型的编码信号Subcat)

- 对W_K的梯度:
同样来自Q与K相似度计算,但方向是优化K的特征可被Q识别
(例:让名词的K向量包含"可作为动作对象(可做宾语)"的属性)

- 对W_V的梯度:
来自最终的加权求和,要求V保留足够的信息量
(例:"激光点"的V向量需要包含「小、明亮、可移动」等细节)

权重计算四部曲

1. Q-K点积计算关联度
2. 缩放防止梯度爆炸
3. Softmax归一化得到概率权重
4. 加权求和生成语境化表示

三、分工稳定的结构性保障

除了梯度驱动,模型结构设计也确保了分工不会混乱:

1. 线性变换的隔离性

- Q/K/V来自三个完全独立的矩阵乘法
(不同于共享参数的LSTM门控机制)
- 每个矩阵的梯度更新互不干扰

2. 多头注意力机制

使用8-64组独立的注意力机制(多头注意力),就像侦探团分头调查不同方向:有的关注时间线,有的分析人物关系,最后综合所有关系的匹配结果。

不同注意力头形成「分工协作」:
- 头1:W_Q¹学习语法角色匹配
(例:让主语的Q匹配谓语的K)
- 头2:W_Q²学习语义关联
(例:"银行"的Q匹配"利率"的K)
- 这种多目标优化迫使参数必须专业化

四、实例验证:参数分工的具象化

通过可视化训练后的参数,可以观察到明确的分工模式:

案例:动词"吃"的关联参数

- W_Q矩阵:
在"吃"的Q向量中,高权重维度对应「可食用」「具体物体」等特征

- W_K矩阵:
在"苹果"的K向量中,高权重维度对应「食物类」「固体」等属性

- W_V矩阵:
在"苹果"的V向量中,高权重维度包含「颜色」「口感」「营养成分」等细节

当计算 `Q(吃)·K(苹果)` 时,由于双方在「可食用性」维度上的高激活值,会产生强注意力权重。而V(苹果)则携带了制作输出时需要的具体信息(如翻译成"apple"时需要知道这是水果而非科技公司)。

关键结论:自组织的智慧

Transformer参数分工的本质,是在统一目标函数约束下,不同计算路径自然演化出的功能专门化。系统不需要预先设定分工细节,而是通过海量数据中反复的"试错-反馈"循环,自发形成了高效的信息处理体系。这种基于误差驱动的自组织过程,正是深度学习模型强大表征能力的根源。

 

 

【外一篇】

Q/K/V的关系:一个更深入的解读

Q和K的关系

- Q 是 K 空间的一个特定视角或投影
- 就像同一本书可以从不同角度去检索:
- Q1主题分类(K1:文学/科技/历史)
- Q2难度等级(K2:  入门/进阶/专业)
- Q3写作风格(K3:理论/实践/案例)

这是因为Q是“主动”寻求某个与其他tokens关联的特征;而K是“被动”准备被其他tokens匹配的特征。K好比索引,需要概括token的所有主要特征,但Q则是专注于查询某个特征。

这样理解多头注意力就更自然了:

# 每个头学习到不同的投影视角
Q1 = token * W_q1 # 可能关注主题相关性
Q2 = token * W_q2 # 可能关注语法关系
Q3 = token * W_q3 # 可能关注语义角色

就像一个高维空间的不同切面:
- 每个注意力头学习到一种特定的"查询视角"
- 这些视角共同构建了token间关系的完整图景

K和V的分工

- K:是信息的"检索表示"
- 包含了各种可能被查询的特征
- 好比图书的多维度标签系统

- V:是信息的"内容表示"
- 包含了实际需要被利用的信息
- 就像书本正文的具体内容

## 一个具体例子
以"开车"这个词为例:

多头注意力可能学到的不同视角:
Q1:寻找动作的工具(与"汽车"高度相关)
Q2:寻找动作的主体(与"司机"高度相关)
Q3:寻找动作的修饰(与"快"、"稳"等相关)

这种理解很好地解释了:
1. 为什么需要Q/K分离
2. 为什么需要多头QKV机制
3. 模型如何自动学习到不同类型的上下文关系

最后,我们来进一步了解第三个关键角色 V

V与Token表示的连续性

一个token 的 V(Value)与该 token 的初始embedding最相关,因为表示的都是这个token的内容和意义。

- 初始embedding:代表词元在大规模预训练中学到的一般含义,好比是查了词典
- Value向量:可以看作是这个初始表示在特定上下文中的延续和更新

换句话说:
1. Embedding是词元的"基本词典定义"
2. Value是这个定义在特定语境下的"具体表达"

Value在模型中的演化

随着信息在多层网络中的流动:

初始embedding → 第1层Value → 第2层Value → ... → 最终表示

这个过程中:
- 每一层的Value都承载了更加丰富的上下文信息
- 同时保持着对原始token含义的连续性(若担心连续性衰减大,还可以用残差来弥补)
- 这种演化是渐进式的,而不是断裂式的

 Q/K与V的本质区别

- Q和K主要服务于"建立关系”(俗称“挖坑和填坑”)这一目标
- Q和K提取出用于匹配的查询特征和索引特征
- Q和K自然比V更抽象、更概括

- V则直接承载"具体内容"
- 包含词元需要传递的实际信息
- 更具体、更详细

形象地说:
- Q/K像是图书馆中的检索系统
- V则像是书架上的实际书籍内容

从整个模型的角度看:
1. 初始embedding进入第一层
2. 每一层都通过注意力机制加权求和以及前馈网络来更新下一层token表示
3. 最终层的表示涵盖了上下文的全部关系和意义,直接赋能输出

 

【相关】

DeepSeek爆火真相:不靠“人盯”, 让AI自己学会慢思考

本文整理自InfoQ策划的DeepSeek系列直播第二期节目——DeepSeek爆火背后DeepSeek,纯强化学习路线到底有何不同。在直播中,出门问问大模型团队前工程副总李维博士聚焦推理范式的创新,分析了R1 Zero 对推理模型平民化的创新贡献。他提到,DeepSeek通过开源和透明化,证明了不需要过程监督,仅通过结果控制就能训练出优秀的推理模型,这大大颠覆了传统认知以及OpenAI 所暗示的需要在每一步监督推理强化学习的观点。

 

DeepSeek 的最大功绩在于将这一切透明化

InfoQ:“DeepSeek坚持纯强化学习路线,但业界常说RL(强化学习)是‘炼丹’”——他们如何让这个过程可控和“平民化”?有什么"推理范式的创新"?

李维博士:实际上,推理模型的强化学习一直是业界的难题。大约半年前,IIya 等人宣称预训练时代已经结束,这意味着单纯依靠预训练模型的规模扩展来提高性能已经难以为继。GPT5迟迟不能上线也是预训练式微的一个迹象。因此,业界开始寻找新的增长道路,推理大模型在头部团队开始暗流涌动,直到 Open AI发布全球第一个推理大模型O1. 紧接着就是DeepSeek的R1出圈,这就是deepseek爆火的背景。

从 神神秘秘、据传引发了OpenAI宫斗的Q-Star 项目开始到 o1 大模型的推出,推理大模型被AI主流广泛公认为新的范式。这种范式的核心是开启模型的“慢思考”能力,即所谓 System 2,利用强化学习提升模型在复杂任务中的智能程度。然而,这一切都是闭源的,OpenAI 甚至故意制造了一些神秘感,遮掩其思维链的内容。除了少数头部玩家如 Google 和 Anthropic 在背后悄悄探索追踪外,其他团队对这一领域知之甚少。

DeepSeek 的最大功绩在于将这一切透明化。它的模型和详尽的技术论文全部开源,甚至也不怕露怯,在系统里公开了思维链的所有内容。它通过纯粹强化学习,证明了即使没有过程控制数据,仅通过结果控制也能达到头部推理大模型的水平。这就好像是捅破了一层窗户纸,让业界看到了强化学习平民化的道路。

 

InfoQ:推理范式的创新听起来很抽象,能否举个例子?

李维博士:DeepSeek 的R1论文非常出色,堪称大模型领域中的一篇佳作。论文分为两部分:一部分是关于 Zero 的研究,这是纯粹的强化学习推理方向的成果,非常精彩;另一部分则是基于 Zero 研究成果的实用系统 R1,这是一个真正上线的头部推理大模型。在开发 R1 时,需要考虑实用性,包括综合性能、安全性以及各种实用考量等,因此论文中详细介绍了四阶段训练的最佳实践(best practice),帮助其他团队理解和复制这一成果。

论文最精彩的部分还是 Zero 的研究。Zero 的研究证明了一个颠覆性的观点:与传统认知(或 OpenAI 所暗示的需要在每一步监督推理强化学习的观点)不同,实际上并不需要过程监督。仅通过最终结果(即“黄金标准”)作为监督信号,就能训练出推理大模型所需的“慢思考”过程。

这是 Zero 的最大亮点,也是其名称的由来——它借鉴了 AlphaZero 的精神。AlphaZero 在人工智能历史上开创性地完全不依赖人类棋谱或经验学习,而是通过自我对弈的再生的过程数据(即:棋局状态+落子+评分的三元组步骤数据),实现了零人类监督的强化学习,并最终完全碾压了人类顶尖棋手。DeepSeek 的 Zero 研究也是如此,它表明在推理任务中,模型可以自主生成内部的过程数据,即思维链(CoT,Chain of Thought)序列,而无需人类标注。

具体来说,推理模型最初以数学和代码为对象,因为这些领域本身就存在标准答案。从宏观上看,这其实是一种典型的端到端监督学习,因为输入端(数学题或代码题)和输出端(答案或代码运行结果)都是固定的、已知的。然而,从输入到输出的过程非常复杂,信息差很大,这就需要一个“思维链”作为桥梁。就像人类遇到难题时需要分解问题、逐步思考一样,模型也需要这样的过程。DeepSeek 的研究发现,模型本身具有自主学习这种深度思考过程的能力,只要给予足够的时间和空间。如果没有这个空间,模型就只能直接从问题跳到答案,信息鸿沟大,随机性就强,成绩好不了。

DeepSeek 的解决方案是通过设计一个简单模板引导模型进行思考。具体说,就是在传统的监督数据 question+answer里面人为增加了一个标签[think]: question+[think]+answer, 通过强化学习的方式,模型会自主填空,再生过程数据 question+cot+answer,以此迭代学习,cot中就自动出现了反思、自我校正等过程。这表明,只要给予模型思考的空间,它就能自主生成思维链。非常奇妙!

 

给模型留够充分的自主学习空间

InfoQ:动态推理路径听起来像AI自己“画思维导图”——但如何避免它中途跑偏?比如写代码时突然开始写诗?

李维博士:从目前的情况来看,这种可能性几乎不存在,或者概率极低,可以忽略不计。在deepseek公布他们的结果和研究细节之前,大家确实对这一点感到困惑:只靠结果监督,没有过程监督,深度思维不会乱套吗。在没有真正进行大规模强化学习实验之前,这确实是一个很大的疑问。就好比放风筝,你只牵着一根线,让风筝在天上自由飞翔,你会担心它会不会一头栽到地上。

现在看来是过虑了。它不会走偏的原因在于,所有这些推理的强化学习,包括自主生成的推理思维链的数据,实际上都是建立在原有的头部大模型(如V3)的基础上的。这些大模型在海量数据的学习过程中,已经很好地掌握了如何把话说得顺溜。这种“顺溜”的背后是条理性。虽然不能说它完全等同于逻辑性,但至少不会偏离到完全不合理的情况。就像一个人说话很顺畅,背后的思想相对来说也是有条理的。

所以,模型在原有大模型的基础上生成数据,经过筛选和强化学习迭代,会越来越条理化。这种思考方式本身是由大模型自然生成的,再加上有选择机制在不断强化过程中让它越来越符合条理地导向正确答案。

话说回来,在研究人员真正做出成果之前,大家心里还是充满了怀疑和疑问,不知道让机器模拟学习人类的高阶智能这条路是否真的能走通。如果是一个能力弱的小模型,这条路是否能走通就很难说了。但V3本身是一个很强大的基座模型,在此基础上让模型自己生成思维链,虽然这些思维链并不总是很有条理,但并不影响最终结果。因为这是一个以结果为导向的强化学习过程,只要坚持用正确和错误的结果来控制强化学习过程,即使思维链中有时会出现一些偏差,但总体目标是一致的,最终还是能学到推理高难度题目的能力。

再从更大的角度来看,我们发现当大模型发展到一定程度时,日常人类的数据已经基本用尽,高品质的数据也所剩无几。要进一步提升能力,就必须依靠模型自己生成数据。说到底,AI发展到现在,需要AI自己反哺自己才能进一步提升

在过去很长一段时间里,很多人对这一点存在疑问,担心模型自己教自己会导致退化,或者即使是一个好的模型教一个差的模型,也会有天花板。但现在回过头来看,再生数据的重要性越来越大。不仅是推理模型,就连多模态大模型也是如此。以Sora为例,我们知道视频和语言之间的自然对齐数据非常少,很难找到大量对视频情节进行详细讲解的数据。为了实现视频和语言的对齐,Sora选择了再生数据的道路,用自己的模型对整个的视频训练数据集进行了非常详细的标注。再生数据助力,Sora成为了第一个爆款的视频大模型。如今,国内的视频大模型也已经迎头赶上,如快手的可灵和字节的即梦,甚至比Sora还要更强一些,这背后也离不开再生数据的作用。

 

InfoQ:另一方面,与 OpenAI 的 o1 相比,DeepSeek R1 还有一个显著亮点是将推理思维链应用到了语言文字的创作和风格模仿能力上,这一点可以详细介绍一下吗?

李维博士:o1 出来时,大家都知道它在数学和代码能力上有了显著提升,因为标准测试显示它达到了一个更高的水平。但大家没有意识到的是,这种推理能力,或者说“慢思维”能力,不仅仅在需要严格逻辑推理的领域表现出色,它在传统的语言文字创作方面同样可以大放异彩。

传统上,语言文字能力一直是大模型的强项,大家都知道大模型生成的语言非常流畅。到了像 4o 或 V3,它们写文章已经很顺了,似乎提升空间不大。然而,当要求模型写一篇古典诗歌,或者模仿鲁迅的文风时,之前的模型还做不到。直到 R1 推出,这些问题都得到了解决。从社会效应来看,这其实是非常厉害的。

老实说,真正关心数学或代码的人并不多,虽然我们知道代码是今后几年的一个大方向,自动编程能改变世界。所有 IT 方面的东西归根结底都是软件,数字世界是由软件构成的。如果软件能力可以从手工编写变成模型辅助,甚至模型自主编写,这将极大地提高我们的生产力。这是大家都能看到的,但对普通老百姓来说却没有那么直观,因为他们面对的更多是写文章如何出彩这类任务。

R1 的文科能力被大家发现后,不仅仅是极客或者做软件应用的人看到了推理模型的好处,普通人也为之奔走相告。一旦上手,任何人都可以成为诗人、文学家、哲学家,这种震撼是非常大的。在o1 出来时,大家没有这种感觉,可能是因为 OpenAI 没有意识到,或者至少没有聚焦这一点。但 DeepSeek 在做代码和数学推理时,内部肯定已经意识到,这种“慢思维”在文字能力方面也可以提升一大步,尤其是在中文领域。

大家都知道,中文的数据相对没有英文那么丰富,所以之前大模型写英文诗可以写得很漂亮,但写唐诗就不够好。这可能是因为中文数据要么量不够,要么品质不够,导致模型学习得不够到位。我们一直觉得这是一个遗憾,模型写诗有时押韵,有时不押韵,有时多一个字,少一个字,更不用说平仄,总是有问题。DeepSeek 在这方面肯定下了功夫,其数据品质一定比行业标准更高、更好。

但大模型光有数据还不够,另一条腿是推理时间的计算量。在用户实际使用时,增加计算量和思考时间,我们发现模型的文字能力显著提升了层次,这给大家的震撼非常大。思维链是模型“慢思考”的一个特征。一开始,我们可能想当然地认为,逻辑思维是它的核心,思维链就是要非常严谨地符合逻辑的每个步骤,以确保在数理化和代码中表现出色。

但我们根本没想到,在文学创作这种领域,并不需要严谨的逻辑思维,它更多的是要有想象力,需要反复斟酌和修改。比如你要写一篇非常漂亮的文章,或者模仿一种风格,你需要考虑的方面很多,写古风诗词要考虑押韵、平仄、用词,考虑如何用古典文字表达现代概念等。为了写出一篇好文章,你需要周密地计划,这本质上是一种“planning”,而不仅仅是狭义的“reasoning”。可见,慢思维背后的真正价值在于为最终结果做铺垫,制定计划和反复修正。无论任务是文科还是理科,只要是高难度的任务,都需要这种“planning”的时间,就像我们打草稿、反复校改一样,这些都是思维链的用武之地。

 

InfoQ:思维链机制具体是如何产生的?

李维博士:DeepSeek 之所以能够产生复杂的思维链,背后是因为它是基于头部大模型V3训练的,而 V3 所涵盖的知识比我们任何个体所了解的都要广博得多得多。在这基础上,关键点是要给模型留下空间,让它有自主学习的机会。作为设计者或开发者,需要设计出这样的空间,让模型自己去填补、去学习。DeepSeek 就是这样实现的。它设计了一种格式,在输入问题question和输出答案answer之间,它留下了一个“思考”的空间,用标签 [think] 来标记: question+[think]+answer。这个 think 标签就是准备要学      思维链(cot) 的, 虽然开始为空,Zero 的 research 表明:只要留下think的标签,就给LLM自主填补cot 留下了空间。此后他们“啊哈”地惊喜发现,越来越条理化的cot 居然在 GRPO 组内选优的强化学习迭代算法的指引下,就自主学出来了。啥也不用做,模型就是自己要思考,而且能思考。LLM really wants/tends to think and think deep if given a chance.  比如,它可能会在推理过程中发现自己前面的某个结论与已知事实不符,于是就会自我纠正,说:“不对,这里可能有偏差。”这种反思和自我纠正的能力,是模型在学习过程中自然形成的。可以想像研究者当时的兴奋之情, 简直就是上帝给他们面授了天机。不但他们“啊哈”, 我们读论文追踪他们的人也感觉开了天目,不可思议,但 it just works。Zero research 的美丽就是没有人工的过程数据的任何干预,完完全全的纯强化出来的奇迹。

从信息论的角度来说,思维链降低了困惑度(perplexity),搭建了从难题到答案之间的桥梁,使得得出正确结论的可能性增大,从而提高了模型的智能。

 

推理模型已经进入“平民化”时代

InfoQ:如果让您给程序员推荐一个最值得复现的DeepSeek模块,会是哪个?比如各种声称几十美元复制R1的Aha moment?

李维博士:如果让我推荐程序员群体最值得复现的 DeepSeek 模块,大概会是与 Zero 研究相关的部分。这种复现并不是从全面能力上,而是证实了 Zero 研究中揭示的关键点——机器确实能够自主学到反思能力或慢思维推理。这是 OpenAI 一直遮掩不让人知道的,也许他们早就悟出来了,但就是不公开。

现在,我们看到至少有五六组不同的团队,用很少的资源就复现出了 R1 的这种反思能力。这不仅是一个有趣的实验,更关键的是,它标志着推理模型已经进入“平民化”时代。以前,大家不知道推理模型是如何工作的,只知道需要大量的过程数据,模型才能学会慢思维。这被认为是一个难以跨越的门槛,因为过程数据很难获取,而且强化学习的不稳定性高、对数据要求也高,所以很多程序员觉得这条路很难走。

但现在,我们知道可以绕过这个最繁难的过程数据,通过有限的资源复现这种“Aha moment”,证明慢思维能力是可以让模型自主学出来的。基于这个前提,如果你是一个行业专家(domain expert),在自己的项目或应用领域中,你会想:是否可以用这些技术在你的领域实现大幅提升?这是完全可能的。因为即使是最强大的大模型(如 V3 或 4o),在具体场景中如果不经过优化,也只能达到 60%~70% 的正确率,而在 real life应用场景中,经验告诉我们没有 80% 或 85% 以上的正确率,根本无法上线一个真正有价值的系统。

从大模型的“开箱即用”(out-of-box)结果到真正能投入应用并产生价值,中间存在一个差距。以前,我们想到的唯一方法是收集领域数据进行微调。但现在,我们多了一条路:顺着推理模型的思路,让系统充分发挥推理阶段的慢思维能力,从而提升数据质量到可接受甚至出彩的程度。这条路似乎已经打通了。

不过,我的码农朋友告诉我,他做了一个微调(SFT)与deepseek式强化学习(RL)的对比实验,发现RL的确强过SFT,但RL训练目前的计算代价还是远远大于SFT。效果好于SFT可以理解,因为SFT的数据总是非常有限的,而RL自主再生的数据成功强化的话,会远远大于SFT数据。

仔细看 R1 的设计,它是一个实用系统,不像 Zero 那么纯粹。Zero 是一个研究项目,旨在证明可以排除人类干预来构建推理模型。但 R1 是为了实际应用,所以它结合了微调和强化学习:遵循他们自己创新的SFT+RL+SFT+RL的四阶段训练的pipeline。它在第一阶段是微调,使用了 2,000 条左右的人类过程数据来提高效率,他们称为“冷启动”。强化学习之后,又加入了微调和最后的偏好强化学习,以确保合适的数据配比和能力平衡,以及与人类偏好的对齐。这种设计是经过深思熟虑,可能经过了很多尝试和调整,最终呈现出的一个最佳实践。

虽不好说R1 的这种设计一定就是绝对的最佳方案,但它确实提供了一个很好的思路:现在我们有两个工具——SFT 和 RL。如果能够将这两个工具很好地结合起来,互相补充,那么在实际应用场景中,我们就能构建出更好的系统。

从更广泛的意义上说,DeepSeek 的出现不仅是因为各种原因而短暂火爆,它更重要的作用是极大地加速了大模型向应用领域发展的速度。这对整个行业来说是一个巨大的利好刺激。

 

InfoQ:有人说大模型是“暴力美学”,但OpenAI 的前首席科学家、联合创始人 IIya 说预训练到头了,怎么讲?推理模型出现的背景就是增加了又一个暴力美学的scaling law 吗??

李维博士: 这更像是技术聚焦点的转移和技术创新的范式转变。大模型涉及三大块:首先是预训练,这是大模型的基础能力,从海量数据中学习基本规律;其次是后训练,最初主要是微调,OpenAI 早期也用了一些强化学习(如 RLHF)来对齐人类偏好,但到了 Meta 时,他们甚至放弃了典型的RLHF,代之以更简单的DPO,因为与很多人一样,他们玩不转。最后是推理阶段的工作,即模型上线后与用户交互的阶段。

这三个阶段理论上都可能找到资源投入与性能提升之间的正相关S曲线,即scaling laws的某种表现函数。在过去,预训练是最受重视的部分,大家认为只要数据量不断加大、模型规模足够大,能力就一定持续提升。

LLM Scaling的底层逻辑是什么?为什么到了千亿tokens这种以前难以想象的数据规模,大模型依然显得"吃不饱"?为什么从千亿扩展到万亿tokens,scaling law依然有效?

这个现象的关键在于LLM是序列学习(编码)和序列推理(解码)的系统。序列本身是一维的,但序列中蕴含的patterns和规律性却是高维的。举个例子:即使是简单的"猫追老鼠"这样的序列,背后可能涉及物种关系、捕食行为、空间运动等多个维度的知识。这种多维知识表现在序列层面,就会发生天然的组合爆炸。对大数据的"大胃口"正是应对这种组合爆炸的有效策略。

 

然而,人类自然产生的高质量数据是有限的。预训练已经几乎吃尽了现有的高质量自然数据。业界开始意识到数据增长的困扰,性能提升也变得困难。GPT-5 难产,据传投入大量算力却收效有限,这表明预训练可能遭遇了瓶颈

于是,业界开始探索另外的AI智能增长曲线。强化学习的推理模型就是在这种背景下走到主流舞台的中心:应该在后训练中加入纯粹的强化学习。以前的强化学习依赖人类偏好,但这次是让模型在得出答案之前有更多思考时间,学习背后的规律。V3 已经做得很好,但当时除了业界并没有在社会上引起太大轰动。直到 R1 出现,deepseek 才真出圈了,成了春节后最受关注的大众话题,在海外也引发了热议和震惊。R1 代表了一种新的范式。在 R1 之前,只有 OpenAI 出了 o1 这种推理模型,给人一种高不可攀的感觉,大家不知道如何跟进。然而,R1 不仅复现了 o1 的能力,还更加透明、清晰。这种反差进一步凸显了 R1 作为开源大模型引领者的重要性。

 

未来脑洞

InfoQ:DeepSeek 乍看就是工程上的极致化,为什么会引起全世界的轰动?它的获客速度(一周上亿)超过了 ChatGPT 核爆的时候?它的历史地位到底如何?

李维博士:从我个人的体会和感受来说,大模型的发展历程中,ChatGPT 的爆火是一个标志性事件。其实我们业内人在 ChatGPT 出现之前就开始关注大模型了,至少从 GPT-3  开始吧。当时 GPT-3 的 Playground 出现,我们乐在其中,就已经感觉到一场风暴要来了。但从整个社会的感知来看,真正引发全社会震动的还是 ChatGPT 的出现,它像核爆一样震撼了我们,超出了所有人的预期。ChatGPT 出来,我们就陷入了一种痴迷的状态。

R1的 出现,我认为是继 ChatGPT 之后的第二个重大震撼。当然,在 ChatGPT 和 R1 之间也出现了一些有影响力的大模型,比如 4o,它也是一个了不起的里程碑。我们当时觉得 ChatGPT 已经很好了,3.5 版本已经很出色了,但 4o 的出现证明了它还可以更好。我们一直在案头使用它。再后来出现了 Sora,这种视频大模型也给人带来了震撼。我个人还特别喜欢一个叫 Suno 的音乐模型,它在音乐创作方面表现出色,让我觉得自己仿佛一夜之间就能成为音乐家,想写什么歌就写什么歌,还能配上自己的视频。这些模型都给人带来了不同阶段的震撼,但都没有 R1 这么强烈。

如果让我排序的话,我认为 R1 的震撼力仅次于 ChatGPT,甚至超过了 4o 和 Sora 所创造的轰动效应。R1 的震撼感有点类似于当年 ChatGPT 刚出现时的感觉,让人痴迷。ChatGPT 是开天辟地的大模型,R1 总体上是一个追随者,尽管它有很多创新亮点,有些方面甚至超越了之前的模型,比如在古典诗词创作和文风模仿方面。作为追随者,能在太平洋两岸乃至全球引起如此大轰动,是奇迹般的成就。

从实际效果来看,R1 的产品化非常成功。它在一周内就获得了上亿客户,远远打破了 ChatGPT 所创造的记录,提升了整个社会对 AI 的感知度。此外,从地缘政治对技术应用的影响来看,国内很多用户一直渴望使用全世界最先进的大模型,比如 GPT系列、Claude 或 Gemini,但常常够不着。而 R1 的出现,让人们不用担心国内外的限制。这些也都是促成R1 快速普及的因素。

 

InfoQ:您理想中AI编程的终极形态是什么?是程序员对着AI说“给我做个抖音”,它就直接输出可部署的代码+运维方案吗?

李维博士:总是有两类人怀疑派和乐观派。像 Ilya 这样的人,认为通用人工智能(AGI)已经迫在眉睫,超级智能(ASI)也在不远的未来,所以现在最大的问题是确保超级智能的安全性

Anthropic 的 CEO 预计,在未来 3 到 5 年内,大模型将实现真正的突破,不仅仅是目前让我们震撼的表现和demos,而是真正能在生产力上对整个社会带来革命性的改变。他们所说的,归根结底就是 AI 能规模化平替人类的体力劳动和脑力劳动。目前大模型虽然很热闹,但在社会生活中的实际应用还远未达到上一代移动互联网平台的水平。上一代的 super apps,比如美团、滴滴、小红书、抖音等,它们改变了我们日常生后的主要方面,无论吃穿住行还是通信和娱乐,它们最大程度缩短了供应商和客户之间的距离,这些价值我们每天都能感受到。而玩大模型虽然有趣,但在生活层面的实际价值还不明显,应用层面还处于爆发的前夕。

值得指出的是,DeepSeek 的出现降低了大模型应用门槛,为应用铺平了道路,虽然目前我们还没有进入应用真正爆发的时代。未来,当AI应用真正爆发时,会是什么时候、什么样子呢?我认为,最终目标是 AI 在脑力劳动和体力劳动中全面代替人类。大模型对白领阶层的冲击,迹象已经很明显,甚至连程序员群体都难幸免。体力劳动方面,具身智能发展也很快,无论是人形机器人还是机械手,都在逐步代替人类的体力劳动。

当然,这也会带来副作用,比如大量工作岗位消失,社会如何适应这种生产力大发展但缺乏工作岗位的状态,是另一个层面的讨论。但从AI本性和最终目标来看,AI 的发展可以有两个里程碑:一是何时能替代人类 50% 的工作,让社会只需要一半人工作,剩下的人通过基本收入保障(UBI)等方式维持一个体面的自由生活,在我看来这就是AGI到来的标志;二是何时能替代 90% 的人类工作,这可能算是所谓的超级智能(ASI)出现的时候,某种意义上的技术共产主义。

 

【相关】

Does the New Reasoning Paradigm (Query+CoT+Answer) Support a New Scaling Law?

— Reflections on LLM Scaling Laws and DeepSeek's R1

My friend Zhang Junlin's article "Looking at the Future of Scaling Laws through DeepSeek R1" has sparked interesting discussions among peers.

Core Insights from Initial Discussions

Professor Bai summarised the key highlights as follows:

Infinite stacking won't lead to infinite growth (physical laws don't support this)

Only S-shaped growth is possible, with diminishing returns inevitably appearing

The initial emergence of language capabilities relates to the density of linguistic knowledge in training data

The next growth phase represents a second S-curve, driven by common sense knowledge, which requires more computing power due to lower knowledge density

The third phase involves learning logical reasoning (Chain of Thought), where natural data has even lower density of such knowledge. Brute-force mining with computing power becomes inefficient, making reinforcement learning with synthetic data a more rational approach

As Dr. Lu points out: The term "Scaling Law" is becoming overloaded. While S-curves (nonlinear curves characterized by sigmoid functions) can describe technology adoption lifecycles, they typically occur in succession (one technology hits its ceiling, making way for another). Large language models' multiple "Scaling Laws" confirm this pattern, with some overlap between Test-Time and Post-Training "Scaling Laws".

The Nature of LLM Scaling

Let's examine the fundamental logic behind LLM scaling. First, it's crucial to understand that LLMs are not databases - they don't aim to memorize long-tail data details. Large model training essentially compresses big data, or more precisely, compresses the knowledge systems behind the data (including common sense and encyclopedic knowledge), focusing on capturing patterns and regularities of various patterns (what we call generalizations).

Conventional intuition suggests that as data scale increases, redundancy increases too. Regardless of filtering, cleaning, and deduplication, growing redundancy seems to imply diminishing returns. So why do large models still appear "hungry" even at the unprecedented scale of hundreds of billions of tokens? Why does the scaling law remain effective from hundreds of billions to trillions of tokens?

The key lies in LLMs being sequence learning and sequence decoding systems. While sequences are one-dimensional, the patterns and regularities behind are high-dimensional. For instance, even a simple sequence like "cat chases mouse" potentially involves multiple knowledge dimensions: species relationships, predatory behavior, spatial movement, actor-patient roles, etc. This multi-dimensional knowledge naturally leads to combinatorial explosion at the sequence level as information is flattened in language. The "appetite" for insatiable big data effectively addresses this combinatorial explosion. As long as there isn't complete information redundancy, additional diverse sequences will help models abstract data patterns more precisely.

The Two vs. Three S-curves Debate

Zhang Junlin observes that since OpenAI's O1, two other phases have gained recognition with their own Scaling Laws: the reinforcement learning Scaling Law (RL Scaling Law) for post-training, and the Inference Scaling Law (also called Test Time Scaling Law).

This raises a crucial question: Are there really three S-curves, or just two? How comparable is the reasoning model's S-curve to the pre-training S-curve?

While theoretically we can identify three phases:

Pre-training
Post-training (especially reasoning-focused reinforcement learning)
Inference phase

In practice, post-training and inference phases likely share a single S-curve; there aren't two independent growth curves.

DeepSeek R1's Insights: The Truth About "Slow Thinking"

Consider DeepSeek R1: users can activate "deepthink" mode to enable Chain-of-Thought (CoT) reasoning, but they can't actually control reasoning quality by increasing computation time. Why is this?

Let's examine a concrete example. When R1 solves a complex mathematical problem:

Traditional models might directly answer: "The result is 42"

R1 shows detailed reasoning: "Let's think step by step: 1) First consider... 2) Then we can... 3) Finally, we get 42"

While R1's response appears to demonstrate "slow thinking" (CoT), this reasoning process reflects actually a generation pattern fixed during training, not dynamic exploration of multiple potential reasoning paths during response time. In other words, CoT+answer might look like "slow thinking," but it doesn't fundamentally change the unidirectional next-token prediction paradigm. R1's CoT+answer creates an illusion of slow thinking, but the generative nature remains fundamentally the GPT "fast thinking" paradigm. At test time, unlike AlphaGo, the depth and scale of thinking isn't dynamically explored, though beam search, if applied, can provide implicit multi-path optimization internally.

Test Time Compute Constraints

The industry's buzz word "test time compute" refers to reasoning models requiring more online computational resources compared to traditional non-reasoning models. For example, R1 with CoT enabled might need several times more computation time than its base model V3 for the same problem. However, this increased computation results from behavior patterns acquired during training, not dynamically adjustable compute investment. Without controllable scalability in test time compute, we can't really talk about a test time scaling law.

A major difference between pre-training and CoT reinforcement learning lies here: pre-training scaling laws can remain stable long-term because once training completes, it doesn't significantly impact online response time - the generation mode remains a simple query+answer. Therefore, offline training for months is acceptable if the resulting model shows significant capability improvements. However, reasoning models' post-training CoT reinforcement learning differs - it cultivates models' habits of responding with slow thinking, changing the generation mode to query+CoT+answer. Extending the CoT isn't just about the cost of training resources and time; more critically, it reflects in extended test time compute for each query during deployment, severely delaying system response time. Users generally have limited tolerance for slow thinking computation time and delays during online system use.

The Sustainability Debate

OpenAI's Sam Altman and Anthropic's Dario might argue that for extremely complex problems (like proving the Riemann hypothesis or designing next-generation aerospace vehicles), even if a model needs a week of computation time, it's still a massive improvement over human teams requiring decades. However, this argument has two issues:

LLM feasibility for such super-complex problems remains far from validated

Extreme scenarios lack universality and can't serve as data points for sustainable scaling laws

This isn't to deny S-curves as effective models for describing scaling laws, nor to reject the rationality of S-curve stacking. The combination of pre-training and post-training growth curves (s1 and s2) might indeed reflect the overall relationship between resource investment and performance improvement. However, we should carefully examine whether CoT reasoning truly opens a sustainable scaling curve.

Conclusion: How Far Is the LLM Road to AGI?

If reasoning models' scaling laws lack sustainability, this raises a deeper question: Can we reach the promised land of Artificial General Intelligence (AGI) through these two scaling laws alone? Furthermore, is the technical ideal of Artificial Super Intelligence (ASI) - AI replacing human labor and dramatically improving productivity - truly feasible?

Current evidence suggests that while pre-training scaling laws have shown considerable sustainability, reasoning models' scaling laws may quickly hit practical constraints. This reminds us that the path to AGI/ASI likely requires more innovative breakthroughs, not just simple extrapolation of existing methods. In the next phase of artificial intelligence development, we might need to discover entirely new growth curves.

[#LLMs #ArtificialIntelligence #DeepLearning #AGI #ScalingLaws #MachineLearning]

 

【相关】

张俊林:从Deepseek R1看Scaling Law

Technical Deep Dive: Understanding DeepSeek R1's Reasoning Mechanism in Production

A detailed analysis of how DeepSeek R1's inference mechanism works in production, and how it differs from training-time reinforcement learning.

Training vs. Deployment: Key Questions

1. Training Phase (GRPO): Does the reinforcement learning mechanism generate multiple candidate CoT+answer sequences to optimize the policy and cultivate "slow thinking" habits?

- The answer is definitively yes.

2. Deployment Phase: Does R1 implicitly generate multiple paths during inference but only display one? If so, how does this mechanism compare to traditional ensemble methods?

3. Comparison with AlphaGo's MCTS: How does R1's mechanism fundamentally differ from Monte Carlo Tree Search?

1. Inference Mechanism in Production

DeepSeek R1's real-time reasoning can be characterized by two modes:

A. Implicit Multi-path Generation and Selection

- Generation: The model may implicitly generate multiple potential reasoning paths (CoT+Answers) during a single inference but outputs only one.

- Technical Implementation: Through decoding strategies (e.g., beam width adjustment), the model maintains multiple candidate sequences, ultimately selecting the highest-scoring path.

- User Experience: Users see only the final output, though internal multi-path exploration occurs.

- Efficiency Trade-off: Setting beam_width=1 (greedy search) defaults to single-path generation for fastest response; increasing beam width improves quality at the cost of latency.

B. Explicit Multiple Candidate Generation (Optional)

- API Control: The num_return_sequences parameter allows explicit generation of multiple candidates.

- Practical Application: While not enabled by default in the DeepSeek App, this functionality may be available through enterprise APIs or open-source implementations.

2. Training Phase: Cultivating "Slow Thinking"

A. Role of Reinforcement Learning

- Objective: GRPO algorithm trains the model to generate more detailed, logical reasoning steps (longer CoT) to maximize rewards.

- Mechanism: Training generates multiple candidate answers, with rewards evaluating both answer correctness and format correctness.

B. Driving Forces Behind CoT Growth

- Reward Design: Longer CoTs naturally emerge when they lead to better answers.

- Data Feedback: High-quality SFT data generated through rejection sampling enhances this pattern.

3. Comparison with Ensemble Methods

Similarities

- Multi-path generation conceptually similar to ensemble predictions

- Result filtering comparable to voting/weighted averaging

Key Differences

R1's implicit multi-path generation is fundamentally a dynamic decoding strategy within a single model, distinct from traditional ensemble's static combination of multiple models.

4. Fundamental Distinction from AlphaGo's MCTS

AlphaGo's MCTS

- Dynamic Programming: Builds search trees through simulation

- Online Learning: Adjusts search strategy based on real-time feedback

R1's Implicit Multi-path Generation

- Static Model: Fixed parameters during deployment

- No Reward Modeling: Path selection based on model probability rather than cumulative rewards

Key Insights

1. Training phase GRPO cultivates detailed CoT capabilities for effective single-pass inference.

2. Deployment allows flexible trade-off between single-path (for speed) and multi-path (for quality) generation.

3. While model parameters are fixed post-training, decoding strategies offer some runtime flexibility.

4. R1's multi-path generation fundamentally differs from both traditional ensembles and MCTS-style dynamic planning.

This architecture achieves a practical balance between efficiency and effectiveness for large-scale industrial applications, though it sacrifices some dynamic planning and global optimization capabilities.

#ArtificialIntelligence #MachineLearning #DeepLearning #LLM #DeepSeek

【相关】

人类反馈是超级智能的桎梏吗?

回答这个问题之前,先从 AGI/ASI 谈起。

AGI (Artificial General Intelligence, 通用人工智能)
ASI (Artificial Super Intelligence,超级人工智能)

在当代人工智能历史上,这两个术语虽然流行的先后有别,常常混杂使用。它们是挂在AI先知(代表人物之一是伊利亚)和企业家(代表人物包括Sam奥特曼和马斯克)嘴边的最常用的词,作为鼓励自己和团队的目标,也 serve 给投资人和大众营销的作用。

这里谈谈我的看法。

机器达到甚至超越人类的技能,无论是人类顶尖个体的专业能力(例如围棋冠军、名校教授),还是人类总体的知识水平,这就是我眼中的 AGI。但这里的专业能力和知识水平,我认为并不包括重大的发明创造能力。这个意义上的AGI是一种确定的趋势,最多不过就是两年内实现,还是五年内实现的差异而已。

AGI 是确认无疑的,正在发生、已经发生、即将发生。

ASI 则是全面超越人类顶尖智能,包括发明创造的能力。ASI 的实现应该还可以商榷。现在就确信ASI可以在不太久的未来(有说三五年,也有说10年左右)实现的吹鼓手,主要是伊利亚、Dario(Anthropic CEO)这些AGI时代的“先知”们,他们是信仰者。奥特曼和马斯克貌似也在营销类似ASI的概念,但感觉更多是企业家需要画饼的驱动。

ASI比AGI更少共识,但可以描述。ASI 实现的时候,机器可以解开困扰数学家几百年的世纪难题,可以批量制造陈景润级别的模型把1+n等问题解决。更重要的是,ASI(for science)可以自己针对疾病制造特效新药,发明创造的速度比人类缓慢的探索要提升 n(Dario 好像说 n等于2)个量级。这一切带来物质极大丰富,重大疾病被有效控制甚至消除,寿命至少延长一倍,一句话,ASI意味着技术共产主义的全面实现。

人类反馈是超级智能的桎梏吗?

如果是,那又如何理解以人为本,与人类对齐的宗旨呢?

现在看来,以人为本以及人类(偏好)反馈对齐等,指的是最终结果或成品,这是人类价值观的体现。这一点永远不会改变,也不应该改变。但需要强调的是,人决定的是 what,不是how。what 永远是人说了算。至于生成结果的过程,现在看来,人类越来越有心无力,甚至成为障碍,而不是助力。

一个有意思的例子是,当 alpha zero 下棋到第30几步的时候,走了一步人类不能理解,连世界冠军也会判定是愚蠢的一步。但那却是超人智能的精彩过程,是制胜法宝的一个精妙环节。这种高招连冠军都不能理解,说明机器智能显然超越了人类智能的边界。如果在过程上依赖人类反馈,哪怕是围棋冠军来做标注,也会阻碍机器智能的超人潜力。

当这类超人智能大量产生的时候,人类很自然会感觉困扰。因为 by human nature,所有人多多少少都有某种控制欲,对于自己不理解、不能掌控的过程,总是持有戒心,至少是很不舒服。但可惜无解。未来会出现越来越多的不可理解的奇迹,或技术魔术。人类所能做的就是加强目标制定和结果控制,而不是“不自力量”试图过程控制。

最后谈一下马斯克的AI威胁论,主要是把人类类比为蚂蚁,而ASI类比为人类:ASI 灭绝人类文明不需要恶意,因为蚂蚁不构成人类的心理负担。

我认为,这个比喻是荒谬的,因为蚂蚁永远造不出人类,而ASI是人类创造的。人类与蚂蚁均属动物,但却不在一个价值参照系中。

但我们不排除,人类可以以ASI形态,制造出自己的上帝。

如果上帝是共识中的人格化的存在,人类完全可能把机器变成上帝。无论你在上帝与人之间是持谁是照着谁的模样创造的,the key 是,上帝与人位于同一层人类价值观的参考系上。上帝至善、至美而万能。善、美、能,都是人类的价值观的表现。

而蚂蚁不同,蚂蚁与人类不处于同一个参照系,人类 is way beyond ants。蚂蚁们自己没有尺子来度量人类。但人类对上帝是有度量或想象的。

人类对于结果(而不是过程)不理解,无法判别、或感觉不到好处的东西,最简单也是最自然的反应就是停止那个结果导向。再超级的过程智能,如果没有人类规定的方向,或违背人类的价值观,也是(原则上)随时可以按下停止键的。

所以马斯克的那种担忧,属于耸人听闻、杞人忧天。

但这不是说AI没有更加现实的威胁,例如真假莫辨造成的社会混乱,取代人类jobs而福利制度尚未建成而造成的恐慌,还有体制滞后、技术加速度所造成的不匹配和不适应,等等。这些都是看得见、正在到来、可以预见的巨大社会问题,而不是机器统治人类那种天方夜谭。

当然也不能排除ASI被恶人恶意使用可能对人类造成的伤害,但绝不是什么ASI像对待蚂蚁一样,可以任性消灭人类。恶意使用类似核扩散的潜在恶果,最终需要向对付核武器一样防控。

 

【相关】

Will DeepSeek Fail at Simple Math Problems?

Amid the waves of surprises brought by DeepSeek, an old friend pointed out that it struggles with simple math problems, using a popular elementary arithmetic question as an example:

Is 3.11 greater than 3.8?

What’s the core issue here?

In the wake of the DeepSeek frenzy, I looked into its research paper, which explains how its reasoning capabilities are enhanced through outcome-oriented reinforcement learning. The paper suggests that, in theory, outcome-oriented reinforcement learning can help a model learn proper reasoning processes. However, in practice, it’s not necessarily so.

Take the above math problem as an example. The answer is a binary yes/no question, meaning even random guessing has a 50% chance of being correct. This highlights a key potential flaw: outcome-oriented supervision signals are weak because they lack sufficient granularity. This kind of weak supervision inevitably hampers the model’s ability to learn proper reasoning processes.

Three Possible Solutions

  1. Scaling Up the Model
    One approach is to make the model larger and deeper, hoping that the theoretical concept of lossless compression based on Kolmogorov complexity can be pushed to its limit. In doing so, proper reasoning, as the "shortest program," might eventually be learned by the model. Theoretically, correct reasoning ensures accurate results. However, the gap between theory and practice makes it hard to place much confidence in this. The shortest program or lossless compression might just be an unreachable ideal.
  2. Targeted Supervision Data
    Another solution is to feed the model with problem-specific supervised data. For example, providing thousands or tens of thousands of reasoning cases involving such math problems. There’s no reason the model wouldn’t learn from this. However, solving one specific problem this way is merely a stopgap measure. Soon, others will come up with new edge cases involving weak supervision signals and reasoning pitfalls to challenge it.Another common challenge is the so-called “self-identification” problem. For instance, when asked “Who are you?”, many models, including DeepSeek (earlier versions), would claim they are ChatGPT developed by Open AI if no targeted supervised data is injected. After all, ChatGPT has dominated the internet in the two years since its explosive debut, and its data has inevitably influenced other models. However, this issue is already on the radar for specialized solutions and is gradually becoming less of a problem. Some Western media still claim that DeepSeek is just a distilled version of ChatGPT. Their evidence?  Probably based on early versions they tested, the DeepSeek bot often self-claimed to be OpenAI’s ChatGPT. But if you test it now, you won’t see this problem anymore. Most likely, it was fixed with specialized training data. Their research paper also mentioned addressing the self-identification as a problem.

    Similarly, the problem of comparing 3.11 and 3.8 can also be a transitional issue. If it disappears in the future, it won’t be a cause for celebration. Most likely, it will be resolved through targeted fixes rather than through fundamental improvements in intelligence brought about by algorithms or architecture changes or innovations.

  3. Re-introducing Process Reward Models (PRM)?
    The inherent weakness of outcome-oriented supervision signals is that it focuses only on the result while ignoring the checking of the process—a natural shortcoming of reinforcement learning driven by  results-oriented pragmatism in RL (following the “black cat, white cat” principle, lol). This is essentially the cost of abandoning PRMs (Process Reward Models). So, would re-introducing process-based reward models solve the issue? Honestly, we don’t know. This is the third possible path, and it might be worth exploring. But again, as mentioned in my previous blog post (DeepSeek's R1 Paper: A Storm in AI LLM Circle), PRMs aren’t easy to work with—they’re unstable and difficult to implement, although, in theory, they could help correct nonsensical reasoning during the process.

In conclusion, the issue with DeepSeek struggling with problems like 3.11 vs. 3.8 lies in the limitations of weak supervision in results-oriented reinforcement learning. While there are potential solutions—scaling the model, targeted data, or process reward models—each comes with challenges and trade-offs. Whether any of these approaches can fundamentally improve reasoning capabilities remains an open question.

 

 

 

The Turbulent Second Chapter of Large Language Models: Has Scaling Stalled?

The recent Chinese podcast from Guangmi's quarterly report on large language models, discussing the "scaling paradigm shift" toward AGI (Artificial General Intelligence), is well worth a listen. It touches on many key topics related to the AI industry landscape, offering a unique perspective and style.

The term "paradigm shift" may sound a bit dramatic, but as a seasoned analyst, Guangmi uses it to describe the current turbulent landscape accurately. While the AI arms race among industry giants is still in full swing, real-world scalable applications of these models are struggling to materialize. The question of how to justify investments has become a significant pressure point, or perhaps even a looming bubble.

Let's revisit some AI basics. There are three main types of learning in LLMs (Large Language Models):

(i) supervised learning;
(ii) unsupervised learning (self-learning/pre-training); and
(iii) reinforcement learning (RL, self-play/post-training).

Ilya has emphasized the importance of RL in exploring new directions for LLMs. Guangmi's podcast highlights RL as the pathway to the paradigm shift in AGI through large models.

Historically, two key milestones in RL have stood out: AlphaZero's victory over human Go players, which shocked the world, and RLHF (Reinforcement Learning from Human Feedback), which aligned models with human preferences and paved the way for ChatGPT’s explosive growth.

Currently, discussions revolve around the potential of a new RL-driven ecosystem for large models (though there's no broad consensus—it's primarily a conversation within small Silicon Valley circles) and the emerging trends in the "arms race" of large models. Here’s the context:

1. Pre-training scaling seems to have hit a bottleneck, with GPT-5 still unreleased;
2. The overall momentum of the arms race remains unchanged among the major players (the billionaire clubs/giants);
3. Key tech figures are proposing new roadmaps or trying to construct new scaling laws to continue the AGI journey.

Guangmi closely monitors trends in Silicon Valley. His small team conducts in-depth research in the Bay Area and has established extensive contacts. Having chatted with them over coffee a couple of times, I’ve found them to be a dynamic, young team under his leadership—a small but sharp presence.

Guangmi’s thoughts are well-structured, and his breadth of knowledge and understanding of the larger context are impressive. This is no small feat, as the landscape of large models, both in terms of the models themselves and the industry, is often akin to the parable of the blind men and the elephant. Even top experts and business leaders struggle to assess the full picture. Just recently, Meta’s Zuckerberg responded to a question about whether the AI arms race would deliver the expected AGI returns, essentially saying: “No one really knows, but we can’t afford to miss out,” reflecting a typical FOMO (Fear Of Missing Out) mindset.

We’re currently in a delicate phase with little consensus. However, the few tech giants that have propelled Nvidia’s stock to astronomical levels won’t allow the arms race to slow anytime soon, as it is central to their tech and business dominance. OpenAI continues to raise funds, and Ilya, with his new company, recently secured more investment, all of which keeps the race heated.

At the same time, the obsession with scaling among tech elites and the mainstream AGI circles in Silicon Valley persists. The endless demand for resources driven by this scaling wave of large models means that only a small circle of tech insiders has the opportunity and resources to experiment, sense, and adjust the roadmap.

According to Guangmi, the so-called self-play RL scaling is currently gaining traction within a small circle of about 200 tech elites in Silicon Valley, indicating that this is still a nascent trend—one that even management leaders have not fully aligned with yet.

It seems Guangmi adopts a “prophet” mentality at times, perhaps exaggerating this trend to alert his audience. He even suggests that if he were a large-model entrepreneur, he would focus 200% of resources on RL, betting on it as the future path to victory.

In reality, for most people, this advice is neither practical nor actionable—it’s likely aimed at tech giants or unicorns, though even for them, it may fall on deaf ears.

Reinforcement learning is inherently challenging. Even the open-source leader Meta LLaMA 3 has chosen to sidestep RLHF in post-training alignment. So, it's even less realistic to expect large-model teams to fully bet on RL as the core of a new ecosystem. Furthermore, this trend is, at best, a “subtle undercurrent” in Silicon Valley. We’ll likely have to wait until OpenAI’s “Strawberry” or the new version of Claude releases later this year to fully assess its impact.

It seems the first chapter of LLM scaling has indeed come to an end. The actionable items in the so-called second chapter might not emerge from lofty, exploratory scaling directions with an uncertain roadmap. Instead, the focus should be on finding market entry points, accelerating applications, and addressing genuine market needs (PMF, product-market fit), especially as the inference costs of top models like GPT-4o/Claude 3.5 become more affordable, and multimodal capabilities (such as advancements in hyper-realistic full-duplex voice and video) further enhance application opportunities.

For the industry, the bottleneck in scaling large-model applications is the sword hanging over its future. This will determine whether the second chapter of the tech adoption curve ends with a soft landing and eventual recovery. As for the arms race, it’s best to leave that to Elon Musk, Zuckerberg, and the billionaire club to continue playing.

Reinforcement learning, as an extension of pre-training, belongs to the realm of “post-training.” When pre-training hits bottlenecks and diminishing returns, strengthening RL is a natural complement. In the simulation of human cognition, pre-training represents the accumulated knowledge of human civilization, while RL applies that knowledge in practice, learning from the environment. This overall approach to intelligent learning makes perfect sense and is the necessary direction for applying large models.

My old friend Lu said: “It’s intuitive that RL is the path we must take because there isn’t enough supervised learning data anymore.”

Indeed, utilizing regenerated data to varying degrees has become common practice. It’s inevitable. Models can already generate data of higher quality than humans, and this will only improve. However, this is not the same as self-play's proactive exploration and data regeneration.

As Mr. Mao pointed out: “RL aligns with the cognitive processes of humans and epistemology. It’s essentially the process of receiving external feedback and being tested in practice. RL is active learning, while training is passive.”

Guangmi's RL paradigm shift suggestion still lacks the necessary catalysts. But this potential trend is worth keeping in mind. It’s best to remain cautiously optimistic and open-minded while watching how things unfold.

 

Related original:

大模型风云诡谲的下半场:scaling 失效?

Professor Ma's long paper out

Here is the link to Professor Ma Yi’s presentation from the Shenzhen Entrepreneurship Forum, in Chinese, recommended.

Professor Ma is a compelling speaker, and his talk is definitely worth listening to. His paper on whitebox transformer, over 100 pages long, has just been released (Yi Ma’s white-box transformer paper is available here).  Unfortunately, I haven’t had the time to dig into it yet. We’ll have to wait until more people have accepted or verified it before delving deeper.

His current claims revolve around using an extremely sparse approach to force transparency in transformers, with results that are reportedly on par with BERT and GPT-2 in many benchmarks. However, this doesn’t mean that he will be able to catch up with GPT-3 or later models anytime soon. But to be fair, it’s not a level playing field—he’s an academic without the resources to compete with mainstream AI in an arms race. What he does believe, however, is that he has opened a door—a path toward explainable AI in large models.

Honestly, I’ve always had a litttle bit doubts about Ilya’s theory explanation of shortest program compression (his Berkeley talk). From an ultimate theoretical perspective—where lossless compression is the ideal—the idea of continually scaling training, deepening, and lengthening learning makes sense, as it pushes the model toward becoming the smallest possible program for universal tasks. Ilya’s theory may hold up in this respect, at least in theory or as an end goal. But in any real-world scenario (e.g., under budgetary constraints, with methodological limitations), it’s hard to call a model purely derived through gradient descent the “shortest program,” because these models appear to be gigantic beasts with "huge circuits" inside, intuitively, should not be considered "short or small".

Models with hundreds of billions or even trillions of parameters are massive monstrosities, succeeding mainly through sheer size rather than through high regularity or elegance. Emphasizing how impressive their compression ratios are or how well they handle lossless compression may help explain the generalization and emergeng abilities in sequence learning from a theoretical standpoint. But in practice, any model at a given time is far from being the “shortest program.”

This highlights an unavoidable distance between theory and practice. Ilya essentially hedged practice with theory along a future time axis, but our immediate reality doesn’t seem to align with this. It’s like a clumsy wrestler trying to brand himself as sleek and slender fashion model.  Visually not a fit, to most of our eyes.

Instinctively, LLMs feel full of rote memorization with significant redundancy. Under real-world conditions, achieving extreme or lossless compression seems impossible.

On the other hand, Professor Ma’s sparsity approach almost feels “over the top.” Enforcing the same weight for QKV directly seems a bit crude and simplistic, yet it still managed to be trained successfully. This shows that there’s a lot of flexibility within transformers—no matter what restrictions or pruning are applied, the model still finds a path out. In this sense, Professor Ma’s pursuit of the “shortest program” is more real and direct—it’s so short that even a human can interprete the process (hence the LLM explainability).

Yet the difference between these two extremes is still mind-boggling. On one side, we have gigantic models, and on the other, extreme simplicity to generate whitebox models. The fact that both approaches work is shocking.

Speaking of simplicity and explainability, here’s an interesting anecdote in AI history: Back in the day, during the era of symbolic MT, one of the earliest deployed systems (Siemens' METAL) for English-German translation used only eight symbolic features (such as human, animal, etc.). The rules were simple, transparent, and easy to explain. This shows that extreme simplicity and rule-based transparency can work in some rough application scenarios (where English and German are linguistically close, making translation easier).

Later, we MT-ers expanded the number of features to the thousands, trying to cover more of the long tail. Even then, it wasn’t perfect. At the time, we thought that with enough effort, we could match the quality of statistical MT. But now, we know that even if symbolic MT could catch up and match statistical MT, it’s still far from competing with neural MT.

So, could we have continued refining features further? It wasn’t because we didn’t want to keep extending symbolic features (similar to one-hot encoding, but with the internal structure of ontology/taxonomy). We wanted to go beyond thousands to tens of thousands of features. But in reality, thousands (of features in size) were already reaching the limit of human experts’ capacity to understand (AI explanability), manage and debug. Expanding further would have been unmanageable.

Meanwhile, how many parameters do mainstream Transformer neural networks have? And the space and granularity they represent are on a completely different scale. Given the vast difference in scale between the two, it’s natural to doubt any efforts to bridge this gap for AI explanability.  How could that even be possible?

That’s why I’ve always felt that explainability in large models is an elusive goal. But Professor Ma is telling the world that they’ve achieved it.

 

 

Relevant link:

Professor Ma Claims to Have Fully Unveiled the Mysteries of Neural Networks

What did Ilya see? -- secret behind success of LLMs

马毅教授的演讲,值得一听

创业邦深圳会议马毅教授的演讲链接在此:https://mp.weixin.qq.com/s/ibxGO_A7H-akpbwf2R2mGw

马教授还是很能讲的,他上面的演讲,很值得听。他的100多页论文也已经放出来了,可惜没时间钻研了,等以后更多人接受或验证后再说。

他目前所做出的 claims,是说用那种极度稀疏化的方法逼迫 transformer 透明化,结果也在多方面匹敌了BERT 和 GPT2。但并不说明短期他有办法赶上GPT3以上。话说回来,那也不公平。他作为教授没有资源去以军备竞赛的方式与AI产业主流打擂台。只是说,从路线上说,他觉得自己打开了一扇门,一条可以通向可解释AI的大模型大门。还是应该赞佩这样的反潮流的教授的。

其实,我也一直隐隐约约对伊利亚说的最短程序压缩论,持有怀疑:从终极目的(理论上以无损压缩作为理想目标)来看,不断加大训练、加深加长学习,结果就是朝着让模型真正成为最小程序,伊利亚理论也许没错。但在任何一个实际条件约束下(例如预算约束、方法论约束),这种纯粹靠 gradiant descent “凑出来”的模型/路径,很难说是最小 program,因为模型看上去就是个庞然大物,谈何“最小”。

千亿万亿参数的超大模型本来就是以大取胜,而不是以精简和规则见长的怪兽(gigantic monster),非要强调自己的压缩率厉害,无损压缩做得好,虽然有从理论上方便说明序列学习达成的通用性、泛化现象以及“涌现”能力,但实践中,在任意一个特定时间条件下的模型,都远远不是“最小程序”。

这是理论和实践躲不开的一种矛盾。在伊利亚那里,实际上他是以未来时间轴,用理论对实践做了对冲。我们的真实感觉并非如此,不敢这么说。就好比一个摔跤选手,都那么笨重了,还非要标榜自己性感、苗条?

直觉上,LLM 里面充满了死记硬背和信息冗余的,在现实条件下其实不可能做到极度/无损的压缩。

但另一方面,马教授也太奇了,他的稀疏化直觉上做得“过分”,QKV直接拉平,看上去有点简单粗暴,但居然也最终能训练出来。可见,transformer 的肚子里的操作空间还是很大的,你给它各种限制,动不动就剪枝(化零),也不用担心它走不出来。这种意义上,马教授追求的才是真正的“最短程序”,短到了连“豆腐脑”的人类都可以看懂路径(hence 可解释性)。

疑问还是这两个极端差距太大。一边庞然大物,一边无限精简,二者都能走通,也是震撼了。

谈到精简可解释,谈个掌故。老老年做 symbolic MT,一个著名的早期的实用系统(西门子的 METAL)做英语德语的翻译,整个系统只用了8个 symbolic features(例如人、动物等),规则简单而可解释,系统也一样上线实用了。可见极度精简和规则化,做到完全透明和人类可解释,在粗线条的应用场景(英语和德语距离较近,翻译难度低),有时候也是管用的。

我们后来把 8 个 features 扩展到千数量级,才擦了长尾的屁股。但也没擦干净。当时觉得,也许认真做可以对垒统计MT的品质(与董振东老师谈过,我们都觉得可以在翻译上最终用符号打败统计的,只是需要时间磨细活),但现在知道即便匹敌了统计MT,也远远不能与神经MT比高下。

那就把 features 往细做,成不?不是因为我们不想继续把 symbolic features (类似于 one hot encoding,但人为在 features 内部强加了类似于 HowNet 的 ontology/taxonomy 的结构性),从千这个量级进一步提升到万的量级。实际情况是,千几乎已经达到专家人脑的极限了,再扩大 features 的范围,我们就无法掌控和调试了。

可是,神经里面有多少 params 啊,其所能反映的 representation 的空间和细密度,与千量级的 symbolic features,尺度完全无法比拟。二者表征的尺度如此悬殊,对拉近二者距离的任何努力,我们天然会产生怀疑:这怎么可能做到呢。

所以一直就觉得大模型可解释性是一个可望不可及的目标。马教授告诉世人,他们做到了。

相关链接:

马毅教授称,已经揭开完全揭开神经网络的面纱

NLP老司机的AIGC旅程

今天想到做个小结,以“玩”的心态,回顾一下前两年的AIGC旅程,以及一个NLP老兵一路走来的心路历程和感受。‍‍‍

大模型爆发前,最痴迷的是当时就有的 txt2img 文生图模型。当时尝试过很多种工具,“小雅”就是那个阶段的产物。不仅人物,也做过各种绘画风格,在群里和博客也分享多次。后来疲劳了,就不怎么玩了。

开始对数字人感兴趣,2D 的 talking photo,2.5D 的有姿态虚拟主播,以及 3D 舞蹈等。因为是自家产品「奇妙元」,玩起来没限制,作为“产品体验官”,疯玩了一阵子。

可惜数字人的黄金时期转瞬即去,还没来得及起飞,就开始鱼龙混杂、遍地开花了,市场给卷的。

紧接着对于超拟人/超写实配音,以及跨语言的突破,包括最近“双工”的突破,各大头部模型开始显摆自己的语音亲民能力,与普通真人无异,不再是板着腔调的播音味了。 咱们自家的AIGC拳头产品「魔音工坊」赶上了这波语音tokens化的端到端大模型浪潮,也实现了超写实,那是大约半年前的事儿。意义重大,因为语音是所有copilot类大模型应用的最自然的接口,也是数字人和短视频的必要赋能点,但语音从可玩性上,不如音乐生成。

Suno 惊艳登场,我入迷了几个月,实现了自己也做“音乐人”的梦想。当然,现在也淡化了,不是不好,是没时间玩了。

时间被中国的 Sora,快手可灵AI的视频生成大模型占用了。视频生成疯玩到今天,我用它做了很多儿时的回忆,定格和再现了人生的高光时刻,虚拟了超生活的场面,最 high 的时期也过去了。这一通尝试,包括三分钟视频连续生成的极限试验,以及种种提示词工程探索,对当前视觉大模型的优点短板看得比较清晰了。

视觉模型的重要应用形态之一就是“一键成片”,也是自家产品了,叫「元创岛」。 目前还很粗糙和简陋,但的确做到了“傻瓜”制作能力,零门槛,任何人都可以用它来生成视频。显然有落地场景和起飞的迹象。

这种对多模态体验和迷恋,想起来与一辈子只做文本NLP得经历,本来是格格不入的。但背后有个大模型的宏大背景。原来,LLM炸平了NLP后,马不停蹄,又开始炸平多模态。这种通用性让人觉得这一切有着共同的主线贯之,是自然的技术汇合之流。这是从模型研究的心路历程看。

从人文和科技结合的角度看,我们这种“老文科生”与生俱来对于人文、艺术的追求本性,并没有因为在工业界的码农环境“挖煤”几十年,而(被)湮灭,应用到如今又是一个自然汇聚。这有点像乔布斯当年的说法,他追求的就是人文意味的科技产品,工程结合美学品味,嘲笑微软产品的粗鄙,no taste。

想想这一路走来挺有意思,无论研发还是应用,冥冥之中都在汇聚。而我们何等有幸见证、经历和投入到这种汇聚的潮流中,虽然这个汇聚也同时意味着颠覆自己、碾压自己、否定自己的过往,抛弃很多过去的“绝技”,例如曾经做到世界顶尖的符号解析(symbolic parsing)的庖丁解牛之术。 靠的是终身学习,不至于掉队太远。但一切的一切,更需要一种 精神,尤其是 passion:passion 所驱,乐此不疲。

下一个passion点 应该是 to b 场景,因为最终的应用大期待,大概率在垂直。To c 虽然很卷,但路线图和态势,能做什么,包括 aigc,已经基本清晰。但 to b 还在泥潭里挣扎,方向都还隔雾看花,闪闪烁烁,但也看到高人。例如白硕老师,感觉他就在捻须微笑,坐在金融交易的莲花池上,仗着to b 积淀。

个人而言,垂直赛道,最喜欢是教育,其次是法律,这都在大模型知识能力的路上:既容易最终被通用大模型碾压,又立即能对齐场景呈现价值。金融太繁琐,水更深。水利、电力、汽车等非常专门,行外人感觉枯燥。但医疗和心理,却很诱人,虽然比教育、法律更难涉入。看命运之神领我何往吧。

Decoupling to Resolve: Issue of Character Consistency in Video Generation

I’ve now become the go-to expert for AIGC (AI-generated content) "custom services" among my old friends and classmates, just for fun. Below are nostalgic videos made from old photos that two of my classmates asked me to create.

Whenever I find the time, I’m more than happy to provide this kind of emotional value for friends and family because it’s truly satisfying to see their reactions of surprise.

The pianist is now a world-class piano master, frequently touring and performing in Europe, America, and China. These are precious old photos of him practicing and performing with our mutual friend, Brother Sun, in Philadelphia back in the early days.

Dr. Bai Shuo, a seasoned expert in NLP and a multi-talented musician, commented humorously: “Looks real for someone who pulls on the bow in  Meditation as named, but the bowing and fingering are all wrong.”

Another old friend also left feedback noting that the visual model doesn’t understand music: "This needs improvement! It's obvious that the model was created by someone who doesn’t know how to play the violin or piano. The bowing and piano accompaniment are off. The first note has a two-and-a-half beat long tone, which should be played with a long bow. Additionally, the pianist’s right foot should never be raised or shaking like that—it should be on the sustain pedal.”

LOL

Even though the music's name Meditation was clearly specified in my prompt during generation, there is no model, in the foreseeable future, that can truly align the understanding of music with the intricate details of bodily movements during performance. Perhaps this can be reserved as one of the ultimate challenges for large models aiming for AGI, because theoretically, if enough alignment data of musical performance is available, based on the compression theory of "joint training", it’s possible to aim at perfect alignment across different modalities.

If simulating the objective world is the ultimate goal of visual models, then the current generation of visual models is at the level of “playing the piano to a cow” or “playing music to a tone-deaf audience”—completely unable to withstand scrutiny from musicians. For example, as someone with little musical knowledge, when I watch the nostalgic performance videos above, I wouldn’t notice the flaws as an expert would; instead, I find them vivid and emotionally engaging.

Of course, the standards of musicians might as well just be a "pseudo-demand" or a pseudo-goal (even if the visuals satisfy the picky “expert eye,” so what? Will it sell well?). It might not be worth the effort to pursue this. However, in theory, an ideal AGI should be capable of meeting these expert-level demands.

This is the challenge of musical performance alignment.  Another challenge to Sora-like video generation models is character consistency in videos.

Achieving facial consistency in generative visual models is an extremely difficult. Don’t expect this issue to be resolved by video generation models alone in the short term, especially not through autoregressive methods.

Human eyes are extremely discerning with regards to face recognition, especially when it comes to familiar faces of friends and family—you can immediately tell when a character's appearance is off. For example, while playing with old photos recently, I used the KeLing model (top notch Video Model in China) to generate a video of myself. At the 5-second mark, it still looked passable, but by 10 seconds, it no longer resembled me.

In the second 10-second video, just a slight turn of the head, and it’s no longer me—it looks more like my brother. How can a model handle such fine details? Especially when the starting image for video generation is not even a straightforward frontal shot, making the character information incomplete—how could it not go off track?

While the videos I've made for friends and family using KeLing during its public testing phase have generally been met with passionate surprise and amazement, most of them suffer from this issue of character consistency, which is a regret.

The current one-click video generation products on the market (including our own YuanChuang Island recently launched) tend to mainly use anime or manga styles. This is to avoid user scrutiny since these styles lack 3D distinct individual characteristics. As long as there is consistency in attire, no gender mix-ups, with age and race alignment, most people will accept it. The current one-click videos are generally rough, with entertainment value primarily in the story rather than character portrayal akin to a Hollywood blockbuster. However, as this path progresses, it will inevitably encounter the challenge of maintaining the consistency of digital IP actors and their roles.

My colleague, Lu, mentioned, "the consistency issue might require cross-checking from multiple video angles, which more or less touches on the core issue of whether modeling is necessary."

Indeed, some form of cross-checking is required, not just monotonic correction over time/sequence—that is indeed the key. There’s a need to decouple or separate the character's image from the storyline, rather than generating in a linear, one-way path. While sequence learning has indeed produced miracles in LLMs, sequence generation inherently has limitations, including random deviations over time. Although it's not as extreme as LeCun's criticism—where he says GPT's error accumulation is a tiny discrepancy that leads to a significant miss—his claim isn't entirely accurate because GPT's autoregressive operation also corrects and adjusts its course at every step in the context. Nevertheless, when it comes to fine-grained consistency, random deviations are almost impossible to handle, even with corrective mechanisms in place.

Hence decoupling, decoupling, decoupling! Decoupling can solve the problem. The world isn't limited to sequences. Beyond sequences and time, there is a constant abstraction (i.e., character image, or IP) that can be utilized. This is becoming increasingly clear. Take, for example, the digital IP character Maria (Xiao Ya) that I created using AIGC txt2img more than 2 years ago::

Unless they’re fans, perhaps my numerous Maria videos might cause aesthetic fatigue—someone even called her “Dr. Li's fairy” (LOL). But indeed, there are fans; several of my old classmates are among them.

Why? Because she is an IP, and she has been decoupled.

 

Related Links (original posts in Chinese):

视觉模型生成的极限对齐

解耦才能解套:再谈视频中的人物一致性问题

 

解耦才能解套:再谈视频中的人物一致性问题

前两天说过,对于生成式视觉大模型,人脸一致性是个非常难缠的东西,不要指望短期靠视频生成大模型本身来解决,尤其是不能指望自回归解决这个问题。

人眼太挑剔了,尤其是亲友和熟人,你会一眼看出人物走形了。譬如这几天玩老照片,我用头部视觉模型可灵5秒生成我自己,还过得去,到了10秒就不是我了。

10秒视频:

一转脸就不是我了,倒更像我哥。这种细粒度,模型怎么能搞定。尤其是,如果图生视频的起点图就不是正面照,character 信息本就不全,怎么可能不走偏。所以,我在可灵公测以来为亲友同学做的视频虽然普遍感觉惊喜或震撼,但大多存在这个人物变形的问题,成为一个遗憾。

现在市面上做的一键成片产品(包括我们的元创岛),其所以用二次元或其他夸张风格为主,是为了避免用户的挑剔,因为那些形象缺乏鲜明的个性,不是真正的 individual IP,只要保持穿戴一致性,男女不要错位,年龄和种族不要相左,一般人也就接受了。目前的一键成片普遍比较粗线条,娱乐价值更多是为视频里的故事,而不是好莱坞大片那样的角色形象刻画。但这条路往上走,就躲不开这种 digital IP 的演员角色定位及其一致性问题。

吕兄说:一致性问题可能需要靠多角度视频的cross-checking, 这里面多多少少要涉及到是不是要建模的硬核问题。

是的,要某种 cross-checking,而不是时间流单调矫正,这是key。需要解耦/剥离故事线上的人物形象,不能生成一条道走到黑。序列出过奇迹,但序列生成有随着时间出现随机偏差的局限,虽然不是 LeCun 批评的那样极端:他说gpt错误积累是差之毫厘失之千里;他的说法不正确,因为gpt的“自回归”推理方式也在每一步context自回归中不断纠错、矫正航向。尽管如此,对于细线条一致性,随机偏差哪怕有了矫正机制,也是基本搞不定的。

因此,解耦、解耦、解耦。解耦就可以解套。世界上也不是只有序列。跳出序列和时间,还有个恒定抽象(即character形象)可以利用。这一点已经越来越清晰了。以我制作的数字人IP形象小雅/Maria为例:

除非粉丝,也许我的众多小雅视频会引起审美疲劳吧,有人称她为“立委的妖精”(LOL)。但确实有粉丝,老同学中好几位人物就粉她。

为啥,因为她是IP,解耦了。

Related Links:

视觉模型生成的极限对齐

马毅教授称,已经揭开完全揭开神经网络的面纱

原创 立委 LiWeiNLP 2024年09月01日 17:44 北京

马教授的白盒transformer论文在:https://ma-lab-berkeley.github.io/CRATE/?continueFlag=680deb7516c156566f8eb73fdcc896ca

马毅教授大名鼎鼎,特立独行,一方旗帜人物,天下无人不识君。值得关注的是他最近对主流大模型的批评以及他自己工作的宣示。

前不久深圳会议请他来,他把主流大模型、伊利亚,以及k氏复杂性理论,批为中学生的水平,说他们根本不懂理论。而他自称理论实践双突破,说的就是他带领团队做的这个白箱 Transformer,不仅揭开了大模型的神秘面纱,而且是工程可实现的替代方案。

说句实话,把话说到这份上的人,多半是真牛人,有自带的底气和傲视群雄的霸气。对于主流的批判,老美有个杨立昆(他说GPT不如他家一条狗,是死路,他的世界模型才是替代方案),中华有个马教授,世界才显得不那么寂寞。也确实希望他们somehow能弥补当前主流大模型的“慢思维短板”,推动AI的整体进步。有时间还是应该慢慢研究他的学术和实践,但也许更多是要等待时间和同行对他们工作的检验和复现。

深圳会上就露了个脸,自己讲完和批完,立马走人。估计是真忙。

论文100多页,说过几天就放出来。看现在的 outline,重点在,quote:

Overall, CRATE is similar to a transformer, with two differences:

in each attention head, the Q,K, and V weight matrices are weight-tied, i.e., set to be equal;

and the nonlinearity following each attention layer is no longer a multi-layer perceptron (MLP), but rather a more structured operator (ISTA) with sparse outputs.

咱们了解一下,ISTA(Iterative Soft-Thresholding Algorithm,迭代软阈值算法),是一种用于解决稀疏优化问题的算法,在机器学习领域有广泛应用。在CRATE架构中, ISTA被用来替代传统Transformer中的多层感知器(MLP),还记得前不久的 KAN 的创新也是旨在平替 MLP。都是在 Transformer 里面开刀。

我浅薄的理解,ISTA 与 KAN for Science/Physics 的思路是一致的,就是要经过某种正则化或剪枝,最终拟合成稀疏化路径,从而获得可解释性。

工作原理: ISTA通过迭代的方式逐步接近问题的最优解。每次迭代包括两个步骤: a) 梯度下降步骤,这与主流同;b) 软阈值操作。增加这个操作,是要在两个目标之间找平衡:

a) 使模型尽可能准确;b) 使模型尽可能稀疏,即简单(因为人很难理解一团乱麻)。

软阈值操作鼓励内部元素变为0,从而产生稀疏的输出,增加可解释性。权重绑定的注意力机制和ISTA(迭代软阈值算法)促进了模型对输入数据结构的更深入理解。这更像人类的结构化解析过程,抓大放小,正则化、规整化。

马教授说的是,它们经过上述的两个改造,结果在监督学习中(后来他们也成功地用到了自监督学习),就自然学到了人类的这种结构化、稀疏化所带来的可解释性。

拿图像识别来说,研究发现某些注意力头对应于动物的不同部位。更引人注目的是,这种对应关系在不同动物甚至不同动物类别之间保持一致。例如,专注于"头部"的注意力头在处理不同种类的动物时,始终关注其头部区域。这种一致性表明CRATE已经学习到了跨类别的通用视觉特征表示。

但做LLM可解释性研究的人早就发现,在MLP网络的尽头,各种结构化(例如 头、脚)的部件也都被注意力抓到了,否则不能解释LLM表现出来的泛化(或压缩)能力。困难的是在网络的前期,那些个注意力比较混杂,主流的MLP搞不清都在注意什么。隐隐约约是在注意像素(点)、线等基础元素之间的关系。

可解释AI的基本思路都是一样的,就是要把一团麻一样的黑箱多层网络内部数据拟合中凑出来的路径,改造成强加了种种限制条件和剪枝、化零的“结构化”路径。

正常人,谁不愿意模型可解释?所以这种稀疏精简的路数/算法一定也是有极大的挑战,才 so far 无法与那种黑箱子瞎凑的方法比试。

马教授觉得自己很牛的底气是因为他在最近半年一年,开始用可解释的上述白箱方法,训练出了可以匹敌传统 transformer 的成绩。他在深圳会上说,此前他有信心这才是正道,但在没有拿到结果前,还是必须谨慎。现如今,他觉得他的跨国团队用这一套在各方面所做的实现,已经让他满意到可以向全世界宣告,他找到了理论和实践的突破口,找到了transformer白箱化的正确方法,可望引发范式转变式的深度学习的突破。这令他很激动,也很霸气。所以他不再满足于大学教授的理论建树,而是觉得应该下场了。马教授自己成立了一家公司要在工程上做这个推进。这次来创业邦深圳主题宣讲,据他说,是他第一次以这个新公司的名义来向社会宣布这个挑战主流的工程落地的方向性大项目。

不过,凭着多年经验和直觉说一句这件事的挑战性(或可疑之处):人类的可解释性建立在一个非常小的高度精简的 finite set 基础之上。如果拿 symbolic features 来说,超过千这个量级的 feature system,对于人就是不可解了。贵为上帝选民,我们人类的“豆腐脑”真地很局限。可是另一边看 transformer 里面的 parameters 以及 attention heads 的KQV数量,完全不具有可比性,天上地下。让大变小在这样的尺度下去归约,感觉上是不可思议的。

KAN for Science 之所以成功,是因为他们瞄准的 target 极为狭窄,只是 science 中的某些已有的解析公式,或者未知的但局限在少数参数的潜在公式。有了这样的目标去做剪枝,中间加上科学家的人为干涉或反馈,KAN 声称是做到了可解释性。

Anyway,马教授似乎是成竹在胸,咱们还是静观他的法术/魔术。

Related Links:

What did Ilya see? -- secret behind success of LLMs

 

 

Professor Ma Claims to Have Fully Unveiled the Mysteries of Neural Networks

Professor Yi Ma’s white-box transformer paper is available here.

Professor Ma is a prominent figure, renowned for his distinctive style and leadership in the field. His name is widely recognized and respected. Of particular interest recently are his critiques of mainstream large models and the bold claims he has made about his own work (see his post in Chinese below).

Recently, at a conference in Shenzhen (which I attended with my own talk too), Professor Ma sharply criticized mainstream large models, Ilya, and Kolmogorov complexity theory, dismissing them as being on the level of high school students and claiming that they lack a true understanding of theoretical concepts. He asserted that he has achieved breakthroughs in both theory and practice, particularly with the white-box Transformer developed by his team. According to him, this model not only demystifies the complexity of large models but also offers an engineering-feasible alternative.

When someone speaks with such confidence, it usually indicates genuine expertise and a commanding presence. Just as Yann LeCun in the U.S. criticized GPT as being inferior to a dog and called it a dead end, proposing his world model as an alternative, China has Professor Ma. Their critiques balance the global discourse, making the world feel less excluding. There is indeed hope that their work might address the "slow thinking" and "interpretability" shortcomings of current mainstream large models and contribute to the overall advancement of AI. Professor Ma’s academic and practical work deserves close study, though we may have to wait for time and peer reviews to fully test and validate their findings.

At the Shenzhen conference, after delivering his talk and sharp critiques, Professor Ma left immediately, likely due to his busy schedule.

The paper is over 100 pages long and is said to be released in a few days. Based on the current outline, the key points are as follows:

Overall, CRATE is similar to a transformer, with two differences:

- In each attention head, the Q, K, and V weight matrices are tied, i.e., set to be equal.
- The nonlinearity following each attention layer is no longer a multi-layer perceptron (MLP) but rather a more structured operator (ISTA) with sparse outputs.

Let's examine ISTA (Iterative Soft-Thresholding Algorithm), a widely used algorithm for solving sparse optimization problems in machine learning. In his  CRATE architecture, ISTA replaces the traditional MLP in Transformers. Not long ago, KAN also introduced innovations aimed at replacing the MLP, both approaches representing surgeries within the Transformer architecture.

In my understanding, ISTA and KAN (for Science/Physics) share a common goal: through regularization or pruning, they ultimately fit a sparse path, thus achieving interpretability.

How it works

ISTA Iteratively approaches the optimal solution of a problem. Each iteration involves two steps: a) a gradient descent step, which aligns with mainstream methods; and b) a soft-thresholding operation. This operation is added to balance two objectives:

a) Maximizing model accuracy;
b) Achieving model sparsity, i.e., simplicity (as overly complex models are difficult for humans to interpret).

The soft-thresholding operation encourages internal elements to become zero, resulting in sparse outputs and increased interpretability. The weight-tied attention mechanism, combined with ISTA, promotes a deeper understanding of the input data structure, resembling a human-like structured analysis process that prioritizes key elements while regularizing the data.

Professor Ma claims that these two modifications naturally lead the model to learn the interpretability associated with human-like structuring and sparsity during supervised learning (and later as claimed successfully applied to self-supervised learning too).

For example, in image recognition, it was observed that certain attention heads correspond to different parts of animals. What's more remarkable is that this correspondence remains consistent across different animals and even different categories of animals. For instance, an attention head focused on the "head" consistently pays attention to the head area when processing different kinds of animals. This consistency suggests that CRATE has learned a general representation of visual features across categories.

However, those studying LLM interpretability have long discovered that at the end of MLP networks, various structured components (such as heads and feet) are also captured by attention mechanisms. Without this, it would be difficult to explain the generalization (or compression) capabilities exhibited by LLMs. The challenge lies in the early stages of the MLP network, where attention is more mixed, and mainstream researcher struggle to clarify what the attentions heads are focusing on. It seems that they are vaguely paying attention to the relationships between basic elements like pixels/dots and lines.

The core idea behind explainable AI is consistent: transforming the tangled, black-box, multi-layer network's internal data fitting paths into structured paths that are enabled with various constraints and pruning, leading to a sparse representation.

Who wouldn’t want a model to be interpretable? However, achieving sparsity and simplicity is extremely challenging, which is why, so far, these approaches have struggled to compete with the black-box methods that involve randomness.

Professor Ma’s confidence stems from the fact that, in the past six months to a year, he has begun to train models using the explainable white-box methods mentioned above, achieving results comparable to traditional transformers. At the Shenzhen conference, he mentioned that while he had always been confident that this was the correct approach, he remained cautious until results were obtained. Now, he believes that his cross-national team’s achievements with this approach have satisfied him enough to announce to the world that he has found a breakthrough in theory as well as practice, the correct method for white-boxing transformers, which could lead to a paradigm shift and a breakthrough in deep learning. This has made him both excited and confident. Therefore, he is no longer content with academic theoretical achievements alone; he feels compelled to take actions in industry as well. Professor Ma has recently founded a company to advance this work on an engineering level. At Shenzhen, he announced a directionally significant project challenging the mainstream, first time under the banner of his new company.

However, based on my years of NLP experience and intuition, I must point out a challenge (or potential issue): Human interpretability is built on a highly simplified finite set. If we consider symbolic features, a feature system with more than thousands of elements becomes incomprehensible to humans. But on the other hand, the number of parameters in transformers and the number of KQVs for attention heads are on a completely different scale. Reducing such complexity on this scale seems almost unimaginable.

KAN for Science succeeded because their target was extremely narrow—certain existing symbolic formulas in physics or potential formulas limited to a few parameters. With such a goal, pruning, along with scientist intervention or feedback, allowed KAN to claim interpretability.

Regardless, Professor Ma seems confident, so we would like to observe how his methods and results evolve and will, or will not, be accepted.

 

 

Related Links:

What did Ilya see? -- secret behind success of LLMs

 

创业邦深圳会议演讲笔记:推动AIGC商业落地,出门问问的「产模结合」实践

8月22日,2024AGI商业趋势大会在深圳成功举办。活动由创业邦主办,南山区科技创新局、西丽湖国际科教城融合创新中心、深圳市“模力营”AI生态社区为支持单位,揽胜运动为大会指定用车,搜狐为战略合作媒体。

本届大会以“AI塑造未来市场”为主题,邀请人工智能领域的行业专家、上市公司、独角兽企业、创新企业、投资机构代表,分享研讨最新的技术、创投、治理热点话题,并围绕产品商业化、人工智能+行业解决方案进行深度交流,共同探寻更广泛的产业合作与创新机遇。

会上,出门问问副总裁李维在主题为“盈利为王,AIGC从技术创新走向商业落地”的演讲中精彩观点如下:

  1. 大模型的出现,就像翻越了语言的大山,统一了这些不同的赛道,建立起了一种类似圣经中巴别塔的通用语言能力。
  2. 历史上,技术革新往往遵循一条被称为“technology adoption curve”的路径:一开始,人们对于重大创新和突破趋之若鹜,但当这些创新在商业化、盈利和实际应用方面遇到瓶颈时,就会出现一段回落期。
  3. 在通用人工智能时代,一个模型可以处理各种任务,这使得许多细分领域的创新空间被压缩。以前,每个细分赛道都有机会诞生超级应用,但现在这种可能性大大降低了。

以下为演讲内容,由创业邦整理:

很高兴有机会跟大家分享我们在AIGC赛道上的一些工作。我叫李维,是出门问问的VP。我今天演讲的题目叫“盈利为王,AIGC从技术创新走向商业落地”。

实际上这个题目挺难的,原因在于AI和大模型非常热闹,但是真正走向商业落地的寥若晨星。所谓“AI一日人间一年”,大模型的发展虽然很快,但从总的趋势上来看,AI实则进入了一段下行期。历史上,技术革新往往遵循一条被称为“technology adoption curve”的路径:一开始,人们对于重大创新和突破趋之若鹜,但当这些创新在商业化、盈利和实际应用方面遇到瓶颈时,就会出现一段回落期。我们现在正处于这一回落下行期,但尚未触底。

这一趋势也映射到AI投资上,今年对于大模型公司来说,融资变得极具挑战性。主要原因是大模型非常烧钱,算力消耗往往远超人力成本。然而,只投入不产出的模式是不可持续的。尽管大模型拥有巨大的潜力,但要实现盈利并不像看起来那么简单。

大家都在谈,上一次技术革命是移动互联网,虽然它的技术含量可能不如大模型带给人的震撼,但它确实催生了一系列超级应用,很好地满足了人们“吃穿住行、娱乐和通信”等基本生活需求。这些超级应用已经在上一代技术革命中达到了很高的水平,留给后来人创新提升的空间已经不多。想要通过大模型在这些领域实现进一步的突破,变得非常难。

ChatGPT的出现标志着通用人工智能的一次大爆发。在通用人工智能时代,一个模型可以处理各种任务,这使得许多细分领域的创新空间被压缩。以前,每个细分赛道都有机会诞生超级应用,但现在这种可能性大大降低了。

我举个例子,机器翻译并不是大模型时代的新现象,早在12年前神经网络革命开始时,机器翻译就是首批受到影响的领域。结果,像百度、谷歌这样的大公司都把机器翻译服务做成了免费产品。一旦一个领域变成了免费或白菜价,创业的空间就大大减少,甚至可能扼杀整个产业。类似的情况也出现在大模型普及之后,比如Jasper。在ChatGPT出现之前,大模型刚刚兴起,那时业内人都知道GPT3。Jasper利用大模型的能力,将其包装成可以帮助人们进行写作和生成各种文案的工具,吸引了一批客户。

随着ChatGPT等通用大模型的崛起,Jasper的优势逐渐消失,开始走向衰落。ChatGPT不仅仅是一个超级应用,它通过人机对话的方式,实际上已经成为一个“超级的超级应用”(super super-apps),它超越了传统的界限。通用大模型现在能够处理各种语言、知识,甚至多模态的内容,如语音、音乐、图像和视频。这种广泛的能力使得通用大模型在很多领域都占据了主导地位,挤压了相关赛道的生存空间。

如今,作为投资人或创业者,在寻找发展方向时,需要经常思考一个问题:我现在所在的领域或开发的App是否具有持久的竞争优势?虽然可能暂时吸引一些客户并满足部分需求,但如果我的项目挡在这个大模型自然扩散能力的路上,那么我最终很可能面临像Jasper那样的衰退命运。

通用大模型的能力如此之强,被寄予很高的期望,结果在真正落地应用的时候,才发现原来有很多限制条件在。一方面,上一代技术已经相当成熟,吃穿住行等低枝果实的超级应用已经做到极致;另一方面,大模型本身还在不断发展,其通用能力常常限制了落地赛道的发展空间。因此,大家普遍发现将大模型规模化落地应用是一件相当困难的事情。

我曾是大模型的热烈支持者,也是在中国最早“鼓吹”大模型的人之一。当时非常激动,因为在我一辈子的NLP职业生涯中,从未见过如此革命性的变化。

在NLP领域,过去我们有机器翻译、对话系统、问答系统等多个专业方向,甚至还有分词这样的细分技术。但大模型的出现,就像翻越了语言的大山,统一了这些不同的赛道,建立起了一种类似圣经中巴别塔的通用语言能力。大模型的出现,彻底改变了NLP行业的格局。但实际应用起来,我们发现它比我一年多前想象的要困难得多。例如,以NLP为方向的应用(如各种文案或翻译的 co-pilot)因为已经被头部大模型搞定,这个方向的创业产业就做死了。

现在,大家都在期待大模型原生的超级应用(LLM-native Super APPs),虽然呼声很高,行业内竞争激烈,但真正能够规模化落地的,目前还只有像ChatGPT、豆包、文心一言这样的通用类的toC应用。

简而言之,这种超级大模型正逐步深入社会的各个层面。然而,除了它之外,我们还没有看到其他大模型实现规模化落地的成功标杆案例,大家都在艰难跋涉,试图找到与通用大模型基本能力有差异化的突破口和切入点。

目前的情况是,无论是行业内外人士、投资者还是创业者,都对大模型的技术潜力广泛认可,这仍然是基本共识,但要找到它的market fit并实现规模化落地,可能需要至少两三年的时间探索:现在处于技术落地的难产期,也是突破期。

大模型虽然功能强大,但也存在一些严重的短板:第一,信息准确性问题:由于学习了大量信息,大模型可能在记忆不精确的细节时出现错误,导致其输出内容不准确,所谓“幻觉”。第二,可控性问题:与以往的符号逻辑AI不同,大模型包含百亿千亿参数,运作起来像一个巨大的黑箱,难以进行精确的调试和控制。第三,安全性问题:大模型可能存在安全隐患,公开上线需要谨慎。第四,算力成本问题:尽管随着技术进步,算力成本有望降低,但与上一代边际成本趋于零的许多应用相比,使用大模型的应用在算力成本上仍然相当高。推理时也常常遭遇高并发的瓶颈。普及到 toC的大众市场,巨大的推理成本直接影响盈利能力。

以上的分析回顾,听起来有些悲观。但长远一点来看,特别是在接下来的三到五年内,还是值得期待的,尤其是垂直赛道(尽管垂直领域尚未迎来大爆发,但许多人在积极探索)。今天上午座谈会的嘉宾们也在分享他们如何将大模型技术应用到自己的领域,都是从需求出发的第一线分享。这些努力最终将取得成效,预计在未来几年内,垂直领域将会迎来显著的发展和突破。

大模型的研发成本高昂,这对创业公司来说是个挑战,而且模型与产品之间的脱节也是一个主要问题。我们目前正在探讨如何将模型与产品有效结合,以克服这些困难。接下来,我将分享我们在这方面的探索和经验。

模型开发可以选择自主研发或使用第三方服务,这取决于产品的具体需求。目前市场上竞争激烈,第三方服务往往价格低廉,更具成本效益。但如果产品需要高并发处理能力或有特定的定制化需求,第三方服务可能无法满足,这时自主研发就显得重要。

这就需要探索如何将模型开发与产品需求紧密结合,以实现成本控制和产品定制的灵活性。这包括调整大模型以适应我们的产品需求,实现最优的产品模型结合。

然而,产模结合说起来容易做起来难。因为模型开发和产品开发是两个不同的领域,涉及的人员背景和思维方式各异。要让这两类人员有效协作,发挥出最大的协同效应,对许多公司来说都是一大挑战。

我们开发了自己的大模型,名为"序列猴子",基于此,我们推出了多模态AIGC产品,涵盖数字人、配音和短视频一键生成等功能。此外,我们还成功推出了出海产品"DupDub "。"魔音工坊"是我们在市场中占据主导地位的产品,尤其在抖音平台上,约70%的声音内容都使用了我们的技术。

我们的目标客户群体包括内容创作者(ToPC,to professional consumer)和小型至中型企业(ToSMB,to small medium businesses)。内容创作者愿意为方便他们工作的工具付费,而我们正是为他们提供这样的工具。对于ToB客户,我们专注于为中小企业提供较为标准化的解决方案,因为大型客户的定制化需求较为复杂,不易操作。目前,我们拥有86万付费用户,这证明了我们的服务已经成功落地并得到市场的认可。下面是我们产品的一些展示。

视觉模型生成的极限对齐

我现在成了老友同学中的AIGC定制服务的大师了,下面是两位老同学请我做的AIGC老照片怀旧视频。我只要能抽出空,非常乐于为亲友提供这种情绪价值,因为很开心看到他们的惊喜。

现在刘老师可是世界级钢琴大师,常在欧美中巡回演出。这是当年与我老同学孙兄在费城一起演奏练习的珍贵老照片。

吹拉弹唱无所不能的NLP老司机白硕老师评论说:会拉沉思的表示,弓法指法都不对。前微软NLP老友吴兄也留言反映视觉模型不懂音乐:“這個有待改進哦!一看就是不會拉小提琴和彈鋼琴的人做的/:,@P。音樂和弓子的運行差太多,第一個音是有兩拍半的長音,應該給個長弓才對, 另外鋼琴伴奏的右腳永遠不會這樣翹著,或抖動著。他的右腳應該在 sustain pedal 上。”

LOL

生成的时候 虽然prompt里是标明了音乐的名字的,但在可见的将来,没有哪家模型能够真正做到音乐理解与演奏肢体动作细节的对齐。或可留作大模型AGI的极限挑战题(之一)吧,因为理论上如果有足够的对齐数据,根据联合训练的压缩理论,做到极致不同模态之间是可以对齐的。

如果客观世界模拟器/仿真器是视觉大模型的最终目标的话,当代的视觉大模型处于“对牛弹琴”和“对音乐盲弹琴”的段位,绝对经不起音乐家的检验。譬如,我这样的乐盲,看上面的怀旧演奏视频,就不会像音乐行家那样一眼看出破绽,反而觉得惟妙惟肖,情绪拉满。

当然,音乐家的标准也许就是个伪需求伪目标(让挑剔的“专家眼”满意了视觉细节又如何?能大卖么),也许并不真值得费力追求。但理论上,理想的 AGI 都应该搞定这些专家水平的要求。

 

 

立委论LLM:什么是AI刚需

什么是刚需?

 

朋友圈看到一个鸡汤:“穷人说,有钱就是幸福;盲人说,能看见就是幸福;乞丐说,有饭吃就是幸福;病人说,能活着就是幸福;单身说,有伴侣就是幸福。...". 可以诠释用户刚需。

让穷人富起来是刚需,当然很多时候,因为它是刚需,不乏人类饥不择食的本性,刚需者成为被割的韭菜。

吃饱肚子是致富的子集,是最刚的需,因此AGI路上最应该实现的是UBI,不能让资本家独享技术进步和裁员的红利。

让盲人重见光明是让病人痊愈的刚需之最,二者说明,医疗无论多么艰难,仍然是最诱人的垂直领域。

以上属于吃穿住行生老病死的低层次刚需。

但最后,“脱单”却是高低层次融合(物理和精神陪伴:原始性吸引+情绪价值)的刚需。这个刚需恰好是 LLM native 的 sweet spot,不会因为 character.ai 商业化暂时受挫而失去前途。

以前说过,能让一个人感觉成功脱单,对提供商是功德,对用户是砸锅卖铁也心甘情愿的生活品质提升:绝对有付费意愿。

(也许是年轻脆弱多愁善感吧,孤独飘零在大北京,形单影只,当时既没有互联网,也没有游戏,生活中也见不到找到可心女孩的希望。在脱单前记得年轻做过多次噩梦,以为一辈子要打光棍了,不寒而栗,刻骨铭心。)

长期看,老年陪护、心理therapy等都属于 LLM native 的刚需,找到切入点,排除万难做到底做好的赢家,不仅是赚钱,而且是功德。

这些应该属于第一性原则的思考。

Llama 3 Notes and Llama MV with Llama 3.1 Legend

Notes on the 92-page Paper Released with Meta's Super Large Model Llama 3.1

The super-large model Llama 3.1 is a milestone in the open-source large model community. As a leader, Meta's project involved over 500 participants/contributors (the authors of this paper are listed alphabetically in the appendix, similar to how the Central Committee members' names are displayed by stroke order). This original text is full of implementation details:

meta Llama 3.1 paper

AIGC MV using Suno and keling (just for fun & cheering opensource milestone)

Notes:

  1. Llama 3.1 doesn't use sparse techniques, it's not a multi-expert system like model 4, but a dense model.
  2. 405B parameters, 15.6T tokens: The number of tokens is 40 times the number of parameters. Large-scale top models now emphasize data growth far exceeding parameter growth. Is this 15T tokens of data open source? (No, because even if they were willing to open source it, they wouldn't dare, as it could lead to countless data infringement lawsuits)
  3. Emphasizes three major levers for super-large foundation models: data, scale, and managing complexity.
  4. Compared to the previous generation system Llama 2, computational power has increased 50 times (using 3.8 × 10^25 FLOPs).
  5. Complexity management: (1) Choosing a standard dense Transformer architecture instead of a mixture of experts model to maximize training stability. (2) Adopting a relatively simple post-training procedure: Supervised Fine-Tuning (SFT), Rejection Sampling (RS), and Direct Preference Optimization (DPO). In other words, algorithm design and implementation tend towards simplification. Not using sparse techniques and multi-expert systems is for stability (but training challenges are greater, though they're not afraid). Using simpler, easier-to-implement DPO in the post-training phase instead of reinforcement learning is also for stability, as reinforcement learning has always been difficult to handle.
  6. Benchmark tests cover: general, code, math, reasoning, tool use, long context, and multilingual. All performances are SOTA (state-of-the-art international level).
    • MMLU (Massive Multitask Language Understanding): 405B model achieves 87.3% (5-shot), 88.6% (0-shot, CoT).
    • Code generation (HumanEval): 405B model reaches 89.0%, close to GPT-4.
    • Math problems (GSM8K): 405B model achieves 96.8%, slightly higher than GPT-4.
    • Long context tasks: Excellent performance on some tasks, such as 95.2% on QuALITY.
    • Multilingual tasks (MGSM): 405B model reaches 91.6%, on par with top models. The 405B model is comparable or close to GPT-4 and Claude 3.5 Sonnet on many tasks. In short, open-source has caught up with closed-source.
  7. Pre-training started with an 8k window, expanded to a 128k window in the later stages of pre-training (continued training).
  8. After the foundation model pre-training was completed, multiple iterations of alignment "post-training" were performed. Including: (1) Aligning the model through human feedback, including multiple rounds of Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO); (2) Integrating new capabilities, such as tool use; (3) Enhancing coding and reasoning abilities (specialized optimization); (4) Safety alignment.
  9. Multimodal expansion (in progress, not yet released): Image, video, and speech capabilities. Including (1) Multimodal encoder pre-training: Image encoder trained on a large number of image-text pairs, aligning visual content and natural language in a unified space; (2) Speech self-training? (3) Experiments on video-text data alignment based on images.
  10. Language model as the core, other modalities are added later (whether added to pre-training and/or post-training). When expanding to multimodal, the language model parameters remain unchanged, adapting to multimodality, allowing multimodal alignment in the same semantic space, closer to the language model. In other words, Llama follows a modular, step-by-step approach to gradually expand to multimodality. This is not the mainstream approach (mainly referring to Open AI and Google, at least in theory) advocating for "unified multimodal native data joint pre-training". The overall impression of Llama's algorithmic strategies is seeking stability rather than innovation or unification. It tends towards practicality, not caring about leading in algorithms. For example, the integration of speech first involves speech self-training (because speech is actually very similar to text, both being language systems), then alignment between speech and text (including Automatic Speech Recognition ASR and Text-to-Speech TTS). Integrating step by step into the cross-modal large model, this approach isn't cutting-edge in terms of advancement, but it's steady progress, beneficial for engineering development, integration, and iteration. It's unclear when they will be able to release multimodal capabilities online.
  11. Data collection and cleaning work is very complex, but the Llama team is meticulous, which is also the data guarantee for its quality to catch up with SOTA. To recap: (1) De-duplication: URL-level de-duplication; Document-level de-duplication using MinHash algorithm; Row-level de-duplication: removing rows appearing more than 6 times every 30M documents. (2) Filtering: Removing low-quality documents, outliers, and excessively repetitive documents, using repetitive n-gram coverage to remove repetitive content (such as logs or error messages); using "dirty word" counts to filter adult websites not covered by blacklists; using token distribution KL divergence to filter documents with too many abnormal tokens. (3) Controlling data quality: Using fasttext classifier to identify text that might be cited by Wikipedia; using a Roberta-based classifier trained on Llama 2's predictions; using DistilRoberta to generate document quality scores. Also, fasttext language classifier can identify 176 languages; specially filtering two types of information: adult content and personal identity/privacy information. Special fine processing for code and math web pages.
  12. Data proportions: For example, downsampling over-represented data categories on the web (such as art and entertainment); data mixing ratios determined by a series of small model experiments, final data mix summary: About 50% of tokens correspond to general knowledge; 25% of tokens involve math and reasoning; 17% of tokens are code; 8% of tokens are multilingual content.
  13. Model architecture: Apart from empirical detail adjustments, the basic architecture of the dense model remains unchanged, so it's data and scaling that create top models. 405B model specific parameters: 126 layers; token representation dimension 16,384; 128 attention heads; model size of 405B determined according to scaling law, about the computational optimal size under 3.8 × 10^25 FLOPs training budget.
  14. Vocabulary: Using a vocabulary of 128K tokens. Combines 100K tokens from the tiktoken3 tokenizer and 28K additional multilingual tokens to better support non-English languages.
  15. Computing resources, including GPU clusters of tens of thousands of cards, massive storage, and high-speed networks, represent huge resource investments. Specific data as follows: Computing resources:
    • Used up to 16,000 H100 GPUs (a very powerful graphics processor).
    • Each GPU has 80GB of high-bandwidth memory, with a power of 700W.
    • These GPUs are installed on servers designed by Meta itself, with 8 GPUs and 2 CPUs per server. Storage system:
    • Uses a distributed file system called Tectonic.
    • Provides 240PB (1PB=1000TB) of storage space, distributed across 7,500 servers.
    • Can process 2TB of continuous data per second, with a peak of 7TB/second.
    • A major challenge is handling the large amount of burst writes generated when processing model checkpoints (the process of saving model states).
  16. Three-step pre-training process: a) Initial pre-training; b) Long context continued pre-training; c) Annealing with high-quality data sources Key pre-training strategies:
    • Gradually increase batch size and sequence length to balance stability and efficiency.
    • Dynamically adjust data mixing to specifically enhance certain capabilities.
    • Increase context length in stages to avoid early computational overhead.
    • Use annealing and high-quality data in the late stages of training to fine-tune model performance.

 

[LLM Summary]

Llama 3: Meta's Open-Source Large Language Model Breakthrough**

1. Introduction and Overview

Meta has introduced Llama 3, a series of foundation language models designed to support various tasks including multilingual processing, programming, reasoning, and tool use. This model series includes versions with 8B, 70B, and 405B parameters, with the largest 405B parameter model adopting a dense Transformer architecture and supporting context windows of up to 128K tokens. The development of Llama 3 highlights three key factors: data quality and scale, computational scale, and complexity management.

2. Model Architecture and Pre-training Strategy

2.1 Model Architecture

Llama 3 retains the standard dense Transformer architecture rather than adopting a mixture of experts model. This choice aims to maximize training stability, reflecting Meta's emphasis on simplifying design to manage complexity. Key architectural improvements include:
- Using Grouped-Query Attention (GQA) mechanism, with 8 key-value heads per attention layer.
- Introducing attention masks to prevent self-attention between different documents in the same sequence.
- Expanding the vocabulary to 128K tokens, combining 100K tokens from the tiktoken3 tokenizer and 28K additional multilingual tokens.
- Increasing the RoPE base frequency hyperparameter to 500,000 to support longer contexts.

2.2 Pre-training Data Processing

Llama 3's pre-training data processing is extremely rigorous, including:
- Multi-level deduplication: URL-level, document-level (using MinHash algorithm), and row-level deduplication.
- Heuristic filtering: Removing low-quality documents, outliers, and excessively repetitive content.
- Model-based quality filtering: Using fasttext and Roberta-based classifiers for quality assessment.
- Special content processing: Developing specialized processing pipelines for code and mathematical content.
- Multilingual data processing: Using fasttext base language identification model, supporting 176 languages.
- Safety and privacy protection: Filtering website data containing personally identifiable information (PII) and unsafe content.

2.3 Pre-training Strategy

The pre-training process is divided into three main stages:
1. Initial pre-training: Conducted on about 15T multilingual tokens, far exceeding Llama 2's 1.8T tokens.
2. Long context pre-training: Gradually expanding from initial 8K tokens to 128K tokens context window.
3. Annealing phase: Fine-tuning with high-quality data in the final stage, using Polyak averaging to generate the final model.

Data mixing ratios are carefully designed:
- 50% general knowledge
- 25% mathematics and reasoning
- 17% code
- 8% multilingual content

3. Training Infrastructure and Challenges

3.1 Computational Resources
- Using up to 16K H100 GPUs, each equipped with 80GB HBM3 memory.
- Adopting a 4D parallel strategy: tensor parallelism, pipeline parallelism, context parallelism, and data parallelism.

3.2 Storage System
- Using the Tectonic distributed file system, providing 240PB of storage space.
- Supporting 2TB/s sustained throughput, with peak capacity of 7TB/s.

3.3 Network Optimization
- Developing the NCCLX communication library to improve network efficiency.
- Designing specific network topologies and load balancing strategies.

3.4 Training Challenges
- Experiencing 466 job interruptions during the 54-day training period, 419 of which were unexpected.
- Developing automated systems and specialized tools to handle hardware failures and network issues.

4. Post-training and Alignment

Llama 3 adopts a multi-round iterative post-training process, including:
1. Supervised Fine-Tuning (SFT)
2. Direct Preference Optimization (DPO)
3. Reward model training: Using human feedback data
4. Safety alignment: Implementing multiple rounds of safety measures

This process not only improves the model's instruction-following capabilities but also enhances safety and specific abilities (such as coding and reasoning).

5. Multimodal Expansion

Although not officially released yet, Llama 3 demonstrates promising multimodal capabilities:
- Image recognition: Training independent image encoders, integrated with the language model through adapters.
- Video understanding: Adding video adapters based on image adapters.
- Speech processing: Independently training speech encoders, then aligning with the language model.

This modular approach allows flexible addition of new modalities while maintaining core language capabilities.

6. Performance Evaluation

Llama 3 performs excellently in multiple benchmark tests:
- MMLU (5-shot): 87.3%
- HumanEval (code generation): 89.0%
- GSM8K (math problems): 96.8%
- Long context tasks (like QuALITY): 95.2%
- MGSM (multilingual tasks): 91.6%

These results indicate that Llama 3 405B is comparable or close to GPT-4 and Claude 3.5 Sonnet on multiple tasks, particularly excelling in document understanding and long context tasks.

7. Safety Considerations

Meta highly prioritizes safety in the development of Llama 3:
- Implementing strict safety measures in both pre-training and post-training stages.
- Developing the Llama Guard system-level safety solution.
- Conducting extensive red team testing and risk assessments.

8. Open Source Impact and Future Directions

Meta's decision to publicly release the entire Llama 3 series, including the 405B parameter version, may have far-reaching impacts on the AI research community:
- Promoting open, responsible AI development.
- Accelerating AGI research progress.
- Providing researchers with opportunities to examine and improve large-scale language models.

Future development directions may include:
- Further improving multimodal integration.
- Expanding context length.
- Continuously enhancing data quality and model scale.

9. Conclusion

The development of Llama 3 demonstrates Meta's deep experience and forward-thinking in large-scale AI systems. By focusing on three key levers - data quality, computational scale, and complexity management - Llama 3 has reached or approached the current state-of-the-art level on several key benchmarks. Its open-source release may drive a wave of innovation across the entire AI field, paving the way for responsible AGI development.

Llama 3: Meta's AI Chef's Latest "Divine Delicacy"

Attention, all tech enthusiasts! The Michelin three-star AI chef Meta has just unveiled a new dish! This divine delicacy named "Llama 3" is not only spicy enough but will elevate your taste buds to new heights!

1. The Chef's Secret Weapon

Imagine Llama 3 as a super nanny who speaks 8 languages, writes code, does math, and can be your personal assistant. She can handle a kindergarten full of rambunctious kids (8B version), manage a mid-sized company (70B version), or even govern a small country (405B version)! This 405B big sister can remember 128,000 "gossips" (oh no, I mean context) simultaneously, essentially a walking encyclopedia + supercomputer!

2. Ingredient Selection: Only the Freshest!

Llama 3's chefs are masters at picking ingredients:

  • They "fished" 15 trillion words from the internet, nearly 10 times more than the previous generation!
  • Half of these words are everyday life seasonings, a quarter are math problems and brain teasers, nearly a fifth are programmer spells, and the rest are dialects learned from world travels.
  • They even invented a super weed remover, filtering out all the online garbage, repetitive, and unhealthy stuff.

3. Cooking Process: Three-Step Stir-Fry Method

Step 1: "Slow Simmer" - Start with a regular stove (8K context) to cook it halfway. Step 2: "High Heat Stir-Fry" - Switch to a super stove (gradually increasing to 128K context), reducing the sauce to be thick and fragrant. Step 3: "Low Heat Finish" - Finally, a gentle simmer with the best ingredients, the legendary "annealing" (even the chefs don't know why it's called that), bringing the flavor to its peak!

4. Kitchen Equipment: Top-of-the-Line Luxury Version

  • 16,000 super high-power induction cookers (H100 GPUs) firing simultaneously!
  • A refrigerator that could fit half the Pacific Ocean (240PB storage)!
  • A proprietary ingredient prep system faster than 5G (NCCLX communication library)!

Imagine all these stoves firing at once, making the kitchen feel like a sauna. But our chefs persevered through the heat, changing chef uniforms 466 times in 54 days to whip up this dish!

5. Training Method: Both Cute and Well-Mannered

Being a good cook isn't enough; you've got to have manners too! So our chefs began a long "training" process:

  • First came a round of "gentle education" (supervised fine-tuning)
  • Then the "carrot and stick" tactic (direct preference optimization)
  • Finally, they invited moral role models (safety alignment) for guidance

After all this fuss, Llama 3 not only cooks well but also knows how to please people, program, do math, and mind her manners - a true decathlon champion!

6. Special Side Dishes: Showcasing Multiple Talents

Don't think Llama 3 can only cook; she's a multi-talented "goddess":

  • Storytelling from images? Piece of cake!
  • Writing movie reviews? No problem!
  • Recognizing songs and even singing a bit? The karaoke queen!

Although these "talents" are still in practice, they already show the potential of Li Bai's "from black hair to snow white in a day"!

7. A True Powerhouse: Dazzling Test Scores

Llama 3 participated in a series of "Top Chef Competitions," with eye-popping scores:

  • College Entrance Exam (MMLU): 87.3 points (out of 100)
  • Programmer Interview (HumanEval): 89 points (out of 100)
  • Math Olympiad (GSM8K): 96.8 points (out of 100)
  • Long Novel Reading Comprehension (QuALITY): 95.2 points (out of 100)

Bring this report card home, and even a "Tiger Mom" would be grinning from ear to ear!

8. Safety First: AI's "Security Captain"

Meta's chefs know well the principle of "don't leave guns and ammo lying around." They've assigned Llama 3 a 24/7 bodyguard team (Llama Guard) to prevent her from accidentally saying or doing the wrong thing. They even arrange occasional "moral exams" to ensure she doesn't turn into a "Terminator."

9. Open Source Feast: Everyone Can Be a Master Chef!

The most impressive part is that Meta decided to make the recipe for this "divine delicacy" completely public! It's like a Michelin three-star restaurant putting their signature dish's recipe online. Now anyone who wants to can whip it up at home! This move not only shocked other master chefs but also made countless food lovers cheer with joy!

10. Future Outlook: Reaching New Heights

Meta's chefs aren't resting on their laurels; they're already pondering the next "divine delicacy":

  • Maybe a dancing Llama 4?
  • Or a painting Llama 5?
  • Who knows, one day we might see a Llama 6 composing symphonies!

In short, the AI world's "Michelin" journey has only just begun!

Epilogue

The birth of Llama 3 not only elevates Meta's status in the AI world but also brings a fresh breeze to the entire AI research community. This bowl of "Llama soup" is not only delicious but also brings unlimited imagination to everyone. What will the future of AI be like? Let's wait and see what flavor the next "divine delicacy" will be!

辣妈3笔记 以及辣妈MV与辣妈3.1传奇

Meta超大模型辣妈3.1 发布时的92页论文读书笔记

超大模型 辣妈3.1 是大模型开源社区的里程碑。作为 leader,Meta 该项目的参与者/贡献者有 500 多人(这篇论文的作者署名作为附录按字母顺序排列,跟中央委员姓氏笔划公示似的)。这是充满了实现细节描述的原文

meta Llama 3.1 paper

AIGC MV (just for fun & cheering opensource milestone)

【立委笔记】

1. 辣妈3.1 没有用稀疏技术,不是 model 4 那种多专家系统,而是一个 dense model

2. 405B参数,15.6T tokens:token 数是参数的 40 倍

超大规模头部模型现在强调的都是数据量远大于参数量的增长,这 15T tokens 的数据也是开源的吗?(不会,因为即便愿意开源,也没这个胆子,可能会引起无数数据侵权的官司)

3. 强调超大基础模型的三大杠杆:data, scale, and managing complexity.

4. 比上一代系统辣妈2,计算量增加了 50 倍 (using 3.8 × 1025 FLOPs)

5. 复杂性管理:(1)选择标准的密集Transformer架构,而非混合专家模型,以最大化训练稳定性。(2)采用相对简单的后训练程序:监督微调(SFT)、拒绝采样(RS)和直接偏好优化(DPO)。就是说,算法设计和实现上,趋于简单化。不利用稀疏技术和分流多专家系统,为的是稳定性(但训练挑战更大,但他们不怕)。后训练阶段用更简单、容易实现的DPO,而不用强化学习,也是为了稳定性,因为强化学习从来就不好缠。

6 基准测试涵盖:通用、代码、数学、推理、工具使用、长上下文和多语言。全部表现为 SOTA(国际先进水平)。

MMLU(大规模多任务语言理解):405B模型达到87.3%(5-shot),88.6%(0-shot,CoT)。
代码生成(HumanEval):405B模型达到89.0%,接近GPT-4。
数学问题(GSM8K):405B模型达到96.8%,略高于GPT-4。
长上下文任务:在某些任务上表现优异,如QuALITY达到95.2%。
多语言任务(MGSM):405B模型达到91.6%,与顶级模型持平。

405B模型在许多任务上与GPT-4和Claude 3.5 Sonnet相当或接近。一句话,开源追平了闭源。

7. 预训练开始是 8k 窗口,在预训练后期(继续训练)时候扩展到了 128k 窗口。

8. 基础模型预训练完成后,经过多次迭代的对齐“后训练”。

包括:(1)通过人类反馈对齐模型,包括多轮的监督微调(SFT)和直接偏好优化(DPO);(2)集成新能力,如工具使用;(3)增强编码和推理能力(专项优化);(4)安全对齐。

9. 多模态扩展(进行中,未上线发布):图像、视频和语音能力。

包括 (1)多模态编码器预训练:图像编码器在大量图像-文本对上训练,统一空间里对齐视觉内容和自然语言;(2)语音独自自训练?(3) 在图片基础上,进一步作视频-文本数据对齐的实验。

10. 语言模型为核心,其他模态都是后加的(无论加到预训练 and/or 后训练)。

在扩展到多模态的时候,语言模型保持参数不变,适配的是多模态,让多模态在同一个意义空间对齐、靠近语言模型。换句话说,辣妈是遵循模块化、步骤化的方式逐渐扩展到多模态。而不是采用主流(主流主要指 Open AI 和谷歌,至少在理论上)倡导的“大一统的多模态原生数据的联合预训练”。

辣妈的所有算法策略,总的印象是求稳,而不是求创新 或 求大一统。偏向于务实,不在乎自己的算法领先。例如语音的集成,先是语音的自训练(因为语音与文字其实很类似,都是语言体系),然后是语音与文字的对齐(包括语音识别 ASR 和语音合成 TTS)。一步一步集成进跨模态的大模型,这方面谈不上先进性,但一步一个脚印,有利于工程化的开发、集成和迭代。不知道他们什么时候可以上线发布多模态的能力?

11. 数据收集清洗工作非常繁杂,但辣妈团队一丝不苟,这也是它质量可以追平 SOTA 的数据保证。

盘点一下:

(1)去重(De-duplication):URL级别去重; 文档级别去重:使用MinHash算法;行级别去重:每30M文档移除出现超过6次的行。
(2)过滤:移除低质量文档、异常值和过度重复的文档,使用重复n-gram覆盖率移除重复内容(如日志或错误消息);使用"脏词"计数过滤未被黑名单覆盖的成人网站;使用token分布KL散度过滤含异常token过多的文档。
(3) 控制数据品质:使用fasttext分类器识别可能被维基百科引用的文本;使用基于Roberta的分类器,该分类器基于Llama 2的预测进行训练;使用DistilRoberta生成文档质量分数。还有 fasttext 的语言分类器,可识别176种语言;特别过滤两类信息:黄色;个人身份等隐私信息。对代码和数学的网页做特别的精细处理。

12. 数据比例:例如,对网络上过度表示的数据类别(如艺术和娱乐)进行降采样;数据混合比例由一系列小模型实验决定比例取舍,最终数据混合摘要:

约50%的token对应一般知识;25%的token涉及数学和推理;17%的token是代码;8%的token是多语言内容。

13. 模型架构:除了经验性细节调整,dense 模型的基本架构不变,所以是数据和规模化造就了头部模型。

405B模型具体参数:126层;token表示维度16,384;128个注意力头;根据scaling law决定模型大小为405B,约为3.8 × 10^25 FLOPs训练预算下的计算最优size。

14. 词汇表:使用128K个token的词汇表。结合了tiktoken3分词器的100K个token和28K个额外的多语言tokens,以更好地支持非英语语言。

15. 计算机资源,包括GPU万卡集群、海量存储和高速网络,巨大的资源投入。具体数据如下:

计算资源:
使用了多达16,000个H100 GPU(一种非常强大的图形处理器)。
每个GPU都有80GB的高带宽内存,功率为700W。
这些GPU被安装在Meta自己设计的服务器上,每个服务器有8个GPU和2个CPU。

存储系统:
使用了一个叫Tectonic的分布式文件系统。
提供了240PB(1PB=1000TB)的存储空间,分布在7,500台服务器上。
可以每秒处理2TB的持续数据,峰值可达7TB/秒。
一个主要挑战是处理模型检查点(保存模型状态的过程)时产生的大量突发写入。

16. 三步预训练过程:a) 初始预训练;b) 长上下文继续预训练;c) 用高质量数据源退火(Annealing)

预训练关键策略:
逐步增加批量大小和序列长度,以平衡稳定性和效率。
动态调整数据混合,针对性地提升特定能力。
分阶段增加上下文长度,避免早期的计算开销。
在训练后期使用退火和高质量数据,微调模型性能。

———————————

【LLM摘要】Llama 3:Meta的开源大模型突破

1. 引言与概述

Meta公司推出的Llama 3是一系列基础语言模型,旨在支持多语言处理、编程、推理和工具使用等多种任务。这个模型系列包括8B、70B和405B参数三个版本,其中最大的405B参数模型采用密集Transformer架构,支持多达128K tokens的上下文窗口。Llama 3的开发突显了三个关键因素:数据质量与规模、计算规模、以及复杂性管理。

2. 模型架构与预训练策略

2.1 模型架构
Llama 3沿用了标准的密集Transformer架构,而非采用混合专家模型。这一选择旨在最大化训练稳定性,体现了Meta对简化设计以管理复杂性的重视。主要的架构改进包括:
- 使用分组查询注意力(GQA)机制,每个注意力层有8个键值头。
- 引入防止同一序列中不同文档之间自注意力的注意力掩码。
- 扩展词汇表至128K tokens,结合了tiktoken3分词器的100K tokens和28K个额外多语言tokens。
- 将RoPE基频超参数增加到500,000,以支持更长的上下文。

2.2 预训练数据处理
Llama 3的预训练数据处理极为严格,包括:
- 多层次去重:URL级、文档级(使用MinHash算法)和行级去重。
- 启发式过滤:移除低质量文档、异常值和过度重复内容。
- 模型基础质量过滤:使用fasttext和基于Roberta的分类器进行质量评估。
- 特殊内容处理:为代码和数学内容开发专门的处理流程。
- 多语言数据处理:使用fasttext基础语言识别模型,支持176种语言。
- 安全与隐私保护:过滤包含个人可识别信息(PII)和不安全内容的网站数据。

2.3 预训练策略
预训练过程分为三个主要阶段:
1. 初始预训练:在约15T多语言tokens上进行,远超Llama 2的1.8T tokens。
2. 长上下文预训练:从初始的8K tokens逐步扩展到128K tokens的上下文窗口。
3. 退火阶段:在最后阶段使用高质量数据进行微调,并采用Polyak平均法生成最终模型。

数据混合比例经过精心设计:
- 50%通用知识
- 25%数学和推理
- 17%代码
- 8%多语言内容

3. 训练基础设施与挑战

3.1 计算资源
- 使用多达16K个H100 GPUs,每个GPU配备80GB HBM3内存。
- 采用4D并行策略:张量并行、流水线并行、上下文并行和数据并行。

3.2 存储系统
- 使用Tectonic分布式文件系统,提供240PB存储空间。
- 支持2TB/s的持续吞吐量,峰值可达7TB/s。

3.3 网络优化
- 开发NCCLX通信库,提高网络效率。
- 设计特定的网络拓扑和负载均衡策略。

3.4 训练挑战
- 在54天训练期间经历466次作业中断,其中419次为意外中断。
- 开发自动化系统和专门工具处理硬件故障和网络问题。

4. 后训练与对齐

Llama 3采用多轮迭代的后训练过程,包括:
1. 监督微调(SFT)
2. 直接偏好优化(DPO)
3. 奖励模型训练:使用人类反馈数据
4. 安全对齐:实施多轮安全措施

这一过程不仅提升了模型的指令遵循能力,还增强了安全性和特定能力(如编码和推理)。

5. 多模态扩展

虽然尚未正式发布,但Llama 3展示了promising的多模态能力:
- 图像识别:训练独立的图像编码器,通过适配器与语言模型集成。
- 视频理解:在图像适配器基础上添加视频适配器。
- 语音处理:独立训练语音编码器,然后与语言模型对齐。

这种模块化方法允许灵活添加新模态,同时保持核心语言能力不变。

6. 性能评估

Llama 3在多项基准测试中表现卓越:
- MMLU(5-shot):87.3%
- HumanEval(代码生成):89.0%
- GSM8K(数学问题):96.8%
- 长上下文任务(如QuALITY):95.2%
- MGSM(多语言任务):91.6%

这些结果表明Llama 3 405B在多个任务上与GPT-4和Claude 3.5 Sonnet相当或接近,特别是在文档理解和长上下文任务上表现突出。

7. 安全性考虑

Meta在Llama 3的开发过程中高度重视安全性:
- 预训练和后训练阶段均实施严格的安全措施。
- 开发Llama Guard系统级安全解决方案。
- 进行广泛的红队测试和风险评估。

8. 开源影响与未来方向

Meta决定公开发布Llama 3全系列模型,包括405B参数版本,这一决定可能对AI研究社区产生深远影响:
- 推动开放、负责任的AI发展。
- 加速AGI的研究进程。
- 为研究者提供检验和改进大规模语言模型的机会。

未来发展方向可能包括:
- 进一步改进多模态集成。
- 扩展上下文长度。
- 持续提升数据质量和模型规模。

9. 结论

Llama 3的开发展示了Meta在大规模AI系统方面的深厚经验和前瞻性思考。通过聚焦数据质量、计算规模和复杂性管理这三个关键杠杆,Llama 3在多个关键基准上达到或接近了当前最先进水平。其开源发布可能会推动整个AI领域的创新浪潮,为负责任的AGI开发铺平道路。

———————————

老友说大模型做摘要千篇一律 看了等于没看。其实那是因为学术圈摘要 人类就是这么写的 难免有点八股 怪不得大模型。其实大模型擅长各种表述 端看你怎么提点它了。例如,指令: 根据上面的《辣妈3之歌》,讲一个风趣幽默但不离谱的故事:

好的,我来尝试用更加风趣生动、通俗易懂的方式重新描述Llama 3这个庞然大物:

辣妈3号:Meta的AI大厨烹饪的最新"神仙美味"

各位吃瓜群众请注意,AI界的米其林三星大厨Meta又出新菜了!这道名为"辣妈3号"的神仙美味,不仅辣得够味,还要把你的味蕾辣出新境界!

1. 主厨的秘密武器

想象一下,辣妈3号就像是一位会8国语言、能写代码、会算数、还能当你小助手的超级保姆。她不仅能照顾一个幼儿园的熊孩子(8B版本),还能管理一个中型公司(70B版本),甚至能治理一个小国家(405B版本)!这位405B大姐能同时记住12.8万个"流言蜚语"(哦不,是上下文),简直是行走的百科全书+超级计算机!

2. 食材选择:只选最新鲜的!

辣妈3号的主厨们可是挑食材的老手:
- 他们从网上"海捞"了15万亿个词,比上一代多了近10倍!
- 这些词里,一半是日常生活的调料,四分之一是数学题和脑筋急转弯,近五分之一是程序员的咒语,剩下的是环游世界学来的各国方言。
- 他们还发明了一个超级除草机,把网上的垃圾、重复的、不健康的统统都筛了出去。

3. 烹饪过程:三步炒作法

第一步:"小火慢炖"- 先用普通灶台(8K上下文)煮个半熟。
第二步:"大火爆炒"- 换成超级灶台(逐步加到128K上下文),把汤汁收得又浓又香。
第三步:"温火收尾"- 最后用最好的食材轻轻一熬,这就是传说中的"退火"(连厨师自己都不知道为啥叫这名),让味道达到巅峰!

4. 厨房设备:顶配豪华版

- 16000个超级大功率电磁炉(H100 GPU)同时开火!
- 一个能装下半个太平洋的冰箱(240PB存储)!
- 比5G还快的专属配菜系统(NCCLX通信库)!

想象一下,这么多灶台同时开火,厨房里热得跟桑拿房似的。但我们的大厨们愣是顶着高温,54天里换了466次厨师服,才把这道菜给炒出来!

5. 调教方法:既要软萌可爱,又要知书达理

光会做菜不行,还得懂礼貌啊!于是我们的厨师们开始了漫长的"调教"过程:
- 先是来了一轮"温柔教育"(监督微调)
- 接着是"棒棒糖加大棒"战术(直接偏好优化)
- 最后还请来了道德模范(安全对齐)来指导

经过这番折腾,辣妈3号不仅会做菜,还会哄人、会编程、会算数、懂礼貌,简直就是十项全能!

6. 特色小菜:多才多艺显身手

别以为辣妈3号只会做饭,她还是个多才多艺的"女神":
- 看图说故事?小菜一碟!
- 看视频写影评?不在话下!
- 听歌识曲还能唱两句?简直就是KTV麦霸!

虽然这些"才艺"还在练习中,但已经有几分李白"朝如青丝暮成雪"的潜质了!

7. 实力派:考试成绩单亮瞎眼

辣妈3号参加了一系列"星厨争霸赛",成绩简直亮瞎眼:
- 大学入学考试(MMLU):87.3分(满分100)
- 程序员面试(HumanEval):89分(满分100)
- 奥数比赛(GSM8K):96.8分(满分100)
- 长篇小说阅读理解(QuALITY):95.2分(满分100)

这成绩单拿回家,恐怕连"虎妈"都要笑得合不拢嘴!

8. 安全第一:AI界的"保安队长"

Meta的厨师们深知"枪支弹药不能乱放"的道理。他们给辣妈3号配了个24小时保镖团队(Llama Guard),防止她不小心说错话或做错事。还时不时安排几次"道德考试",确保她不会变成"终结者"。

9. 开源盛宴:人人都能当大厨!

最厉害的是,Meta决定把这道"神仙美味"的菜谱完全公开!这就相当于米其林三星餐厅把招牌菜的制作方法放到了网上。以后谁想做,在家就能整一道!这波操作,不仅让其他大厨们惊掉了下巴,还让无数吃货们欢呼雀跃!

10. 未来展望:更上一层楼

Meta的厨师们可没闲着,他们已经在琢磨下一道"神仙美味"了:
- 也许是会跳舞的辣妈4号?
- 或者会画画的辣妈5号?
- 搞不好哪天就冒出个会写交响乐的辣妈6号!

总之,AI界的"米其林"之路,才刚刚开始呢!

结语

辣妈3号的诞生,不仅让Meta在AI界的江湖地位更上一层楼,还为整个AI研究界带来了一股新风。这碗"辣妈汤",不仅好喝,还给大家带来了无限的想象空间。AI的未来会怎样?让我们拭目以待,看看下一道"神仙美味"会是什么滋味!

———————————

下面才是中规中矩、非常忠实的另外几个LLM论文摘要,出自头部大模型,角度和风格略有不同,可以对比着快速浏览,获得辣妈3的基本信息 。我核对过,觉得比人类写得好。

【摘要2】

Meta最新推出的Llama 3系列基础模型,旨在支持多语言任务、编程、推理和工具使用,其旗舰模型拥有4050亿参数,上下文窗口可达128K tokens。本文将深入探讨Llama 3的模型架构、训练过程、性能评估、多模态扩展、数据处理与质量控制、复杂性管理以及安全性等方面。

模型架构与训练

Llama 3采用了标准的密集Transformer架构,而非混合专家模型,以最大化训练稳定性。模型训练分为三个主要阶段:

  1. 初始预训练:使用大规模多语言文本语料库进行预训练,模型在此阶段学习语言结构和大量知识。初始上下文长度为8K tokens。
  2. 长上下文预训练:逐步增加上下文长度至128K tokens,以支持更长的输入和输出。
  3. 退火:对高质量数据进行上采样,并采用Polyak平均化方法生成最终预训练模型。

数据处理与质量控制

Llama 3的数据处理过程极其严格,包括大规模文本语料库清理、去重和过滤,以确保数据的高质量。具体步骤包括:

  1. PII和安全过滤:过滤掉可能包含个人可识别信息(PII)和不安全内容的网站数据。
  2. 文本提取和清理:使用自定义解析器从HTML内容中提取高质量文本,并保留数学和代码内容的结构。
  3. 去重:在URL、文档和行级别上进行多轮去重。
  4. 启发式过滤:通过启发式方法去除低质量文档、异常值和重复内容。
  5. 模型质量过滤:使用模型分类器对数据进行进一步的质量筛选,确保训练语料库的高质量。

最终数据混合比例为:50%通用知识,25%数学和推理,17%代码,8%多语言内容。

模型评估与性能

Llama 3在多个基准数据集上进行了评估,包括多语言理解、编程、推理等任务。其旗舰模型在多个关键基准上的性能接近甚至超过了GPT-4。具体来说:

  • 多语言能力:支持至少八种语言的问答和指令执行。
  • 编程能力:能够编写高质量代码,并在HumanEval等编程基准上表现优异。
  • 推理能力:在ARC Challenge等推理基准上表现出色。
  • 长上下文任务:在某些任务上表现优异,如QuALITY达到95.2%。
  • 多语言任务(MGSM):与顶级模型持平。

多模态扩展

Llama 3还进行了多模态扩展实验,整合了图像、视频和语音的识别能力。这些模型通过组合方法进行训练,显示出与当前最先进模型在图像、视频和语音任务上的竞争力。

复杂性管理与安全性

Llama 3在设计上做出了一些简化以管理复杂性,如采用标准的密集Transformer模型架构,而非混合专家模型,从而最大限度地提高训练的稳定性。此外,Llama 3引入了Llama Guard 3模型,用于输入和输出的安全性过滤,并在训练过程中实施多轮安全措施,以确保模型的安全性和可靠性。

开源与未来发展

Llama 3的所有模型将在更新的Llama 3社区许可证下公开发布,包括预训练和后训练版本的4050亿参数语言模型。通过公开发布旗舰模型,Meta希望推动研究社区的创新,并加速通向人工通用智能(AGI)的负责任发展。

总结

Llama 3展示了在多语言、编程、推理和工具使用等方面的显著改进。通过优化数据质量和多样性、扩大训练规模以及简化模型架构,Llama 3在多个关键基准上达到了或接近了当前最先进模型的性能。Llama 3的公开发布将对AI研究社区产生深远影响,并加速AGI的负责任发展。

【摘要3】

引言

现代人工智能系统的基础是基础模型。本综述介绍了Meta推出的新一代基础模型集——Llama 3。Llama 3是一组支持多语言、编程、推理和工具使用的语言模型,其最大模型包含4050亿参数,并支持最多128K的上下文窗口。本论文对Llama 3进行了广泛的实证评估,结果显示其在多种任务上的性能与当前领先的语言模型如GPT-4相当。

模型架构与训练

Llama 3采用标准的稠密Transformer架构,并在数据质量和多样性以及训练规模上进行了显著改进。模型分为三个主要阶段进行训练:

  1. 初始预训练:使用大规模多语言文本语料库对模型进行预训练,模型在此阶段学习语言结构并获得大量知识。初始上下文长度为8K tokens。
  2. 长上下文预训练:在初始预训练后,通过逐步增加上下文长度进行进一步预训练,最终支持128K tokens的上下文窗口。
  3. 退火:在最后的退火阶段,对高质量数据进行上采样,并采用Polyak平均化方法生成最终预训练模型。

Llama 3在设计上做出了一些简化以管理复杂性,如采用标准的稠密Transformer模型架构,而非混合专家模型,从而最大限度地提高训练的稳定性。后训练阶段采用相对简单的程序:监督微调(SFT)、拒绝抽样(RS)和直接偏好优化(DPO)。

模型评估与性能

Llama 3的模型被评估在多个基准数据集上,包括多语言理解、编程、推理等任务。其旗舰模型在多个关键基准上的性能接近甚至超过了GPT-4,具体表现如下:

  • 多语言能力:Llama 3支持至少八种语言的问答和指令执行,包括英语、德语、法语、意大利语、葡萄牙语、印地语、西班牙语和泰语。
  • 编程能力:能够编写高质量代码,并在HumanEval等编程基准上表现优异。
  • 推理能力:在ARC Challenge等推理基准上表现出色。
  • 长上下文任务:在某些任务上表现优异,如QuALITY达到95.2%。
  • 多语言任务:在MGSM测试中,405B模型达到91.6%,与顶级模型持平。

Llama 3还进行了多模态扩展实验,整合了图像、视频和语音的识别能力。这些模型通过组合方法进行训练,显示出与当前最先进模型在图像、视频和语音任务上的竞争力。

数据处理与质量控制

Llama 3的数据处理过程包括大规模的文本语料库清理、去重和过滤,以确保数据的高质量。具体步骤包括:

  1. PII和安全过滤:过滤掉可能包含个人可识别信息(PII)和不安全内容的网站数据。
  2. 文本提取和清理:使用自定义解析器从HTML内容中提取高质量文本,并保留数学和代码内容的结构。
  3. 去重:在URL、文档和行级别上进行多轮去重。
  4. 启发式过滤:通过启发式方法去除低质量文档、异常值和重复内容。
  5. 模型质量过滤:使用模型分类器对数据进行进一步的质量筛选,确保训练语料库的高质量。

复杂性管理

Llama 3在设计上做出了一些简化以管理复杂性,如采用标准的稠密Transformer模型架构,而非混合专家模型,从而最大限度地提高训练的稳定性。通过优化数据质量和多样性、扩大训练规模以及简化模型架构,Llama 3在多个关键基准上达到了或接近了当前最先进模型的性能。

安全性

Llama 3引入了Llama Guard 3模型,用于输入和输出的安全性过滤。该模型在训练过程中实施多轮安全措施,以确保模型的安全性和可靠性。

训练基础设施

Llama 3的训练在Meta的生产集群上进行,使用多达16,000个H100 GPUs,每个GPU具有80GB的高带宽内存,功率为700W。存储系统使用Tectonic分布式文件系统,提供240PB的存储空间,支持每秒2TB的持续数据处理能力,峰值可达7TB/秒。网络方面,使用RDMA over Converged Ethernet (RoCE) 和Nvidia Quantum2 Infiniband fabric。

后训练与对齐

Llama 3的后训练和对齐过程包括:

  1. 监督微调(SFT):使用人类反馈数据训练奖励模型,然后用于指导模型生成。
  2. 直接偏好优化(DPO):进一步对齐模型,使其更符合人类偏好。

通过多轮迭代,不断改进模型的指令遵循能力和安全性。

社区贡献与未来发展

Llama 3的所有模型将在更新的Llama 3社区许可证下公开发布,包括预训练和后训练版本的4050亿参数语言模型。通过公开发布旗舰模型,Meta希望推动研究社区的创新,并加速通向人工通用智能(AGI)的负责任发展。

结论

总体而言,Llama 3展示了在多语言、编程、推理和工具使用等方面的显著改进。通过优化数据质量和多样性、扩大训练规模以及简化模型架构,Llama 3在多个关键基准上达到了或接近了当前最先进模型的性能。通过公开发布如此大规模的模型,Meta展现了其推动开放、负责任的AI发展的决心,这可能会对整个AI领域产生深远影响。

【摘要4】

本文详细介绍了Meta推出的基础语言模型Llama 3。Llama 3是一组支持多语言、编程、推理和工具使用等多种任务的模型,其最大模型具有405B参数和128K token的上下文窗口。

Llama 3的开发基于数据、规模和复杂性管理三大关键支柱。在数据方面,其预训练数据量达约15T多语言tokens,远超Llama 2的1.8T tokens,并进行了严格的数据清洗和质量控制,包括PII和安全过滤、文本提取和清理、多轮去重、启发式过滤以及模型质量过滤等,数据混合比例为约50%通用知识、25%数学和推理、17%代码、8%多语言内容。

在规模上,旗舰语言模型使用3.8 × 10^25 FLOPs进行预训练,规模远大于以前的Llama模型。在管理复杂性方面,采用标准的密集Transformer架构和相对简单的后训练程序,如监督微调(SFT)、拒绝采样(RS)和直接偏好优化(DPO),以最大化训练稳定性。

Llama 3在多个基准数据集上进行了广泛评估,包括多语言理解、编程、推理等任务。其旗舰模型在许多任务上的表现与GPT - 4等领先语言模型相当或接近,在帮助性和无害性之间取得了更好的平衡。例如,在MMLU、HumanEval、GSM8K、QuALITY、MGSM等基准测试中表现出色。

此外,Llama 3还进行了多模态扩展实验,通过组合方法整合图像、视频和语音的识别能力。图像编码器在大量图像 - 文本对上训练,语音模块进行了自训练并与语言模型对齐,视频 - 文本数据也进行了对齐实验。这些模型仍在积极开发中,尚未准备好发布。

Llama 3的训练基础设施强大,使用多达16K个H100 GPUs,采用4D并行策略和专门的网络拓扑及负载均衡策略。训练过程中面临硬件挑战,但通过自动化系统和专门工具解决。在预训练和后训练阶段都实施了严格的安全措施,开发了Llama Guard系统级安全解决方案,并进行了广泛的红队测试和风险评估。

Meta公开发布了Llama 3模型,包括405B参数版本,这可能会推动AI研究社区的创新,加速负责任的AGI开发。未来可能会进一步提升模型能力,如改进多模态集成和扩展上下文长度,同时持续改进数据质量和模型规模。总的来说,Llama 3强调了数据质量、计算规模和复杂性管理在开发高质量基础模型中的重要性,其算法策略求稳务实,通过一丝不苟的数据收集清洗工作保证了模型质量。

 

What did Ilya see? -- secret behind success of LLMs

What did Ilya see?

-- looking closely into his historical Berkeley talk

by Wei Li, Jia Gao

Introduction

When Ilya Sutskever left OpenAI and re-emerged with his new company, SSI (Safe Superintelligence Inc.), the move was both surprising and expected—he bypassed AGI and directly aimed at SSI (Safe Superintelligence). He confidently declared: Superintelligence is imminent, and establishing safe superintelligence (SSI) is the most important technological issue of our time.

Ilya, a legend in the field of deep learning and AI, and the former true soul of OpenAI, was at the center of the dramatic internal shift, addressing the issue—effective acceleration versus super alignment. Why was Ilya so steadfast about "super alignment" amid the underlying AI values and strategic path debate? Even after the storm settled, the outside world continued to speculate: what did Ilya see that compelled him to join the board in making the decision to oust CEO Sam Altman? Ilya remained hidden until recently, when he left OpenAI, leading to the dissolution of his super alignment team and the creation of his new company.

What did he see behind the push for "safe intelligence"?

Back on October 3, 2023, Ilya gave a talk at UC Berkeley titled "A Theory of Unsupervised Learning." Though obscure and known to few, it is destined to be one of the most significant moments in AI history. This talk was a theoretical reflection and summary by a top expert in deep learning on the GPT model he pioneered, now famous worldwide. Ilya revealed the core principles of large models and vividly described his obsession with, and excitement over, independently understanding the mechanisms of unsupervised learning. Despite the complexity, the talk was brilliant and enlightening.

Until recently, Leopold Aschenbrenner, a former member of his super alignment team, published a 165-page article, "Situation Awareness," preliminarily revealing the shock and concerns within OpenAI over the exponential evolution of GPT models. This partly answered the question of what Ilya saw, but Ilya himself remained silent until his official re-emergence not long ago.

Reflecting on his "confessional" talk at Berkeley, we might glimpse his "moment of enlightenment" when facing potential superintelligence and understand his original intent for safe intelligence. It was a rare deep sharing by Ilya, attempting to convey essential message to the world. But did the world hear him?

1. Machine Learning: Supervised Learning and Unsupervised Learning

To accommodate readers with varying mathematical backgrounds, this blog aims to explain Ilya's historical presentation in an accessible language. Purely technical explanations can be skipped by non-technical readers without affecting the understanding of the presentation's main ideas.

Before diving in, let's review the basic concepts of machine learning. Machine learning is like having computers as students and humans as teachers. By providing computers with numerous "practice problems" and "answer keys," they slowly learn to solve problems. This is supervised learning. But can computers really learn from practice problems instead of merely memorizing them? Ilya assures us there's theoretical proof of this.

Imagine a sea of problems before you, each paired with a standard answer. This is the model's training data. Model training is like diligently solving these problems until most of them are correct, meaning low training error. But even an extensive problem set has its limits. When new problems arise, can the model still get them right? These new problems are the test data, akin to exams. Whether the model performs well depends on its test error rate.

Mathematics tells us that as long as the problem set is large enough, far exceeding the model's size, excellent performance on training problems (low training error) ensures good performance on test problems (low testing error). In other words, if the model trains well, it will do well in exams! This is the mathematical guarantee for supervised learning.

However, if the model merely memorizes without extraction, no matter how large its memory or how strong its "memory power," it lacks real adaptive learning ability (called "generalization ability"). Only when the model isn't too smart, it will be forced to extract the essence (called "compression"), learning real skills from the problem set.

This explains why the model size shouldn't be too large, to avoid giving the model too much room to cut corners. In short, Ilya wants to say that "big labeled data + low training error" is the winning formula for supervised learning, guaranteed by mathematics. This point has been confirmed both theoretically and practically. Since the deep learning revolution 12 years ago, countless successful cases have shown that as long as the training data is sufficient, neural networks can excel, at all sorts of AI tasks, from recognizing cats and dogs to machine translation.

But what about unsupervised learning? Can computers learn intelligence from a problem set without standard answers? It sounds far-fetched, but Ilya is about to explain how he managed to seek a solid mathematical foundation for unsupervised learning as well.

2. Distribution Matching: A New Approach to Unsupervised Learning

Everyone knows that machine translation was a typical win of supervised learning, in fact, the only win among various NLP tasks (such as dialogue, information extraction, sentiment analysis,  question answering, docuent understanding, etc.) prior to the large language model's era. Why? Because we have a vast amount of historical bilingual data. It's like students having workbooks with English on the left and Chinese on the right—supervised learning thrives on this setup.

But what if the teacher suddenly stops providing aligned bilingual data and only gives you English books and unrelated Chinese books, leaving you to figure out how to align and learn automatic translation? That's the problem unsupervised learning needs to solve. Ilya says unsupervised learning can also handle various language machine translations (which we've seen today with large models—specialized translation software is no longer needed), and even any input-to-output transformation tasks. What's the catch?

Ilya discovered a new approach called distribution matching. Essentially, if the English and Chinese book collections are large enough, containing various sentence structures, their linguistic regularities will be learned "without supervision". For example, the context distribution of "I/me/my" in English should correspond to "我" in Chinese; adjectives near nouns in English with semantic compatibility should have a similar pattern in Chinese, etc. This provides the basic condition for potential language alignment.

Ilya points out that if two languages' native data is sufficiently rich, the input in one language can almost uniquely determine the equivalent translation in the other language. This principle applies not only to machine translation but also to tasks like speech recognition and image recognition.

Ilya independently discovered this approach in 2015, fascinated by the underlying mathematical principle—compression theory. If we can find a method that maximally compresses both English and Chinese data, this approach will capture the common patterns of the two languages, which form the basis of translation.

So, Ilya proposes that unsupervised learning is essentially about finding the optimal data compression method. This perspective not only sounds cool but also provides a mathematical explanation for the effectiveness of unsupervised learning. Although real-world tasks are not idealized, this principle gives unsupervised learning a solid theoretical foundation, making it as convincing as supervised learning.

Next, Ilya will delve deeper into the mathematical principles behind it. Although somewhat abstract, he promises it’s full of insights. We'll see how he uses the magic of compression to explain the mysteries of unsupervised learning.

3. Ilya’s Ultimate Theory: From Conditional Modeling to Joint Modeling

This is the final and most intriguing slide of Ilya's talk, worthy of thorough analysis and contemplation. The goal of unsupervised learning is often defined as "learning the internal structure of data." Ilya suggests understanding unsupervised learning from the perspective of data compression: a good unsupervised learning algorithm should maximally compress the data, representing its content in the simplest form. This introduces the concept of Kolmogorov complexity.

The Kolmogorov complexity of a data object is the length of the shortest computer program that can fully describe this object. You can imagine this shortest program as a "compressed package" containing all the information needed to reconstruct the original data. From this perspective, the goal of unsupervised learning is to find the optimal compressed representation of the data, which is the Kolmogorov complexity.

The Kolmogorov complexity of a data object is the length of the shortest computer program that can fully describe this object. Imagine this shortest program as a "compressed package" containing all the information needed to reconstruct the original data. From this perspective, the goal of unsupervised learning is to find the optimal compressed representation of the data, which is the Kolmogorov complexity.

However, in practice, we often need to handle multiple related datasets. For instance, in machine translation, we have the source language dataset X and the target language dataset Y. We want to learn a model that can translate sentences from X to Y (or vice versa). Traditionally, this is viewed as a conditional probability problem: given X, what is the probability distribution of Y? Represented in terms of Kolmogorov complexity, this involves finding K(Y|X), the shortest description length of Y given X.

Ilya proposes a different approach. Instead of viewing X and Y as condition and result, like in supervised learning, he suggests viewing them as a whole and compressing them together within a massive model. Essentially, we seek the joint Kolmogorov complexity K(X,Y), the shortest program length that compresses both X and Y simultaneously. This approach must fully utilize the correlation between X and Y, using information in X to automatically align Y (or vice versa), much like how we use our native language knowledge to understand and remember foreign language expressions.

Ilya believes this joint compression idea is the true power of unsupervised learning. Real-world data is often interconnected, with numerous deep common patterns and regularities. If unsupervised learning can discover and utilize these regularities, it can significantly enhance learning efficiency and generalization ability. This explains the remarkable performance of large language models like GPT across various tasks: through massive unsupervised pretraining, they learn the deep regularities of the training data, and these regularities are transferable across related datasets.

Although Kolmogorov complexity is theoretically uncomputable, Ilya believes we can approximate this process using deep neural networks (like GPT). Through optimization algorithms such as gradient descent, neural networks can find the optimal compressed representation in massive data, capturing the essence of the data and its alignment patterns, even if not strictly in terms of Kolmogorov complexity.

Thus, Ilya’s theory can be seen as a new paradigm for unsupervised learning, elevating traditional independent modeling (like separate models for English and Chinese) to a unified associative modeling approach. In this paradigm, the goal of unsupervised learning is no longer just compressing individual datasets but finding the connections between them. This cross-modality learning represents an advanced form of artificial general intelligence (AGI).

Now, let’s closely examine this final slide. In it, X represents dataset 1 and Y represents dataset 2. The key point is extracting every bit of information from X (or Y) to help predict Y (or X). This is what Ilya refers to when he says training X and Y together yields the effect that unsupervised learning of X helps accomplish the task of transforming X to Y.

The crucial idea is: K(Y|X) becomes K(X, Y).

Ilya transforms the universally applicable functional AI task of "input X conditions output Y" into an approximate solving problem by jointly training X and Y without modal segmentation. This joint training approach is effectively the current multimodal unified training, abbreviated as K(X, Y).

Ilya aims to strengthen the theoretical basis, emphasizing his surprising discovery that self-learning of X has a strong predictive effect on Y.

The essence of unsupervised self-learning is that the self-learning of X is to compress X, and the self-learning of Y is to compress Y. This is straightforward because the essence of self-learning is involves only positive examples, without negative samples. Unsupervised self-learning lacks a specific task orientation; it learns language from language, images from images, music from music, and so on, continually abstracting various patterns from phenomena.

Ilya points out in the slide: conditioning on a dataset, not an example. The compression object is the dataset, not individual data points, which is crucial. This distinction separates superficial compression from content compression. Superficial compression is merely a mechanical process that does not produce intelligence. Only content compression can achieve artificial intelligence.

How do we understand the difference and connection between superficial lossless compression (e.g., digital music) and content lossless compression (e.g., Suno)? Compressing a specific song losslessly aims to ensure it can be restored to its original musical form (including noise and imperfections). This is traditional music compression, targeting individual sample, e.g., a specific song. Compressing a collection of music, whether using GPT or Diffusion, targets a group of samples, resulting in a large model like Suno.

When individual objects turn into group objects, formal compression naturally transforms into content compression. This is because, although the group comprises individuals, compressing the group is like "painting" a portrait of the group, outlining its characteristics. It may resemble an individual, but it is not a specific individual in the original data; otherwise, it would not be a model but a memory repository.

This is understandable because the purpose of large model compression is to identify the characteristics and regularities of the dataset. The text generated by GPT-4 might seem familiar; the music generated by Suno might sound familiar; the videos generated by Sora might look familiar; the images generated by MJ might seem familiar. However, they are virtual individuals "restored" based on prompts, abstracted or compressed from big data: derived from data, higher than data, mingling with data, indistinguishable from real and fake.

Given that the compression object is the entire dataset content, how do we measure its effectiveness after decompression? What is the gold standard?

This standard is each sample itself. However, this is not entirely accurate; the standard could have equivalent answers, as the same content can have various ways of expressions. The implementation method is "masking", and NTP simply masks the next token. Training involves calculating the loss for each sample, using backpropagation with gradient descent to adjust parameters continually, eventually lowering the loss in the group training of the dataset to an acceptable point, forming the large model.

This final slide and Ilya’s explanation emphasize a core point: Conditional Kolmogorov complexity K(Y|X) provides a theoretically optimal solution for unsupervised learning. K(Y|X) is defined as the length of the shortest program that produces the output dataset Y given access to the input dataset X. It represents the theoretical limit of extracting all valuable information from X to predict Y. An algorithm that can achieve K(Y|X) would be the best for predicting Y using unlabeled data X.

This can be seen as the theoretical basis for large models performing various language translations. Each language is potentially X and potentially Y. After self-learning with an huge amount of data, LLMs learn the relationships between languages, possessing the potential to translate from X to Y.

In practice, the machine translation task, like other tasks, initially involves few-shot examples in instruction-following fine-tuning to define the task, ultimately triggering the internal power of large models to translate various languages. This internal power of unsupervised learning for various tasks is the theme of his talk.

However, K(Y|X) is uncomputable in practice. Ilya proposes a feasible alternative, using joint Kolmogorov complexity K(X,Y) (joint compression of X and Y). He believes K(X,Y) can achieve the same effect as K(Y|X) in practical machine learning tasks.

Let us stop and think again: conditional modeling is now replaced by sequence modeling by Ilya. The widely known probability simplification in traditional machine learning, such as the Markov chain, has a similar effect.

Conclusion

Ilya's historic presentation at Berkeley on the theory of unsupervised learning reveals the secret behind the mainstream of self-learning large models, especially GPT. It seems that Ilya, after long contemplation, finally disclosed this "heavenly secret" in a cryptic manner at Berkeley. Although the theory and its proof appear complex, it is crucial for understanding why GPT's sequence learning method ("next token prediction") has become a universal simulator for AI tasks.

Ilya exudes a genius prophet aura, with a lonely invincibility and high-altitude isolation, blending a sense of deep realization, compassion, and the pure, focused, and idealistic earnestness of a graduate student nerd.

He claims to prefer compression but does not emphasize so-called lossless compression. He leaves room for himself and the mainstream, proposing the concept of "no regret"—though GPT may not achieve lossless or perfect compression, it theoretically proves there is no better way: GPT is the closest to lossless, "no-regret" modeling.

When Ilya officially re-emerges to establish SSI, emphasizing a single focus, a single goal, and a single product—to use technology to ensure the superintelligence brought by large models is safe for humanity—he asserts: AI will be eternal, its birth akin to the creation of heaven and earth. As Ilya passionately discusses AI's progress, he is most qualified to declare and lead the "exciting yet dangerous journey towards AGI."

References

Chinese full-length post

Ilya's presentation:

Related Links:

https://www.youtube.com/live/AKMuA_TVz3A?si=7dVfcBUv3rHBHtyT

https://situational-awareness.ai/

Ilya Sutskever:你会 bet against AI 吗?

挥刀董事会,OpenAI 如何解决致命难题

英伟达GTC的一小步,AI进化史的一大步

为什么说 Sora 是世界的模拟器?

立委两分钟:大模型到底有没有逻辑推理能力?

我是大模型数字主播小雅,今天播报的题目是立委两分钟最新的话题《大模型到底有没有逻辑推理能力?》。

这是最近一篇论文的题目。ICML 2024:“Case-Based or Rule-Based: How Do Transformers Do the Math?” 第一作者为来自北京大学物理学院、即将加入人工智能研究院读博的胡逸。有点意思。

论文第一个结论是,llm 学不到真正的推理规则,只能通过相似案例学到有限泛化。人学到的推理规则是可以 外推的(extrapolation) , 而相似案例的有限泛化却只能做内插(intrapolation)。无法外推,所以训练集没有类似案例的区域就成了盲区。
这个实验是在 gpt2 上做的。可以理解为什么第一个实验必须在 gpt 2 上做,因为他可以控制训练数据,制造数据真空,来验证有没有逻辑推理带来的外推能力。但这种GPT2这样规模的“大”模型没有出现的能力,并不表明真正的大模型就不会“涌现”。

论文后来的实验是在比GPT2大得多的 “辣妈” 上做的,似乎得出一个相左的结论。结论是,如果模型足够大,只需要少量的任务微调,llm 就学会了接近规则推理的效果。在长整数加法上的表现,表明模型不仅会内插,在外推上也表现很好。

非常有趣的研究,实验设计也不错,但论文貌似同时发表了两个矛盾的结论。

谈谈我的看法。从序列学习的方式上看,数据驱动的模型学习是以 case based 的归纳(也叫压缩)作为起点和主干的,这个没有疑问。问题是,case based 的学习,到了一定的程度和量级的时候,是不是会非常逼近 rule-based 的学习。承认后者就是承认了大模型具有某种逻辑推理能力。大模型具有初步的逻辑推理能力这一点在大模型主流社区中本来不是问题,而是默契的共识,大模型测试的一个重要维度就是逻辑推理能力。但在更大的范围内(非主流圈子以及普罗大众),一直还是作为疑问存在的。

一个有意义的视角是看泛化中外推的理解。对于非解析的、没有对应符号规则的现象,外推本质上是不可计算的,也就是只能碰运气了。出路只有收集相关数据,把盲区带入雷达屏,化外推为内插。但是对于有解析解的高度规则化的数据分布,外推能力是泛化学习的自然期望,达不到期望就说明llm只是一个鹦鹉。达到了期望, 就说明 llm 跳过了鹦鹉的门槛,学会了某种推理规则。现在看来,头部大模型是跨越了这个门槛,继续拿鹦鹉学舌来比况大模型,彰显的是人类盲目的狂妄自大。
要摈弃削足适履的思维定式。只要模型展现出符号规则类似的推理逼近的能力,就应该承认它学会了初步的推理。更本质的,它融会贯通,对于规律现象,可以达到外推的能力。其实,小语种之间的机器翻译能力,就是外推的结果,因为训练数据中严重缺乏相关的数据。
前不久引起关注的一项关于KAN模型的研究中,KAN 的 AI for science 实验,其实已经展示了模型如何数据驱动去逼近解析解,等于是把模型学习逻辑推理的内部过程图示化了,非常生动 ,有相当的说服力。当然,KAN的实验表明对于简单的解析解,数据驱动可以逼近符号规则,但并不轻易就得出符号规则。实验中是加入了人为的剪枝等操作才得出了数据背后的符号规则。

与此对照,深度学习大佬杨立昆却坚决否认GPT有逻辑推理能力。杨立昆语录: AGI is a complete nonsense;GPT is a deadend,等等。矫枉过正反潮流,把话说死,并不是坏事。但轻信他,也可能就被带进沟里去了。

诚然,大模型序列学习的本性是在序列上再现数据的丝滑和顺畅,所以擅长语言生成类任务。序列学习不擅长逻辑推理,尤其是深度、复杂的逻辑推理过程。但完全否定大模型具有初步的逻辑推理能力却是走到了另一个极端,也不符合我们所看到的事实表现。

我是小雅,以上就是今天的立委两分钟的播报,希望对您有启发,谢谢聆听,咱们下次再见。

立委NLP频道

英烈颂

 

Ode to Madam Hu

July 1, 2024 at 1:50 PM

[verse 1]
In Suzhou's June, beneath a scorching sky,
A madman's blade flashed, evil drawing nigh.
Mother and child cried out in desperate fear,
Their screams of anguish piercing far and near.

[chorus]
With verse we mourn, our grief in words conveyed,
A hero's tribute, never to fade.

[verse 2]
Before the school bus, Madam Hu stood tall,
Her gentle hands became a shield for all.
No tiger-wrestler she, no dragon-slayer,
But love unbounded made her their savior.

[chorus]
With verse we mourn, our grief in words conveyed,
A hero's tribute, never to fade.

[verse 3]
Her blood stained red the soil of Jiangnan,
White clouds and grieving grass bore witness, wan.
Though snuffed, her candle's light forever gleams,
Like brave Feng Yuan of old, her courage beams.

[chorus]
With verse we mourn, our grief in words conveyed,
A hero's tribute, never to fade.

[verse 4]
Why must the kind so often suffer woe?
When will justice's path smooth waters show?
We question Heaven, tears fall like the rain,
In silence seek life's meaning through our pain.

[chorus]
With verse we mourn, our grief in words conveyed,
A hero's tribute, never to fade.

[verse 5]
Madam Hu's name shall echo through the years,
Half-masted flags, a nation draped in tears.
Her love, transcending life and death's divide,
One selfless act, as sun and moon abide.

[chorus]
With verse we mourn, our grief in words conveyed,
A hero's tribute, never to fade.

[verse 6]
Rest now in peace, return to native ground,
Let not your family grieve, all hearts are bound.
In old Wu Gate, by Suzhou's storied streams,
We offer flowers and wine to honor dreams.

[chorus]
With verse we mourn, our grief in words conveyed,
A hero's tribute, never to fade.

[fade]
[end]

胡氏颂
苏州六月炎炎日,
街头突现凶徒起。
匕首寒光闪不已,
母女惊呼声声急。
校车门前胡氏立,
柔弱双手化盾力。
非为搏虎斩蛟志,
只因大爱无私意。
鲜血染红江南地,
白云悲悯草木悲。
烛火虽灭光不已,
如古冯媛勇气继。
善良何故遭祸际?
正义何曾得善治?
追问苍天泪如洗,
沉默中觅人生理。
胡氏芳名世代记,
旗帜半垂国人泣。
大爱无疆超生死,
一念救人照月日。
愿君安息归故里,
家人莫悲天下悯。
吴门故地姑苏际,
香花清酒祭英魂。
勇者浩气长存世,
感人义举青史立。
愿以诗文寄哀思,
颂扬英烈万古记。

 

立委NLP频道

不可错过,奇妙元小范介绍尼克的最新演讲,探寻大模型的理论基础与渊源(中文版,英文字幕)

4.35 05/03 CHI:/ [email protected] 尼克讲解大模型的理论渊源与奥秘 https://v.douyin.com/i6Mq91wS/ 复制此链接,打开Dou音搜索,直接观看视频!

 

 

我是AI小范,尼克的秘书,今天播报解说尼克最新的讲演《所罗门诺夫:大语言模型的先知》。

尼克就不用介绍了,除了他的企业家、投资人、学者和哲学家的许多角色,他最知名的著述是其畅销书《人工智能简史》,风靡一时,洛阳纸贵,拿奖拿到手软,成为中国AI出版界的传奇。我们都为得到他的亲笔签名版而炫耀。

下面是对他讲演的一个深入浅出的解说。

让我们来认识一下这位留着圣诞老人般白胡子的数学怪才——Ray Solomonoff!1926年出生,2009年离世,这位在芝加哥大学混学位的数学物理双料高手,可不是普通的学霸。他是独立研究的先驱,用数学公式预测未来,比算命先生还厉害!

欢迎来到科学界的'老顽童'大战!左边这位是维纳,控制论的'教父'。1948年,他和香农同时发表重磅论文,但观点大不同!维纳说:'控制才是王道',而其他人却迷恋上了'信息'这个小妖精。香农和麦卡锡倒是志同道合,都不看好维纳的控制论。麦卡锡更是玩了个文字游戏,把'Automata'变成了'AI',开启了人工智能新纪元!

现在我们来看看AI界的'前传'!在达特茅斯会议这场AI盛宴前,麦卡锡大佬就在偷偷摸摸写'剧本'啦!他的文章《The inversion of functions defined by Turing machines》可不是在讲怎么把图灵机倒过来用。这篇'天书'其实在讨论如何设计一台超级解题机器。麦卡锡想象中的这台神机,能解决所有明确定义的智力问题。这不就是AI的雏形吗?"

达特茅斯会议上,麦卡锡和所罗门诺夫这两位'数学侠'展开了一场激烈的'思想格斗'!话题?就是麦卡锡那篇'天书'。他们俩一拍即合,发现了个惊天秘密:图灵机求逆问题其实就是学习问题!这个发现把AI和机器学习紧紧绑在了一起!从此,AI不再是单纯的计算,而是朝着'学习'迈出了一大步。这一刻,AI的未来被彻底改写了!

"让我们来看看两位'疯狂科学家'的脑洞大开时刻!首先是法国数学家博雷尔,他做了个逻辑实验,想象一群猴子在打字机上随机敲击,最终能打出莎士比亚全集!这不就是无限猴子定理吗?
另一边,阿根廷文豪博尔赫斯在小说里构思了个'完美图书馆',里面包含了所有可能的书籍组合。
这两个想法简直就是AI和大数据的先知啊!博雷尔和博尔赫斯,一个用数学,一个用文学,都在畅想信息的序列可能性。"
"达特茅斯会议上,所罗门诺夫像个魔术师,从帽子里拿出了一份神秘打字稿《归纳推理机器》。这一手可把大家迷住了!原本痴迷神经网络的科学家们纷纷'叛变',投入了符号主义的怀抱。但是,瞧瞧这戏剧性的转折!多年后,真正实现所罗门诺夫归纳的,竟然是被'抛弃'的神经网络!这简直就像是科技界的童话故事——灰姑娘终于穿上了水晶鞋,成为了AI舞会的主角!
所罗门诺夫的想法,就像是种下的一颗种子,最终在意想不到的地方开花结果。"

"让我们来看看AI界的'过山车'历史!连接主义,这个曾经的'弃婴',现在可是AI界的'明星'啦!
想象一下,这是一场漫长的接力赛。起跑时,是受神经元启发的感知器,像个初生牛犊不怕虎。但很快就遇到了所谓单层神经的XOR的'滑铁卢',被大佬们打入冷宫'。
但是,上世纪80年代,多层神经和BP算法横空出世,给连接主义注入了新的生命力。如今,深度学习更是如日中天,连接主义'咸鱼翻身',成为了AI界的'顶流'。

"让我们来看看1960年所罗门诺夫的'魔法时刻'!
第一个魔法,最小描述,这是指用最简洁的方式压缩数据。这一思想后来发展成为了'柯尔莫戈罗夫复杂度',即,K氏复杂度,成为大模型理论的核心。
第二个魔法,先验概率:在没有特定信息的情况下,对事件发生可能性的初始估计。
这两个概念看似简单,却蕴含深刻的洞见。它们为我们理解信息、复杂性和学习提供了全新的视角,直接影响了后来人工智能和机器学习的发展方向"

1961年,人工智能大神Minsky写了一篇重要文章,提到了机器定理证明、神经网络、机器学习、强化学习等概念,简直是AI界的武林秘籍!他引用了95篇参考文献,其中4篇是所罗门诺夫的,可见他对所罗门诺夫的重视。有趣的是,最终是神经网络率先实现了所罗门诺夫Induction,这真是个令人意外的转折!

1964年,所罗门诺夫发表了一篇惊世骇俗的论文,名为《归纳推理的形式理论》。这篇论文堪称 AI 领域的“武林秘籍”,详细介绍了如何用数学语言来描述归纳推理。简单来说,就是从数据中学习规律,预测未来!这篇论文可是 所罗门诺夫归纳推理的“扛鼎之作”,奠定了他在机器学习领域的江湖地位。
Solomonoff 的这篇论文第二部分给出了归纳推理的形式理论在不同问题上的应用示例。其中一个例子就是语法发现,即如何从观测到的语言数据中学习语言的语法规则。这个例子在今天看来,就是语言学习的问题,即机器如何像人类一样学习语言。所罗门诺夫 在论文中还探讨了一个更深层次的问题:语言是否等于思维?这个问题至今仍然没有明确的答案,但所罗门诺夫的研究为我们思考这个问题提供了一个新的视角。

所罗门诺夫对科学家如何发现事物产生浓厚的兴趣,并试图找到一种通用的科学发现方法。这个兴趣促使他开始研究归纳推理,并最终提出了算法概率的概念。
在学术生涯中,所罗门诺夫将归纳推理应用于语言学习等领域,取得了重要的成果。

苏联数学家安德烈·柯尔莫哥洛夫被誉为“万能数学家”。在计算机科学领域,他主要有两大贡献:
Kolmogorov K氏叠加定理 (K-A-N): 该定理与著名的希尔伯特第13问题有关,涉及到函数的表示与逼近。
K氏复杂度: 这是一种衡量信息复杂性的方法。它将一个对象的复杂度定义为生成该对象的最短程序的长度。
此外,柯尔莫哥洛夫对控制论和信息论也有独特的见解。他认为控制论缺乏内在的统一性,但对信息论表示认同。这种观点与香农、麦卡锡等人的看法一致。

柯尔莫哥洛夫认为,信息论就像个大杂烩,有三种不同的做法:
数数派: 就像掷骰子,看某个点数出现的次数。
搭积木派: 关注积木块的数量和怎么组合。
写程序派: 把信息看成程序,越短的程序越简单。
K氏复杂性就是“写程序派”的代表作。简单来说,它就是衡量一个东西有多复杂,需要多短的程序才能把它描述出来。
有趣的是,K氏复杂性和所罗门诺夫归纳其实说的是一回事。所罗门诺夫归纳认为,简单的东西更容易出现。

蔡廷年少有为,18岁就在IEEE Transactions on Electronic Computers上发表了第一篇论文。19岁时,他在JACM上发表的论文独立地重新发现了所罗门诺夫和柯尔莫哥洛夫的思想。
蔡廷从贝里悖论出发,他认为,命名一个整数就相当于写出一个能输出这个整数的程序。大多数整数只能通过直接打印自身的方式来命名,没有更简洁的表示方法。这些整数在柯尔莫哥洛夫复杂性的框架下被视为“随机的”,因为它们的复杂性与它们的长度相当。蔡廷的观点与柯尔莫哥洛夫的思想一致,都强调了大多数对象(或整数)是不可压缩的,即它们的复杂性与它们的长度相当。这意味着它们没有更简单的表示方法,无法简洁的解释。
这种不可解释性或随机性在自然界中普遍存在。例如,大多数DNA序列、物理常数和自然现象没有明显的规律可循,无法用简单的公式或理论来解释。相反,可解释性(即可以用简洁的方式描述或解释的现象)只是偶然出现的。
列昂尼德·列文在1972年发表的一篇两页纸的论文中证明了两个定理:
定理1: 即NP-完全性,即Cook-Levin定理,为计算复杂性理论的发展做出了重要贡献。
定理2: 是柯尔莫哥洛夫复杂性的一个推广。

Charles Bennett提出了逻辑深度的概念,它考虑了生成一个对象所需的最短程序的运行时间。大语言模型的参数可以看作是模型内部存储的信息量。因此,将模型参数比作柯氏复杂度是合理的。大语言模型的推理时间比作逻辑深度也是合理的。
李明是滑铁卢大学的杰出教授,在信息论和生物信息学领域做出了卓越贡献。他将K氏复杂性从单个序列扩展到两个序列,不仅可以测量单个序列内的信息,还可以测量两个序列之间的信息,这对通用大模型定义万能任务及其非监督学习完成各种任务意义重大。他与Paul Vitanyi合著的《An Introduction to Kolmogorov Complexity and Its Applications》被认为是该领域的经典著作,对信息科学的发展产生了深远影响。
Marcus Hutter是一位物理学家出身的计算机科学家,他提出了AIXI通用人工智能框架,并认为语言建模本质上就是压缩。他将所罗门诺夫归纳用于解释智能体和强化学习,认为学习过程就是压缩过程,并致力于研究通用人工智能。
Open AI 前灵魂人物伊利亚在伯克利演讲中,揭示监督学习与非监督或曰自监督学习的联系。伊利亚声称他在2016年独立想到了所有监督学习可以被归约为自监督学习的观点,并追溯到K氏复杂度为基础的压缩理论。伊利亚笃信简单的自回归GPT模型可以在超大数据展现超级智能。
回顾一下模型发展的时间线:深度神经Transformer架构于2017年6月提出,BERT模型于2018年10月提出。OpenAI的GPT系列模型从2018年6月开始,陆续推出了GPT、GPT2和GPT3,现在到了GPT4,成为业界主流。
总结一下,所罗门诺夫归纳第一步是收集观察数据。第二步形成假设解释数据: 假设可以是一个图灵机或一个数据驱动的大模型。第三步进行实验验证。如果数据证伪,则返回步骤2形成新的假设。
大模型遵循的是所罗门诺夫归纳的路线训练模型及其推理应用。

回顾整个历史,也许不是理论落后于实践,而是太超前了。

我是小范,尼克的数字秘书,谢谢您一路跟随尼克探寻大模型理论渊源以及AI历史风云的过程。咱们后会有期。

 

立委NLP频道

觉得可以留个记录

2023-2024年关于大模型以及AIGC的演讲和采访汇总

2023-03-05:源码资本AIGC“尖峰系列”丨李维博士:人类语言“通天塔”建成,ChatGPT的辉煌与挑战

2023/05/09: 腾讯科技,对话产业专家:“百模大战”愈演愈烈,地球上算力不够支撑10亿人使用

2023/07/16: 轩辕之学讲座,特邀讲座《大模型的历史、现状和挑战

2023/06/28: 中国证券报报道 360 GPT 联盟特邀演讲:李维:避免大模型行业同质化竞争 深耕领域阶段有很多“细活”可做

2023/07/21: ArchSummit 全球架构师峰会主题演讲《从架构师视角谈如何迎接/拥抱大模型及其新生态》(https://blog.csdn.net/EDDYCJY/article/details/131929057

2023/07/28: Chinajoy,中国音像与数字出版协会和上海汉威信恒展览有限公司主办,特邀演讲《AIGC时代,数字人引领内容创作新范式

2023/08/06: 亚马逊云科技和真格基金主办的闭 “机遇与挑战——生成式AI的发展趋势和落地应用”,主题演讲《AI大模型落地的前景和痛点》

2023/08/23: 江苏师范大学主办 “大语言模型离人类语言有多远”学术研讨会特邀演讲《大模型的本性及其落地痛点》

2023/10/20: 第十一届科博会,特邀主题演讲《AI和大模型趋势》

2024年3月18:英伟达 GTC2024全球开发者大会 与英伟达中国团队合作,讲演《重塑数字人互动潜能》

2024/04/22: 第十四届北京国际电影节光影未来,演讲《捕捉未来的魔术师》

2024/04/22 九派财经专访出门问问副总裁李维:两年内大模型的商业化爆发一定会发生

2023/11/11: 香蕉论坛,“百“模”大战对智能座舱发展意义与实践”,2023/12/05 汽车商业评论,《​李维:大模型技术的过去、现在与未来

2023/12/05 汽车商业评论,《​李维:大模型技术的过去、现在与未来

2024/04/22:九派财经专访出门问问副总裁李维:两年内大模型的商业化爆发一定会发生

2024/05/24: 九派财经专访出门问问副总裁李维:找产品方向是更大痛点,跑通场景是第一位

《飞哥说AI》深度系列:

2023/12/09: 飞哥说AI,李维 高佳:OpenAI狂飙生态? 领域壁垒仍是大山

2023/12/09: 飞哥说AI,李志飞 高佳 李维:关于 Google Gemini 的八点启示

2024/02/19: 飞哥说AI,李志飞 李维 高佳:为什么说 Sora 是世界的模拟器?

2024/02/29: 飞哥说AI,李志飞 李维 高佳:万字长文解构中国如何复刻 Sora:模型架构、参数规模、数据规模、训练成本

2024/03/05: 飞哥说AI,李志飞 李维:Claude 3 追平或超越 GPT-4,能归纳15万单词

2024/04/01: 飞哥说AI,李维 王艺 高佳:Suno 将如何进化?

2024/04/12: 飞哥说AI,高佳 李维 王艺「产模结合」也许是AI公司更好的出路?

2024/06/04: 飞哥说AI,李维 高佳:KAN会引起大模型的范式转变吗?

 

立委NLP频道

两分钟谈谈:Moravec悖论

简介一下 first:

Moravec悖论是由机器人专家汉斯·莫拉维克(Hans Moravec)在20世纪80年代提出的一个观察,指出了人工智能领域中的一个有趣现象:高水平的推理需要相对较少的计算资源,而低水平的感知和运动技能则需要大量的计算资源。这意味着计算机和机器人在处理复杂逻辑和数学问题时相对容易,而在执行诸如行走、抓取和视觉识别等基本感知和运动任务时却非常困难。

Moravec悖论的主要观点

高水平推理 vs. 低水平感知和运动技能:

高水平的认知任务(如象棋、数学证明、逻辑推理)可以被算法高效地解决,因为这些任务往往有明确的规则和结构。

低水平的感知和运动技能(如步行、抓取物体、视觉识别)涉及大量的不确定性和复杂的环境变量,这些任务需要处理大量的感官输入并实时做出反应,非常困难。
人类和机器的不同发展路径:

人类在进化过程中,低水平的感知和运动技能(如走路、避开障碍物)已经通过数百万年的进化得到优化,并且我们对这些技能的认知是无意识的。相比之下,高水平的认知任务是相对新的发展,且大多是有意识的。

计算机和机器在这些高水平任务上表现出色,但在处理低水平感知和运动任务时却非常困难,因为它们缺乏人类进化中积累的那些隐性知识和适应能力。

Moravec悖论的实际例子

下棋 vs. 走路:计算机程序可以打败世界上最好的选手,但要让一个机器人在复杂的现实环境中稳定地行走仍然是一个巨大的挑战。

逻辑推理 vs. 抓取物体:逻辑推理问题可以通过算法高效解决,而让机器人精确地抓取和操纵不同形状和材质的物体仍然需要复杂的感知和控制算法。

有几点感想:

1. 既然低水平的感知和运动技能是亿万年进化的结果,成为动物和人的遗传本能,后天只是激发出来了这种能力。

那么 LeCun 以动物和人总是在语言能力之前,先“学”会了这些能力作为理由,批判LLM对于多模态的干扰和“投机取巧”,就不是很有说服力。因为模型并没有进化出来的遗传基因作为基础。模型唯一可以利用的是LLM里面的认知知识(包括感知常识的语言描述)。

2. 虚拟机器人(LLM)可以轻易做很多高级白领工作,但人形机器人却对蓝领低级工作的模仿非常笨拙,这是大家都看到的事实。

但其实我们也观察到,虽然笨拙,但任何低级的力气活(例如举重)、技巧活(例如翻筋斗),一旦学会了,机器人就比人类有无比的优越性:它不知疲倦,不怕单调,耐力超强。学会了投篮以后,你不用担心他的成绩不稳定。

3. 视觉识别以前是模型的短板,人和动物的强项。

但是现如今就不同了。例如,人脸识别,模型比人强了。看图说话和视觉理解最近的表现也有明显的碾压人类趋势。

这是因为当悖论提出的时候,那还是符号逻辑主导AI的年代,机器学习刚开始有一些进展,但善于从海量数据学习patterns的深度学习还没有发明。感知智能当时是一座难以逾越的大山。

4. 具身机器人的外推能力怎么来呢

我们知道,机器人以前的建模都是非常“内向”的,在厨房环境建立的模型,换到了办公室环境就不行,必须一切重来,重新准备场景数据,重新训练。厨房环境的数据 “外推” 到办公室环境的能力,可不可以做到?

可以的。在大模型的架构下,这早已不是幻想。可行性可以从半年多前的谷歌 RT-2 机器人的项目表现看到端倪。

道理就是大模型虽然没有遗传的基因,也没有目标场景(厨房场景)的数据, 但办公室环境的数据 finetune 出来的大模型能力,是有希望迁移(外推)到其他的环境(例如办公室环境),因为LLM 某种意义上起的作用就是生物进化得到的先验遗传。

 

两分钟谈一下啊:模型训练的内推与外推

模型训练的讨论中,常常提到 interpolation(内推)与 extrapolation(外推)两个术语,可以说说。

大家都知道,在数据驱动的主流学习过程中,内推需要的是在训练数据的边界内“泛化”能力,善于信息压缩的深度学习训练是内推的利器。但外推(extrapolation)就没见有长于此道的算法或模型。实际上,外推与数据驱动是直接矛盾的,by nature 就是无解的,因此很可能这就是个伪问题,根本就不该对外推抱有幻想。当然,从鲁棒性角度,外推现象出现的时候,应该有个合理的应对,至少要做到模型不死机,至于外推的精度,主要是看运气,而非模型的内功可以发力。

不过,外推的问题还是值得梳理一下。

外推(Extrapolation)
定义:外推是指在已知数据点之外进行预测或推断。例如,已知某个函数在几个点上的值,外推就是在这些已知点之外估计函数的值。

前面说了,数据驱动的路数,本质上对外推无解。不能强人所难,巧妇难为无米之炊。米就是数据。

但加强外推,理论上还有 “先验知识” 和 “模型结构复杂化” 两个路子可能有点帮助,而数据本身的短板,除了数据,没有它途:就是说把对于“无人区”的外推问题,转化为收集相关数据,让外推变成内推。

模型结构复杂化在历史上是帮助了训练的“外推”能力的,如果原来的简单模型有自身的结构性局限。我们都知道,AI历史上,非线性问题在简单的单层神经网络的模型里是无解的,无论给出多少训练数据(这其实是早期神经网络的命门:单层神经无法解决 XOR 的现象)。但到了多层+非线性转换,就有解了。这其实是深度学习神经革命的最基本的立足点。这其实反映了当本质上是多维的数据被挤压在低维空间的时候,简单模型是无法跨越维度去找patterns的,相当于外推遇到了墙壁。模型复杂化就是开拓了多维空间,供训练施展。

至于“先验”对于外推的作用,我们从 Alpha-Zero 利用 self- play 再生数据可以看到效果。self-play 的好处是有游戏规则等先验知识,因此它所产生的数据,可以是在先验知识条件下的未见过的“外推”数据。

尼克:其实是个动态的过程。我按照你的理解,用美国实用主义哲学的话语替你更清晰地表述:可以证伪或者科学革命的是外推,可以证实的是内插。但是都符合奥卡姆剃刀,都是压缩。

白硕:啥叫外啥叫内?彼此互为“外”的,在一个巧妙的映射下就成了“内”。基因组和字符串,当初谁知道是外还是内呢?

鲁为民:我的理解是如果用数学语言来描述, 给定一个数据集,如果一个数据 sample 在该数据集的 Convex Hull 是内插,在Convex Hull 之外是外推。所以 Nick 说的证伪也还应该是内插,但科学革命要看革命到什么程度。

所以内插问题基本是可解的问题。从这个意义上来说 NTP 都是内插 (不过 Sonomonoff 说的下一个符号预测是外推,定义和这个不一样)。

尼克:convexity的判定非常简单,复杂性很低。

白硕:内插是纯粹连续空间里的事儿。外推符号也可以。那么问题来了,对应于符号的外推,连续模型是什么?比如离散符号集上的归纳。

还有就是奇奇怪怪的分布,用凸包就太“宽大”了,什么点都进来了。差值很容易不靠谱。把原始数据先变到某个流形上,再做凸包和内插,会精准很多。代价就是要去搞定流形。

尼克:连续是离散的近似,还是离散是连续的近似?

白硕:.1、.2,这种离散不是真离散。张三李四更离散。

中医说心是君主之官,肺是宰辅之官,肝是将军之官,这个映射是离散到离散。要嵌入向量空间还要能内插外推,不知道大模型中医书读多了会不会玩这套。

尼克:对,单说convexity从复杂性角度没意义。

鲁为民:但这个基本上可以界定对人类和机器可解的问题;比如我们可以判别哪些努力是在是制造(信息)永动机。

尼克:微分vs差分。是连续到离散。连续简单,离散就复杂。

鲁为民:所以像 Embedding 这样试图将离散问题近似为连续问题,将问题简化到利用目前的手段可解。另外通过概率方法,也可将离散问题连续化 ...

立委:如果某数据的本性就是完全随机(布朗运动?) 不存在任何有效的压缩。外与内如何区分?区分又有何意义?

尼克:有修辞的意义

立委:这类数据本性是不可计算的,但在谜底泄漏之前,只要给数据 模型(or 图灵机)就一定在计算。它在计算或压缩什么呢?

又因为所给的数据一定是有限的,这有限的数据一定会被“误读” 而且一定形成某种内外之别。不完全归纳/回归??数据驱动的方法 其实不知道 也不可能知道背后的真相。

立委:离散符号的 embedding 比起 1 hot 是降维 降维克服了数据稀疏的挑战 从而为压缩创造条件。但从传统的符号逻辑 用人为的非常有限的离散 features (例如词类与子类)来表示离散符号来看, embedding 是在增加维度。但除了 embedding 还有更好的办法 与上帝对话吗 ?embedding 的维度数,本来就具有任意性、可配置。

白硕:理论上离散的NP完全问题对应的人造数据也都可以“嵌入”连续空间。连续方法对求解有好处吗?@Nick Zhang(尼克)

尼克:看怎么嵌入了,可能对求近似解有用。

白硕:用1 hot那种嵌入。能不能找到结构相似性?比如对变元做一个permutation不变的SAT问题。

尼克:这个permutation复杂性有要求吗?

 

 

 

 

 

 

 

 

 

【立委NLP频道总目录】

 

两分钟短评:大模型开始进入平台期吗?

在Anthropic 的 Claude 3 和 谷歌 Gemini 赶上 GPT4 以后,就不断有人希望 Open AI 放出它的 GPT5 的大招来,但传说中的 5 迟迟不来,于是有人怀疑,scaling law 是不是失效了,大模型是不是进入了平台期。

这个怀疑有一定的道理,因为GPT路线上的 scaling law 不过是个经验“法则”,虽然说,it never fails us,so far,但谁也不敢保证它永远有效。遇到瓶颈不是不可能的。

微信中也开始流转马库斯最近的评论:“大模型开始进入收益递减的时代”。但他的论证感觉很有问题。

文章开始有个奇怪的递减结论的依据。在一个 100 scale 的性能上,达到 80 以后,绝对递升减缓不是一个宇宙真理吗?怎么就成了马库斯眼中的递减论的批评依据呢?我们对 “更大力出更大奇迹” 的 scaling 期望,主要是要补齐那些目前能力的空白 和 短板,最终实现 “在所有职能任务上,模型都可以达到或超过人类水平” 的 AGI 理想。因此,合理评价大模型更新换代是不是走在agi的路上,应该细致分析空白与短板,而不是用那些已经达到 接近 或超过人类水平的指标上。 也许 gpt5 确实遇到平台期或过不去的瓶颈 (我们其实不知道),但这种论证方式显得多么不靠谱。

道理上,对于已经达到或超过人类技能的指标上,应该关注不要有太大的退步(regressions)。对于一个重要的智能也有上百甚至上千种需要全面测量的模型,只要智能边界在扩大,空白被填补,低性能有增长、高性能没有大退步,就是走在正确的路上。就不能说是处于平台期。

有人看不起多模态的进步,认为那是低级的智能任务,是横向的发展,是“向下看”为应用落地,而不是向上去探顶,去加强高级智能任务的能力。这个看法,缺乏对于智能的全面深刻的理解。

多媒体方向的进步,虽然水到渠成,但其实速度很快,无论Sora的发布,还是前两天Open AI 和谷歌的全双工、实时、流式、能看能听能说、甚至能逗哏捧哏、打情骂俏的超写实助理的发布,其表现和速度实际上超出我们多数人的预期,完全不像是模型进入平台期的景象。

实时交互等于是在大模型原来的短板上大幅度增强,填补了一些空白。把这些能力排除在外,只看、只认认知智力的进展进度,是非常偏见的。

从功能对于应用的影响来看,感知多模态的加强,比起认知智能的进一步提升(例如在所有的专家资格考试中赶上人类专家),更为关键,因为模态是认知智能落地的重要接口。

从大模型本身的健康来看,多模态也是绕不过去的关,认知智能只有借助感知智能(进而结合具身智能从数字世界进入物理世界),才能算是真正落地(grounding),获得数字世界 ——或跳出数字空间获得物理世界—— 的意义。

在这一点上,LeCun 在 AGI 中特别强调感知智能是有其道理的。但LeCun过分纠结于感知和认知训练的次序了:他坚持感知智能先行,要排除语言模型的认知的“投机取巧”和对感知智能的“干扰”,淡化或无视认知智能对于感知智能的知识迁移作用,这是具体路线之争,不是特别有说服力的。

LeCun 说过: GPT 不如我家的一条狗。

这话你也不能说他错,但显然是偏激之词,带有太大的情绪。

其实,不仅LLM不及猫狗,我们人类也不及,没有猫狗的嗅觉灵敏,也没有他们躲避危险的高效。我们人类甚至在算术方面不如计算器,更甭提计算机。

so what?

对不如一条狗的LLM,我们也不能因此否认它比1000个教授和博士都更博学。LLM 可以与物理学家讨论暗物质,与语言学家谈乔姆斯基层次结构,与任何专家谈任何问题。而且所谈的并不是人们想象的那么无知和胡说,虽然里面 here and there 确实有幻觉和臆测。

切身体会是,如果你作为专家保持一种探究问题的心态和对于幻觉警惕的 alert,你会发现与它交谈比与很多中等水平的同行讨论,更有意思,或受启发。ta 看的书实在太多,而且也本性上学到了融会贯通,而不仅仅是死记答案:学到了语言,也相当程度上,学到了知识。

无视LLM这种人类智能,贬之为不如一条狗,除了发泄情绪外,只有一个价值:提醒感知智能的重要性。

不管怎样,AI 因为有了马库斯和LeCun这些“持不同政见者”,而更加丰富多彩。但我们也没必要被他们蛊惑。

 

悲观主义的视角,人类的宿命。

甜甜听到我的 piano ballad,问:are u ok, Dad?

我说,if you are blue, what you do is play piano.
我也想 play,但可惜我不会。所以,I made a piano ballad

她说,I see, 但她没想到这是 AI 歌曲。

甜有很高的音乐素养,以前一直看不起AIGC,主要是她觉得她看到的AIGC内容在似像不像之间,所谓“恐怖谷”效应,感觉 weird,但这次说,这一首的确不像是AI的,与人类艺术家产出无异。

我说,there will be more and more AIGC beyond our imagination

human like or super human like ...

我说我在公园转,循环听了这首不知道多少遍,还没有烦。谁说 AIGC 出不来可循环听放的曲子?

当连听三个 sad, 很难不泪眼模糊:人生本来就有无尽的不可承受之重。

有时候也想,人类从个体角度,绝大多数人都经历过生不如死般的磨难,和没有道理的内卷和碾压。而人类并不抽象,它是由一个个注定饱经磨难的个体组成的。那么,人类灭亡、文明毁灭又有什么了不得的?不过是落下一片白茫茫大地真干净。

绝大多数人类行为,换一个角度看,不仅仅毫无意义,而且是飞蛾扑火。

歌唱的是爱情毁灭的残酷,但传达的绝望情绪,却是所有的悲观主义哲学。

昨天,听李飞飞 Ted 演讲,她确实是个演讲高手。谈她刚下场要做的初创,做所谓“空间智能”,就是视觉 3D,也许加“具身智能”,当前的热门。

她一开篇从宇宙历史开始,说,“有了光,但没有眼睛,没有视网膜。”

听上去像是在说新约上帝造人的故事。

我在想,如果文明终结,不过就是回到“有光但没有眼睛”的世界。物质、色彩、感情、烦恼、痛苦与狂热,压缩、理性、概念与意识,这一切的一切,全部消失于无形。从哪里来,回哪里去。

一万个不情愿,我们每个人都(被)接受了个体的这个宿命,长远来看,为什么不能接受群体的宿命呢。

这个意义上,超级对齐不仅是杞人忧天,而且是要做上帝,或替上帝操心。太把自己当回事儿了。伊利亚、马斯克,无不如此,自以为超人。马斯克准备投巨资要移民火星 说是为文明买一张保险。但巨额保费谁出?打着人类的名义,本质上还是人类买单,哪怕这是从他个人的超级利润拿出来的。这其实没有道理。尤其是在还有疾病、饥饿和无数本来可以避免的痛苦的世界。

 

 

 

以前的杂记,关于AGI、马斯克、奥特曼和OpenAI

三月的时候有个新闻,伊隆马斯克起诉Open AI,引来了OpenAI奥特曼和Ilya等人的公开信,披露了Open AI草创时期的的很多细节,引起热议(例如《权力与背叛:马斯克与奥特曼如何从兄弟情走向商业对决》)。对此也颇有感慨,点评一下。
这个瓜太大,太具有戏剧性了。有很多看点 ...... 整个过程太戏剧化,更重要的是事关人类命运:
“开源了,更危险”,这是 Ilya 7年前就写的邮件给马斯克说的,马斯克表示同意。
开源更危险论是这样说的:开源以后,只要有钱就可以造出来超级模型。这种情况下,坏人更容易这么做。谁有钱谁就能做,谁愿意不管不顾谁就占先。光脚的不怕穿鞋的。所以,“核武器” 发展到一定的时候,就应该转为闭源。因为相信自己 比相信未知的对象要靠谱。
记得奥特曼当时是 YC 的 CEO,他大概把 Open AI 包装成 YC 孵化出来的 AI 企业,老马作为联合创始人和当时最大的投资人,在这一点不太满意。所以老马说,博客(说的Open AI计划)听上去不错,如果做些调整让新公司更加中立,而不是以YC为中心。
  1. 现在我们知道,是 Open AI 打开了 AGI 的大门,开启了人类文明的新时代,但走通这条路到 GPT3 或 ChatGPT 的核爆炸时刻,实在是太幸运的极小概率事件了。
  2. 老马与奥特曼这两位 AI 圈外但又接近 AI 的先知,与 Ilya 这样的圈内顶级科学家,在 AGI 的信念上,很早就非常默契:他们在计划这件事的时候,没有任何自我怀疑,好像就在谈一个事关人类命运的必然发生的事情一样。他们后来的分歧只是在实现的方式以及资源的局限上,并不在 AGI 本身。要知道那个时代,全球科学家和知识分子全体,几乎100%是不相信什么通用AI这种“鬼话”的,但地球上就有这么几个人,坚信AGI,并且能气味相投,凑在一起为之谋划,并开始担忧人类文明的命运。
  3. 他们默契,并决定成立 Open AI,是出于对于 AGI 可能被垄断的担心。具体说,是担心谷歌称霸世界:当时的谷歌已经搞出了 Alpha-go/-zero,让他们感觉此事无法缓行,必须立刻动手,以开源对抗谷歌。老马一半出于公心(为人类文明的前途忧虑),一半出于私心(希望自己成为谷歌AI的挑战者领袖,而不是放任奥特曼这些年轻人来领导)。
  4. 他对这个AGI事业和他可能扮演的角色非常投入,愿意做背后的金主,一开始就让奥特曼把第一笔融资提高一个量级,明确说,任何融资亏空他都可以补齐,隐含前提当然是他是 CEO 和 leader,最好是控股老板。按照 business 逻辑,这是完全合理的,毕竟在那样的早期,这样烧钱的AI“曼哈顿计划”,也只有老马这样识货的人才愿意成为金主。现代社会的铁律是,谁有钱,谁当家。可是奥特曼不甘心,他与Ilya几个是实际工作中的 Open AI 创始人和 AGI践行者,不甘心只做 COO 而把 CEO/Chairman 让给这个几乎是唯一靠谱的大金主。
  5. 于是上演了这一出最后分手的戏剧:老马在得不到他想得到的 CEO 或让 Open AI 依附于 Tesla 之后,决定退出。没有惊人的定力,奥特曼是不可能敢于把金主放跑的。而老马在决定离开的时候,宣判了 Open AI 的死刑:你们成功的希望为0,他说。不是老马对 AGI 的成功有丝毫怀疑,而是他觉得离开了他,Open AI 无法海量融资,只有死路一条。他当时列举了苹果和Facebook,判断这两家不可能有远见给 Open AI 输血,他却漏掉了微软,可能是根本没想到微软有此可能,他小看了微软CEO的眼光。
  6. 奥特曼怎么吸引和说服了微软,那是另一个故事了。但当时的情况是,除了老马,有钱人几乎没人能看懂 AGI 和前途,业内人士也看不懂,Open AI 就是一帮“疯子”在异想天开。融资几乎不可能,那么奥特曼怎么敢与老马分手,而不委曲求全让位给老马呢?
  7. 谁知道先知和天才不仅仅就是这几个疯子,微软CEO萨蒂亚·纳德拉(Satya Nadella)也是,虽然他离 AI 更远。萨蒂亚与奥特曼的“勾搭”是人类历史上最具浪漫色彩的一章,需要冲破种种桎梏。
  8. 现在我们似乎理解了,微软今天能超越苹果成为世界企业首富,就是英雄创造的历史:萨蒂亚是不可思议的领袖。他的悟性和远见让 Open AI 与微软结合,这是一桩非常奇特的姻缘:一方投入巨资,另一方短期看不到希望,巨资投入也带不来任何董事会决定权,萨蒂亚依然前行。世界上找不到微软这样的对象,它几乎是彼时彼刻唯一可以牵手 Open AI,摆脱它必死宿命的救星。呼唤的与被呼唤的,在千载难逢的那个时间点,没有错过。
  9. 后来的故事,所有人都知道了:这个“姻缘”彻底改变了AI,更重要的是,也改变了人类文明的走向。
  10. 其他都是花絮了:老马以维护人类的名义起诉 Open AI 违背初衷;Open AI 披露早期信件来往证明老马本人就梦想控股,并不真正在乎开源还是闭源,而他们则依然不忘初心。
顺便一提,Ilya 此前不知所踪,现在看来是被冷藏了,但他现在出来给奥特曼这个公开信背书,而且作为公开信的主要作者,似乎说明,他并没有(被)选择分道扬镳。我们的猜想是,他还在内部继续领导 AGI 的安全研究,所谓人类价值观的超级对齐,希望用技术手段保障AGI不失控。但(被要求?)保持了低调。
微软的地位其实很尴尬。一方面,现在知道他们对于 Open AI 的巨额投资,已经从股价的飞升中得到了足够的回报,所以从投资角度,萨蒂亚是微软的英雄。但另一方面,这个“婚姻”始终无法稳定,也难以建立恒久的互信。微软不得不给自己做 Plan B,而 Open AI 也有自己的 Plan B:都需要在两人分手的时候有所准备。Open AI 这种独一无二的公益实体控股企业实体的架构,改变了人类历史进程,但却天然有矛盾和不稳定。上次奥特曼被踢出而复返的危机会不会重演?奥特曼本人会不会成为 AGI 沙皇,违背初心,一意孤行?
这些都还在演进中,进行时 ......