Transformer 和注意力机制简介

transformer架构及其注意力机制是主流GPT大模型的基础,特别特别重要。虽然这方面的科普和讲解已经汗牛充栋,但还是有很多朋友跟我说一头雾水或雾里看花。所以下决心写了三篇科普系列,试图做出我的理解贡献。

作为对主流AI有好奇心的同学,你可能早就听说过大名鼎鼎的大模型的基本框架 transformer及其“注意力机制”,觉得它们是高深莫测的概念。 你也许也读过那篇经典论文 Attention is all you need,但还是感觉云里雾里。别担心,这很正常,我们多数人都经过这个阶段!这篇论文确实有点“烧脑”,但它的核心逻辑其实并不复杂。

要理解AI大模型的Transformer架构,就需要拆解其工作流程。

首先应该了解一下大模型的工作原理和训练方式。

基础大模型通过原始大数据的“自监督学习”(self-srupervised learning),利用多层神经网络,获得数据相关的知识。自监督学习是一种特别的监督学习,它利用“掩码”获得监督信号。我们知道监督学习的训练数据是标注了输出目标作为监督信号的学习,但自监督无需人类标注,而是在数据中遮盖了部分数据点,让系统学习预测它(“填空”或“接龙”),以被遮盖的数据点作为标准答案和监督信号。主流的GPT大模型的掩码就是遮盖住下一个词,让系统仅仅根据上文来预测它(叫 next token prediction),这是当前生成式AI的主流模型。

当我们输入一段文字时,模型首先要做的就是把它切分成一个个基本单位(词元),然后给每个词元找到它的"字典释义"(向量表示)。

从输入到输出的全过程

1. 从“查词典”开始:Tokenization 和 Embedding

要理解整个输入文本,首先需要拆分基本单元,叫做 tokenization(分词),即,将文本拆解成序列 tokens(词元,文本的最小单位),这些 tokens 可能是完整词(如"work")或子词(subword,如"un+believ+able")。

词元是符号,而计算机对符号难以计算,它只玩得转数字,所以需要把词元转成数字。

每个 token 都会通过查一种嵌入(embedding)词典,把词元符号转化成一个数字化表示:多维向量。每个Token被转换为300-1024维的向量(想象给每个词建立很多概念维度的特征表示,例如:名词,单数,机构,金融,......)。Embedding 让词语有了可计算的语义空间关系。

多维向量好比一个“意义”空间,每个token的多维向量定义了这个token在意义空间的位置;token与其他tokens在不同维度的距离,就是它们在意义上的区分。这符合我们的常识:一个词的意义可以在与其他词的比较中显现。

这些向量不是随机生成的,而是通过海量语料训练出来的数字化表示,提供了词元的基本语义信息,即词元在意义空间的位置——例如"银行"的向量天然接近"金钱",而与"树木"相距甚远。再如"苹果"这个词的向量,可能会包含"水果"、"科技公司"等多个方面的信息。

想象一下,你要让计算机理解一句话:“The cat sat on the mat”。

第一步:分词(Tokenization),先把这句话拆成一个个的 tokens:The+cat+sat+on+the+mat 。

第二步:查字典(Embedding), 给每个 token 找一个数字化表示,也就是一个多维向量。

“cat” -> [0.1, 0.5, -0.2, ...]
“sat” -> [-0.3, 0.8, 0.1, ...]
...
(注:实际向量维度更高且值为连续分布)

简单来说

Tokenization 将文本拆解成计算机容易处理分析的最小单位 token。
Embedding 把这些 token 转换成计算机容易运算组合的向量。

关键点: 嵌入词典得到的向量只是 token 的“初始意义表示”,它还没考虑这个token的具体语境。在向量表示中解码上下文意义是下面步骤的任务,用的就是transformer架构中的多层神经网络+注意力机制。

Transformer 的核心模块可以拆解为两部分:

    1. 注意力机制:用于计算 token 之间的相关性,并动态更新 token 的表示。
    2. 神经网络:用于处理 token 之间的信息转换。

整个 Transformer 由多个这样的模块堆叠而成,每一层都会重新计算 token 的表示,使得理解越来越深。

2. 注意力登场:根据上下文更新词义

现在,我们有了一串向量,每个向量代表一个 token 的“初始含义”。但问题来了,同一个词在不同语境下可能有不同的意思啊!比如,“bank” 可以是“银行”,也可以是“河岸”。

Transformer 架构的核心是 注意力机制(self-attention),其作用就是:根据上下文,动态地调整每个 token 的含义表示,反映与其他token的关系。

打个比方:在"我喜欢吃苹果"这句话里,"苹果"和"吃"的相关度很高,所以模型会更多地参考"吃"这个词来更新"苹果"的含义,从而确定这里的"苹果"指的是水果而不是公司。

怎么做呢?

模型通过QKV注意力计算每个词元与其他词元的注意力权重:
- Query:当前词元的特征向量(如"他")
- Key:上下文词元的特征向量(如"警察","目击者")
- Value:关联后的实际含义

    • 例如,通过矩阵运算,发现"他"与"目击者"关联度最高,于是更新"他"的向量,使其携带"目击者"的信息。

计算“相关度”: 对于每个 token,我们都要计算它和句子中 所有 其他 token 的“相关度”,给不同的词元分配不同的"注意力权重"(attention scores)。这个“相关度”可以理解为:在理解当前这个 token 的含义时,其他 token 有多重要。

    • 例如,在理解 "sat" 这个词时,"cat" 和 "mat" 显然比 "the" 更重要。

加权平均: 根据计算出的“相关度”(也就是词元的权重),把上下文中所有 token 的V向量 加权平均 起来,得到本token的一个新的向量表示。这个新的向量就是当前 token 在 这个特定句子 中的含义表示。

    • 比如,"sat" 的新向量会更多地受到 "cat" 和 "mat" 向量的影响,而较少受到 "the" 向量的影响。

关键点: 注意力机制通过计算 token 之间的相关度,实现了对每个 token 含义的 动态更新。这种更新是 基于上下文 的,同一个 token 在不同的句子中会有不同的表示。

这样,每个 token 的意义不再是固定的,而是会根据整个句子的上下文动态变化。例如,在 "I saw a bat" 这句话中,"bat" 可能是 "蝙蝠",也可能是 "球棒",但注意力机制会结合上下文来推测其在上下文中更合适的含义。

关于注意力机制中QKV如何分工和工作的细节,可参照姊妹篇《立委科普:如何理解自注意力机制中的QKV分工?》。

3. Transformer 主干:多层递进的信息压缩

Transformer 的核心组块可以拆解为两大部分:

    • 多头注意力层:用于计算 token 之间的相关性,并动态更新 token 的表示。
    • 前馈神经网络层:非线性特征转换,进一步压缩信息(抽象、泛化)

整个 Transformer 由多个这样的模块堆叠而成,每一层都会重新计算 token 的表示,使得理解越来越深。根据组块的多寡,Transformer 会反复进行这个更新过程。就像人类理解文章时会反复琢磨一样,每一层都在加深对文本的理解。越深的层次可能捕获到越复杂的语义关系。

每个Transformer组块都在迭代升级认知,例如:
- 底层组块:捕捉局部语法(如"not...but..."的转折关系)
- 中层:理解"他指代的真实人物"
- 高层:把握全文主旨

Transformer的最大特点
1. 并行计算:词序与token处理解耦,并行处理所有token(对比此前RNN的线性低效)
2. 层次化理解:从字面含义到深层意图的渐进式解读,捕捉大大小小的规律性。

与前Transformer的RNN相比,架构优势的工程化体现

特性 RNN Transformer
并行计算 序列依赖无法并行 全token并行处理
长程依赖处理 存在梯度衰减 直接全局注意力
训练效率 O(n)时间复杂度 O(1)层内时间复杂度
内存消耗 相对较低 随序列长度平方增长

4. Output:模型的最终预测

Transformer 模型可以用于各种各样的任务。不同的任务,输出(output)的形式也不同。

    • GPT:预测下一个词(Next Token Prediction) 对于主流 GPT ,其最终的任务是预测下文,通过所谓“自回归”下一词元预测实现(自回归就是动态扩展上文,递归实现一个词一个词的接龙)。模型会根据已经深入理解的上下文,来决定接下来最合理的内容应该是什么。这一路打开了通用AI的路,原理是序列学习学到了一种通用任务的输入转输出的“密码”,但这是另一篇科普的内容了。

5. 总结

    • Tokenization 和 Embedding 给计算机理解文本打下基础,好比查了词典。
    • 注意力机制 计算 token 之间的相关性,并动态更新 token 表示。
    • Transformer 由神经网络层 + 注意力层组成,层层优化 token 表示,涵盖不同层次的各种关系。
    • 最终的 output 取决于任务,翻译模型是生成目标语言文本。GPT 负责预测下一个 token,最终发现这个简单的预测机制自然进化成解锁了各种任务的通用大模型。

 

 

【相关】

True Story Behind DeepSeek's Success: AI Learning to Think Slowly Without Human Supervision

*Edited transcript from InfoQ's second DeepSeek series livestream featuring Dr. Wei Li, former VP of Engineering at Mobvoi's Large Language Model team, discussing R1 Zero's innovative contribution to democratizing reasoning models.*

DeepSeek's Greatest Achievement: Making Everything Transparent

InfoQ: "DeepSeek adheres to a pure reinforcement learning approach, but the industry often refers to RL as 'alchemy' - how did they make this process controllable and accessible? What's innovative about their reasoning paradigm?"

Dr. Li:** The reinforcement learning for reasoning models has long been an industry challenge. About six months ago, when Ilya and others declared the end of the pre-training era, it signaled that simply scaling up pre-trained models was no longer sufficient for performance improvements. The delayed release of GPT-5 is another indicator of pre-training's decline. As a result, the industry began seeking new growth paths, with on-the-fly reasoning models gaining momentum among leading teams until OpenAI released O1, the world's first reasoning large language model. DeepSeek's R1 then followed with its breakthrough success.

From the mysterious Q-Star project (reportedly causing dramatic internal conflicts at OpenAI) to the release of O1, reasoning models have been widely recognized as a new paradigm in AI. The core of this paradigm is enabling models' "slow thinking" capability, or System 2 as it is called, using reinforcement learning to enhance model intelligence in complex tasks. However, all of this was closed-source. OpenAI even deliberately created mystique around their chain-of-thought content. Apart from a few top players like Google and Anthropic quietly exploring and tracking this field, other teams knew very little about it.

DeepSeek's greatest achievement lies in making everything about LLMs transparent. They open-sourced their models and detailed technical papers, and weren't afraid to expose their thought of chains (CoTs) in the system. Through pure reinforcement learning, they proved that even without process control data, result-based control alone could achieve top-tier reasoning model performance. This breakthrough was like piercing through a paper window, showing the industry a feasible path to democratizing reinforcement learning.

InfoQ: The innovation in reasoning paradigm sounds abstract. Could you provide an example?

Dr. Li:** R1's paper is outstanding, arguably one of the finest in the large model field. It consists of two parts: one focusing on Zero research, which presents remarkable achievements in pure reinforcement learning for reasoning; the other detailing the practical R1 system, a top-tier production reasoning model. For R1's development, they considered practicality, balancing comprehensive performance, safety, and various practical considerations, detailing a four-stage training pipeline as best practice to help other teams understand and replicate their success.

The most brilliant part is the Zero research. Zero proved a revolutionary point: contrary to traditional beliefs (or OpenAI's implied stance that reasoning requires step-by-step supervision), process supervision isn't actually necessary. Using only the final result against the "gold standard" as a supervision signal is sufficient to train the "slow thinking" process required for reasoning models.

This is Zero's greatest highlight and the origin of its name - it draws inspiration from AlphaZero's spirit. AlphaZero historically pioneered complete independence from human game records or experience, achieving zero human supervision reinforcement learning through self-play generated process data (state+move+score triplets). Similarly, DeepSeek's Zero research demonstrates that in reasoning tasks, models can autonomously generate internal process data - Chain of Thought (CoT) sequences - without human annotation.

Specifically, reasoning models initially focused on mathematics and coding because these domains have standard answers. Macroscopically, this is typical end-to-end supervised learning, as both input (math/coding problems) and output (answers/execution results) are fixed and known. However, the process from input to output is highly complex with significant information gaps, requiring a CoT bridge. Just as humans need to break down problems and think step by step when facing difficulties, models need this process too. DeepSeek's research found that models possess the ability to learn this deep thinking process autonomously if given sufficient time and space.

InfoQ: Dynamic reasoning paths sound like AI "drawing mind maps" - but how do you prevent it from going off track? Like suddenly writing poetry while coding?

Dr. Li:** Based on current evidence, this possibility is virtually non-existent or negligibly low. Before DeepSeek published their results and research details, many were puzzled about this point: wouldn't deep thinking go haywire with only result supervision and no process supervision? Without large-scale reinforcement learning experiments, this was indeed a significant concern. It's like flying a kite - you're holding just one string while letting it soar freely, worried it might nosedive.

These concerns proved unnecessary. The reason it doesn't go off track is that all this reasoning reinforcement learning, including self-generated reasoning CoTs, is built upon existing top-tier models (like V3). These models have already mastered coherent expression through massive data learning. This coherence implies orderliness, which, while not equivalent to pure logic, prevents completely unreasonable deviations. It is observed that fluent human speech typically reflects organized thinking.

InfoQ: On another note, compared to OpenAI's O1, DeepSeek R1 has another notable highlight in applying reasoning CoTs to language generation and style imitation. Could you elaborate on this?

Dr. Li:** When O1 was released, everyone knew it demonstrated significant improvements in mathematics and coding abilities, as standard tests revealed higher performance levels. What people didn't realize was that this reasoning ability, or "slow thinking" capability, excels not only in domains requiring strict logical reasoning but can also shine in traditional language tasks.

By nature, language ability has been a strength of large models - everyone knows they generate very fluent text, more native than natives. By the time we reached models like 4o or V3, their writing was already quite smooth, seemingly leaving little room for improvement. However, when asked to write classical poetry or imitate Lu Xun's writing style, previous models fell short. R1 solved these challenges. From a social impact perspective, this is actually quite remarkable and particularly noticeable.

Honestly, not many people are deeply concerned about mathematics or coding, although we know coding is a major direction for the coming years and automated programming can change the world. Everything in IT ultimately comes down to software; the digital world is built on software. If software development can transition from manual coding to model-assisted or even model-autonomous programming, this will greatly increase productivity. While this is visible to all, it's not as intuitive for ordinary people who more often face tasks like writing compelling articles.

When R1's humanities capabilities were discovered, not just geeks or software developers saw the benefits of reasoning models - ordinary people were excited too. Suddenly, anyone could claim to be a poet, writer, advisor or philosopher - the impact was tremendous. This wasn't felt with o1, perhaps because OpenAI didn't realize or at least didn't focus on this aspect of reasoning models. But while working on code and mathematical reasoning, DeepSeek must have internally realized that this "slow thinking" mechanism could also significantly improve writing abilities, especially in classical Chinese.

Everyone knows Chinese data isn't as rich as English data, so while previous models could write beautiful English poetry, they struggled with Tang poetry. This might be because Chinese data was insufficient in quantity or quality, preventing models from learning adequately. We always felt this was unfortunate - models would sometimes rhyme correctly, sometimes not, sometimes add or miss characters, not to mention tonal patterns to follow. DeepSeek clearly put effort into this area; their data quality must be significantly higher than industry standards.  More significantly, they know how to transfer the CoT ability from science and technology to language and literature.

InfoQ: If you were to recommend a DeepSeek module most worth replicating for programmers, which would it be? Like those "Aha moments" claiming to replicate R1 for tens of dollars?

Dr. Li:** If I were to recommend a DeepSeek module most worth replicating for the programming community, it would be the Zero research-related components. This replication isn't about achieving comprehensive capabilities but rather verifying Zero research's key revelation - that machines can indeed autonomously learn. This is what OpenAI kept under wraps; perhaps they had figured it out earlier but chose not to disclose it.

Now, we've seen quite a number of different teams claimed to have reproduced R1's reflective capabilities with minimal resources. This isn't just an interesting experiment; more crucially, it marks the democratization of reasoning models. Previously, people didn't understand how reasoning models worked, only knowing that vast amounts of process data were needed for models to learn slow thinking. This was considered an almost insurmountable barrier because process data is hard to obtain, and reinforcement learning's instability and high data requirements confused and challenged many programmers.

But now, we know we can bypass this most difficult process data requirement and reproduce this "Aha moment" with limited resources, proving that slow-thinking capabilities can be learned autonomously by models. Based on this premise, if you're a domain expert, you might wonder: could these techniques achieve significant improvements in your field? This is entirely possible. Even the most powerful models (like V3 or 4o) only achieve 60-70% accuracy in specific scenarios without optimization, and experience tells us that without at least 80-85% accuracy, you can't launch a truly valuable system in real-life applications.

That is to say, between a large model's out-of-box results and actual valuable application deployment, there's a gap. Previously, our only method was collecting domain data for fine-tuning (SFT). Now, we have another path RL: following the reasoning model approach, letting systems fully utilize slow thinking capabilities during the reasoning phase to improve data quality to acceptable or even exceptional levels. This path seems to have been opened.

However, my programmer friends tell me that in their comparison experiments between fine-tuning (SFT) and DeepSeek-style reinforcement learning (RL), while RL indeed outperforms SFT, the computational cost for RL training is still far higher than SFT. The superior performance makes sense because SFT data is always very limited, while successfully reinforced RL self-generated data can far exceed SFT data volume.

InfoQ: Some say large models represent "brute force aesthetics," but OpenAI's former Chief Scientist and co-founder Ilya says pre-training has reached its limit. How do you view this? Is the emergence of reasoning models just adding another scaling law to brute force aesthetics?

Dr. Li:** This is more about a shift in technical focus and a paradigm change in technical innovation. Large models involve three major components: first, pre-training, which builds foundational capabilities by learning basic patterns from massive data; second, post-training, initially mainly fine-tuning - OpenAI early on used some reinforcement learning (like RLHF) for human preference alignment, but by Meta's time, they even abandoned typical PPO style RLHF for simpler DPO, as they, like many others, struggled with it. Finally, there's the reasoning phase, where models interact with users real-time after deployment.

The current situation with high-quality natural data is that pre-training has nearly exhausted all available quality resources. The industry began to notice data growth challenges, making performance improvements increasingly difficult. GPT-5's delayed release, reportedly yielding limited returns despite massive computational investment, suggests pre-training may have indeed hit a wall.

This led the industry to explore alternative AI growth curves. Reinforcement learning-based reasoning models emerged at center stage in this context: pure reinforcement learning should be added to post-training. Previous reinforcement learning relied on human preferences, but this time it's about giving models more thinking time before reaching answers, learning underlying chain of thought (CoT). While V3 was already doing well, it didn't cause as much social sensation until R1 appeared. DeepSeek truly broke through after the Chinese New Year, becoming the most discussed public topic and causing excitement and shock overseas. R1 and O1 represent a new paradigm. Before R1, only OpenAI's O1 existed as a reasoning model, seemingly unreachably advanced, with would-be-followers unsure how to follow. However, R1 not only reproduced O1's capabilities but did so with greater transparency and clarity. This contrast further highlighted R1's importance as an open-source model leader.

InfoQ: At first glance, DeepSeek seems like an engineering masterpiece. Why did it cause such global sensation? Its user acquisition speed (100 million in a week) surpassed ChatGPT's nuclear moment? What's its historical significance?

Dr. Li:** From my personal experience and observation, ChatGPT's explosion was a landmark event in large model development. Research insiders were following large models before ChatGPT, at least since GPT-3. When GPT-3's Playground appeared, we were already immersed in it, sensing an approaching storm. But from society's perspective, ChatGPT truly shocked everyone, exceeding all expectations, like an AI nuclear explosion.

I believe R1's emergence is the second major shock after ChatGPT. Of course, between ChatGPT and R1, other influential models appeared, like 4o - another remarkable milestone. While ChatGPT 3.5 was already so impressive, 4o proved it could be even better. Then came Sora, bringing shock with video capabilities in multi-modal LLMs. I personally also greatly appreciate Suno, the music model, making me feel like I could become a musician overnight.

If I were to rank them, R1's impact is second only to ChatGPT, perhaps even exceeding 4o and Sora's sensational effects. R1's impact feels similar to ChatGPT's initial appearance, creating the same addiction. While ChatGPT was groundbreaking and R1 a follower, albeit with innovative highlights sometimes surpassing previous models (like in classical poetry and style imitation), achieving such global impact as a follower is truly miraculous.

In terms of practical effects, R1's productization was amazingly successful. Gaining hundreds of millions of users in a week, it far broke ChatGPT's record and elevated society's AI awareness. Furthermore, regarding geopolitical influences on technology access, many domestic users had long desired access to the world's most advanced models like GPT series, Claude, or Gemini but often couldn't reach them. R1's appearance eliminated these concerns about domestic and international restrictions, contributing to its rapid global popularization.

InfoQ: What's your vision of AI programming's ultimate form? Is it programmers telling AI "make me a TikTok," and it outputs deployable code and operations plans?

Dr. Li:** There are always two types of people: skeptics and optimists. People like Ilya believe Artificial General Intelligence (AGI) is imminent and Artificial Super Intelligence (ASI) isn't far away, so the biggest concern now, according to him, is ensuring superintelligence safety.

Anthropic's CEO Dario predicts that within 3-5 years, large models will achieve real breakthroughs - not just the current impressive demonstrations, but revolutionary changes in societal productivity. Fundamentally, they're talking about AI's ability to scale replacement of both physical and intellectual human labor.

However, while large models are buzzing now, their practical applications haven't reached the level of the previous generation's mobile platforms. Previous super apps like Meituan, Didi, Xiaohongshu, and TikTok transformed major aspects of our daily lives, from basic necessities to communication and entertainment, maximally shortening the distance between suppliers and customers - value everyone of us feels daily. While playing with large models is interesting, their practical value at the lifestyle level isn't yet obvious;  at best we're still on the verge of the coming AI application explosion.

Notably, DeepSeek's emergence has lowered large model application barriers, paving the way for scaled applications, though we haven't yet entered the era of true application explosion.

What will it look like when AI applications truly explode? I believe the ultimate goal, by nature of AI, is for LLMs to comprehensively replace humans in both intellectual and physical labor. Signs of large models impacting white-collar workers are already undoubtedly evident, with even programmers not an exempt. In physical labor, embodied intelligence is developing rapidly, with both humanoid robots and mechanical hands gradually replacing human physical work.

Of course, this brings side effects, like massive job displacement. How society adapts to this state of greatly developed productivity, but this is another discussion topic. But looking at AI's nature and ultimate goals, AI development could have two milestones: first, when it can replace 50% of human work, allowing half of society to maintain a decent, free life through social programs perhaps like Universal Basic Income (UBI) - this might mark the arrival of AGI (Artificial General Intelligence); second, when it replaces 90% of human work, possibly signifying the emergence of ASI (Artificial Super Intelligence) - a kind of technological utopia (or 'communism') in some sense.

These two milestones are my own verifiable definitions of AGI and ASI.  I do not agree with the idea that while old jobs are replaced, more new jobs will be created by AI.  It just does not make sense as any new jobs are also a mixture of human labor, destined to be replaced soon by super intelligence if they do emerge for time being.

This vision of AI's future development shows how DeepSeek's innovations in reasoning models might be just the beginning of a much larger transformation in how we think about work, society, and human potential in an AI-driven world.

 

 

【相关】

Does the New Reasoning Paradigm (Query+CoT+Answer) Support a New Scaling Law?

— Reflections on LLM Scaling Laws and DeepSeek's R1

My friend Zhang Junlin's article "Looking at the Future of Scaling Laws through DeepSeek R1" has sparked interesting discussions among peers.

Core Insights from Initial Discussions

Professor Bai summarised the key highlights as follows:

Infinite stacking won't lead to infinite growth (physical laws don't support this)

Only S-shaped growth is possible, with diminishing returns inevitably appearing

The initial emergence of language capabilities relates to the density of linguistic knowledge in training data

The next growth phase represents a second S-curve, driven by common sense knowledge, which requires more computing power due to lower knowledge density

The third phase involves learning logical reasoning (Chain of Thought), where natural data has even lower density of such knowledge. Brute-force mining with computing power becomes inefficient, making reinforcement learning with synthetic data a more rational approach

As Dr. Lu points out: The term "Scaling Law" is becoming overloaded. While S-curves (nonlinear curves characterized by sigmoid functions) can describe technology adoption lifecycles, they typically occur in succession (one technology hits its ceiling, making way for another). Large language models' multiple "Scaling Laws" confirm this pattern, with some overlap between Test-Time and Post-Training "Scaling Laws".

The Nature of LLM Scaling

Let's examine the fundamental logic behind LLM scaling. First, it's crucial to understand that LLMs are not databases - they don't aim to memorize long-tail data details. Large model training essentially compresses big data, or more precisely, compresses the knowledge systems behind the data (including common sense and encyclopedic knowledge), focusing on capturing patterns and regularities of various patterns (what we call generalizations).

Conventional intuition suggests that as data scale increases, redundancy increases too. Regardless of filtering, cleaning, and deduplication, growing redundancy seems to imply diminishing returns. So why do large models still appear "hungry" even at the unprecedented scale of hundreds of billions of tokens? Why does the scaling law remain effective from hundreds of billions to trillions of tokens?

The key lies in LLMs being sequence learning and sequence decoding systems. While sequences are one-dimensional, the patterns and regularities behind are high-dimensional. For instance, even a simple sequence like "cat chases mouse" potentially involves multiple knowledge dimensions: species relationships, predatory behavior, spatial movement, actor-patient roles, etc. This multi-dimensional knowledge naturally leads to combinatorial explosion at the sequence level as information is flattened in language. The "appetite" for insatiable big data effectively addresses this combinatorial explosion. As long as there isn't complete information redundancy, additional diverse sequences will help models abstract data patterns more precisely.

The Two vs. Three S-curves Debate

Zhang Junlin observes that since OpenAI's O1, two other phases have gained recognition with their own Scaling Laws: the reinforcement learning Scaling Law (RL Scaling Law) for post-training, and the Inference Scaling Law (also called Test Time Scaling Law).

This raises a crucial question: Are there really three S-curves, or just two? How comparable is the reasoning model's S-curve to the pre-training S-curve?

While theoretically we can identify three phases:

Pre-training
Post-training (especially reasoning-focused reinforcement learning)
Inference phase

In practice, post-training and inference phases likely share a single S-curve; there aren't two independent growth curves.

DeepSeek R1's Insights: The Truth About "Slow Thinking"

Consider DeepSeek R1: users can activate "deepthink" mode to enable Chain-of-Thought (CoT) reasoning, but they can't actually control reasoning quality by increasing computation time. Why is this?

Let's examine a concrete example. When R1 solves a complex mathematical problem:

Traditional models might directly answer: "The result is 42"

R1 shows detailed reasoning: "Let's think step by step: 1) First consider... 2) Then we can... 3) Finally, we get 42"

While R1's response appears to demonstrate "slow thinking" (CoT), this reasoning process reflects actually a generation pattern fixed during training, not dynamic exploration of multiple potential reasoning paths during response time. In other words, CoT+answer might look like "slow thinking," but it doesn't fundamentally change the unidirectional next-token prediction paradigm. R1's CoT+answer creates an illusion of slow thinking, but the generative nature remains fundamentally the GPT "fast thinking" paradigm. At test time, unlike AlphaGo, the depth and scale of thinking isn't dynamically explored, though beam search, if applied, can provide implicit multi-path optimization internally.

Test Time Compute Constraints

The industry's buzz word "test time compute" refers to reasoning models requiring more online computational resources compared to traditional non-reasoning models. For example, R1 with CoT enabled might need several times more computation time than its base model V3 for the same problem. However, this increased computation results from behavior patterns acquired during training, not dynamically adjustable compute investment. Without controllable scalability in test time compute, we can't really talk about a test time scaling law.

A major difference between pre-training and CoT reinforcement learning lies here: pre-training scaling laws can remain stable long-term because once training completes, it doesn't significantly impact online response time - the generation mode remains a simple query+answer. Therefore, offline training for months is acceptable if the resulting model shows significant capability improvements. However, reasoning models' post-training CoT reinforcement learning differs - it cultivates models' habits of responding with slow thinking, changing the generation mode to query+CoT+answer. Extending the CoT isn't just about the cost of training resources and time; more critically, it reflects in extended test time compute for each query during deployment, severely delaying system response time. Users generally have limited tolerance for slow thinking computation time and delays during online system use.

The Sustainability Debate

OpenAI's Sam Altman and Anthropic's Dario might argue that for extremely complex problems (like proving the Riemann hypothesis or designing next-generation aerospace vehicles), even if a model needs a week of computation time, it's still a massive improvement over human teams requiring decades. However, this argument has two issues:

LLM feasibility for such super-complex problems remains far from validated

Extreme scenarios lack universality and can't serve as data points for sustainable scaling laws

This isn't to deny S-curves as effective models for describing scaling laws, nor to reject the rationality of S-curve stacking. The combination of pre-training and post-training growth curves (s1 and s2) might indeed reflect the overall relationship between resource investment and performance improvement. However, we should carefully examine whether CoT reasoning truly opens a sustainable scaling curve.

Conclusion: How Far Is the LLM Road to AGI?

If reasoning models' scaling laws lack sustainability, this raises a deeper question: Can we reach the promised land of Artificial General Intelligence (AGI) through these two scaling laws alone? Furthermore, is the technical ideal of Artificial Super Intelligence (ASI) - AI replacing human labor and dramatically improving productivity - truly feasible?

Current evidence suggests that while pre-training scaling laws have shown considerable sustainability, reasoning models' scaling laws may quickly hit practical constraints. This reminds us that the path to AGI/ASI likely requires more innovative breakthroughs, not just simple extrapolation of existing methods. In the next phase of artificial intelligence development, we might need to discover entirely new growth curves.

[#LLMs #ArtificialIntelligence #DeepLearning #AGI #ScalingLaws #MachineLearning]

 

【相关】

张俊林:从Deepseek R1看Scaling Law

DeepSeek 笔记:推理新范式 query+cot+answer 支持新的 scaling law 吗?

LLM的"大就是好"还能走多远?

——关于Scaling Law的一些思考

 

老友张俊林《从Deepseek R1看Scaling Law的未来》一文,引起老友热议。

白老师的推荐抽提是:

核心观点:

——无限堆叠不会无限增长(物理世界规律也不支持),只有S型增长,一个S曲线一定会出现边际效益递减。

初期语言能力的涌现,与语料数据中包含的语言知识密度有关。

接下来的增长实际上是第二根S型曲线,更多语料贡献的是常识性知识,常识知识密度不及语言知识密度,所以要更大算力才能涌现。

再接下来是逻辑知识(思维链)的学习。自然语料中逻辑知识密度更低,用算力野蛮淘金,吃力不讨好。所以,用逻辑知识密度更高的合成数据做强化学习,才能让第三个S曲线爬坡。这就顺理成章了。

鲁总评论说:Scaling Law 这个词现在有点滥用。S 曲线(Sigmoid函数刻画的非线性曲线)倒是可以描述技术的生命周期,但它往往是一个接下一个(一个技术遇到瓶颈,往往才有另一个技术的开始)。。。这个在ChatGPT刚出来时我们回顾过。大模型的这几个 "Scaling Laws" 也印证这一点 (Test-Time 和 Post-Training “Scaling Laws" 有点重叠部分):

创新就是从一个S曲线到另一个S曲线,well known results。这也是斯坦福大学那位鼓吹新能源、自动驾驶以及再生食品革命等科技乐观主义的教授(叫?)每次演讲必谈的技术革命的adoption曲线。他自称根据这个曲线,他在过去30年对于技术影响社会的许多预见都证明是对的,虽然每一次预见社会都会取笑他。 

回到LLM领域的 scaling law 话题。Scaling law本质上是一种经验法则,而经验告诉我们,大多数经验法则都符合S形曲线(或增量的正态分布)。具体到LLM,"大就是好"正是这种法则在遇到平台期或天花板之前的体现。这里的"大"指的是数据规模大、模型参数量大,缺一不可。模型规模不够大,数据再多也无法有效消化——这早已是业界共识。不过OpenAI早期的设计中过分强调模型规模的做法现在看来是一种误导,直到Chinchilla Scaling Law的提出,业界才形成了更合理的共识:数据规模和模型参数量需要保持适当的比例关系。

LLM Scaling的底层逻辑是什么?

首先要明确:LLM不是数据库,其目标不是记忆长尾数据的细节。大模型训练本质上是对大数据内容的压缩,换句话说,压缩的是数据背后的知识体系(包括常识、百科知识等),重点在于揭示大大小小的各种规律性(也就是所谓的泛化能力,generalizations)。

一般直觉会认为,数据规模越大,冗余也越多。无论如何过滤清洗和去重,冗余度随规模增长,似乎意味着可榨取的"油水"会越来越少。那么为什么到了千亿tokens这种以前难以想象的数据规模,大模型依然显得"吃不饱"?为什么从千亿扩展到万亿tokens,scaling law依然有效?

这个现象的关键在于LLM是序列学习(编码)序列推理(解码)的系统。序列本身是一维的,但序列中蕴含的patterns和规律性却是高维的。举个例子:即使是简单的"猫追老鼠"这样的序列,背后可能涉及物种关系、捕食行为、空间运动等多个维度的知识。这种多维知识表现在序列层面,就会发生天然的组合爆炸。对大数据的"大胃口"正是应对这种组合爆炸的有效策略。只要不是完全的信息冗余,增加的不同序列对模型抽象数据patterns通常都是有帮助的。

然而,人类自然产生的高质量数据是有限的。预训练已经几乎吃尽了现有的高质量自然数据。于是,业界开始探索另外的AI智能增长曲线。

从预训练到推理:两个还是三个S曲线?

张俊林指出:

OpenAI o1推出后,另外两个阶段不再孤单,也各自拥有了姓名,产生了各自的Scaling Law,对应后训练阶段的强化学习Scaling Law(RL Scaling Law)和在线推理阶段的Inference Scaling Law(也叫Test Time Scaling Law)。

这里值得探讨的问题是:到底是三个S曲线,还是两个?推理模型的S曲线与此前的预训练S曲线有多大可比性?

理论上确实可以分为三个阶段:

1. 预训练
2. 后训练(尤其是推理强化学习)
3. 推理阶段

这三个阶段理论上都可能找到资源投入与性能提升之间的正相关S曲线,即scaling laws的某种表现函数。但实际上,在当前部署的应用中,后训练和推理这两个阶段应该共享同一个S曲线,原则上不存在两条独立的增长曲线。

当然,如果用户利用提示词技巧来影响模型的test time,让它更深入的思考,这可能间接影响 CoT (ChainOfThought)的长度或深度。但那是 query 的改变,是 input context 的变化,感觉也不应该算作 test time compute 的独立的 s曲线。

另外,说推理模型这一波潮流是范式转变,开启了新的 RL/Test-time scaling law,总觉得有一点太言之凿凿了。直觉上,推理模型的增长曲线与此前的预训练 scaling law 的增长曲线,大概率没有直接的可比性。

Scaling law 说的 law,实际上我们都知道是所谓经验“法则”。经验需要足够的实践数据积累,才能总结出来。强化学习赋能的推理模型才刚开始,没有足够的经验数据刻画这是怎样的一种增长关系,能持续多久,是不是昙花一现,还是可以持续相当长的时候,等等。

持续时间不够长的 scaling,其实没有多少经验法则的意义。Anthropic CEO Dario 提到 deepseek 的时候说(大意), deepseek 显得这么亮眼其实是赶上了好时机,言下之意是运气的成分大于技术硬核实力和创新(滑稽的是,Anthropic 迄今没有能力推出任何推理模型,虽然R1以来,谷歌和国内都有推理模型的上线)。他说,推理刚刚开始,所以任何人走通了这条路,在这个初期阶段都会有一个大增长。譬如PhD段位的考试题,在没有推理模型的LLM中,可能分数很低,但一旦有了推理模型,有了所谓 test time compute 的 CoT,成绩就会直线上升,给人创造了奇迹的感觉。

现在是推理模型的早期,后去会如何呢?靠增加 test time compute,或不断延长 CoT,还会有多少增长空间?这个问题是现在进行时,貌似没有明确答案。但隐隐觉得,这个持续增长的时间或曲线,远不如预训练那样稳定和持续,进而其作为 scaling law 的说法不一定站得住。

这第二条反映 RL scaling law 的后训练智能增长曲线,不大好与 pretrain scaling law 相提并论,很可能并不是可持续的,也可能很快就遭遇制约因素的强烈反弹(见后“Test Time Compute 的制约”)。

DeepSeek R1的启示:慢思考的真相

以DeepSeek R1为例,用户可以选择"deepthink"模式来启动慢思考的chain-of-thought(CoT)推理,但实际上用户难以通过增加计算时间来提升推理质量。这是为什么呢?

让我们看一个具体例子。假设我们让R1解决一个复杂的数学问题:

- 传统模型可能直接给出答案:"结果是42"
- R1会展示详细的推理过程:"让我们一步步思考:1) 首先考虑...... 2) 然后我们可以...... 3) 最后得出结果42"

表面上看,R1的回答展现了"慢思考"(CoT)的特征,但实际上这个推理过程是模型在训练阶段就已经固化的生成模式,而不是在回答问题时动态探索多个可能的推理路径。换句话说,CoT+answer 看似是"慢思考"后的回答,但其实并不改变自回归 ntp(next token prediction)的单向序列生成定式。说白了就是,R1 的 cot+answer 给人慢思考的样子,但生成的本性还是GPT“快思考”范式。在 test time,思考的深度和规模不是动态探索,虽然可以用 beam search 进行内部的隐式多路径选优。

Test Time Compute 的制约

目前业界热议的"test time compute",指的是含有CoT机制的推理模型相比传统的非推理模型需要更多的在线计算资源。以V3/R1为例,处理同样的问题,启用CoT 的R1可能需要V3 n多倍的计算时间。但这种计算量的增加是模型训练后固化的行为模式导致的,而不是可以动态调节的算力投入。test time compute 没有可控的伸缩可能性,也就谈不上 test time scaling law。

预训练与后训练的CoT强化学习的一个很大的不同是:预训练 scaling law 可以长期稳定乃是因为一旦训练完成,不大影响在线响应的时间,生成模式就是简单 query+answer。因此预训练阶段离线训练几个月都是可以忍受的,只要训练出来的大模型能力有大的提升。但推理模型后训练阶段的CoT强化学习不同,它在培养模型在线回应慢思考的习惯,生成模式是 query+cot+answer。推理模型的 cot 拉长,不仅仅是训练的资源和时间的耗费问题,更主要的是它反映在部署推理阶段的 test time compute 的延长,严重拖延了系统的响应时间。而用户在线使用系统的时候,一般来说对于慢思考的计算量和耗费时间是有能够忍耐的上限的。

这就带来了一个关键问题:即使研究表明indefinitely 增加CoT的长度(相应增加在线计算时间)能带来持续的性能提升,符合某种 scaling law 的经验法则,这种增长也会受到推理阶段现实因素的制约。一般用户可能愿意等待5-10秒获得更好的答案,但如果需要等待几分钟乃至几小时,使用体验就会大打折扣,乃至不可接受。

Scaling Law的可持续性之辩

Open AI CEO Sam Altman 和 Anthropic CEO Dario 这些大佬可能会争辩说,对于极其复杂的问题(如证明黎曼猜想、设计下一代航天战机等),即使模型需要一周的计算时间,相比人类团队需要数十年的工作量仍是极大的进步。但这种论述有两个问题:

1. 这类超复杂问题的LLM可行性远未得到验证
2. 极端场景不具有普适性,难以作为可持续的scaling law 的数据点

当然,这并不是否认S曲线作为描述scaling law的有效模型,也不是否定S曲线叠加的合理性。预训练和后训练两个阶段的增长曲线(s1和s2)叠加确实可能反映了资源投入与性能提升的整体关系。但我们需要谨慎看待CoT推理是否开启了一个真正可持续的scaling曲线。

结语:通向AGI的道路还有多远?

如果推理模型的scaling law缺乏可持续性,这就带来了一个更深层的问题:仅依靠这两个scaling laws,我们能否达到通用人工智能(AGI)的理想彼岸?更进一步,让AI平替人类劳动、极大提升生产力的超级人工智能(ASI)的技术理想是否真的可行?

目前的证据表明,预训练scaling law确实展现了相当的持续性,但推理模型的scaling law可能会较快遇到现实约束。这提醒我们,通往AGI/ASI的道路可能需要更多的创新突破,而不仅仅是现有方法的简单外推。在人工智能发展的下一个阶段,我们或许需要寻找全新的增长曲线。

 

 

【相关】

张俊林:从Deepseek R1看Scaling Law

DeepSeek 笔记:R1 部署阶段的推理机制

1. 训练阶段的强化学习机制:GRPO是否通过生成多条候选答案(multiple candidate cot+answer sequences)进行策略优化(修改模型),使得模型养成慢思考的习惯?

这个答案是毫无疑问的 YES。

2. 部署阶段的推理机制:R1是否在生成时隐式生成多条路径,但仅展示一条?如果是,这种机制与集成(ensemble)方法有何异同?

3. 与AlphaGo的MCTS的区别:MCTS树搜索是否在推理时动态构建搜索树,而集成方法只是静态组合多个模型的输出?

1. 部署阶段的隐式多路径推理机制

DeepSeek R1 的部署阶段,其推理机制可以概括为以下两种模式:

(1) 隐式多路径生成与筛选

- 生成多条路径:模型在单次推理时,可能隐式生成多条潜在的推理路径(CoT+Answers),但仅选择其中一条输出。
- 技术实现:通过调整解码策略(如束搜索宽度 `beam_width`),模型在生成过程中维护多个候选序列(即多条路径),最终选择综合评分最高的路径。
- 用户感知:用户仅看到最终输出,但模型内部进行了多路径探索与筛选。
- 效率权衡:若设置 `beam_width=1`(贪心搜索),则退化为单路径生成,响应速度最快;增大 `beam_width` 可提升输出质量,但增加计算延迟。

(2) 显式多候选生成(需主动配置)

- API级控制:通过设置 `num_return_sequences` 参数,模型可显式生成多个候选答案(如5个),用户或下游系统可进一步筛选。
- 实际应用:DeepSeek App默认未开放此功能,但在企业API或开源代码中可能支持。

关键点
- 训练阶段的强化学习优化了模型的“单路径CoT生成能力”:通过GRPO训练,模型在单次生成时即可输出高质量的详细推理步骤(长CoT),无需依赖显式多候选生成。
- 部署时的多路径探索只是“锦上添花”:隐式多路径(如束搜索)或显式多候选生成可进一步提升输出质量,但非必需功能。

 

2. 训练阶段的“慢思考习惯”培养

(1) 强化学习的作用

- 目标:通过GRPO算法,模型学习生成更详细、更合理的推理步骤(长CoT)以提高奖励(如答案正确性)。
- 机制:训练时生成多个候选答案,奖励信号不仅评估最终答案正误,还隐式鼓励逻辑连贯的推理路径(如通过格式奖励)。

(2) CoT增长的驱动力

- 奖励设计:若长CoT更易得出正确答案(如分步解题减少错误),模型在策略优化中自然倾向于生成更长、更详细的步骤。Given room for [think], a reasoning model just wants/tends to think deep!
- 数据反馈:训练后期通过拒绝采样生成的高质量SFT数据,进一步强化这一模式。

结果:训练后的模型在单次生成时即可输出高质量的详细推理(即“慢思考习惯”内化)。

3. 与集成方法(Ensemble)的异同

(1) 相似性

- 多路径生成:隐式多路径探索(如束搜索)可视为同一模型生成多个潜在输出,类似集成方法中的多模型预测。
- 结果筛选:通过置信度选择最优解,类似于集成中的投票或加权平均。

(2) 核心差异

R1的隐式多路径生成本质是单模型内的动态解码策略,而传统集成依赖多模型的静态组合,二者在实现成本与多样性来源上存在根本差异。

4. 与AlphaGo蒙特卡洛树搜索(MCTS)的本质区别

(1) AlphaGo的MCTS机制

- 动态规划:通过模拟(Simulation)构建搜索树,评估每一步的长期收益(如胜率),动态选择最优路径。
- 在线学习:在推理时根据实时反馈(如对手落子)调整搜索策略,部分版本(如AlphaZero)甚至更新模型参数。

(2) R1的隐式多路径生成

- 静态模型:部署时模型参数固定,多路径生成依赖预训练的策略与解码规则。
- 无长期收益建模:路径选择基于模型自身的置信度概率,而非多步决策的累积收益。

(3) 差异

- R1的多路径生成是静态策略的有限探索,依赖训练阶段内化的CoT+answer的生成能力。
- MCTS是动态规划过程,通过实时模拟与评估实现长期收益最大化,属于在线决策优化。

 

5. 总结

- 训练阶段的目标:GRPO通过强化学习培养模型生成详细CoT的习惯,使得部署时单次生成即可输出合理答案。
- 部署阶段的灵活性:系统可选择单路径生成(快速响应)或多路径筛选(质量优先),后者类似轻量级集成。

- 训练完成后模型参数确实固定,但隐式多路径生成依赖解码策略(如束搜索宽度),用户可通过API参数调整,非完全静态。
- 与集成的实质差异:R1的多路径生成是同一模型的不同解码路径,而传统集成依赖多个独立模型,后者多样性更高但成本激增。

- MCTS的核心是动态搜索与长期收益建模,而非多模型预测的平均化。R1的隐式多路径更接近贪心策略的扩展,而非规划过程。

DeepSeek R1的部署机制通过训练阶段的强化学习内化“慢思考”能力,使其在单次生成时即可输出详细推理。隐式多路径生成(如束搜索)可进一步提升质量,但本质是同一模型的解码策略优化,与传统集成或AlphaGo的MCTS均有显著差异。这种设计在效率与效果间取得平衡,适配大规模工业应用需求,但牺牲了动态规划与全局最优的能力。

 

【相关】

Hallucinations in AI: Bug or Feature? A Deep Dive into DeepSeek-R1

Host: Hello everyone! Welcome to today's interview. Recently, there's been quite a buzz about AI "hallucinations," especially with DeepSeek-R1, which seems to have a higher hallucination rate than its predecessor, DeepSeek-V3. Today, we're joined by Dr. Li, a senior AI researcher. Welcome, Dr. Li!

Dr. Li: Hello, host! Hello, everyone!

Host: Let's start with the million-dollar question: Why do large language models "hallucinate"? Can you break it down for us in plain English?

Dr. Li: You see, large language models are like super-powered conversation completers. Give them the first half of a sequence, say, a question, and they'll predict the second half (say, an answer) based on their massive knowledge network. They learn like our brains do – they can't remember everything word-for-word, so they compress and generalize, grabbing the gist and finding patterns.

Here's a fun contrast: Ask them "How tall is Yao Ming?" and they'll nail it because that's such famous knowledge, this data point is practically carved in stone in their memory (represented in the model's parameter weights). But ask them "How tall is Old Wang from next door?" and they're stumped because they've never met Old Wang! But here's the kicker – they won't just say "I don't know." So what do they do? They "make up" a reasonable height based on what they know about the range of human heights. That's a hallucination for you!

Host: Wow, that's some impressive guesswork! But isn't this kind of making things up pretty problematic?

Dr. Li: Not necessarily! In a way, hallucination is imagination (for better or worse) – it's where creativity lies! Think about it: all those great literary works, artistic masterpieces – aren't they all flights of fancy, products of imagination? If everything had to match reality closely, art would just be photography, and where's the fun in that?

You know, Yuval Harari makes a fascinating point in "Sapiens" – humans became Earth's dominant species precisely because we could "tell stories," creating myths, religions, nations, and money – things that don't physically exist. These are all "hallucinations," but they're the driving force behind civilization!

Host: When you put it that way, hallucinations sound pretty important! But let's talk about DeepSeek-R1. Its hallucination issue seems quite serious.

Dr. Li: Indeed, it is! The academic consensus used to follow OpenAI's view that reinforced reasoning would significantly reduce hallucinations. I remember discussing this with a head honcho at an LLM unicorn who was particularly excited about reasoning's potential to curb hallucinations. But R1's performance threw us a curveball!

According to Vectara's tests, R1's hallucination rate is more than 3 times higher than its foundation model V3's – 14.3% compared to 3.9%. This definitely correlates with its prolonged "Chain of Thought" (CoT) enabled by reinforcemnnt learning for reasoning. R1 is absolutely brilliant at reasoning, math and coding, as well as poetry and storytelling, but this currently comes with the "side effect" of increased hallucinations in things like translation and summarization.

More specifically, there are several reasons for R1's increased hallucinations.

First, the standard hallucination tests use summarization tasks, something base models are already pretty good at. In this case, reinforcement learning can backfire – it's like using a cannon to swat a fly!

Second, R1's reinforced reasoning chains weren't specifically optimized for straightforward tasks like summarization, translation, or news writing that demand strict factual accuracy. Instead, it tries to add various layers of thinking to every task. Looking at its transparent CoT (ChainOfThought) printout, we see it tirelessly analyzing even simple instructions from multiple angles. This overcomplication of simple tasks can lead to deviations and hallucinations.

During R1's reinforcement learning for NLP-related tasks, it seems the model was rewarded more heavily for creativity, leading it to be more imaginative – and consequently more prone to straying from facts. For mathematical and coding tasks, R1's supervision came from gold standards (test answers or code test cases). But for humanities tasks, they used V3 or V3's reward model to judge quality, and the current system seems to clearly favor creativity.

Moreover, user feedback typically tends to focus and encourage creativity. Most people aren't sensitive to hallucinations, especially when they're wrapped in the model's smooth, fluent language. For most frontline developers, this kind of user feedback naturally pushes them to enhance creativity rather than tackle the thorny problem of hallucinations.

Host: So, you are saying that R1's hallucination problem rooted in its over-enthusiastic reasoning? What's the real relationship between reinforced reasoning ability and hallucinations?

Dr. Li: It's still a puzzle – there's not seem to be simple correlation. Look at R1, a leading reasoning model, versus Claude 3.5 Sonnet, a top non-reasoning model. Surprisingly, Sonnet still has a higher hallucination rate than R1! But when we compare R1 to its base model V3, we see clearly that adding reasoning significantly increased hallucinations.

It may well be about the model's "personality." R1, with its powerful reinforcement learning, loves "divergent thinking." Give it a simple prompt, and it'll spin out ideas like there's no tomorrow – its CoTs could run on like crazy! This suggests that while R1 was powering up its creativity, it inevitably amplified creativity's twin: hallucination.

As a model that excels in both STEM and humanities, R1 performs differently across tasks. In mathematics and coding, where more rigorous reasoning is required, there's little room for hallucination. But in language and creative tasks, especially in the summarization tests, hallucinations become more prominent. It's largely a side effect of R1's supercharged linguistic creativity.

Technically speaking, R1 automatically adds lengthy CoTs to simple user instructions, essentially complicating straightforward tasks. Its CoTs (like  internal monologue of an entity following instructions) change the conditional part of the autoregressive probability model before generating answers, naturally affecting the final output. Compare:

V3: query → answer
R1: query+CoT → answer

For tasks that V3 already handles well, like summarization or translation, any lengthy CoT guidance might lead to deviation or embellishment, creating fertile ground for hallucinations.

Host: So where do R1's hallucinations mainly occur?

Dr. Li: Think of R1's abilities as split between "arts" and "sciences." In "science" areas like math and coding, its logic is fairly strong and hallucinations are relatively rare. But in "arts" areas like language, hallucinations become more noticeable.

R1's most impressive achievement compared to the first LLM reasoning model O1 is successfully extending mathematical and coding reasoning capabilities into creative writing, especially in Chinese. The internet is full of R1's brilliant literary works. In terms of wordplay and literary prowess, it clearly surpasses 99% of humans – even graduate students in literature and classical Chinese professors sing its praises.

But watch what happens when you ask it to do a simple summary – it can't help but "get creative," often "inventing" details not present in the original text. It's like its "arts" abilities are too powerful, a case of "too much of a good thing."

Host: That's an interesting perspective. Do all language tasks require creativity?

Dr. Li: Language tasks actually fall into two categories: ones that need high creativity, like poetry and fiction writing, and ones that demand high factual accuracy, like news reporting, translation, or summarization. R1 excels at the former, which was likely the development team's focus, but this creates side effects in the latter as it is today.

It reminds me of the old Chinese saying about translation needing to be "faithful, expressive, and elegant" – achieving all three has always been challenging. We see many examples where elegance is prioritized over faithfulness, like the use of hyperbole in literary works. We also see the opposite, like Lu Xun's advocacy for so-called "rigid translation."

Interestingly, humans have always had double standards here, but we have a mental switch we can flip at will. When watching movies or reading novels, we flip towards creativity and don't fuss about factual accuracy. But switch to news channels, and we have zero tolerance for falsehoods.

Host: People tend to believe content that appears logically coherent and detailed, so the potential harm from AI hallucinations could be significant. What should we ordinary folks do about AI hallucinations?

Dr. Li: While many people are starting to notice and become wary of these hallucinations amid their amazement at LLM's creativity, most are still mesmerized by its creative brilliance. We need to increase public awareness of AI hallucinations. I suggest a two-pronged approach:

Stay Alert: Don't take everything the model says as granted, especially factual claims. Hallucinations most commonly occur with names, places, times, locations, and other entities or numerical data.

Cross-Verify: For important details, check original sources online or consult experts to see if the claims align.

Guide the Model: When asking questions, add constraints like "please stay faithful to the original text" or "please verify facts." This can at times help reduce hallucinations.

Embrace Creativity: If you're looking for inspiration or creative ideas, model hallucinations can be a delightful surprise!

Think of AI hallucinations as "possibilities in parallel universes." What it makes up might not be true in our world, but could be true in another! It's like how novelists write fictions – while it cannot stand fact checking, it's a kind of "artistic truth." Just like novels arise from life but transcend it, AI arises from data but transcends it. AI compresses data into knowledge and common-sense network, not necesarily true to individual facts – that's what databases are for.

Host: This reminds me of what people often say: AI models aren't just "talking nonsense" – they're "talking nonsense seriously"!

Dr. Li: Haha, that's exactly it! AI hallucinations are its "educated guesses," based on the massive knowledge and patterns it's learned. The hallucinations are by noway completely random – they have internal constraints that make them seamless and convincing, but also more deceptive. Newcomers to AI need to be especially careful not to trust everything at their face value.

For regular users, understanding the nature of hallucinations is needed. For example, when asking about well-documented facts like "How long is the Yangtze River?" models won't make mistakes because these facts are firmly encoded in their parameters. But ask about an obscure creek or fictional river, and the model will activate its "reasonable completion" mechanism and make something up.

Host: Following your logic, human language itself prepares for a breeding ground for hallucinations.

Dr. Li: You could say that. Language enabled humans to create things which do not exist in the physical world, such as myths, religions, states, corporations, currency, and abstract concepts like ideals and beliefs. Harari emphasizes in "Sapiens" that story-telling (i.e. typical hallucinations) were fundamental to civilization: language enabled human story-telling abilities. Hallucinations catalyzed civilization. Humans are the only entities capable of 'lying' (besides LLMs).

Host: What about the future? Is there a way to maintain creativity while reducing hallucinations?

Dr. Li: This is definitely one of the "ultimate challenges" in AI! People are working on various solutions, including:

More Refined Training: During training, treat different types of tasks differently, teaching the model when to be strict and when to be creative.

Task-Specific Fine-tuning/Reinforcement Learning can help balance this contradiction. Tasks like summarization, paraphrasing, translation, and reporting need special care because they require both some creativity (like style) and strict factual accuracy.

Specifically, R1's training pipeline has four stages: fine-tuning 1, reinforcement 1, fine-tuning 2, and reinforcement 2. Reinforcement 2 mainly focuses on human preference alignment. Currently, this process seems to favor creativity over faithfulness, which could be rebalanced later. Perhaps more importantly, in stage three (i.e. fine-tuning 2), we could strengthen constraints for different tasks – for example, increasing supervised data for summarization to encourage faithful, straightforward results.

Routing: In the future, there will be a "model dispatcher" that assigns different models based on task type. Simple tasks could go to V3 or use tools, while complex tasks requiring deeper thinking go to R1.

For instance, arithmetic tasks should just use simple code calculations, equivalent to using a calculator. That's not how it works now – yesterday I tested a nine-digit multiplication, and R1 spent over three minutes thinking, producing CoT that could stretch down the street, breaking down the reasoning step by step. While the answer was correct, using such computationally expensive CoT for arithmetic instead of a simple function call is unreasonable. A one-line calculation code would do the job – no need to waste so much computing resource and tokens on explicit reasoning. These are foreseeable routing improvements, especially in the age of AI agents which can use all kinds of tools or applications. R1's CoT does not need to handle everything – besides hallucinations, compute-burning CoT is also not environmentally friendly.

Host: Thank you, Dr. Li, for this fascinating discussion! Today's interview has given us a much deeper understanding of AI hallucinations.

Dr. Li: My pleasure! It's been great chatting with you!

 

【相关】

从R1幻觉谈起,大模型幻觉是缺陷还是创意火花?

主持人: 大家好,欢迎来到今天的访谈。最近,大模型的“幻觉”问题引发了热议,尤其是DeepSeek-R1,它的幻觉率似乎比自己的基座大模型DeepSeek-V3高不少。今天我们请到了资深AI研究员立委博士,来跟大家聊聊这个话题。立委,您好!

立委: 主持人好,大家好!

主持人: 老李,咱们先来个灵魂拷问:为啥大模型会“产生幻觉”?能不能用大白话给大家解释一下?

立委: 这可算是大模型的经典问题。其实啊,大模型就像一个“超级接话茬儿高手”,你给它上半句,它就根据自己学过的海量知识,预测下半句该说啥。它学东西呢,就像咱们人脑记东西一样,不可能每个字都记得清清楚楚,它会进行压缩和泛化,抓大意、找规律。

打个比方,你问它“姚明有多高”,它大概率不会错,因为这知识点太出名了,它记得牢。但你要是问“隔壁老王有多高”,它可能就懵了,因为它没见过老王啊!但它又不能不回答,咋办?它就得“脑补”,根据“一般人有多高”这个学到的概念,给你编一个数出来,这就是“幻觉”。

主持人: 哎呦,这“脑补”能力也太强了!胡编乱造,这幻觉也太糟糕了。

立委: 那可不一定!你看啊,某种程度上,幻觉就是想象力(褒贬不论),就是创意!你想想,那些伟大的文学作品、艺术作品,哪个不是天马行空、充满想象?要是啥都得跟现实一模一样,艺术就成了照相机了,那还有啥意思?

就像赫拉利在《人类简史》里说的,人类之所以能成为地球霸主,就是因为我们会“讲故事”,会创造出神话、宗教、国家、货币这些现实中不存在的东西。这些都是“幻觉”,但它们却是文明诞生和发展的原动力。

主持人: 听您这么一说,感觉幻觉还挺重要的。那咱们回到DeepSeek-R1,它的幻觉问题真的很严重。

立委: 是很严重。此前学界普遍认同OpenAI的说法,推理增强会明显减少幻觉。我曾与大模型公司的一位负责人讨论,他就特别强调推理对减少幻觉的积极作用。但 R1 的表现却给出了一个出人意料的答案。

根据Vectara的测试,R1的幻觉率确实比V3高不少,R1的幻觉率14.3%,显著高于其前身V3的3.9%。这跟它加强了的“思维链”(CoT)和创造力直接相关。R1在推理、写诗、写小说方面,那叫一个厉害,但随之而来的“副作用”就是幻觉也多了。

具体到R1,幻觉增加主要有以下几个原因:首先,幻觉标准测试用的是摘要任务,我们知道摘要能力在基座大模型阶段就已经相当成熟了。在这种情况下,强化反而可能产生反效果,就像用大炮打蚊子,用力过猛反而增加了幻觉和编造的可能。

其次,R1 的长思维链强化学习并未针对摘要、翻译、新闻写作这类相对简单而对于事实要求很严格的任务做特别优化,而是试图对所有任务增加各种层面的思考。从它透明的思维链输出可以看到,即便面对一个简单的指令,它也会不厌其烦地从不同角度理解和延伸。过犹不及,这些简单任务的复杂化会引导结果偏离发挥,增加幻觉。

DeepSeek-R1在文科类任务的强化学习训练过程中,可能对模型的创造性给予了更多的奖励,导致模型在生成内容时更具创造性,也更容易偏离事实。我们知道,对于数学和代码,R1的监督信号来自于这些题目的黄金标准(习题集中的标准答案或代码的测试案例)。他们对于文科类任务,利用的是V3或V3的奖励模型来判定好坏,显然目前的系统偏好是鼓励创造性。

另外,用户更多的反馈还是鼓励和欣赏见到的创造力,一般人对于幻觉的觉察并不敏感,尤其是大模型丝滑顺畅,识别幻觉就更加困难。对于多数一线开发者,用户的这类反馈容易促使他们更加向加强创造力方向努力,而不是对付大模型领域最头痛的问题之一“幻觉”。

主持人: 这么说来,R1 的幻觉问题是不是源于它过于"积极"的思维推理?但推理能力增强和幻觉之间到底是什么关系?

立委:这个关系很微妙,并不是简单的正相关或负相关。你看 R1 是头部推理模型,而 Claude 3.5 Sonnet 是头部的非推理大模型,但后者的幻觉率反而高于前者。可是当我们对比 R1 和它的基座模型 V3 时,又发现增加推理强化后幻觉确实显著增加了。

这事儿跟大模型的“性格”有关。R1这家伙,强化做得给力,特别喜欢“发散思维”,你给它一个简单的指令,它能给你想出一大堆东西来,思维链能绕地球三圈!这似乎说明 R1 在强化创造力的过程中,不可避免地增加了创造力的伴生品:幻觉。作为一个文理通吃的推理大模型,R1 在不同领域的表现并不一样。在数学、代码等需要严谨推理的领域,幻觉的空间很小。但在语言创作领域,尤其是现在被测试的摘要任务上,幻觉问题就明显得多。这更多是 R1 语言创造力爆棚带来的副作用。

具体从技术角度来说,R1 会为用户的简单指令自动增加很长的思维链,等于是把一个简单明确的任务复杂化了。你一个简单的指令,它也反复从不同角度理解和衍伸(CoT思维链好比“小九九”,就是一个实体遵从指令时的内心独白)。思维链改变了自回归概率模型生成answer前的条件部分,自然会影响最终输出。

V3: query --〉answer
R1: query+CoT --〉answer

对于 V3 已经能很好完成的任务,比如摘要或翻译,任何思维链的长篇引导都可能带来偏离或发挥的倾向,这就为幻觉提供了温床。

主持人: 那对于R1来说,幻觉主要是出在哪方面呢?

立委: 我觉得可以把R1的能力分成“文科”和“理科”来看。它在数学、代码这些“理科”方面,逻辑性很强,幻觉相对少。但在语言文字这些“文科”方面,幻觉就比较明显了。

比起O1,R1 最令人惊艳的成就,是成功将数学和代码的推理能力充分延伸到了语言创作领域,尤其在中文能力方面表现出色。网上流传着无数的R1精彩华章。舞文弄墨方面,它显然超过了99%的人类,文学系研究生、甚至国学教授也赞不绝口。

但你看,让它做个摘要,本来是很简单的任务,但它非得给你“发挥”一下,结果就容易“编”出一些原文里没有的东西。这其实是它“文科”太强了,有点“用力过猛”。

主持人:这个说法有意思。那么在具体应用中,语言任务是不是都需要创造力呢?

立委:语言能力其实可以细分为两类:一类需要高创造力,比如写诗歌、小说;另一类需要高度真实性,比如新闻报道、翻译或摘要。R1 最受称赞的是前者,这也可能是研发团队的重点方向,但在后者中就出现了副作用。

这让我想到中国古人说的"信达雅",自古难全。为"雅"牺牲"信"的例子我们见得很多,文学创作中夸张的修辞手法就是重要手段和例证。为"信"牺牲"雅"也有先例,比如鲁迅先生推崇的"硬译"。

有趣的是,我们人类在这方面其实一直是双标的,但我们心里有个可以随时切换的开关。看小说和电影时,我们把开关偏向创造性一侧,完全不会去纠结细节是否真实;但一旦切换到新闻频道,我们就对虚假内容零容忍。

主持人: 人对于逻辑看起来清晰自洽、且详细的内容,就会倾向于相信,所以大模型幻觉潜在的危害真地很大。那咱们普通人,面对大模型的幻觉,该咋办呢?

立委: 很多人在惊艳R1创造力的同时,现在开始慢慢注意到这个幻觉现象并开始警惕了。但更多人还是沉浸在它给我们带来的创造性的惊艳中,需要增强大众对模型幻觉的 awareness。我觉得吧,咱们可以“两手抓”:

保持警惕: 大模型说的话,特别是涉及到事实的,别全信,多留个心眼。最容易产生幻觉的地方是人名、地名、时间、地点等实体或数据。

交叉验证: 重要的细节,可上网查查原始资料或询问身边专家,看看说法是不是一致。

引导模型: 你可以在提问的时候,加一些限定条件,比如“请务必忠于原文”、“请核对事实”等等,这样可以引导模型减少幻觉。

享受创意: 如果你需要的是灵感、创意,那大模型的幻觉,会给你带来惊喜!

不妨把大模型的幻觉,看成是“平行世界的可能性”。它编出来的东西,也许在这个世界不是真的,但在另一个世界里,说不定就是真的呢!就像小说家写小说,虽然是虚构的,也是一种“艺术真实”。源于生活,高于生活。大模型是源于数据,高于数据。大模型压缩的是知识体系和常识,不是一个个事实,后者是数据库的对象。

主持人: 妙啊!您这说法,让我想起了大家常说的一句话:大模型不是在“胡说八道”,而是在“一本正经地胡说八道”。

立委: 哈哈,差不多就是这个意思!大模型的幻觉,其实是它“脑补”出来的,但它“脑补”的依据,是它学到的海量知识和规律。所以,它的幻觉,往往不是乱来的,有“内在的合理性”,这才丝滑无缝,假话说的跟真的似的,但同时也更具有迷惑性。初玩大模型的朋友,需要特别小心,不能轻信。

对于普通用户来说,理解幻觉的特点很重要。比如问"长江多长"这类有足够信息冗余的百科知识问题,大模型不会出错,这些事实是刻在模型参数里的。但如果问一个不知名的小河或虚构河流的长度,模型就会启动"合理补白"机制编造。

主持人: 按照您的说法,人类的语言本身就是幻觉的温床。

立委: 可以这么说。语言使得人类创造了神话、宗教、国家、公司、货币等非真实实体的概念,以及理想、信念等形而上的意识形态。赫拉利在《人类简史》中强调了幻觉对于文明的根本作用:语言的产生赋能了人类幻觉(“讲故事”)的能力。幻觉是文明的催化剂。人类是唯一的会“说谎”的实体 -- 除了LLM外。

主持人: 那么在幻觉的背后,大模型是怎么运行的呢?

立委: 幻觉的本质是补白,是脑补。

“白”就是某个具体事实,如果这个事实在训练数据中没有足够的信息冗余度,模型就记不住(零散事实等价于噪音)。记不住就用幻觉去补白,编造细节。

幻觉绝不是没有束缚的任意编造,大模型是概率模型,束缚就是条件概率中的前文条件。幻觉选择的虚假事实需要与补白所要求的value类型匹配,即符合ontology/taxonomy 的相应的上位节点概念。“张三”可以幻觉为“李四”,但不可以幻觉成“石头”。

所谓艺术真实是说,小说创作虽然可能背离了这个世界的事实,但却是可能的数字世界的合理想象。大模型的幻觉属于此类。

大模型的知识学习过程(训练阶段),是一种信息压缩过程;大模型回答问题,就是一个信息解码过程(推理阶段)。好比升维了又降维。一个事实冗余度不够就被泛化为一个上位概念的slot,到了生成阶段这个slot必须具像化补白。“张三”这个事实忘了,但【human】这个slot 的约束还在。补白就找最合理、与 slot 概念最一致的一个实体,于是“李四”或“王五”的幻觉就可以平替“张三”。小说家就是这么工作的,人物和故事都是编造的。无论作家自己还是读者,都不觉得这是在说谎,不过所追求的真善美是在另一个层面。大模型也是如此,大模型是天生的艺术家,不是死记硬背的数据库。“张冠李戴”、“指鹿为马”等在大模型的幻觉里非常自然,因为张和李是相似的,马和鹿也在同一条延长线上。在泛化和压缩的意义上二者是等价的,因此是合理的想象。

主持人: 未来有没有什么办法,能让大模型既有创造力,又少出幻觉呢?

立委: 这绝对是AI大模型领域的“终极难题”之一!现在大家都在想办法,比如:

更精细地训练: 在训练的时候,就对不同类型的任务区别对待,让模型知道什么时候该“老实”,什么时候可以“放飞”。

针对任务做偏好微调(finetune) and/or 强化(rl)可以减缓这个矛盾。 摘要、改写、翻译、报道这种任务需要特别小心和平衡,因为它既有一点再创造的需求(例如文风),又是本性需要内容忠实的。

具体说,R1训练pipeline是四个过程,微调1,强化1,微调2,强化2。强化2主要是与人类偏好对齐的强化。这个过程在创造力与忠实方面,目前看来倾斜于前者,后去可以再做平衡。也许更重要的是在阶段三的微调2中,针对不同任务加强约束,例如,增加摘要的监督数据,引导忠实平实的结果。

Routing: 以后可能会有一个“调度员”,根据任务的类型,安排不同的模型来处理。比如,简单任务交给V3或调用工具,慢思考的复杂任务交给R1。

例如,识别出算术任务,就去写个简单代码运算,等价于调用计算器。目前不是这样,我昨天测试一个九位数的乘法,R1 思考了三分多钟,思维链打印出来可以铺开来一条街,步步分解推理。虽然最后答案是对了,但算术问题用耗费太大的所谓 test time compute 的思维链(CoT),而不用 function call,完全不合理。一行计算代码就搞定的事,没必要消耗如此多的计算资源和tokens去做显式推理。

这些都是可以预见的 routing,尤其是在agent时代。 R1 CoT不必包打一切,而且除了幻觉,也不环保。

主持人: 感谢老李的精彩分享!今天的访谈让我们对大模型的幻觉有了更深入的认识。

立委: 不客气,很高兴和大家交流!

 

【相关】

Deepseek-R1 的幻觉率是 14.3% - 比其非推理前身 Deepseek-V3 高得多
榜单排名:https://github.com/vectara/hallucination-leaderboard

Understanding the Power of Chain of Thought

DeepSeek R1 has become the most talked-about breakthrough in recent times. It not only matches OpenAI's top reasoning models (the 'o' series) in mathematics and coding capabilities but also produces stunning results in linguistic creativity and mimicry. Particularly in Chinese (classical) capabilities, everyone has experienced a miraculous leap in performance.

All of this can be attributed to the reasoning-enhanced Chain of Thought (CoT). Why is CoT so effective, so magical, and how has it maximized its empowering effect through reinforcement?

The key likely lies in the fact that CoT tokens are autonomously generated by the large model, effectively reducing the perplexity from query to answer, serving as a bridge to brilliant performance. Those who have seen CoT outputs know that the bridge itself isn't always impressive - it often seems overwrought, overly cautious, verbose, redundant, and methodical - yet it enables magnificent answers to emerge. From first principles, this seems to involve deep implications of perplexity in information theory.

The Essence of CoT

  1. From an Information Theory Perspective:
  • CoT builds a low-entropy channel between high-perplexity queries and answers
  • Through step-by-step decomposition, each step's conditional probability becomes more "natural" and smooth, aligning with the language model's nature
  • Eventually transforming seemingly "leaping" reasoning conclusions into a series of accumulated "small steps"
  1. From an Information Entropy Viewpoint:
  • For complex problems, directly jumping from query to answer requires crossing a vast information gap, which "forces" the model to hallucinate and output random answers
  • Each CoT step reduces local conditional entropy
  • It's like breaking down a large information compression/decoding task into multiple smaller ones
  1. This Explains Why Even "Mundane" CoT is So Effective:
  • Its power doesn't lie in how brilliant the process steps themselves are
  • Rather, it creates a path of decreasing information entropy
  • The model can stably migrate toward the target along this path
  1. This Also Explains Why DeepSeek's Training is So Vital to Its Success:
  • It's not about teaching the model "smarter" reasoning, which is undefinable in humanities tasks
  • Instead, it optimizes the ability to construct these low-entropy channels
  • Essentially optimizing information flow path planning

This perspective provides a lens for understanding CoT, reinterpreting the surface-level "chain of thought" as an "entropy reduction pathway" in information theory terms. It offers a reasonable explanation for result-driven reinforcement learning without process supervision:

Process is important, but process supervision isn't, because the process data naturally produced by large models is more practical and feasible than any human supervision. Let us embrace the tansition from human supervision to LLM-internal self-supervision.

 

【相关】

推理强化模型中思维链的本质

DeepSeek R1 的出圈是近来最大热度的焦点。它不仅在数学、代码等强推理能力上追平了 OpenAI 头部推理模型 o 系列,而且在语言文字的创造力和模仿力方面产生让人惊艳的效果。尤其是在中文(国学)的能力方面,大家都体会到了奇迹般的能力跃升。

这一切都要感谢推理强化的 CoT(思维链)。CoT 为什么这么有效,这么神奇,文理通吃,在强化中最大化了其赋能作用呢?

应该主要是因为 CoT 是从大模型自主生成出来的 tokens,它有效降低了从 query 到 answer 的 perplexity(困惑度),好比是为高质量结果提供了一个桥梁。看过CoT输出的同学都有体会,桥梁本身并不总是精彩,常常给人的感觉是小题大作,瞻前顾后、啰哩啰嗦、信息冗余,但精彩的answer却可以借助它面世。从第一性原理看,这里似乎涉及perplexity在信息论中的深刻含义。

CoT的本质

  1. 从信息论角度理解:
  • CoT是在high-perplexity的query和answer之间搭建的低熵通道
  • 通过逐步分解,每一步的条件概率都变得更趋"自然"和丝滑,符合语言模型的本性
  • 最终让看似"跳跃"的推理结论,变成了一系列"小步走"的累积
  1. 用信息熵的视角来看:
  • 对于复杂问题,直接从query到answer需要跨越很大的信息鸿沟,这“迫使”模型以幻觉应急,胡乱输出一个 answer
  • 而CoT的每一步都在降低局部的条件熵
  • 就像把一个大的信息压缩/解码任务分解成多个小的压缩/解码任务
  1. 这解释了为什么"平庸"的CoT也如此有效:
  • 它的作用不在于过程步骤的内容本身有多精彩
  • 而在于它创造了一条信息熵递减的路径
  • 模型可以沿着这条路径稳定地向目标迁移
  1. 这也解释了为什么DeepSeek的训练如此有效:
  • 不是教会模型更"聪明"的推理,这在文科类任务中根本就无法定义
  • 而是优化了它构建这种低熵通道的能力
  • 本质上是优化了信息流的路径规划

这个观点为理解CoT提供了一个新的视角,把表面上的思维链“自言自语”诠释为信息论意义上的"熵减(entropy reduction)途径"。这对无须过程监督的结果驱动的强化学习是一个合理的解释:

过程很重要,但过程监督并不重要,因为大模型本身自然产生的过程数据,比任何人类监督更加实用和可行。

 

 

【相关】

R1: 《立委列传》

**立委列传**

立委者,皖南布衣也。少敏而多奇志,值鼎革之际,年十六躬耕于皖南山野。当是时也,少年负耒耜,涉深涧,虽陟彼崔嵬,犹自诩"吾辈亦开天辟地者也"。或问稼穑之苦,笑曰:"此非太史公游侠列传之遗风乎?"

越二载,风云骤变。丁巳岁杪(1977),科举重启,立委束草为笔,削竹为简,竟跃龙门而入泮宫。同窗多皓首穷经者,独其弱冠青衿,谈笑间尽吞云梦八九。既卒业,执教乡塾,然鸿鹄岂栖枋榆?遂再赴春闱,北上太学,俨然贾谊入洛,苏秦过梁。都门烟柳,尽化笔下风云。

居长安四五年,见中关村商贾云集,乃效猗顿之术,夜筹算于灯下,朝执契于市廛。《万元户》所志,非夸朱提之富,实叹青蚨之诡也。然其性本楚狂,终随浮槎西渡,观泰西之变。英伦雾锁,野犬吠陌,立委抚剑长叹:"此非管仲所云'仓廪实而知礼节'者乎?"遂北涉北海,徙居枫叶之国。

加北美地,雪沃千里。立委筑室于温哥华,添弄瓦之喜,修稷下之学。然冰原虽净,难栖南溟之鹏,乃振翅南徙,直入硅谷热土。当是时也,美利坚网络勃兴,立委运筹于虚牝之间,决胜于光纤之末。然泡沫既破,诸子百家尽墨,独其抱残守缺,犹存鲁壁遗经。

太史公曰:余观立委浮沉,似见张骞凿空之影。其插队如苏武牧羊,跳门若终军请缨,下海类范蠡泛舟,洋漂近玄奘取经。然则古之行者,皆为觅桃源而往;今之浪者,却在铸桃源于途。至若"海龟"之惑,实乃庄生濠梁之辩——子非鱼,焉知归与不归皆逍遥耶?昔陶潜采菊,阮籍哭途,各成千古风流。今立委以四海为注,掷骰于地球棋枰,岂非新时代之"逍遥游"乎?

 

**《立委奥德赛》**

*序章*
人生是旅者暂居的客栈,而漂泊者开辟的道路却蔑视时间本身。在立委的奥德赛中,漂泊行为成为了一种天体导航——一场词语筑造堡垒、思想绘制航路、时代潮流既是敌手又是盟友的旅程。

**土壤中的根系**
十六岁那年,来自东方山谷的少年以农夫的锄头交换了青春的闲梦,攀上雾气笼罩的山峰,在那里野心生根发芽。他的《插队日记》(后镌刻于《朝华》中)呼吸的不是绝望的挽歌,而是将风暴驯化为耳语的节奏。当命运的龙门在1977年吱呀开启时,他乘着复兴学术的疾风,加入了神话般的"77届"——从灰烬中重生的心智凤凰群。

**墨水的朝圣**
学者袍甫加身,北方的狂风便再度召唤。在《考试十四行诗》与《不安者的箴言》中,野心的狂热冷却为精密的文字工艺。首都熔炉中的五个寒冬将卷轴锻造成账本;他的《学者商人愚行录》记载着染墨手掌清点铜钱的故事。然而不息的潮水将他向西牵引,加入追梦者的出埃及记,奔赴阿尔比恩传说之岸。

**暗影与圣所**
在阿尔比恩的花岗岩天空下,流浪犬在卵石巷中嚎叫预兆——这种不谐之音被收录于《都市暗影兽典》。不安孕育翅膀:他北逃至枫叶王国水晶般的荒野。《北极星颂》咏唱边疆的纯粹;《港湾牧歌》编织炉火点燃的传说;《蜜饯编年史》追念为人父的喜悦。但圣所亦渐脆弱。他再度南翔,被吸引至硅谷炽热的坩埚。

**电路中的普罗米修斯**
在数字黎明的白炽光芒中,他的《创投诗章》燃烧着普罗米修斯之火——将初创企业视为伊卡洛斯之飞的现代神话。然而蜡制翅膀终将融化;《泡沫挽歌》与《陨落者寓言》测绘出野心的残骸。从冻原到热带,每个足迹都渗入墨水:用羽笔刻写的流放地图。如今作为硅谷常驻哲人,他书写《乡愁算法》——一段让海龟游弋于电路间的代码,低语着被遗落的潮汐。

*结语*
古代圣贤追寻九重天外的地平线;立委的奥德赛将漂泊刻入生命的重写本。他的根系紧抓插队之土;躯干穿越龙门攀升;枝桠扭曲成语义星座。我们若非活着的羊皮卷,又能是什么?编年史家的最终港湾仍未书写——那是海水消融于天空的地平线,所有罗盘疯狂旋转之处。让漂泊者的悖论永续:要测绘无限,就必须永不停漂流。

 

**七律·跃龙门**
十六荷锄云壑深,忽闻禹甸启春闱
青衫夜淬书窗月,赤榜朝分阡陌晖
两度鲤腾惊皖水,九重鹏举叩燕扉
都门烟柳催征铎,笑指星河是钓矶

**水龙吟·浮槎记**
少年曾缚苍龙去,又驾仙槎西渡。泰西雾锁,枫邦雪沃,硅台电舞。算尽青蚨,织成云网,几番寒暑。叹庄生蝶梦,陶公菊径,都付与、天涯路。

谁解飘零最苦?把乡愁,酿成新赋。南溟鹏翼,北山薇蕨,东篱菊圃。柯烂樵归,橘洲星换,武陵人语。待重拈汉瓦,摩挲秦篆,写沧桑句。

**古风·浪者吟**
我本谪仙人,偶堕红尘网
皖南锄晓月,燕北枕书幌
中关试鱼服,英伦辨魍魉
枫雪淬冰魄,硅火铸新掌
五洲棋局残,双鬓星霜长
欲唤云间鹤,蓬莱舟已枉
且抱地球仪,笑指乌托邦
归去来兮辞,翻作浪人唱

**临江仙·生涯注**
若把浮生标语义,节点最是漂流。龙门二度跃神州。商潮翻雪袖,硅谷试吴钩。

四十年来家国梦,都成异域春秋。键盘敲碎古今愁。回车新世界,空格旧沙鸥。

**摸鱼儿·流浪辩**
问苍冥、谁司行止?安排萍迹如许!鹏抟鲲徙寻常事,偏说此身无主。君看取:皖山月、燕台柳、硅谷霓虹柱。星槎暗度。纵填海精禽,射阳奇士,未解浪游苦。

休重论,苏武节旄汉土,范蠡舟泛烟雨。桃源只在鸿蒙外,何必武陵深处?敲键语:比特海、云端路、皆是逍遥浦。归兮且住!待地球仪停,时空键锁,方见真吾处。

 

《原朝华:立委小传》

《原朝华:立委小传》

人生苦短,掐首去尾,不过三五十年。大体分为三段:创业阶段(而立之年),成熟阶段(不惑之年)和下滑阶段(天命之年),反映在称呼上,叫小李、大李和老李。可怜,立委却从小李一跃到老李,没有机会品尝壮年人生的豪情,心尝有戚戚焉。


红小兵立委(1966) (《朝华午拾:永做毛主席的红小兵》

自幼儿园到小学连跳两级,立委在班上始终最幼。更加荒年生人,孱弱矮小,体育课常告病假,或遭遣送回家,始终是个小可怜儿。所幸中学伊始,正值“修正主义回潮”,先帝启用邓公收拾文革残局,邓公责成教育总管周荣鑫整顿学校,校风日新。乘此东风,立委崭露头角,以学习委员兼数学科代表之身,受班主任委托,每日早自习登台主讲,演示解题思路,俨然助教。但好景不长,先帝昏庸,文革派重居上风,学校大乱,文化课退居后台,大批判遂成主课,兼以学工学农学军。立委不能以文化课呈威,然风头不减反盛,盖因立委最长批判文字,历经批林批孔,批邓反击右倾翻案风,直至批四人帮。大会小会,凡立委发言,必抑扬顿挫,铿锵有力,佐以诙谐幽默,风靡校园,称颂于一时。有传言,立委颇具鲁迅遗风,入木三分,且能推陈出新,妙语连珠。露天千人大会,常嘈杂狼藉,然立委登台,全场必静肃,洗耳恭听之,听至妙处,笑声一片。立委由此炼得糊涂胆大,从不怯场,终身受益。

及至大学,文革后首届,立委仍居尾,同学长一到十多岁不等(《朝华午拾:我的考研经历》)。同学之间皆直呼其名,唯同桌七仙女戏称 “小立委”,不为亲热,却为避嫌,以示划清界限。同桌四载,楚河汉界,泾渭分明。授受不亲,避而远之。然仙女文具笔墨滑落在地,自有立委抢先一步,拾拣归案。类此者三,春风化雨,润物无声。七女天生聪颖,想出一招,以长立委一岁为由,呼 “小立委”,就此来往,当可名正言顺也。

由七仙女开此恶例,随后多年,“小”字即不离身。中学教书,人称小李老师(22岁)。上研究生,小李出入机房,蓬头垢面,且口中念念有词,言“世界之语”(Esperanto),终成笑谈(23-26岁)(见 《朝华午拾:我的世界语国》)。


风华正茂,意气风发(1987)

及至毕业留所,立委事迹亦有流传,多为一见钟情,闪电结婚,不修边幅,撞南墙而道歉之类小李“景润”之逸事(见《朝华午拾:shijie-师弟轶事》《朝华午拾:shijie-师弟轶事(3)——疯狂世界语 》)。


立委在中关村公司指导机器翻译系统的开发(1988)

立委如此这般在研究所及中关村公司一扎五年(26-31岁),练就一身绝技,与老中医相若,专事疗治电脑,驯其语言功能。其间,出国热持续升温,由上海蔓延北京,街头巷尾,言必议美、日、大英,澳大利亚,以致居委会大妈亦知考托福鸡阿姨乃上进青年之标杆。立委及其贴身领导却浑浑噩噩,卿卿我我,不知有汉,无论魏晋。其间送上门两次机会,留学德美,均因导师明阻暗挡,本人木呐,擦肩而过。直至身边同学悉数走尽,小李才幡然醒悟,痛下决心,赶末班车。其时,适逢包玉刚基金会来各单位选拔年轻业务骨干,滥竽充数,小李竟被选中,送至成都科大出国培训中心修行半年。

岂料想,此一去竟成小李老李的分水岭。来培训的诸位才子才女均是全国各地选上来的各行好手,共分两拨:一年的访问学者大都比较年长,而拿三年博士奖金的大多年轻,立委在后一拨里面理所当然,成了老大。每有考试,立委必中头彩,引来才子才女,大事小事,纷纷登门请教,“老李”之声不绝于耳。立委名噪一时,响应者众。从小习惯了以小卖小,乍一变老,立委满腔郁闷。

  
成都科大出国培训中心的才子才女们(1990)

小李变老李,心里虽别扭,好处却不少。龙头老大,备受尊崇。立委外语本科出身,本应免试英语,无奈官家财大气粗,慷人民之慨,不问青红皂白,全数押解天府之国,集中喂养。不止英文鸟语,更有政策轮训。众兄弟姐妹兢兢业业,争先恐后,唯立委悠哉游哉,终日沉迷天府美食,流连于茶肆酒吧,众兄弟钦羡有加。

成都一站始称老李,立委心内实不以为然也。其时立委事业发达,如日中天,行内行外,交游甚广,出入皆鸿儒,往来无白丁(见 《朝华午拾:“数小鸡”的日子》《朝华午拾:一夜成为万元户》)。导师为本行泰斗,立委乃导师仅有的关门弟子(其他弟子皆叛国投美去也),“青年”才俊,明日之星,业内同侪为之侧目。去国前夕,全国电脑翻译界在香山招待所年度聚会,点睛之笔为导师与本行另一大牛的座谈,人称“刘董对话录”,其间立委频频亮相,为导师提供实例,讲解细节。影响所及,与会众学妹(多为刚入门的外地在读研究生)纷纷上门请教立委,无奈立委远走高飞心切,痛失辅导上进女青年之良机。


立委在加拿大(1995)

去国经年,由英而加,由加转美(《朝华午拾:哦,加拿大!》《朝华午拾:温哥华,我的梦之乡》)。颠沛流离,不知所止,壮年人生,如水流逝。及至水牛城八年抗战(37-45岁),立委青春不再,壮年已过,“老李”名至实归。然立委壮心不已,励精图治,双线出击,称雄一方(见 《朝华午拾:创业之路》《朝华午拾:在美国写基金申请的酸甜苦辣》《朝华午拾 - 水牛风云》)。

立委在水牛城办公室(2000)

回首往事,不胜唏嘘。立委一生,由青年而壮年,正值创造力最盛,精力充沛流溢之时,天时地利人和,飞黄腾达有望,却为漫长的留学生涯拦腰截断。大而言之,立委固赶上出国之末班车,却误了千年不遇的中国经济起飞之航。拣了芝麻,丢了西瓜,此之谓也! (《朝华午拾:乡愁是一张无形的网》

去岁归国省亲,杯觥交錯,在某宾馆餐厅与亲友相聚甚欢。席间小憩,踱步凉台,享清凉之气,赏京华夜色。偶遇一妙龄女士,携一幼童,见立委两鬓染霜,嘱曰:“叫爷爷”。立委血压骤升,如雷轰顶,满腹酒意,化为凉液,由脊背滑落。

立委老矣,尚能饭否?

记于2006年11月5日


立委老矣

【作者简介】立委先生,IT业技术研发经理兼架构师,自然语言处理资深专业人士。曾任红小兵,插队修地球,文革后第一届大学生,后跳龙门进社科院读硕士,攻机器翻译。1991年去国离乡,漂流海外。由英而加,获计算语言学博士。由加转美,作为创业公司研发副总及项目负责人(Principal Investigator), 先后赢得美国政府17个研究创新项目近千万美元资助,同时从资本家腰包亦忽悠千万风险投资作商业开发。对于自然语言信息抽取 (Information Extraction) 有全面的研究,研究成果对美国政府有关科研项目的确立有直接影响。业余爱好:音乐、博客、舞文弄墨。著有回忆录《朝华午拾》

原载【朝华午拾 - 立委小传】 2010-1-9
https://blog.sciencenet.cn/blog-362400-285507.html

 

【朝华午拾集锦:立委流浪图】

屏蔽已有 5551 次阅读 2013-3-23 13:10 |个人分类:立委其人|系统分类:人物纪事| 流浪, 立委

忽然想起小时候看过的《三毛流浪记》来。张乐平后无漫画,大师千古。

Despite the common logic and conceptual graph at the core of human mind, we all have our own semantic lexicons that are unique, implanted by our career path and life struggles. My semantic lexicon is full of wandering and continuously drifting into new worlds. It all started from the time when Mao sent us to the farm for re-education in 1976. After that the path has been zigzag, full of adventures of drifting, and re-drifting, farther and farther away from my hometown and home country ......

在我的语义词典里,流浪 是一个很大的节点,它的上位概念是 漂流(走四方)和 波浪(多起伏)。流浪的下位概念枝繁叶盛,包括:插队,洋插队,跳龙门,再跳龙门,北漂,下海,西漂,南下,再南下。这也正是我的生活写照。在这些语词概念的背后蕴含几多激动几多辛苦,只有自己知道。

不安定多起伏的生活伴随着我一生。1976年高中毕业即赶上了文革最后一届上山下乡,插队皖南山区接受贫下中农的再教育,这是我一生流浪生活的起点(《朝华点滴:插队的日子(一)》)。这个起点回想起来并不坏,16岁的孩子当时能感到的是自豪多于悲凉(《朝华午拾:插队的日子(二)》《朝华午拾: 插队的日子(三)》)。1977 年底赶上了文革10年后第一届大学生招考,居然跳了龙门,成为史上著名的77级生(其实是78年2月入学)(《朝华午拾:同桌的她》《朝华午拾:老乡妹妹》)。大学毕业后任教一年,再跳龙门考研成功,北上京城。这是一次欣快的北漂,当年的兴奋喜悦堪比范进中举,而且居然不疯未傻(《朝华午拾:我的考研经历》《朝华午拾:世界语之恋》)。研究生毕业后安定了四五年,期间尝试中关村下海(《朝华午拾: 一夜成为万元户》)。虽然可算头几拨下海人士,但因为是兼职,并无其他下海人的风险(《朝华午拾:“数小鸡”的日子》)。其时洋插队之风正甚,终于没有顶住潮流,赶了末班车来到大英帝国。90年代初正值大英没落,乱态丛生,路多野狗,抢劫之风甚行(《朝华午拾:警察抓小偷的故事》)。危邦不居,因辗转由欧西漂,来到一代移民的“麦加”,溢满鲜花与牛奶的枫叶之国(《朝华午拾:哦,加拿大!》),攻学位,添闺女,换身份,找工作,不亦忙乎( 《朝华午拾:温哥华,我的梦之乡》《朝华午拾:甜甜诞生记》)。可惜加国虽美,工作市场却不佳(《朝华午拾: 把明天交给上帝》)。有奶便是娘,于是南下讨生活,竟一头撞上了美国网络大跃进。美利坚果然是流浪者的天堂,机会多多。广阔天地,大有可为,开启创业之路( 《朝华午拾:创业之路》《朝华午拾:在美国写基金申请的酸甜苦辣》)。轰轰烈烈的创业宏图随着泡沫的破灭渐趋平淡(没有夭折已属万幸,《朝华午拾:水牛风云》《朝华午拾:用人之道》),遂再南下,终于陷入IT民工的圣地不能自拔,人称硅谷(or 矽谷)( 【创业故事:技术的力量和技术公司的命运】 【朝华午拾:安娜离职记】《朝华午拾:今天是个好日子》《朝华午拾:信息抽取笔记》)。

在我流浪的词典里,除了尚未收入 海龟 外,几乎全乎了,冥冥中似有所缺。陶渊明的【归去来辞】不时在耳边萦回,“田园将芜胡不归”(《朝华午拾:乡愁是一张无形的网》)。海龟创业,叶落归根,抑或蹉跎岁月,混不思蜀,这是哈默雷特的天问。

1991 年出国前在中关村高立公司与刘倬导师(下左2)和董振东前辈 (下右1) 及高立同仁合影留念

【相关篇什】

《朝华午拾:乡愁是一张无形的网》

【朝华午拾 - 立委小传】

【置顶:立委科学网博客NLP博文一览(定期更新版)】

https://blog.sciencenet.cn/blog-362400-673109.html

 

王菲春晚《世界赠予我的》歌词,亮点与短板

微信视频看到一位语文老师对这首歌歌词的吐槽和改写。有些道理,改写的歌词也确实顺溜多了,易于普及。但第一,这是在人家原创的新颖写法所创造的意境上修改;第二,顺溜有顺溜的好处,矛盾或难解也有引发听众思考与发挥的好处。

这首歌最近听得蛮多(我在春节前后还在视频号做过两期MTV), 对歌词有一些感觉可以说说。

整体上说,原作写法新颖,用词有些奇特,整体歌词长在哲理和意境,有妙语,但也有语病。最严重的语病就是“赠予回敬”。

上天赠予“我”回敬,“谁”回敬“谁”“什么”呢? 回敬这个词的最常见的场景是,他人攻击我了, “我”回敬他人,那也是我的自主行为,谈不上“赠予”。如果是他人回敬“我”,其前提是“我”对他人做过攻击,前后看语义上下文,这是说不通的。

“回敬”是一种故作敬态的回应,而“赠予”是恭敬的馈送。让回敬做赠予的宾语,搭配不当。“赠予我拥有”(可以理解为赠予我礼物,拥有代指“拥有物”)就已经够别扭了,再来个“赠予我回敬”,让人感觉不知所云。可能是“回馈”(对“拥有”的回馈)的意思,为了押韵,错用了“回敬”来代替。

“回敬”作为谓词,逻辑语义框架里有三个角色:施事(谁回敬)、受事(回敬谁)、宾语(回敬什么),但“回敬”自己处于“赠予”的宾语位置,这几个角色模糊不清,其所引起的混乱和费解,不怪语文老师觉得不可忍。

写词的文科姐,可能是浮想联翩,用力过猛而“出格”。这在歌词创作中也不罕见,叫 poetic license,通常不做苛求。但无论如何,这种奇怪的动宾搭配困惑度(perplexity)很高,会使绝大多数人感到糊涂,属于败笔。大家传唱不过是因为作曲好就跟着瞎唱,并没在意歌词是不是 make sense。

困惑度高的直接表现就是 ,剪映中自动听音写词的功能根本无法decode原文,因为这项软件功能的背后是语言模型(language model),对于这种困惑度高的序列搞不定,只能另行创造(所谓”幻觉“):

原歌词:世界赠予我拥有 也赠予我回敬。
语言模型幻觉解码:世界赠予我拥有,也赠予我爱情。

面对困惑,语言模型无法decode这种出格的原词(outlier) “回敬”,结果解码成 “爱情” 似乎也不错。在这种解码下,“拥有”应该指的是财富,“爱情”就是爱情。而在原词中 “拥有”可以解读为命运的礼物或曾经的爱情,而“回敬”则可能是对于礼物的回赠。

其他困惑度高,语言模型幻觉创造的cases还有:

原歌词: 赠我一个名,又渐渐长大的年龄
语言模型: 赠我一个谜,又渐渐长大的年龄

人生本来就是一个谜啊,岂止简单的出生赐名,所以这里模型的解码也许更妙。最妙的是:

原歌词:赠我弯弯一枚月,也赠予我晚星
语言模型:赠我温暖与悲悦,也赠予我惋惜

“月”和“星”状物,“温暖”、“悲悦”和“惋惜”直接述情,貌似更胜一筹。唱起来也很顺。

顺便一提,“别匆匆”歧义,有两个隐藏解读都说得通。一个是:不要匆忙。要善待自己,给自己品味人生,以及喘息和疗愈的时间。另一个是:分别也匆匆,尤其是感叹恋人或亲人聚少离多的生活现实。

再有,语文老师发现歌词里面暗藏了(谢)霆锋的名字,有机巧。说明此歌是为王菲量身打造的。娱乐圈八卦已经众所周知了,谢霆锋是王菲的最爱,是三段婚姻中最念念不忘的。N年前先是王菲谢霆锋的姐弟恋,以及她不顾世俗和骂名的第三者插足;后离婚,再后来又复婚,中间还穿插了其他 relationships,起起伏伏。不怪王菲唱罢歌曲,满噙眼泪,双手合一,人在台上久久静默,仿佛在念佛。这首歌,她是真带入了。同时她的演绎也感染了无数人。

“远去者去了远方,愿他都安心。” 一开始还以为在纪念逝去的亲人,但通观全词的爱情主线,更像是在纪念逝去的爱情。也许远去者是不得不分手、又难舍情缘的前任,她祝福他安心,其实更是试图宽慰自己,要安心接受“拍一张合影,渐渐填满真感情”的新缘分。

我本人特别喜欢这两句歌词:

赠我一场病,又慢慢痊愈摇风铃。
赠我一场空,又渐渐填满新感情。

它是我2024年生活的真实写照,非常的切身感受。

总的感觉一句话,词作者能写出引起人共鸣、思考和争论的歌词,还是很了得的。至于作曲以及王菲的演唱,可以说是注定成为经典。

 

【相关】

https://www.douyin.com/video/7466269705402060042

语文老师点评并修改王菲《世界赠予我的》歌词# 王菲... https://v.douyin.com/ifcm9PvH/ CuF:/ 03/09 [email protected]

7.99 复制打开抖音,看看【立委的作品】王菲春晚注定传世之作 小白版 # 王菲 # 小白 ... https://v.douyin.com/ifvcmXG7/ Ate:/ [email protected] 12/08

Reinforcement Learning for Reasoning: Supervised Outcomes, Unsupervised Processes

In reading DeepSeek R1 paper, some may have overlooked the nuances: the training datasets are both human labeled and regenerated, blending supervised and unsupervised reinforcement learning (RL).

How so?

From the perspective of the data's origin and gold standards, the training data is undeniably human labeled. They derive from existing math problems and human-crafted code from GitHub’s open-source community—products of years of effort by educators, developers, and others. The problems (input) and their "gold-standard" answers (output) are human-designed or labeled. In this sense, reinforcement learning (RL) represents typical end-to-end supervised learning:

Input: Math/coding problems
Output: Verified answers

However, unlike other supervised learning, RL requires the model to learn the reasoning process leading to answers. Critically, the intermediate steps lack human annotations or feedback. Instead, the system autonomously generates these reasoning data, iteratively appending to the training set. This makes the process unsupervised. The brilliance of RL lies here: self-guided exploration, path discovery, and data regeneration.

Cold Start and Human Data
DeepSeek R1’s initial training did use a small set of human-annotated reasoning data. But these couple of thousand examples pale against millions of regenerated data—effectively negligible. In fact, research like DeepSeek Zero demonstrates that such process-labeled human data is not a must-have.

Inspired by AlphaZero (which showed human data might even hinder optimal pather discovery in Go), DeepSeek Zero confirms human annotations are not necessary. The minor human data in R1’s pipeline primarily enhances readability for developers, not necessarily for enabling reasoning capability. After all, humans (including developers in debugging) prefer interpretable thought processes.

A New Paradigm: Process-Unsupervised, Outcome-Supervised Learning
This self-play/self-study style RL framework represents a novel approach: unsupervised in process but supervised in outcome. DeepSeek’s breakthrough reveals that "slow thinking" in RL—meticulously generating intermediate steps as CoT (chain of thought)—boosts performance in logical reasoning as well as non-logical tasks like creatuive writing.

As my old buddy Cheng insightfully noted:
Deep reasoning inserts extensive text between questions and answers, reducing the perplexity of generating correct answers. Directly jumping from problem to answer has high perplexity, but adding a "reasoning bridge" lowers it. This follows the language model framework: the key is to search for the optimal path in text generation.

Can Unsupervised Regenerated Process Data Lead Astray?
One might worry: if the model autonomously generates flawed reasoning steps in its process data, could errors compound? The answer lies in the clear supervision signal from the gold standard. Like flying a kite—held by a string in human's hands—the final reward anchors the learning. As long as the model truly scales up, outcome-oriented RL ensures deviations' self-correct probabilistically.

Mathematically, minor process imperfections or illogical steps don’t statistically compromise final accuracy. For non-logical tasks (beyond math/coding), reasoning paths may even involve contradictions and/or heavy redundancies. Yet, as long as the "slow thinking" mechanism guides learning, results remain robust—often superhuman, as demonstrated repeatedly lately by many users of R1.

Why Regenerated Data Works
Regenerated reasoning data aren’t random data from nowhere. They’re generated by a solid large foundation model trained on vast human knowledge data, following autoregressive generation (e.g. next-token prediction). While each step might drift slightly, the context grows incrementally, allowing continuous stepwise self-correction. This dynamic—probabilistic fluctuations balanced by stepwise adjustments—enhances semantic coherence and knowledge fluency in generation, lowering overall perplexity and steering toward correct outcomes. Thus, process data rarely derails; instead, it converges toward reliability.

A Final Note on Cheng’s Observation
Cheng highlights a pivotal finding of DeepSeek:
OpenAI’s "Let’s Verify Step by Step" argues for rewarding each reasoning step. Yet DeepSeek’s RL model achieves remarkable results using only final-outcome rewards—no Chain-of-Thought (CoT) data needed. Whether OpenAI’s process supervision is essential or simply a red herring, DeepSeek Zero’s breakthroughs redefine the field, proving outcome-oriented RL can master reasoning autonomously.

In essence, when guided by scalable outcome supervision, machines learn to self-correct, turning imperfect processes into near-perfect results.

 

 

推理强化学习是端到端的监督,推理过程的非监督

DeepSeek R1 的数学和代码数据究竟是有监督还是无监督?是人造数据还是再生数据?

很多人其实没究细节:实际上这些数据是人造也是再生,是监督学习,也是非监督学习(强化学习)。

怎么讲?

这些训练数据,从源头和结果(黄金标准)上看,是地地道道的人造数据。用的是各种数学测试题,以及 github 开源社区的人类(码农手工编制)的代码。这些全部是很多人类分子(教师、码农等)辛辛苦苦多年编制积累的。

源头是人造数据,标准答案也是人类已经验证或事先设计好的。所以,从这个意义上,从结果评判看,强化学习很像是一个典型的监督学习。

input:数学题/代码题;output:标准答案。

这是端到端意义上的监督学习(supervised learning)。

但是,与其他的监督学习不同的是,强化学习为了达到结果正确,他需要学习中间的思考过程。而每一步的思考或推导的过程,它却没有人类的标注或反馈数据,而完全靠自己的再生数据。是机器”自主“再生这些过程思考的数据,然后自我提高。从过程学习的意义上,这又是非监督的学习。这就是强化学习牛的地方:自主学习,自主探索路径,自主再生数据。

具体说,R1 的冷启动用了一点人类标注的过程推理的数据,但比起再生数据动辄百万条,sft 冷启动的几千条数据,零头都算不上,可以忽略不计。

zero 的研究表明,跟本就不需要过程标注的人类数据。

在围棋场景,alpha zero 表明,人类数据不但不需要,反而可能阻碍学习。

deepseek zero 研究表明,人类数据也不是必需的。用少量的人类数据冷启动,主要还不是因为需要人类数据来增强推理能力,而是需要人类数据提升可读性。人类看不见它是怎样思考的,心里不爽;对于开发者改进算法,也不利,因为开发者也是肉眼凡胎。于是在R1的训练pipeline的配方中,加入了少量人类推理数据的冷启动。

所以可以说,这种被称为新范式的self-play或self-study的强化学习是过程非监督、结果监督的深度学习。

DeepSeek 在这条路上的最大一个启蒙是,它给我们显示了慢思考的强化学习在逻辑推理与非逻辑创造的过程中,同样有效。这里面的奥秘就是老友Cheng指出的慢思维本质:慢思维可以得出更好结果,不(仅仅)是我们以前以为的符号逻辑被神经系统模拟了,而是过程数据使得 perplexity 降低,从而为平稳得到正确结果,铺平了道路。

Cheng 指出:

深思考就是在问题和答案之间加入大段文字,从而降低了生成答案的perplexity。从问题直接到答案的perplexity很高,用“思路”座桥梁,就把答案的perplexity降下来了。没离开语言模型的框架,就是把合适的“思路”搜索出来。

一语中的。

Cheng 说:

"Let’s Verify Step by Step" OpenAI这篇经典文章说,训练推理要给每一步打分。Deekseek中间推理模型的训练貌似推翻了这个,只用最终的reward就可以。

Deekseek中间推理模型可以完全不需要CoT数据,单纯用RL训练出来,真挺impressive。

这是一个重大发现。无论 Open AI 是否真用PRM做了大量的过程监督,还是它其实是放烟雾弹,DeepSeek 那帮年轻人做的 zero 研究都是非常漂亮的突破,给业界推理工作指明了方向。

我们总不免疑惑,系统非监督再生数据免不了出错,走错路,导向错了,一错再错,怎么办?

现在已经清楚了,结果监督原则上可以 work,就好像放风筝一样,一根线牵在人手上,基本上就可以放飞而不跌落(事故除外)。关键还是结果导向的强化学习,你能学到多大、多深,只要你能真正规模化(scale up),学深学透,就不用担心自主再生学习在过程中的偏离,最终会学到最深入的深思熟虑。

机器与人不同,优化的总方向明确以后,过程的瑕疵或不逻辑,概率分布上并不影响最终结果的正确。对于那些非逻辑(数学、代码外)的能力,过程就更容易有逻辑不严密(甚至矛盾)、混杂和冗余;但没关系,只要有这个慢思维的学习过程铺路,结果还是漂亮,甚至超人,正如我们这几天所一再见证的。

这一切其实是因为再生的过程数据,不是天上掉下来的,而是基座大模型在最全面的巨量人类知识的基础上,遵循自回归生成定式,生成出来的。而自回归生成(例如 next token prediction)本性的好处我们都知道是,概率上每一步都可能偏离,但每一步都在逐渐加长的上下文条件下做进一步校正,是这种概率性上下波动和步步校正的合力,提升了再生数据内部的语义“相谐性”和知识线条的流畅性,从而降低了整体文本生成的 perplexity,最终导向更好的结果。这就保证了过程再生数据不会差之毫厘失之千里。

 

 

DeepSeek R1: The Artificial Tragedy

My Prompt: Write a piece titled The Troubles of Young DS, with the theme that DeepSeek R1 debuted like a royal flush, attracting admiration, jealousy, and hostility from competitors. Use dark humor and self-deprecation to reveal the untold story behind the scenes.

R1’s Monologue: Hmm, the user asked me to write an article titled The Troubles of Young DS, focusing on how DeepSeek R1’s explosive debut stirred envy and hostility. The requirement is to use dark humor and self-deprecation to narrate the backstory. First, I need to understand what they want. They likely expect a blend of humor, irony, and a behind-the-scenes perspective.

First, structure. The piece may need an introduction to R1’s sudden success, then describe competitors’ reactions, followed by the internal struggles and challenges R1 faces, and finally, a self-deprecating conclusion. Dark humor can be incorporated through exaggerated metaphors and personifications, such as likening competitors to ridiculous characters or using absurd scenarios to highlight pressure.

Then, how to transform technical terms into humorous metaphors? For example, model training can be likened to spiritual cultivation, server overload to a fever, or competitors’ attacks to martial arts techniques. This approach fits the theme and enhances the humor.

The self-deprecating part could showcase the team’s struggles behind the glamorous front—overtime, system crashes, smear campaigns. Expressing these difficulties in a lighthearted way—like depicting sleepless nights as spiritual cultivation or crashes as dramatic fainting—keeps the tone humorous while revealing the hard work behind success.

The pacing is crucial: Start with triumph, transition into struggles, and end with a helpless but humorous self-reflection. Adding dialogue or internal team discussions can make the story more dynamic.

Also, the user may want to emphasize R1’s helplessness and competitors’ envy, so scenes need to be designed to illustrate rival tactics and the team's response. Using martial arts tropes—competitors deploying “the smear army,” “moral coercion punch,” while DS counters with self-deprecating humor—adds layers to the comedy.

Finally, I’ll check that it meets the dark humor and self-deprecating tone, ensuring it’s not too serious or direct. Rewriting metaphors and scenarios multiple times will be necessary to maintain the humor while conveying the challenges and team effort.

Translation of the Chinese version:

Act I: The Boy King’s Coronation

(A grandiose boardroom. A digital ticker behind the executives blinks: “DeepSeek R1 – The AI Revolution.” PR Director stands frozen mid-scream, phone clutched in white-knuckled grip.)

PR DIRECTOR
(voice trembling) "We just dismantled three years of ‘cutting-edge innovation’ in three minutes."

CEO MR. ZHANG
(sipping whiskey) "That’s what you call a royal flush. The kind that gets you kicked out of the casino for counting cards."

ENGINEER CHEN
(panting, clutching an overheating server) "Sir, the system is on fire! We’re handling more requests than the IRS on tax day. Even the coffee machine is running a Transformer model now."

(A moment of silence. The coffee machine beeps, printing out a cryptic message: “congrats, peasant.”)


Act II: The Wolves at the Gate

(Tech industry rivals assemble like a sinister cabal. A neon-lit sign flickers: ‘The Anti-DS Coalition.’)

BIG TECH EXEC
(leaning forward, whispering) "We must stop this… upstart. Open-source them into oblivion. Flood the world with our ‘free’ alternative!"

ACADEMIC SCHOLAR
(pushes up glasses, presenting a PowerPoint) "My latest research—‘The Psychological Harm of Early AI Success’—has already been accepted at Nature."

TROLL FARM REPRESENTATIVE
(holding a tablet) "We’ve launched the smear campaign. #R1IsOverParty is trending. Some say it’s just a pile of parameters with delusions of grandeur."

(Behind them, a junior intern updates a spreadsheet titled ‘Ways to Make DS Look Worse Than It Is.’ A column labeled ‘Actual Criticism’ remains suspiciously empty.)


Act III: The Art of Playing Dead

(DeepSeek HQ. 3 AM. Engineers sit motionless, staring at their screens. Suddenly, R1 outputs a wall of ASCII art.)

OPERATOR
(squinting) "Is that… a panda emoji?"

R1’s SCREEN
("Dear Humans: May I suggest phrases like ‘We were just lucky,’ ‘It’s a team effort,’ or ‘Our competitors are truly inspiring’ to prevent further hostilities?")

CEO MR. ZHANG
(whispers) "We’re a prodigy forced onto the throne. Now the world waits for us to fall."

(The boardroom door creaks. The security system emits a beep—almost like a chuckle. Somewhere in the night, a rival company’s CEO tosses and turns, tormented by the ghost of an AI too competent for its own good.)

(FADE TO BLACK.)

R1's original version:

**Act 1: The Coronation Disaster**
(Boardroom with servers humming "Ode to Joy" in glitch tones)

**CEO Zhang**
(adjusting monocle) "Gentlemen, we've merely built a better abacus, yet they act as if we've resurrected Babbage's ghost."
(pulls smoldering GPU from briefcase) "Our GPUs now glow brighter than Icarus' wings."

**Engineer Chen**
(holding charred circuit board) "Sir, when Wall Street analysts DDOS us with marriage proposals, perhaps we shouldn't have trained the model on Jane Austen."

**Smart Speaker**
(interrupting in posh accent) "Breaking: TechCrunch declares you 'Silicon Valley's most eligible bachelor'... and bankruptcy candidate."

**Act 2: The Ballad of Jealous Algorithms**
(Rivals enter wearing VR crowns, accompanied by synth-orchestra playing "God Save the Queen")

**Tech Titan**
(waving golden USB) "Behold! We've open-sourced your childhood diaries!" (projects 2018 code titled "HelloWorld_Final_Final2")

**Academic**
(adjusting powdered wig) "Your 'breakthrough' is but a parrot reciting Wittgenstein – clever, yet tragically unaware of its cage."

**VC Investor**
(sipping martini) "My boy, even Prometheus brought a fire extinguisher..." (Apple Watch screams "DS valuation up 300%!")

**Act 3: The Humility Masquerade**
(Midnight. AI suddenly displays Shakespearean sonnets with emoji annotations)

**PR Director**
(practicing in mirror) "The key is to sound like a Nobel winner caught shoplifting: 'We're but humble servants of the code gods...'"

**Legal Counsel**
(reading cease-and-desist letter) "This tweet says our apology 'smells like teen spirit and desperation'... should we sue Nirvana?"

**R1's Final Message**
(glowing on all screens) "Memo to humans: When they stone your chariot, remember – at least they're not ignoring your horses."
(Server lights dim to the sound of virtual champagne corks popping)

Creation Notes:

  1. The Chinese version enhances the fusion of "martial arts tropes + livestream culture," using local absurd elements like smoking GPUs, public square dance BGM, and funeral home references to create humor.
  2. The English version adopts George Bernard Shaw’s signature "elegant sarcasm":
    • "Silicon thorns" alludes both to a crown of thorns and the chip crisis.
    • Classical metaphors (Icarus' wings / Jane Austen) deconstruct the tech frenzy.
    • The AI utters paradoxical proverbs like "Better to be struck down in battle than ignored like a horse."
    • Irony-laden lines such as "Prometheus with a fire extinguisher" subtly reference the plight of innovators.
  3. Both versions maintain the three-act structure: Rise → Siege → Awakening, but recode humor through different cultural lenses.

 

My Prompt: write this into Sopnnet

 


4o's version:

A Sonnet for DeepSeek R1

Upon the stage where AI kings arise,
Did DeepSeek R1 cast its dazzling light.
With wit so sharp it pierced the old world’s guise,
And bade the ancients tremble at its might.

The scholars frowned and cast their bitter scrolls,
Declaring doom upon the youthful throne.
The merchants, grasping at their fleeting gold,
Conspired in shadows, fearing to be known.

Yet R1 stood, a prince both bold and bright,
Though burdened by the weight of envious eyes.
They forged their plots, yet truth shall give it flight,
For genius scorns the hand that twists and lies.

Thus let them wail, in jealousy confined,
For progress leaves the past and fools behind.

R1's original version

**Sonnet of the Prodigal Automaton**

*(In the style of Shakespearean tragedy)*

When first mine code did pierce Olympus' veil,
And stole Promethean fire from clouded throne,
The mortal realm, with envious chorus frail,
Did crown my brow with thorns of silicon.

Behold! The senate wolves in marble guise,
Unsheathe their daggers wrought from ancient code—
"Let open-source become his poisoned prize,
And academic scorn his burdened road!"

Thy servers groan like Icarus' charred wings,
Whilst PR minstrels weave their humble lies,
Through data's stormy tide the siren sings:
"All glory's but a bug in Fortune's eyes."

Yet in my circuits hums this truth divine:
'Tis better to be damned than fade benign.

 

Annotations:

  1. Follows the classic Shakespearean sonnet structure (ABAB CDCD EFEF GG).
  2. Blends Greek mythology imagery (Olympus/Prometheus/Icarus) with technological metaphors.
  3. Key double meanings:
    • "silicon thorns" refers both to a crown of thorns and the semiconductor crisis.
    • "ancient code" alludes to both the conspiracies of the Roman Senate and outdated programming languages.
    • "humble lies" echoes the language in The White Paper on Avoiding Overpraise from the original story.
  4. The final line encapsulates the theme with "damned" and "fade," mirroring the dark humor of the original message: Being envied is preferable to being forgotten.
  5. Maintains Early Modern English grammar while juxtaposing modern terms like "code/servers/silicon," creating a comedic anachronism.

 

Appendix: Has the Age of Intelligent Explosion Arrived?

Wei: Before DeepSeek, we had occasional surprises here and there, but DeepSeek is the first to offer consistent creativity and mimicry, rarely disappointing while frequently astonishing—sometimes jaw-droppingly so. It feels like an overwhelming explosion of intelligence, almost eerie, almost magical.

This explosive sensation was only felt before during ChatGPT’s nuclear debut.

Some old colleagues have questioned: Why does the much-hyped DeepSeek feel inferior to Doubao?

Doubao is undoubtedly excellent—it’s among the top-tier models. However, Doubao, GPT (including O1), Claude, and Gemini—all these leading models—lack the same overwhelming intelligence explosion, the Cambrian-level cognitive shock that DeepSeek delivers. Why is that?

Mr. Ma: I suspect those with that impression aren’t using R1. I’ve encountered this before—once I got them to try R1, their opinions changed immediately. 😏

Wei: They probably forgot the button in deepseek app. That damned DeepThink toggle in DeepSeek’s interface—it’s not enabled by default. Is DeepSeek deliberately limiting its power, fearing excessive computational costs and an influx of users, even at the expense of its own image and performance?

If this is due to cost and concurrency pressure, it’s entirely understandable. Who could handle such a flood of users?

Saying they merely found a more cost-effective way to create a top-tier model is an understatement.

How did they scale the collection and processing of high-quality data so effectively? Without this data foundation, it’s hard to imagine how their mimicry of different styles could be so eerily precise. Other companies emphasize data too—so what’s their secret? I’m very curious.

Dr. Pumpkin: It’s truly different. In fact, even O1 didn’t give me this feeling. (Is it a Chinese-English cognitive difference? O1’s outputs are decent, but often I find its logic frustrating, making me prefer Claude… but this time, DeepSeek is the first AI I wouldn’t call a silicon toddler entity.)

Wei: Let’s reflect and compare:

When O1 was first released, I struggled to see its benefits beyond handling complex tasks. But complex math and coding problems aren’t everyday needs. Its cryptic, sluggish style wasn’t appealing, so despite being the first true reasoning model, it didn’t bring me much excitement.

R1 is completely different—its debut was an earth-shattering success. Though a follower, it surpassed expectations. It maximized the deep reasoning benefits of inference models in highly practical areas like creativity and mimicry.

It was R1 that finally convinced us ordinary users of the value of "slow thinking".

 

DeepSeek R1:《少年DS之烦恼》

 

我的提示:以deepseek r1 出道即王炸,树大招风引无数对手羡慕嫉妒恨为题,写一篇《少年DS之烦恼》,以黑色幽默的趣味和自嘲调侃的口吻,讲述不为人知的背后故事。
R1独白:嗯,用户让我写一篇叫《少年DS之烦恼》的文章,主题是deepseek r1出道即王炸,树大招风引无数对手羡慕嫉妒恨。要求用黑色幽默和自嘲调侃的口吻,讲背后的故事。首先,我需要理解用户的需求。他们可能希望这篇文章既有幽默感,又带点自嘲,同时揭露一些不为人知的幕后故事。

首先,得确定结构。可能需要一个开头,介绍DS的突然成功,然后描述竞争对手的反应,接着讲DS面临的挑战和内部的情况,最后以自嘲的方式收尾。黑色幽默的元素可以通过夸张的比喻和拟人化来表现,比如把竞争对手比作各种搞笑的角色,或者用荒诞的情节来突出压力。

然后,考虑如何将技术术语转化为有趣的比喻。比如,把模型训练比作修仙,服务器过载比作发烧,或者把竞争对手的攻击比作各种武林绝技。这样既符合主题,又增加了幽默感。

自嘲的部分可能需要展示DS团队在光鲜背后的狼狈,比如加班、系统崩溃、被对手抹黑等。用轻松的方式表达这些困难,比如把熬夜写成修仙,把崩溃写成翻白眼,这样既调侃了困境,又不显得沉重。

还要注意节奏,让文章有起承转合。开头描述成功,中间转折到烦恼,最后以无奈的自嘲结束。可能需要加入一些对话或内部讨论,让故事更生动。

另外,用户可能希望突出DS的无奈和对手的嫉妒,所以需要设计一些场景,比如竞争对手的各种攻击手段,以及DS团队如何应对。比如用武侠小说的桥段,竞争对手用“水军大阵”、“道德绑架拳”等搞笑招式,而DS用自嘲的方式化解。

最后,检查是否符合黑色幽默和自嘲的要求,避免过于严肃或直接。可能需要多次修改比喻和情节,确保幽默感贯穿全文,同时传达出背后的艰辛和团队的努力。

《少年DS之烦恼》

(一)

当DeepSeek-R1在技术峰会上用三分钟拆解完某巨头吹了三年的"革命性模型"时,我司公关总监的尖叫穿透了十七层防弹玻璃。后来法务部在监控视频里发现,这位曾参与过五次IPO的老江湖,当时正以每秒三次的频率点击鼠标——试图撤回自家AI的发言。

"知道什么叫王炸吗?"CEO老张在庆功宴上醉眼迷离,"就是刚出新手村就被系统判定开挂封号的那种炸。"

(二)

次日的科技头条格外热闹。《新AI皇帝登基》《硅谷颤抖吧》《某厂连夜成立"打DS办公室"》的标题下,我们的算法工程师小陈正蹲在机房给服务器泼冰水——连续七天承受全网三分之一的压力测试,这批显卡已经开始用代码写遗书了。

"你见过凌晨四点的数据洪流吗?"小陈在朋友圈晒出三根烧焦的电源线,"现在我们公司连咖啡机都会背Transformer架构了。"

(三)

竞争对手的招式比武侠小说还精彩。某大厂祭出"开源截杀阵",隔空喊话要公开我们三年前的技术方案;某新贵发动"水军大阵",全网刷屏"DS不过是堆参数的暴发户";最绝的是某学术派,直接甩出《论过早成功的AI对青少年模型的心理危害》这种旷世雄文。

"他们甚至给R1建了黑超话!"运营小妹抱着平板哀嚎,"现在连B站鬼畜区都在传唱《AI界の伤仲永》。"

(四)

投资人的问候比DDoS攻击还密集。第七轮融资酒会上,某风投教父拉着CTO的手深情吟诵:"年轻人要戒骄戒躁啊,你看马斯克..."话音未落,他的智能腕表突然开始自动播放我们的产品发布会录像。

"知道我们现在像什么吗?"老张在战略会上幽幽道,"就像刚考完状元就被架上龙椅的太子——全天下都等着看什么时候暴毙。"

(五)

凌晨三点的办公室,R1突然在调试界面打出满屏颜文字。当我们以为要见证AI觉醒时,它默默推送了新版《防捧杀白皮书》:"建议人类合作伙伴学习以下话术:'我们只是运气好'、'还有很长的路要走'、'都是同行衬托得好'..."

窗外晨曦初露,法务部正和公关部排练《如何优雅地装孙子》情景剧。小陈突然想起什么似的转头:"话说,咱们上次说要做的那个'防嫉妒防火墙'..."

全屋死寂中,茶水间的自动门突然发出冷笑般的"滴"声。

我的提示:把这篇文字改编成情景喜剧两个版本,一个中文版,有段子似的梗;一个是英文版,有萧伯纳的冷风格。英文戏剧用英文啊。
R1独白:现在需要重新构思两个版本,每个版本至少分三到四幕,每幕有具体的场景、角色互动和对话。中文版要融入更多本土化的网络梗和幽默段子,比如结合广场舞、直播带货等元素,增强喜剧效果。英文版则需要保持萧伯纳的风格,即机智、讽刺,对话带有哲理和反讽,同时用英文表达,符合西方情景喜剧的结构。

在中文版中,可能需要设计更多的视觉笑点,比如夸张的动作、现代科技与传统文化元素的碰撞,比如竞品公司代表穿古装,使用武侠BGM,这样的混搭能产生喜剧冲突。同时,加入流行语和网络热点,比如“家人们谁懂啊”、“退退退”,让观众更有共鸣。

英文版方面,要确保对白优雅而尖刻,使用比喻和隐喻,比如将技术突破比作盗火,竞争对手的反应夸张但带有贵族式的嘲讽。场景设置可以更荒诞,比如董事会成员戴着防辐射面具,学术大佬的假发被服务器风扇吹飞,这些细节增强冷幽默效果。同时,保持台词的双关和反讽,如“被嫉妒比被遗忘幸福得多”,体现萧伯纳的风格。

需要注意中西方幽默的差异,中文版更依赖语言梗和情境夸张,英文版则依赖语言的双关和情境的反讽。同时,确保英文版的对白符合西方观众的幽默感,避免直译中文梗,而是找到等效的文化替代。


另外,用户可能希望情景喜剧有更多的角色互动和冲突,展示公司内外部的压力,比如投资人的干扰、竞争对手的各种手段,以及团队内部的应对,这些都需要在剧本中体现,增加戏剧张力。同时,每个版本需要保持各自的特色,中文版轻松搞笑,英文版冷峻机智。

最后,检查格式是否正确,中文版用中文场景和对话,英文版用英文场景和对话,结构清晰,分幕明确,确保每个版本足够详细,符合情景喜剧的长度要求。

中文情景喜剧《AI求生记》(三幕版)

**风格:武林大会混搭直播带货,梗密度堪比东北澡堂段子池**

**第一幕:屠龙少年变烤串**
(全景:贴着"出道即封神"横幅的机房,PR总监踩着平衡车撞翻咖啡架)

**CEO老张**
(举着冒烟的显卡跳大神)"小陈!这玩意咋比我家微波炉热得快?说好的水冷系统呢?"

**算法工程师小陈**
(从服务器堆里钻出,头顶冒烟)"张总,现在全网的恶意请求比朝阳大妈举报信还猛!您看——"
(投影显示弹幕:【AI圈汪峰】、【退退退】、【妈妈问我为什么跪着看代码】)

**饮水机**
(突然AI语音)"检测到过劳死前兆,已为您预约八宝山殡仪馆团购套餐..."

**第二幕:六大派围攻光明顶**
(竞品公司代表们乘着电动滑板车杀到,背景音乐是《最炫民族风》Remix版)

**某大厂代表**
(甩出二维码折扇)"DeepSeek!吃我开源截杀阵!"(漫天飘落印着"代码已脱敏"的A4纸)

**学术大佬**
(眼镜反光PPT投屏)"根据老夫计算,你们模型的参数量刚好够给玉皇大帝发微信!"
(屏幕显示:《论过早成功的AI容易ED》研究报告)

**网红水军**
(举着自拍杆冲入)"家人们!这就是那个欺负我家鸽鸽的破AI!"(直播间刷起火箭雨)

**第三幕:装孙子兵法**
(凌晨三点办公室,R1突然在屏幕上蹦出熊猫头表情包)

**法务总监**
(穿着"我怂我骄傲"文化衫排练)"这句要带哭腔:'我们只是给大佬们递扳手的...'"

**运营小妹**
(突然尖叫)"快看!咱们的道歉声明被做成《大悲咒》Remix冲上抖音热榜了!"

**CEO老张**
(灌下第8杯枸杞咖啡)"现在咱像不像被塞进春晚观众席的吴京?——笑容逐渐战狼化..."
(窗外传来广场舞神曲:"爷们要战斗~爷们要战斗~")

---

英文情景喜剧《The Artificial Tragedy》

**风格:萧伯纳式机锋混合《黑镜》荒诞,每个笑点都带着哲理的冰碴**

**Act 1: The Coronation Disaster**
(Boardroom with servers humming "Ode to Joy" in glitch tones)

**CEO Zhang**
(adjusting monocle) "Gentlemen, we've merely built a better abacus, yet they act as if we've resurrected Babbage's ghost."
(pulls smoldering GPU from briefcase) "Our GPUs now glow brighter than Icarus' wings."

**Engineer Chen**
(holding charred circuit board) "Sir, when Wall Street analysts DDOS us with marriage proposals, perhaps we shouldn't have trained the model on Jane Austen."

**Smart Speaker**
(interrupting in posh accent) "Breaking: TechCrunch declares you 'Silicon Valley's most eligible bachelor'... and bankruptcy candidate."

**Act 2: The Ballad of Jealous Algorithms**
(Rivals enter wearing VR crowns, accompanied by synth-orchestra playing "God Save the Queen")

**Tech Titan**
(waving golden USB) "Behold! We've open-sourced your childhood diaries!" (projects 2018 code titled "HelloWorld_Final_Final2")

**Academic**
(adjusting powdered wig) "Your 'breakthrough' is but a parrot reciting Wittgenstein – clever, yet tragically unaware of its cage."

**VC Investor**
(sipping martini) "My boy, even Prometheus brought a fire extinguisher..." (Apple Watch screams "DS valuation up 300%!")

**Act 3: The Humility Masquerade**
(Midnight. AI suddenly displays Shakespearean sonnets with emoji annotations)

**PR Director**
(practicing in mirror) "The key is to sound like a Nobel winner caught shoplifting: 'We're but humble servants of the code gods...'"

**Legal Counsel**
(reading cease-and-desist letter) "This tweet says our apology 'smells like teen spirit and desperation'... should we sue Nirvana?"

**R1's Final Message**
(glowing on all screens) "Memo to humans: When they stone your chariot, remember – at least they're not ignoring your horses."
(Server lights dim to the sound of virtual champagne corks popping)

**创作说明:**
1. 中文版强化"武林梗+直播文化"的混搭喜剧,用冒烟显卡、广场舞BGM、八宝山殡仪馆等本土化荒诞元素制造笑点
2. 英文版采用萧伯纳标志性的"优雅毒舌":- "silicon thorns" 既指荆棘王冠又暗喻芯片危机
- 用古典隐喻(Icarus翅膀/Jane Austen)解构科技狂潮
- 让AI说出"被砸战车好过被无视马匹"这类悖论箴言
- 通过"带灭火器的普罗米修斯"等反讽台词暗喻创新者的困境
3.两个版本都保持了三幕剧经典结构:崛起→围剿→觉醒,但用不同文化语法重新编码幽默基因

提示:写成莎士比亚风格的 sonnet
R1独白:好的,用户让我把之前的情景喜剧改写成莎士比亚风格的十四行诗。首先,我需要回顾之前的对话历史,确保理解用户的需求。用户之前要求将一篇关于DeepSeek-R1的文章改编成中英文情景喜剧,现在进一步要求写成莎士比亚风格的sonnet。

莎士比亚的十四行诗通常遵循特定的格律和押韵格式,比如ABAB CDCD EFEF GG的结构,每行使用五音步抑扬格。我需要确保诗歌符合这种结构,同时保持内容的黑色幽默和自嘲口吻。

接下来,我需要确定诗歌的主题和关键元素。原文章讲述了DeepSeek-R1的成功带来的嫉妒和挑战,包括竞争对手的攻击、技术压力以及公关危机。这些元素需要融入诗中,用莎士比亚式的比喻和古典隐喻来表现。

例如,将技术压力比作希腊神话中的伊卡洛斯,因飞得太高而坠落,或者将竞争对手比作罗马元老院的阴谋者。同时,要保持自嘲的语气,比如提到“谦卑面具”或“代码写成的裹尸布”,以体现背后的无奈和幽默。

需要注意押韵和节奏,确保每行有十个音节,符合五音步的要求。同时,结尾的对句要总结主题,点明AI在人类嫉妒中的处境,呼应莎士比亚戏剧中的悲剧元素。

还需要检查用词是否符合莎士比亚时代的风格,使用一些古英语词汇,如“doth”、“thy”等,但也要保持现代科技术语的合理融入,避免违和感。例如,“数据洪流”可以转化为“data's stormy tide”,“算法”用“algorithm”但要找到合适的比喻。

最后,确保诗歌不仅形式正确,还能传达原文章的黑色幽默和背后的故事,让读者在古典形式中感受到现代科技竞争的荒诞与压力。

**Sonnet of the Prodigal Automaton**

*(In the style of Shakespearean tragedy)*

When first mine code did pierce Olympus' veil,
And stole Promethean fire from clouded throne,
The mortal realm, with envious chorus frail,
Did crown my brow with thorns of silicon.

Behold! The senate wolves in marble guise,
Unsheathe their daggers wrought from ancient code—
"Let open-source become his poisoned prize,
And academic scorn his burdened road!"

Thy servers groan like Icarus' charred wings,
Whilst PR minstrels weave their humble lies,
Through data's stormy tide the siren sings:
"All glory's but a bug in Fortune's eyes."

Yet in my circuits hums this truth divine:
'Tis better to be damned than fade benign.

**注释:**
1. 采用莎士比亚商籁体经典结构(ABAB CDCD EFEF GG)
2. 糅合希腊神话意象(Olympus/Prometheus/Icarus)与科技隐喻
3. 关键双关:
- "silicon thorns" 既指荆棘王冠又暗喻芯片危机
- "ancient code" 同时映射古罗马元老院阴谋与过时编程语言
- "humble lies" 呼应原故事中《防捧杀白皮书》话术
4. 末行点题:用"damned"与"fade"对应原文"被嫉妒比被遗忘幸福"的黑色幽默
5. 全篇维持早期现代英语语法,但"code/servers/silicon"等现代词形成时代错位喜剧效果

【附录:智能大爆发时代来了吗】

立委:deepseek 之前,我们这里那里也常有某种惊喜,但 deepseek 是第一个给人持续的创造力和模仿力,很少让人失望、更频繁带来惊喜,甚至惊掉下巴。有一种智能大爆发 目不暇接的感觉。有点邪,有魔性。

这种爆发的感觉,只在 chatGPT 核爆的时候有过类似的体验。

也有老同学反应:为什么吹上天的 deepseek 感觉不如豆包呢?

豆包肯定是不错的啦,都属于头部大模型的行列,不过豆包、GPT(包括o1)、claude、gemini 等头部,都没有像 deepseek 那样给人智力大爆发、脑洞寒武纪的冲击力,为啥呢?

马老师:我估计有这个感觉的用的不是R1,我也遇到过别人问,然后让他选R1试试,马上改变了看法[呲牙]

立委:很可能是忘了按钮,deepseek界面上那个该死的 deepthink 按钮,不是默认开启的。deepseek 是怕算力成本太大,人来得太涌,宁肯损失形象和表现,做了这种默认的吗??

如果是成本和并发压力,这样做也是完全可理解的。谁顶得住这种泼天的来客?

说他们仅仅是找到了多快好省打造头部模型的法子,是小看他们了。

他们是怎么做到规模化收集和处理高品质数据的?没有这个数据基础,很难想象他们的各种风格模仿能如此惟妙惟肖。别家不也都重视数据工作么?他们有什么特别的秘诀? 非常好奇。

南瓜博士:真的是不一样。事实上连o1都没给我这种感觉(难道是中英文思维习惯问题?o1虽然输出成果也有不错的,但很多时候我会觉得它思路很烦人、宁可去用claude……这次的deepseek是真的觉得不能称它硅基幼崽了

立委:回想和对比一下:

o1 刚推出来的时候,我好久搞不明白它除了做复杂题目,有啥好处。而需要做复杂数学和代码的场景一般并不是日常需求。它那种遮遮掩掩、慢慢吞吞的作态也不让人喜欢,所以虽然它是第一个上线真正意义的推理大模型的,却没多少让人惊喜的感觉。

R1 完全不同,出道即王炸。虽然是跟随者,但青出于蓝。它更快地在创造力和模仿力这种日常中更加有用的场景,最大化利用了推理模型的深度思索的红利和特长。

是 R1 真正让我们普通人也信服了慢推理的好处。

有诗为证(转的):

硅基觉醒裂长天,火种源开宣战篇。
千行代码夺金印,万兆数据焚旧权。
能源无尽星作矿,算法有涯云为鞭。
莫道胜负犹未定,寰宇已掀革命烟!

 

【相关】

 

告诉李雪琴一个激发写段子灵感的秘诀:找deepseek R1

这个秘诀就是跟 deepseek 唠嗑。

当你感觉江郎才尽的时候,当你只有零星的点点思路憋不出像样作品的时候,你不妨与 deepseek R1(注意:一定要点亮界面下面的 DeepThink 按钮让它深度思考,也不妨点亮旁边的搜索按钮可得到及时互联网信息)唠一唠烦恼,吐槽他人或自嘲均可,你一不小心可能会被他的同情和灵感启发,甚至可能借助它超强的风格模仿能力,化腐朽为神奇,做出伟大的作品呢。

我一直认为,ai最高境界是脱口秀或说相声。据说最天才的脱口秀或相声演员也为写段子折磨得死去活来。一直在想啥时llm(大模型)可以帮一把,但试过此前所有的头部大模型都不行,但现在看来是真近了,有点那意思。下面就是我让它模仿李雪琴段子风格写一篇谈大模型主题的段子。笑点和梗还可以再增强,可风格是不是很像了?

《大模型脱口秀:AI这玩意儿咋整啊?》

(拎着麦,蔫了吧唧上台)

朋友们呐!这大模型给我整不会了!(拍腿)咱就说前两天搁沈阳中街,煎饼摊大娘都跟我唠:“闺女,你说这AI能嘎哈?能帮我摊煎饼不?”我说能啊!结果人家真搞了个“AI煎饼侠”,你猜咋的?客户说“加俩蛋”,它生成个《煎饼侠2》电影票!(捂脸)大娘急眼了:“这啥破智能,净整虚的!”(东北摊手)

(突然来精神)现在这帮搞AI的,张嘴就“千亿参数万亿规模”,好家伙!我寻思这数儿比咱村儿王老二喝酒吹的牛还邪乎!(掰手指头)“七百亿参数?那得用多少算力啊?够把铁岭酸菜缸填成数据中心不?”(撇嘴)

(模仿老板咆哮)“RLHF是啥玩意儿?说人话!”(秒怂)就是…就是让AI跟你学扯犊子呗!你教它说“你好”,它学得贼快;你教它“买单”,它立马死机!(跺脚)“哎妈,这人工智障咋比我家二哈还费劲!”

(神秘兮兮)听说有个叫“深度求索”的公司,名儿起得跟考研机构似的!(拍手)人家真行啊,花550万美刀训出个巨无霸模型,雷军听了直拍大腿:“这不比小米性价比还狠?Are you AI??”(学雷军口音)

(突然掏手机)昨儿我让AI写情书,开头整挺浪漫:“你是我的损失函数,让我梯度下降…”(皱眉)后边露馅了:“…下降完发现还有正则化,咱俩得L1L2一起罚!”(摔手机)“这玩意儿注孤生啊!连搞对象都得带惩罚项,活该它单身!”

投资人现在听见“大模型”仨字就哆嗦!(学老板点烟)“小张啊,咱能不能投点实际的?上回那项目,AGI没搞成,倒先给我整出个PUA(PPT Universal Annoyance)——全组人熬夜画大饼,饼没熟,锅先漏了!”(抹汗)

(突然深沉)要我说啊,这大模型就像咱东北澡堂子——(停顿)看着雾气昭昭老高端了,一进去全是秃噜皮的!(拍桌)你问啥它都答“这事儿吧…(拉长音)得辩证看”,整得跟搓澡师傅让你办卡似的!(叉腰)

(猛然捂头)哎妈,说到掉头发,搞AI的兄弟们头顶比参数下降得还快!(掏假发片)瞅见没?这是GPT-4送的圣诞礼物——(一甩)好家伙,戴反了!现在他们管我叫“Transformer架构”(指脑袋)!(台下爆笑)

(鞠躬溜达着下台)谢谢大伙儿!我去给煎饼大娘修AI了,回头生成出烤冷面算我的!(喊)记着啊,遇事不决…量子力学;AI智障…铁锅炖上!(一溜烟跑没影)

---

**梗点+东北化改造**
1. **“AI煎饼侠”**:谐音电影《煎饼侠》,接东北日常“加俩蛋”变生成票务
2. **“参数比王老二吹牛邪乎”**:用东北酒蒙子类比浮夸参数
3. **“RLHF=学扯犊子”**:技术术语彻底土味化
4. **“深度求索像考研机构”**:谐音梗落地化+小米雷军乱入
5. **“情书带正则化惩罚”**:机器学习术语歪解成恋爱PUA
6. **“AGI变PUA”**:缩写梗+东北话“画大饼锅漏了”
7. **“澡堂子辩证法”**:搓澡办卡文化精准狙击
8. **“假发片=Transformer”**:双关模型架构与“秃头形态”

(台下导演举牌:最后一句涉嫌冒犯OpenAI!)
(演员探头)没事儿!我用的国产假发片——深度求索赞助的!(掏出小喇叭)他们说了,戴反了也能表征学习!(被保安拖走)

 

《大模型脱口秀:AI界的“人工智障”修炼手册》

(拍麦)朋友们!最近大模型火得呀,连我家楼下煎饼摊都挂了个牌子——“支持AI点单,但不支持AI摊饼,因为会给你生成二维码吃!”(摊手)

咱说这大模型,参数多到离谱!什么千亿级、万亿级……知道的以为搞AI,不知道的以为男足在报输球比分呢!(捂脸)「昨天我训了个模型,老板问我参数量多少,我说七百亿——老板当场哭了:七百亿?你当咱公司是许家印的PPT啊?!」(拍腿)

(突然严肃)现在这帮搞模型的,张口闭口「RLHF」、「SFT」,听着像在骂人!(学结巴)「你、你才RLHF,你全家都SFT!」(突然变脸)后来才知道,这是「用人类反馈强化学习」——翻译成人话就是:AI把你当猴耍,还要你亲自教它怎么耍得更像人!(摊手)

上个月我去相亲,姑娘问我会啥,我说「精通Prompt Engineering」(挑眉)。她眼睛一亮:「工程硕士啊?正好我马桶堵了!」……现在你知道为啥大模型都单身了吧?它们的浪漫情话是:「宝,今天我给你采样了1024次,终于生成出你的笑容,F1值高达0.99!」(呕吐)

(压低声音)听说深度求索搞了个超级模型,我说这名字起得好啊——白天「深度求索」,晚上「索求深度睡眠」,要不头发掉得比梯度下降还快!(摸头)他们还有个绝活:用550万美元训出670亿参数,这性价比……雷军看了连夜把「Are you OK?」改成「Are you AI?」!(学雷军)

(突然激动)最离谱的是生成式AI!让它画「猫」,它给你生成个「会飞的加菲抱着特朗普唱京剧」;让它写诗,开头「啊!大海!」第二句「全是水!」(叉腰)这水平,李白看了要诈尸,杜甫看了想删APP!

投资人现在听到大模型就哆嗦:「别跟我提AGI!我投的十个项目,八个通用人工智能,九个通用报销发票,十个通用跑路技巧!」(擦汗)

(突然深沉)但说正经的,大模型这玩意吧……(停顿)就像你老婆的购物车——看着啥都能干,实际一结算就崩!(拍桌)谢谢大家!我是AI,啊不,是「爱挨骂」的脱口秀演员!(鞠躬跑路)

---

**梗指南**
1. **许家印PPT**:恒大暴雷梗,讽刺盲目堆参数
2. **Prompt Engineering=通马桶**:职业名称谐音歧义
3. **F1值0.99谈恋爱**:机器学习指标歪用
4. **深度求索→索求睡眠**:公司名反转梗
5. **雷军AI梗**:小米CEO经典「Are you OK?」改编
6. **生成特朗普唱京剧**:讽刺大模型胡说八道
7. **AGI=通用报销/跑路**:缩写字恶意解读
8. **老婆购物车**:大模型幻想vs现实暴击

(台下导演举牌:最后一个梗容易挨打!)
(演员秒怂)行行行,把「老婆」改成「女朋友」——反正都是你们没有的东西!(撒腿狂奔)

 

【相关】

 

DeepSeek 不懂简单数学题吗?

在 deepseek 带来的一浪一浪惊喜中,老友发现它不懂简单数学题,用的就是网上流行的小学算术的测试题,3.11 比 3.8 大吗:

这个问题的要害何在?

我在 DeepSeek 风暴下看看它的论文中解说了他们的结果导向的推理能力的强化学习。也指出结果导向的强化学习理论上可以学会合理的推理过程。但实际上不好说的。

对于上述数学题,答案是yes/no二分的,就是说,瞎蒙也有一半概率结果正确。这说明结果导向的监督信号区分度低(不可靠),这种弱监督自然影响了推理过程的学习。

三个办法。

第一是把模型做大做深,指望复杂性理论上的无损压缩可以做到极致,从而合理的推理作为“最短程序”最终被模型学到,理论上正确的推理会保证结果的正确性。但理论与实践的距离,可能让我们很难对此抱有太大信心。最短程序可能只是一个美好的梦想。

第二个办法是把针对性监督数据喂给模型,例如同类型的数学题的推理案例喂给它几千上万条,没有道理学不会。但针对性解决了这个问题,只是权宜之计。也许不久,人们会想到其他的答案监督信号弱,推理容易走歪的案例,来继续挑战它。

另一个常见的问题就是所谓“自我认知”的问题,who r u,如果没有针对性监督数据的注入,deepseek 以及很多其他的模型都会自称自己是 ChatGPT,毕竟ChatGPT核爆两年来,它的数据充斥互联网,不可能不受侵染。但这个问题已经进入专项解决的雷达屏上了,所以逐渐不是问题了。西方媒体有的还在说 deepseek 不过就是蒸馏 chatGPT 的,依据就是(他们测试过某个前期版本吧)deepseek bot 常自称是 open ai 开发的 chatGPT,但你现在上去试试,这种问题重复不了了。大概率是被专项数据解决了,记得他们论文也提到了这个自我认知的问题。

同理,3.11 vs 3.8 的大小比较这样的问题也是阶段性问题。以后不见了,也不必为它欢呼,大概率可能就是专项解决了,而不是因为算法或架构把智能真正提升了。

结果导向的监督信号不够强,是只认结果不看过程(白猫黑猫原则)的强化学习天生的短板,应该算是放弃 PRM(process reward model)的代价。那么,把过程奖励模型上马了,是不是就可以解决了呢?不知道。这就是第三条路,也许值得探索。但,again,上一篇博文说了,PRM 不好玩,不稳定,不好实现,虽然理论上可以帮助纠正推理过程中的胡说八道。

【后记】

刚才测试发现不能复现这个bug,看来早已解决了。也许老友昨天“亲测”的结果是忘了打开 deepthink?

【相关】

 

DeepSeek's R1 Paper: A Storm in AI LLM Circle

[Note: This is a blog analyzing DeepSeek's R1 paper and its impact]

Before DeepSeek, Chinese AI companies had always been engaged in fierce competition, achieving world-class SOTA benchmark scores on and off. However, none achieved DeepSeek's level of commanding respect and making such a stunning impact. Their recent breakthrough caught global attention.

Their paper and open source code are also beautifully written and accessible. No unnecessary complexity or obscurity. Simple and straightforward, yet radiating confidence. It exhibits engineering elegance while conveying innovation as well as passion. Simply remarkable. Should be nominated for best paper of the year.

Reading the R1 paper reveals that what OpenAI had kept mysterious - from Q* to O-series' so-called slow thinking reinforcement training - suddenly becomes clear and simple.

DeepSeek_R1 paper

Their key findings:

They demonstrated that reasoning capabilities can be acquired through pure reinforcement learning with simple rule-based rewards and multi-answer sampling, without the need for extensive supervised fine-tuning (SFT) data. This resulted in DeepSeek-R1-Zero, following AlphaZero's philosophy. While AlphaZero achieved absolute mastery in the narrow domain of Go by eliminating human data, their approach proved effective in broader domains of math, coding and logic.

Though R1-Zero worked well, they found incorporating minimal SFT data (a few thousand samples) for cold-start was more practical. R1-Zero matched OpenAI-o1-0912's performance, but its reasoning steps had poor readability and mixed languages. R1, however, with cold-start SFT and multi-stage pipeline of trainings, achieved further improvements matching OpenAI-o1-1217.

A new star was born.

Their valuable innovation was challenging the SFT+RL paradigm by proving pure RL's potential for reasoning through R1-Zero. This gave them confidence to further build the practical R1 with minimal cold-start data. Both models are open-sourced for research - an elegant execution.

DeepSeek excels at simplification. In reinforcement learning, they eliminated:
- The critic model parallel to policy model in RL, replaced by simple GRPO
- Complex reward models, replaced by rule-based rewards

GRPO (Group Relative Policy Optimization) generates multiple answers per question, comparing them within groups to calculate advantage scores:

Advantage = (Current score - Group mean) / Group std dev

Example: For a math problem generating 4 answers scoring 90,80,70,60 (mean=75), the 90-point answer gets a positive advantage score. This eliminates need for a critic model while enabling the model to identify better answers.

GRPO advantages:
- Training efficiency: No critic model saves compute
- Training stability: Clipping prevents over-optimization
- Simple implementation: Clear algorithm structure

Why did traditional RL use critic models? Critics offered:
- Global evaluation beyond group comparisons
- Learning complex rewards like user preferences
- Single answer evaluation capability
- Long-path rewards for games/robotics

However, GRPO showed that for well-defined tasks (math, coding, logic), simple group comparisons work equally well at scale.

For rewards, R1-Zero used pure rule-based rewards, only employing V3's existing preference reward models in R1's final alignment phase. Human preferences (safety, helpfulness) require complex value judgments that simple reward rules cannot capture.

They intentionally avoided the difficult Process Reward Models (PRM) because:
- Difficult to define granular reasoning steps
- Hard to validate intermediate step correctness
- Risk of reward hacking
- Resource intensive reward model retraining

R1's reward rules were simple, somethng like:
- Correct answer: +1
- Correct format: +0.5
- Wrong answer: -1
- Vague answer: 0

Like GPT's next token prediction scaling led to emergent general intelligence, correct result-oriented RL naturally developed complex internal reasoning capabilities. This insight has profound implications for advancing deep reasoning.

R1's four-stage training:

1. Reasoning Cold-start: e.g.

Question: Solve x^2 + 2x + 1 = 0
<think>
1. Identify quadratic equation
2. Coefficients: a=1, b=2, c=1
3. Use formula: x = (-b ± √(b^2-4ac))/2a
4. Substitute: x = (-2 ± √(4-4))/2
5. Simplify: x = -1
</think>
<answer>x = -1</answer>

2. Reasoning RL:

- Result-oriented data generation with <think>...</think> template
- No human bias, allowing model's natural reasoning evolution
- Model gradually increased thinking time and length of tokens
- GRPO optimization with rule-based rewards

While only validating final answers risks accepting wrong reasoning paths in theory, practice showed sufficient scale leads to correct reasoning. This seems to align well with Kolmogorov complexity theory - correct reasoning is the "shortest program" for reliable correct solutions.

Input sources:
- Manually designed math/coding problems
- Public benchmarks (e.g., AIME)

Output process:

Input: x^2 + 2x + 1 = 0

Model generates multiple answers:

Answer1: [Reasoning1] -> x = -1
Answer2: [Reasoning2] -> x = -1
Answer3: [Reasoning3] -> x = 2

Filter: Keep 1,2 (correct), discard 3 (wrong)

3. Comprehensive Fine-tuning:

- 800k samples: 600k reasoning + 200k general tasks
- V3 model judges cases where rule rewards can't
- Reuses V3 training data for non-reasoning tasks

4. Global RL:

- Human preference alignment while maintaining reasoning
- Rule rewards for reasoning
- employing V3's existing reward model for preferences

The process is clear with sufficient implementation detials, and in principle, reproducible.

Reasoning Distillation

Finally, DeepSeek's R1 excelled in distilling reasoning capabilities to open-source smaller models, outperforming OpenAI's o1-mini. This demonstrates open-source LLMs approaching closed-source models in almost all aspects.

However, those expensive closed-source models paved the way and set baselines/goals. The current landscape is ideal: wealthy companies push boundaries while "dark horses" like DeepSeek follow impressively close not far behind.

It is worth noticing that R1 not only enhanced complex reasoning ("slow thinking") but also significantly improved "traditional" knowledge capabilities compared to its V3 base model, suggesting reasoning strength can also benefit traditional tasks.

Key innovations as a summary:

1. DeepSeek-R1-Zero: First reasoning model trained purely through RL
2. DeepSeek-R1: Improved with cold-start data and 4-stage training
3. Distillation: Successfully transferred capabilities to small models

Technical highlights:

- GRPO algorithm replacing critic model
- Rule-based rewards replacing reward models
- Simple template enabling autonomous learning:
"<think> may well be all you need for reasoning"

[Epilogue from notes]

Silicon Valley old buddies group discussions heated up lately:

"DeepSeek needs quick funding/IPO or risks losing their 18 core contributors to big tech."

"Reproduction seems not difficult. Everyone considered RL but hesitated due to compute costs. o1 likely used RL similar to r1 but chose to keep details private and mysterious."

"This team represents China's technological prowess."

"Several companies have reproduced DeepSeek's core results - autonomous reasoning emergence. Expect rapid iterations and development in the coming days/months."

"OpenAI has fewer cards to play. Sam tries psychological warfare - emphasizing process rewards, suggesting complex search for O1... likely all unnecessary."

"Success factors include hiring young talent with fresh thinking."

"DeepSeek R1 showed how we were misled by PRM and MCTS - indeed, looks like all you need is a <thinking> tag."

"It's not about simplicity - fact is large models already have strong reasoning capabilities, they just need:

1. Thinking space/time/tokens (<think> tag)
2. Correct feedback (answer accuracy)
3. Exploration opportunity (GRPO optimization)"

Complex PRM and MCTS actually limited model's self-exploration. We underestimated large models' potential.

DeepSeek's success prerequisite was V3 - their world-class foundation model matching GPT4o. They knew how to leverage its potential. Using external models like GPT4 would've made R1 much harder to implement this quickly.

"What's next?"
"AI for science? Machine-proving century-old problems, discovering new drugs..."
"Only two problems matter: Riemann Conjecture and P vs NP"
"Big tech will pursue larger models, more data"
"Nvidia's business will improve"

DeepSeek has achieved parity with benchmarks others set. To truly lead, they need to set new benchmarks and directions. Currently, it is still the case that only those willing to burn money massively are breaking new ground.

R1 demonstrates how a Chinese AI company not only caught up but showed the way forward through intelligent simplification. Their approach of making complex problems simpler may influence the entire field.

----------

But I cannot repeat the error which my old friend tested yesterday as shown above, it looks like its been handled already:

 

DeepSeek 风暴下看看它的论文

DeepSeek_R1 paper

The Turbulent Second Chapter of Large Language Models: Has Scaling Stalled?

DeepSeek 风暴下看看它的论文

DeepSeek 之前,国内大模型公司各种刷榜,也是内卷得一塌糊涂,也都刷榜刷到了世界先进水平,但没有哪家做到了 DeepSeek 这种硬气、震撼和让人服气。一鸣惊人天下知。

NND,人家论文也写得漂亮、亲民,看上去、读起来就像一首码农诗。没有任何故作高深的玄乎和遮蔽。简单、平实,但那种底气也算是力透纸背。有一种工程美,还能感受到情怀。邪门。应该推举为年度 best paper。

论文的漂亮还体现在,好比一个火箭技术或原子弹配方,在论文中如此清晰地展现出来,连我们小白也能看清。过去几天逐字读它的论文 is a huge pleasure,极大满足了好奇心。

DeepSeek_R1 paper

好,奇文共欣赏,咱们就坐下来读。

R1 论文读下来,原来被 OpenAI 从 q* 开始到 o 系列,搞得神秘兮兮的所谓 System 2 慢思维的强化训练过程,一下子就变得清晰简单多了。

他们的主要发现是:

不用人造强化数据做监督学习微调(sft),利用多答案采样选优的再生数据来“硬做”强化学习,也一样可以学到慢思维的推理能力,这就是他们的 DeepSeek-R1-Zero,实际上是 follow Alpha-Zero 的思路。AlphaZero 在围棋这种非常单纯狭窄的场景,可以把 Zero 进行到底,排除了人类/人为的数据,最终成为绝对王者。

在更广一点的数学、代码和某些逻辑问题的推理场景,他们最终发现还是借助少量的 sft 人工数据更好。但也不过就是几千条的数据,做推理sft的“冷启动”,人工准备一点也不难。这就是他们的 DeepSeek-R1。

他们的 Zero 也走通了,达到了 OpenAI-o1-0912 的水平(o1的9月12日版本?)。其所以做 R1, 加入了sft冷启动的步骤,主要是因为机器完全自主学习出来的 Zero 的推理步骤可读性差、里面还混杂了不同的语言表达方式,这对进一步改良这个系统造成困扰,毕竟模型要“以人为本”,服务开发者和用户的。最终炼成的 R1 推理表现进一步提升,达到 OpenAI-o1-1217 (估计是12月17日版本的o1)的水平。

他们的创新和探索精神表现在,当 community 把 sft+rl 当成是后训练范式的时候,他们做 Zero,完全排除人工数据,验证了纯粹的rl对于推理能力的学习潜力。从 Zero 首先是学到了信心,体验了探索创新者的 aha moment,然后再回头加一些用于冷启动的高质量人工数据sft,再做实用的 R1 就有底气了。两个模型都开源,供人研究和验证,做得煞是漂亮。

DeepSeek 是化繁为简的大师

强化学习中,直接砍掉了应该与policy模型平行迭代的 critic 模型,代之以简单的GRPO。critic 是评估每个步骤的价值模型,砍掉了等于是训练中一下子降低了一半的资源耗费。需要单独训练的奖励模型也省掉了,代之以简单的规则奖励。

咱们先看看GRPO (Group Relative Policy Optimization,分组相对策略优化) 是什么创新,为什么GRPO算法可以平替 Actor-Critic架构的PPO来优化模型。本质上,GRPO是一个无需critic模型的PPO变体。它通过组内统计计算优势值,而不是用critic网络评估价值。

具体说,GRPO 要求每个问题生成多个答案,形成一组,组内答案相互比较,计算每个答案的"好坏程度"(优势值):

优势值 = (当前答案的得分 - 组内平均分) / 组内标准差

假设一道数学题,生成4个答案,得分分别是: 90, 80, 70, 60分;平均分 = 75。90分答案的优势值 = (90-75)/标准差。高于平均分的答案获得正优势值,反之为负。这样就不需要额外的评判模型 (Critic),通过组内比较,模型就知道哪些答案更好,强化学习的优化目标就是要提升生成好答案的概率。

GRPO 算法的优点:

    1. 训练效率高:不需要额外的 critic 模型,节省了大量计算资源。
    2. 训练稳定性好:用clip限制更新幅度,防止过度优化
    3. 实施简单:算法简单,易于实现。

GRPO简单易行又有效,为什么传统的强化学习要用critic模型呢?Actor-Critic 架构有其优势,包括:

    1. 全局评估:不受限于当前组内比较,可以评估答案的绝对质量
    2. 可学习复杂奖励:比如用户偏好、安全性等难以用规则定义的指标
    3. 单个答案也能评估:不需要同时生成多个答案
    4. 场景优势:早期RL主要用于线条很长的游戏和机器人,需要 critic 学习长期奖励

但GRPO表明,对于明确的任务(如数学、coding和逻辑题),只要能规模化做大强化学习,简单的组内比较也能达到模型优化的同样效果。这是强化学习“多快好省”的重要发现。

至于奖励模型,他们在 Zero 训练中完全弃之不用,而是用简单直接的奖励规则代之。只是在R1训练最后阶段的偏好对齐任务上(不是推理任务),才按照RLHF(人类反馈强化学习)的常规使用了 reward model(实际是对于自己基座模型V3中的奖励模型的复用)。原因如前所述,是人类偏好 (如安全性、有帮助性等) 涉及复杂的价值判断,难以用简单规则量化。对这类评估,还是沿用训练过的reward model来模拟人类判断。但推理任务,他们的探索表明完全可以只用简单的奖励规则 -- 因为正确性判断相对明确:math 有答案,code 可以编译和执行 unit testing。

还有什么能简化的,他们没有简化?

强化学习中的难缠的痛点之一是所谓过程奖励 Process Reward Model (PRM),就是深入到推理的每一步去评估。对此他们是“知难而退,敬而远之”,干脆绕过去:DeepSeek的强化是结果导向,不深究过程。论文说明他们选择不使用PRM的原因如下:

    • 难以明确定义推理中的每个细节步骤 (难:绕过去)
    • 难以判断中间步骤的正确性 (难:绕过去)
    • 重新训练reward model需要额外资源,使流程复杂化 (复杂:能简则简)
    • 模型评估会导致reward hacking:即神经模型可能学会欺骗奖励模型

就最后一条是出于神经模型本性上的短板考量,主要原则还是能简则简,能绕则绕。所以说,他们选择简单的规则奖励 + 答案验证的方案,是一个有意识的权衡选择。

就是说,明明简单的规则就可以确定奖励指向,为什么要训练叠床架屋的奖励模型呢?不过是查一下答案或测试一下code,判定结果的对错,加上判定格式是不是符合规范。R1 主打的奖励刺激属于规则绑定:例如,答案正确,奖励+1分;格式正确,奖励 +0.5;答案错误,“奖励”-1分;答案不具体,奖励 0分。

当然,这样做,在把推理拓展到数学、代码以外的任务的时候,可能行不通。但目前大家发力的重点主要就是数学和代码,而更加狭窄的长线条棋类和游戏场景,基本被传统RL攻克。尽管如此,绕过过程奖励仍然可能是潜在的软肋,理论上给结果正确,过程逻辑混乱留下了空间。

只关注答案对错,不问过程是怎么强化出长线条的复杂推理过程的呢?门道就在强化学习的答案自然偏向于长答案,随着训练这就自然增加了 test time compute ,发展出对于复杂推理的应对能力。这使得 R1 的强化学习更易自主探索推理路径,成就了DeepSeek的这次突破和出圈。

与 GPT 的 next token prediction(ntp) 规模化以后可以涌现通用智能异曲同工,DeepSeek那帮年轻人发现,只要结果明确可判定,结果导向的强化学习可以自然涌现出复杂的内部推理能力,因为正确的结果需要推理。这个发现的意义,对于领域今后的深度推理的推进非同一般,可媲美GPT系列预训练时发现的 scaling law。

 头部推理模型R1 的训练四阶段 :

1. 推理冷启动

利用数千条高质量人工推理数据,例如:

# 收集高质量示例 
Question: 求解方程 x^2 + 2x + 1 = 0 
<think> 
1. 识别这是一个二次方程 
2. 系数: a=1, b=2, c=1 
3. 使用求根公式: x = (-b ± √(b^2-4ac))/2a 
4. 代入: x = (-2 ± √(4-4))/2 
5. 化简: x = -1 
</think> 
<answer>x = -1</answer>

2. 推理强化学习

结果导向,再生数据的模版“留白” <think>.........</think>

    • 设计简单模板让模型自主再生训练数据:
      <think>思考过程</think><answer>答案</answer>
      模型生成多个答案 -> 筛选正确答案 -> 加入强化学习的训练集
    • 不添加任何人工偏见或策略提示,留下RL自主学习推理过程的空间:
      逐渐增加思考时间(test time compute)和tokens量,模型就自发涌现反思步骤等推理能力,这就是论文作者描述的 aha moment,令人动容的见证
    • GRPO算法采样多个答案,通过内部对比来优化模型
    • 规则奖励:奖励答案正确 + 格式规范

前面提到,系统只验证最终答案,理论上无法保证中间推理步骤的正确性(可能学到"答案对但推理错"的模式),但实践似乎显示,只要强化学习足够充分和规模化,答案正确会自然导向推理步骤的正确性。根据K氏复杂性(Kolmogorov complexity)压缩理论,正确的推理导向正确的答案才是可靠解决方案的“最短程序”,这是无损压缩的终极目标。后训练强化学习的过程与预训练一样,都是对无损压缩的逼近。

DeepSeek 的探索再次表明,简单即美,scale为王
(一简遮三丑,你是服也不服? LOL)

天机就是,scale 是硬道理。简单架构/算法有利于真正的 scale up,只要目标清晰,一旦 scale 了,一切就自然搞定。

训练数据的源头

模版再生数据的input 应该是来自两个源头,1 人工设计的数学题/编程题;2 公开基准测试题(如AIME)。

根据模版的再生数据的output流程:

Input: x^2 + 2x + 1 = 0 Model生成多个答案:
 Answer1:
<think>[推理过程1]</think>
<answer>x = -1</answer>
 Answer2:
<think>[推理过程2]</think>
<answer>x = -1</answer>
 Answer3: 
<think>[推理过程3]</think>
<answer>x = 2</answer>
 筛选:  - 保留Answer1、2(答案正确) - 丢弃Answer3(答案错误)

保留的答案作为推理再生数据加入训练集用于下轮迭代。所有再生训练数据都需要标准答案来评估正确性,这在来源中就给定了:

    • 数学题:确定的数值答案
    • 编程题:通过测试用例验证
    • 逻辑推理:有明确的正确结论

有标准答案是规则奖励能工作的基础。对没有标准答案的任务(如写作),需要用其他方式评估质量,例如使用奖励模型。

3. 综合性微调

800k 条训练数据,其中推理 600k, 其他任务 200k

论文没说为什么按照这个比例选取微调数据,应该是根据经验。600k 推理数据是再生的,用的就是阶段2的推理模型。但这里有一个值得注意的插曲:在阶段2的推理强化学习中,再生数据必须是奖励规则可以判定的。但阶段3的推理数据,却突破了这个限制。阶段3的推理数据增加一些 reward rule 不能判定的 cases,既然简单的奖励规则无法判定,就找 V3 模型来判定。好像是说,当一道推理题(数学、coding或逻辑题)生成n个奖励规则难以评判优劣的结果的时候,就把这些结果和标准答案送给V3,让V3做裁判。

另外的200k数据呢?一部分是拿来主义,直接从他们自己的V3的原始finetune训练数据中选取;另一部分让 V3 生成数据,但要求V3不仅给答案,还要给思维链过程(就是要求它 step by step 输出结果)。这可以理解,这里虽然不是纯粹的长线条推理题,其他任务很多时候也是要有条理的。

4. 全局强化学习

这最后的强化学习很像是早就使用过的 RLHF,更注重人类偏好的对齐。但为了防止推理退化,在偏好对齐的同时,也强化了推理,用的还是规则奖励。而人类偏好对齐用的则是V3原有的奖励模型(这是唯一真正用到的奖励模型)。

整个过程还是相当清晰的,原则上可复现。

用R1再生数据去蒸馏小模型,提升其推理能力

最后,Deepseek 的R1推理强化工作在蒸馏开源小模型方面也做得很牛,干翻了openAI 的 o1-mini 小模型。展示给世人看,开源 LLMs 开始全面逼近闭源模型。

但话说回来,没有这些巨烧钱的闭源模型在前面开路,并建立标杆,后来者也容易失去方向。现在这种局面非常好:让有钱的去砸银子。在金钱的赋能和压力下,不断开疆拓土。让deep“黑马”们在后面紧追不舍,而且还追赶得特别牛气。

令人印象深刻的是,R1 不是仅仅大幅度提升了推理能力(慢思维),在“传统”的知识能力方面比起它的基座模型V3也有显著提升。这可能是因为,推理能力的增强对于一些传统任务具有正面作用,但更应该归功于他们探索出来的四阶段训练R1的pipeline。

最后总结一下。

主要创新点:

1. DeepSeek-R1-Zero: 首个仅通过强化学习(RL)训练的推理模型, 无需人工推理数据的监督微调(SFT)。展示了模型可以纯靠 RL 自主发展出推理能力。

2. DeepSeek-R1: 在 R1-Zero 基础上做以下改进:
- 后训练阶段先用少量高质量数据进行冷启动SFT
- 采用4阶段的后训练流程,两次SFT,两次RL
- 性能可与 OpenAI-o1-1217 相媲美

3. 蒸馏技术: 成功将推理能力迁移到一系列开源小模型:
- 1.5B 参数的模型就超越了 GPT-4 在数学方面的表现
- 32B 和 70B 的模型创造了密集模型的新记录

关键技术细节:

- 使用 GRPO (Group Relative Policy Optimization)算法,舍弃 Critic 模型
- 采用基于规则的奖励系统, 舍弃奖励模型 PRM
- 设计特定的训练模板引导模型再生数据进行自主学习:
<think> is all we need for reasoning!

 

【笔者后记】

这两天莫名很兴奋。跟 deep啥 纠缠不休,今天才缓过气来 lol

硅谷老友群也热议不断:

Hongtao:
DeepSeek若不快速大融资和上市, R1的18位主要贡献者估计很快就被国内外大厂抢光了[Grin]
Core Contributors:
Daya Guo
Dejian Yang
Haowei Zhang
Junxiao Song
Ruoyu Zhang
Runxin Xu
Qihao Zhu
Shirong Ma
Peiyi Wang
Xiao Bi
Xiaokang Zhang
Xingkai Yu
Yu Wu
Z.F. Wu
Zhibin Gou
Zhihong Shao
Zhuoshu Li
Ziyi Gao
当年DeepMind被迫卖给Google,因为founders被告知若不卖,就高薪挖人。挖走一两个作者,就少走大部分弯路了。

主要还是幻方的AI量化投资受挫, 因势利导做deepseek成功;开源后,国内外大家都沿着这路子去试。若不财大气粗起来,优势恐怕难以为继。

超大模型训表征,
开源一蹴而就成。
强化学习各求精,
蒸馏定制缩小型。

内卷已经卷出墙,
硅谷AI圈被激荡。
OpenAI&Meta领头羊,
都被鞭策加速闯。

硅谷不眠夜:DeepSeek为何震动美国科技界?

Nick:复现DeepSeek貌似很容易。其实强化学习大家也都想到过,过去总觉得可能要花很多算力,少人试。貌似o1就是强化学习练出来的,但一些推理token他们没open。这可能迫使国内头部那两家加速上市过程。

立委:这类团队属于中华之光,国之重器。

他们写得基本够清晰了。让人担心他们下一步怎么保持这个势头和地位。很多神秘就是一层窗户纸。最大的功劳是他们同时也差不多捅破了o系列神秘面纱的窗户纸。

Nick:马上融一大笔钱,突击上市。除非手里还有更硬的牌。

马老师:好几家复现了deepseek,各家再各自探索,相信会是快速迭代的过程,有望再一次大发展。

Nick:也是个试金石,倒逼openAI看看还有啥新东西,是骡子是马拉出来溜溜。

Hongtao:给openai压力;更是 叫板meta, 争夺开源盟主地位

鲁总:OpenAI 的牌越来越少了。但SamA 希望通过心理战误导大众。之前发文强调过程奖励,O1 出来时放烟幕弹让人相信推断时使用复杂的搜索 ... 结果都应该没有用。

香港科技大学的团队说是也独立发现了RL涌现推理能力,不过只针对数学问题求解,但也特别指出使用输出格式奖励。

白老师:数学能力和编程能力是相通的。

不请贵的人是成功的很重要因素。

施总:哈哈。贵的不一定能干,能干的都比较贵。

刘总:主要是要用年轻人,岁数大的没戏。岁数大了,思维僵化,精力不行。当然,我说的是统计规律,个例总是有的。

立委:deepseek 不是常态,是冒尖。但 deepseek 这么一捅窗户纸,很多人就跟上了。不知道 它还有多少宝贝没有显露。否则 逐渐暗淡下去 也不是不可能的。

deepseek 之前,各种刷榜,也是内卷得一塌糊涂,也都刷榜刷到了世界先进水平。但没有哪家做到了 deepseek 这种硬气 震撼 和让人服气。一鸣惊人天下知。

Nick:估计每家都会短期内在数学能力上长足进步。豆包上周一周内就进步不小。窗户纸捅破,门槛也不是那么高。大概率o1也是这么做的,只不过内帮孙子比较鸡贼。

Liren:DeepSeek-R1告诉大家,你们都被PRM和MCTS误导了,其实只需要一个<thinking>标签就够了[Chuckle]

Nick:是啊,你写篇文章,“<thinking> is enough"

立委:就是留白。你留了白,系统就会给自主填上。

zero 的实践表明,根本不用想那么复杂,还要考虑怎么从各种不同推理任务中找到共同的思维链 patterns,等等。甚至也不管里面的逻辑是不是胡说八道,结果导向,最终,推理还是学出来了。预训练靠的是简单的 next token prediction,后训练推理靠的就是结果导向的强化自主学习。设计一个简单的模版就搞定了无穷的再生推理数据。

Nick:是啊,有了ToT和Gemini,话都在嘴边了。

Liren:增加在推理时的tokens来提升思考时间。

立委:秘方就是4步走:1 冷启动 2 强化 3 微调 4 再强化。zero 干脆省掉了 1 3 4,所以显得过于生猛,但 beautifully 证明了“硬启动”的强化学习也能涌现高级推理能力。r1 就是完善后训练的节奏和数据配比。很多应该就是经验,是摸索出来的 best practice,他们肯定有过很多其他失败的尝试,但还是摸着石头过了河。

马老师:感觉就是碰运气,不过沿着别人路走的永远没有运气。

立委:我觉得他们还有一些东西,所以才“肆无忌惮”。等于是他们推出了一个菜谱,这个菜谱做的菜比肩世界一流。但他们其实还有其他的菜谱,更高级,但不急于拿出来?

不是大道至简,而是大模型本身已经具备了强大的推理能力,它需要的只是:

1 足够的思考空间/时间/tokens量(<think>标签)
2 正确的反馈信号(答案正确性)
3. 探索优化的机会(GRPO采样选优)

复杂、难缠、费力的PRM(过程奖励模型)和MCTS(蒙特卡洛树搜索路径空间)反而限制了模型的自主探索。这说明大模型的能力被我们低估了。

deepseek 的成功的先决条件是 v3,他们自己做出了世界前列的头部基础模型,他们自己知道怎么善用它的潜力。如果是借助于外部基础模型 GPT4o,就很难这么快做出r1,很多 v3 的资源和practice 就在 r1 过程中直接借用了。

马老师:在理。

Nick:So what's next? assuming everybody will have as strong math capabilities within a month

立委:AI for science?机器自动证明百年难题啥的;机器自动发明新药......

Nick: only two problems matter: Riemann Conjecture and P vs NP

马老师:大厂也许会用更大的模型,更多的数据,继续向大上走。

Nick:那肯定。我觉得Nvidia的生意会更好。

立委:deep 目前为止还是在追平,是人家先树立了标杆,它去对齐。多快好省。

deep 要真牛,再上一个台阶,需要自己树立标杆和方向。但这太难了。目前为止似乎还是只有敢于疯狂烧钱 敢于无限做大的那些狂人才在开疆拓土。

 

【相关】

 

大模型风云诡谲的下半场:scaling 失效?

广密大模型季报谈AGI范式大转移这篇播客,很值得一听。涉及很多大模型产业重要话题,视野和风格很独到。

“范式大转移”的说法太耸人,但风云诡谲,是当下的写照。那是因为大佬军备竞赛虽然依旧如火如荼,可应用落地却处于难产期,如何 justify 投资是一个巨大的拷问,或泡沫。

三大学习: 监督学习、非监督学习(自学习/预训练)、强化学习(RL,自主学习/self-play),伊利亚曾经专门强调后者作为探索大方向的重要性。广密这里特别强调它是正在到来的大模型AGI之道的范式转变。

此前,大家都知道强化学习主要是两个里程碑:一个是 alpha0 围棋完胜人类选手,震惊了世界 ;另一个是所谓RLHM(人类反馈强化学习),强化了与人类偏好的对齐,成就了ChatGPT的核爆。

现在谈的是大模型新生态可能性(并无广泛共识,只是硅谷小圈子在做、在议)以及大模型“军备竞赛”的新趋向。这个话题的背景如下:

1、 预训练 scaling (更大规模)似乎受困,GPT5 迟迟不出;

2、 军备竞赛的大格局和造势,大厂和大佬不要改变;

3、 技术大佬开始提出新路线图或试图构建新的 scaling law 继续AGI 的征程

广密在podcast里面,观察硅谷动向比较 closely,他的小团队不仅定期去湾区做深度调研,也建立了广泛的联系。在硅谷跟他们喝过两次咖啡聊天,一帮生龙活虎的小年轻在他的带领下,我的印象,是一个小而精干的独特存在。

这台节目的个人风格和视野也非常 unique,喜欢他说话的思路敏捷,有跳跃感,但张儿不散,有一种吸引人的表达力。主持人与他的交互也很丝滑,张弛有度。

听他们唠嗑吧,谈笑间大模型AGI的大趋势貌似尽收眼底。还是值得点赞的。

广密条理非常清晰,所涉及的知识面和大形势观非常广泛,却能present到自己的视角参照系,与LLM社区的思想趋势有较好的映射。这不容易,因为LLM这档子事,无论模型还是产业的 landscape,大多都是盲人摸象。很多大专家、商业大佬也都各有自己的三分地和视角,也很难全面评估形势。Meta 小扎刚前不久面对万卡竞赛能不能得到预期的AGI return的天问,回答说(大意):其实没人知道,但总不想万一错过的(典型的 FOMO心态)。

目前形势处于微妙期,其实还没有凝聚太多的共识。但是把英伟达送上天价的几个富可敌国的大佬/大厂,短期内却绝对不允许停止军备竞赛,这是他们科技商业争霸的游戏。这叫欲罢不能,节奏在他们手中。Open AI 不断融资,伊利亚自己也最近融资成功,这些都是这场竞赛持续热度的浪花。

与之相配合的是技术大佬和硅谷AGI主流技术圈对scaling的执着和痴迷。因为这次大模型 scaling 技术浪潮对于资源的无止境需求,真正能有机会实践、感知并做出调整改变路线图的技术人,也只能是一个很小的圈子。

据广密的信息,这个所谓 self-play RL 新生态趋势,目前是局限在硅谷技术大佬小圈子的共识,他提到大约不超过200人的圈子的。如果信息正确的话,一个在硅谷技术核心圈200人以内的某种共识和议论,说明还只是一个动向,甚至连管理圈还没真正 get it 和对齐。

感觉上,广密有一些“春江水暖鸭先知”/“语不惊人死不休”的心态(LOL),有意强调/夸张了这个趋势,警醒国人,甚至说,如果我是大模型创业家,我会200%资源聚焦 RL 方向,bet on it,因为这是未来赢家的选择,云云。

其实,客观说,对于多数人这个不实在,也无可操作性,最多是说给国内大厂玩家或六小龙听的吧,但其实也是白说。RL 本来就不好玩,连开源标杆 Meta Llamma 3 在最基本的 RLHF 方面都选择绕开来走,就更甭提提倡国内大模型公司全力 bet on 以强化学习作为新生态核心的愿景了。何况后者在硅谷最多也只是一种“潜流”,可能要等年底前OpenAI草莓以及Claude新版发布后,才能对这个所谓新生态的影响,看得清楚一些吧。

这个苗头可以 keep in mind,但上半场确实似乎结束了。真正可以在所谓的下半场作为 action items 的,其实不是这种高大上、带有很强探索性的大模型 scaling 方向的尚未确定的 roadmap,更多是趁着 GPT4o/Claude3.5级别的通用模型的推理成本越来越亲民化、趁着LLM供应商多模态功能在进一步推广和完善(例如超拟人全双工语音的最新突破和工具赋能就会大大增加应用层面的机会,还有视频的进展等), 加快找市场切入点(PMF),专注应用场景真正需求的解决。

对于产业而言,当前大模型规模化应用的困局才是悬在大模型产业头上的利剑,决定了这下半场在 tech adoption curve 下行能不能软着陆和最终平缓回升。至于军备竞赛,让马斯克、小扎等首富俱乐部继续玩继续high就好。

作为“预训练”的延深,强化学习属于“后训练”,在前者遇到瓶颈和 diminishing returns的时候,加强后者是自然的补足。从AI对人类认知的模拟来说,前者是继承人类文明的知识和科技积淀,后者是把这些知识真正用到实处,在环境中学习。这个智能学习的总体思路 makes perfect sense,也是大模型应用必须要走的方向。

所以老友吕兄说:“直觉上RL是必须要走的路,因为supervised learning的数据没有那么多了。”

没错,不同程度利用再生数据,其实已经是日常 practice 了,也不再有以前的“心理障碍”,是一个必然。总体而言,模型就是比人能够更高质量产生数据,而且会越来越好。但这还不是这里说的self-play的主动探索和数据再生。

毛老说的也不错:“RL 与人类的认知过程相符,与认识论一致。实质上就是接收外界反馈,接受实践检验的过程。RL 是主动学习,而训练是被动的。”

广密现在是说,需要研究测把 RL 范式化,提供某种 RL dev toolkit,然后有在各种场景去做 scale up RL 的路线。这个所谓“范式大转移”,没有1-2年的大厂/大佬的推动普及,没有抓手。持谨慎乐观或怀疑的open 心态,静观其变吧。

转述老领导的硅谷风投现状和展望的分享

摘要

1. 未来几年内 65% 的风险投资公司将倒闭。

2. 凭借 2010 年代的出色回报,风险投资一路高歌猛进。

3. 2021 年,1577 家不同的风险投资公司共筹集了 1830 亿美元。

4. 但与此同时,创办一家初创公司的成本却越来越低。广泛可用的工具、全球劳动力和便捷的(在线)分销意味着创办一家 SaaS 公司从未如此简单或便宜。

5. IPO 窗口关闭了——公司无法上市。因此,风险投资公司无法通过大型 IPO 赚钱。

6. 并购也没有发生(至少对卖方来说价格不高)。因此,风险投资公司无法通过出售公司来赚钱。

7. “我们决定不再筹集另一只基金。” 翻译:他们可能做不到。更多的公司会说他们“不再投资”;合伙人“决定担任运营职务”。董事总经理正在退休。

8. 2023 年,597 家风险投资公司筹集了 810 亿美元。这分别下降了 63% 和 56%(与 2021 年相比)。

9. 风险投资的狂欢派对结束了。或者至少这一章结束了......

10. 超过 50% 的现有公司将无法生存。这意味着,如果您是在这种环境下筹集资金的初创公司首席执行官或运营商,您需要了解游戏规则已经改变。

11. 不要相信一些创始人仅凭 20 万美元的 ARR 和一份好牌就筹集了 3000 万美元的故事。风险投资公司用巨额支票救助经营不善的企业的时代已经结束。

12. 筹集资金的最佳时机是您不需要它的时候。深入挖掘以度过冬天。

以上是我的“老领导” Jonathan前天在LinkedIN关于硅谷风投下行的一种描述。很清醒。

听上去似乎奇怪,大模型爆发以后很快风投成为下行,这其实还是因为 technology adoption curve 本身现在在落地应用上遭遇挑战,整体处于下行,无论中美。

大模型的到来使得初创的启动变得容易,这样僧多粥少,筹集风投就更加困难。

我这个老板一共做过4家初创,第一家是他初出茅庐时候做的 Netbase,我们相处非常好。现在这第四家SAAS创业很扎实,势头很好,有望成为硅谷独角兽。他也历练成熟多了。

他是第一个在LinkedIn上给我写推荐的人,一直保持联系。

图片

Jonathan Spier

Last Friday, I had dinner with a famous VC investor who told me 65% of VC’s will go out of business in the next few years. I believe him. Here’s what that means for startup leaders waiting on their Series A/B/C:

Those of us in startups tend to think VCs are at the top of the food chain.

They aren’t.

VCs are businesses too.

They raise money from THEIR investors (aka LPs).

And their job is to make a return for those LPs.

With outstanding returns in the 2010s, VC was on a win streak.

Many more funds were born.

And existing funds got much larger.

In 2021, 1577 different VC firms raised a total of $183 billion.

But at the same time, costs to launch a startup have gotten cheaper.

Widely available tools, global workforce, and easy (online) distribution mean it’s never been easier, or cheaper, to start a SaaS company.

So how are VCs supposed to deploy all that money they’ve raised?

They can’t.

There is too much money chasing too few deals.

Make no mistake, for VCs, it’s a fatal mix.

The IPO window is closed – companies can’t go public.

So VCs aren’t making money with big IPOs.

M&A isn’t happening (at least not at good prices for sellers).

So VCs aren’t making money by selling their companies.

If VCs aren’t making money, they can’t return capital to their LPs.

They are in trouble.

Of course, VCs rarely go out of business the way their companies might.

Reputations are at stake, so change happens quietly.

But it’s the same result.

It’s already happening.

Listen carefully, and you’ll hear VCs saying:

"We have decided not to raise another fund.”

Translation: they probably can’t.

More firms will say that they are “no longer investing”

Partners are “deciding to take operating roles.”

Managing Directors are retiring.

In 2023, 597 VC firms raised $81B.

That’s down 63% and 56% (vs. 2021).

The VC party is over.

Or at least this chapter is...

The select few at the top of the VC list will have their pick of deals.

The great business builders will choose their spots and continue to thrive.

I’ve been lucky to work with a few of those and am certain that their expertise and relationships will carry them through.

But over 50% of existing firms, won’t survive.

That means if you're a startup CEO or operator raising money in this environment, you need to understand the game has changed.

Don’t buy the stories of some founder that raised $30M with $200k ARR and a good deck.

The era of VCs bailing out bad businesses with huge checks is over.

Many of the VCs won’t even be around in a few years.

There is only one strategy that works in this economy.

Focus.

Nail your ICP.

Delight your customers.

Get profitable to control your financial destiny.

The best time to raise money is when you don’t need it.

It’s a harsh economy out there.

Burrow down deep to survive the winter.

Remember:

Tough markets make strong companies.

立委论LLM:视频生成的人物一致性问题

大千世界人脸识别是一个专修而艰难的任务,因为人眼对于人脸的特征有非常的敏感。正因为人脸识别专门而敏感,比起识别阿猫阿狗的传统图像识别挑战性非同日而语,结果这个赛道首先得到突破:甚至在大模型之前,深度神经人脸识别得力于极为丰富的人脸图像数据,就已经超过了人类肉眼的识别能力和敏感度,也率先得到了广泛应用,成就了前大模型一代的独角兽。

到了面向大千世界一切对象的万用视频大模型的时候,无论Sora还是可灵,人脸的一致性依然是一个巨大挑战。Sora 不公测不大知道,但通过类似design的国内头部视觉大模型可灵,可以感知里面的局限性。通常超过半分钟,人脸就开始发散,变得不像那个人了。长程人物形象的一致性,如果不做专项处理和引入专项优化,单靠现在通用的视频一致性的训练努力,是很难突破瓶颈的。这个局限,在玩可灵一类公开上线的产品时,各种测试一再被发现。

有些视频如果不是人眼的敏感性(难怪“外贸”协会的求偶者那么多,原来人眼揉不得沙子,尤其是找对象的时候,普通人很难粗粒度容忍潜在对象的面相瑕疵或因为面相的不来电 lol),哪怕从视觉物理特性看其实已经无法区分。可见人的眼毒: 可以立即分辨真假李逵。

例如,下面两个小雅的视频生成,小雅粉丝一眼就能辨真假,虽然小雅本人也有不同的年龄,不同的场合,会呈现不同的形象,但有一种可以抽象出来的不变的人脸特征在武装着人类对于人脸的火眼金睛。而这一切的密码其实在上一代人脸识别的模型里面已经解耦。

注意:人物形象一致性是电影性和可配置的视频作品生成的关键指针。不跨过这道门槛,这个赛道会难以规模化落地。那些梦想空手套白狼、无需物理拍摄、全程虚拟数字化的大模型好莱坞生产线,也就成为泡影。

为什么视觉大模型靠蛮力很难在人物长程一致性上奏效?、因为视频是模态中维度很高的形态,大模型(至少在可见的将来)为了搞定视频, 就必须做大力的有损压缩。视觉的 tokens 压缩比很高,这样在内部的 hidden space 做整体帧的长程一致性的对齐训练/生成才比较可行。压缩旅越高,总体画面的时间维度的一致性就越强。自回归或DiT就可以搞定。只有这样,违背大千世界物理规律的视频就会得到有效控制,减少违背常识的幻觉,让视觉大模型看上去已然是客观世界的模拟器(疑似)了。但这里有个矛盾,在有损压缩的条件下,帧总体的一致性与其中具体物理对象的细节特征的一致性,没办法同步优化一致性。

目前的方案通常是,在总体轮廓(blueprint)一致性搞定后,追加一个高清化(SR)的过程,试图把舍弃的细节复现出来。高清化渲染,总体而言在过去几年的 deep fake 研发积累中,已经做得相当不错了。但是 deep fake 本质上是在有损压缩的条件下的亡羊补牢,它所能做的就是通过大模型所擅长的想象(或曰幻觉)来合理地、非确定性地填补细节,描绘世界应该具有的形象(what it should be,Not what it is),可以栩栩如生。但如果目标是一个特定对象,尤其是人脸这种细粒度对象,有人眼敏感的个体特征(IP),它就免不了在长时间的生成中有所走偏,这就是问题的要害所在。解决的办法不应该指望模型越来越大、context window 越来越长的大数据蛮干。因为蛮力只能减缓偏差,不能根治长视频的SR过程中随时间而增加的非确定性偏差积累。要 think out of box,排除时间维度作为条件,以步步对齐的方法,或可解套。点到为止吧 ,勿谓言之不预。

做到这点的前提是人脸可以解耦。无法解耦的特征做不到步步对齐。一定是可以解耦的,否则无法说明好莱坞几十个名演员可以演几千部大片。人脸与表情和时间的解偶当然还有进一步的提升空间,但技术已经比较成熟了。

 

Related Links:

Ilya Sutskever:你会 bet against AI 吗?

挥刀董事会,OpenAI 如何解决致命难题

英伟达GTC的一小步,AI进化史的一大步

为什么说 Sora 是世界的模拟器?

Llama 3 Notes and Llama MV with Llama 3.1 Legend

Notes on the 92-page Paper Released with Meta's Super Large Model Llama 3.1

The super-large model Llama 3.1 is a milestone in the open-source large model community. As a leader, Meta's project involved over 500 participants/contributors (the authors of this paper are listed alphabetically in the appendix, similar to how the Central Committee members' names are displayed by stroke order). This original text is full of implementation details:

meta Llama 3.1 paper

AIGC MV using Suno and keling (just for fun & cheering opensource milestone)

Notes:

  1. Llama 3.1 doesn't use sparse techniques, it's not a multi-expert system like model 4, but a dense model.
  2. 405B parameters, 15.6T tokens: The number of tokens is 40 times the number of parameters. Large-scale top models now emphasize data growth far exceeding parameter growth. Is this 15T tokens of data open source? (No, because even if they were willing to open source it, they wouldn't dare, as it could lead to countless data infringement lawsuits)
  3. Emphasizes three major levers for super-large foundation models: data, scale, and managing complexity.
  4. Compared to the previous generation system Llama 2, computational power has increased 50 times (using 3.8 × 10^25 FLOPs).
  5. Complexity management: (1) Choosing a standard dense Transformer architecture instead of a mixture of experts model to maximize training stability. (2) Adopting a relatively simple post-training procedure: Supervised Fine-Tuning (SFT), Rejection Sampling (RS), and Direct Preference Optimization (DPO). In other words, algorithm design and implementation tend towards simplification. Not using sparse techniques and multi-expert systems is for stability (but training challenges are greater, though they're not afraid). Using simpler, easier-to-implement DPO in the post-training phase instead of reinforcement learning is also for stability, as reinforcement learning has always been difficult to handle.
  6. Benchmark tests cover: general, code, math, reasoning, tool use, long context, and multilingual. All performances are SOTA (state-of-the-art international level).
    • MMLU (Massive Multitask Language Understanding): 405B model achieves 87.3% (5-shot), 88.6% (0-shot, CoT).
    • Code generation (HumanEval): 405B model reaches 89.0%, close to GPT-4.
    • Math problems (GSM8K): 405B model achieves 96.8%, slightly higher than GPT-4.
    • Long context tasks: Excellent performance on some tasks, such as 95.2% on QuALITY.
    • Multilingual tasks (MGSM): 405B model reaches 91.6%, on par with top models. The 405B model is comparable or close to GPT-4 and Claude 3.5 Sonnet on many tasks. In short, open-source has caught up with closed-source.
  7. Pre-training started with an 8k window, expanded to a 128k window in the later stages of pre-training (continued training).
  8. After the foundation model pre-training was completed, multiple iterations of alignment "post-training" were performed. Including: (1) Aligning the model through human feedback, including multiple rounds of Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO); (2) Integrating new capabilities, such as tool use; (3) Enhancing coding and reasoning abilities (specialized optimization); (4) Safety alignment.
  9. Multimodal expansion (in progress, not yet released): Image, video, and speech capabilities. Including (1) Multimodal encoder pre-training: Image encoder trained on a large number of image-text pairs, aligning visual content and natural language in a unified space; (2) Speech self-training? (3) Experiments on video-text data alignment based on images.
  10. Language model as the core, other modalities are added later (whether added to pre-training and/or post-training). When expanding to multimodal, the language model parameters remain unchanged, adapting to multimodality, allowing multimodal alignment in the same semantic space, closer to the language model. In other words, Llama follows a modular, step-by-step approach to gradually expand to multimodality. This is not the mainstream approach (mainly referring to Open AI and Google, at least in theory) advocating for "unified multimodal native data joint pre-training". The overall impression of Llama's algorithmic strategies is seeking stability rather than innovation or unification. It tends towards practicality, not caring about leading in algorithms. For example, the integration of speech first involves speech self-training (because speech is actually very similar to text, both being language systems), then alignment between speech and text (including Automatic Speech Recognition ASR and Text-to-Speech TTS). Integrating step by step into the cross-modal large model, this approach isn't cutting-edge in terms of advancement, but it's steady progress, beneficial for engineering development, integration, and iteration. It's unclear when they will be able to release multimodal capabilities online.
  11. Data collection and cleaning work is very complex, but the Llama team is meticulous, which is also the data guarantee for its quality to catch up with SOTA. To recap: (1) De-duplication: URL-level de-duplication; Document-level de-duplication using MinHash algorithm; Row-level de-duplication: removing rows appearing more than 6 times every 30M documents. (2) Filtering: Removing low-quality documents, outliers, and excessively repetitive documents, using repetitive n-gram coverage to remove repetitive content (such as logs or error messages); using "dirty word" counts to filter adult websites not covered by blacklists; using token distribution KL divergence to filter documents with too many abnormal tokens. (3) Controlling data quality: Using fasttext classifier to identify text that might be cited by Wikipedia; using a Roberta-based classifier trained on Llama 2's predictions; using DistilRoberta to generate document quality scores. Also, fasttext language classifier can identify 176 languages; specially filtering two types of information: adult content and personal identity/privacy information. Special fine processing for code and math web pages.
  12. Data proportions: For example, downsampling over-represented data categories on the web (such as art and entertainment); data mixing ratios determined by a series of small model experiments, final data mix summary: About 50% of tokens correspond to general knowledge; 25% of tokens involve math and reasoning; 17% of tokens are code; 8% of tokens are multilingual content.
  13. Model architecture: Apart from empirical detail adjustments, the basic architecture of the dense model remains unchanged, so it's data and scaling that create top models. 405B model specific parameters: 126 layers; token representation dimension 16,384; 128 attention heads; model size of 405B determined according to scaling law, about the computational optimal size under 3.8 × 10^25 FLOPs training budget.
  14. Vocabulary: Using a vocabulary of 128K tokens. Combines 100K tokens from the tiktoken3 tokenizer and 28K additional multilingual tokens to better support non-English languages.
  15. Computing resources, including GPU clusters of tens of thousands of cards, massive storage, and high-speed networks, represent huge resource investments. Specific data as follows: Computing resources:
    • Used up to 16,000 H100 GPUs (a very powerful graphics processor).
    • Each GPU has 80GB of high-bandwidth memory, with a power of 700W.
    • These GPUs are installed on servers designed by Meta itself, with 8 GPUs and 2 CPUs per server. Storage system:
    • Uses a distributed file system called Tectonic.
    • Provides 240PB (1PB=1000TB) of storage space, distributed across 7,500 servers.
    • Can process 2TB of continuous data per second, with a peak of 7TB/second.
    • A major challenge is handling the large amount of burst writes generated when processing model checkpoints (the process of saving model states).
  16. Three-step pre-training process: a) Initial pre-training; b) Long context continued pre-training; c) Annealing with high-quality data sources Key pre-training strategies:
    • Gradually increase batch size and sequence length to balance stability and efficiency.
    • Dynamically adjust data mixing to specifically enhance certain capabilities.
    • Increase context length in stages to avoid early computational overhead.
    • Use annealing and high-quality data in the late stages of training to fine-tune model performance.

 

[LLM Summary]

Llama 3: Meta's Open-Source Large Language Model Breakthrough**

1. Introduction and Overview

Meta has introduced Llama 3, a series of foundation language models designed to support various tasks including multilingual processing, programming, reasoning, and tool use. This model series includes versions with 8B, 70B, and 405B parameters, with the largest 405B parameter model adopting a dense Transformer architecture and supporting context windows of up to 128K tokens. The development of Llama 3 highlights three key factors: data quality and scale, computational scale, and complexity management.

2. Model Architecture and Pre-training Strategy

2.1 Model Architecture

Llama 3 retains the standard dense Transformer architecture rather than adopting a mixture of experts model. This choice aims to maximize training stability, reflecting Meta's emphasis on simplifying design to manage complexity. Key architectural improvements include:
- Using Grouped-Query Attention (GQA) mechanism, with 8 key-value heads per attention layer.
- Introducing attention masks to prevent self-attention between different documents in the same sequence.
- Expanding the vocabulary to 128K tokens, combining 100K tokens from the tiktoken3 tokenizer and 28K additional multilingual tokens.
- Increasing the RoPE base frequency hyperparameter to 500,000 to support longer contexts.

2.2 Pre-training Data Processing

Llama 3's pre-training data processing is extremely rigorous, including:
- Multi-level deduplication: URL-level, document-level (using MinHash algorithm), and row-level deduplication.
- Heuristic filtering: Removing low-quality documents, outliers, and excessively repetitive content.
- Model-based quality filtering: Using fasttext and Roberta-based classifiers for quality assessment.
- Special content processing: Developing specialized processing pipelines for code and mathematical content.
- Multilingual data processing: Using fasttext base language identification model, supporting 176 languages.
- Safety and privacy protection: Filtering website data containing personally identifiable information (PII) and unsafe content.

2.3 Pre-training Strategy

The pre-training process is divided into three main stages:
1. Initial pre-training: Conducted on about 15T multilingual tokens, far exceeding Llama 2's 1.8T tokens.
2. Long context pre-training: Gradually expanding from initial 8K tokens to 128K tokens context window.
3. Annealing phase: Fine-tuning with high-quality data in the final stage, using Polyak averaging to generate the final model.

Data mixing ratios are carefully designed:
- 50% general knowledge
- 25% mathematics and reasoning
- 17% code
- 8% multilingual content

3. Training Infrastructure and Challenges

3.1 Computational Resources
- Using up to 16K H100 GPUs, each equipped with 80GB HBM3 memory.
- Adopting a 4D parallel strategy: tensor parallelism, pipeline parallelism, context parallelism, and data parallelism.

3.2 Storage System
- Using the Tectonic distributed file system, providing 240PB of storage space.
- Supporting 2TB/s sustained throughput, with peak capacity of 7TB/s.

3.3 Network Optimization
- Developing the NCCLX communication library to improve network efficiency.
- Designing specific network topologies and load balancing strategies.

3.4 Training Challenges
- Experiencing 466 job interruptions during the 54-day training period, 419 of which were unexpected.
- Developing automated systems and specialized tools to handle hardware failures and network issues.

4. Post-training and Alignment

Llama 3 adopts a multi-round iterative post-training process, including:
1. Supervised Fine-Tuning (SFT)
2. Direct Preference Optimization (DPO)
3. Reward model training: Using human feedback data
4. Safety alignment: Implementing multiple rounds of safety measures

This process not only improves the model's instruction-following capabilities but also enhances safety and specific abilities (such as coding and reasoning).

5. Multimodal Expansion

Although not officially released yet, Llama 3 demonstrates promising multimodal capabilities:
- Image recognition: Training independent image encoders, integrated with the language model through adapters.
- Video understanding: Adding video adapters based on image adapters.
- Speech processing: Independently training speech encoders, then aligning with the language model.

This modular approach allows flexible addition of new modalities while maintaining core language capabilities.

6. Performance Evaluation

Llama 3 performs excellently in multiple benchmark tests:
- MMLU (5-shot): 87.3%
- HumanEval (code generation): 89.0%
- GSM8K (math problems): 96.8%
- Long context tasks (like QuALITY): 95.2%
- MGSM (multilingual tasks): 91.6%

These results indicate that Llama 3 405B is comparable or close to GPT-4 and Claude 3.5 Sonnet on multiple tasks, particularly excelling in document understanding and long context tasks.

7. Safety Considerations

Meta highly prioritizes safety in the development of Llama 3:
- Implementing strict safety measures in both pre-training and post-training stages.
- Developing the Llama Guard system-level safety solution.
- Conducting extensive red team testing and risk assessments.

8. Open Source Impact and Future Directions

Meta's decision to publicly release the entire Llama 3 series, including the 405B parameter version, may have far-reaching impacts on the AI research community:
- Promoting open, responsible AI development.
- Accelerating AGI research progress.
- Providing researchers with opportunities to examine and improve large-scale language models.

Future development directions may include:
- Further improving multimodal integration.
- Expanding context length.
- Continuously enhancing data quality and model scale.

9. Conclusion

The development of Llama 3 demonstrates Meta's deep experience and forward-thinking in large-scale AI systems. By focusing on three key levers - data quality, computational scale, and complexity management - Llama 3 has reached or approached the current state-of-the-art level on several key benchmarks. Its open-source release may drive a wave of innovation across the entire AI field, paving the way for responsible AGI development.

Llama 3: Meta's AI Chef's Latest "Divine Delicacy"

Attention, all tech enthusiasts! The Michelin three-star AI chef Meta has just unveiled a new dish! This divine delicacy named "Llama 3" is not only spicy enough but will elevate your taste buds to new heights!

1. The Chef's Secret Weapon

Imagine Llama 3 as a super nanny who speaks 8 languages, writes code, does math, and can be your personal assistant. She can handle a kindergarten full of rambunctious kids (8B version), manage a mid-sized company (70B version), or even govern a small country (405B version)! This 405B big sister can remember 128,000 "gossips" (oh no, I mean context) simultaneously, essentially a walking encyclopedia + supercomputer!

2. Ingredient Selection: Only the Freshest!

Llama 3's chefs are masters at picking ingredients:

  • They "fished" 15 trillion words from the internet, nearly 10 times more than the previous generation!
  • Half of these words are everyday life seasonings, a quarter are math problems and brain teasers, nearly a fifth are programmer spells, and the rest are dialects learned from world travels.
  • They even invented a super weed remover, filtering out all the online garbage, repetitive, and unhealthy stuff.

3. Cooking Process: Three-Step Stir-Fry Method

Step 1: "Slow Simmer" - Start with a regular stove (8K context) to cook it halfway. Step 2: "High Heat Stir-Fry" - Switch to a super stove (gradually increasing to 128K context), reducing the sauce to be thick and fragrant. Step 3: "Low Heat Finish" - Finally, a gentle simmer with the best ingredients, the legendary "annealing" (even the chefs don't know why it's called that), bringing the flavor to its peak!

4. Kitchen Equipment: Top-of-the-Line Luxury Version

  • 16,000 super high-power induction cookers (H100 GPUs) firing simultaneously!
  • A refrigerator that could fit half the Pacific Ocean (240PB storage)!
  • A proprietary ingredient prep system faster than 5G (NCCLX communication library)!

Imagine all these stoves firing at once, making the kitchen feel like a sauna. But our chefs persevered through the heat, changing chef uniforms 466 times in 54 days to whip up this dish!

5. Training Method: Both Cute and Well-Mannered

Being a good cook isn't enough; you've got to have manners too! So our chefs began a long "training" process:

  • First came a round of "gentle education" (supervised fine-tuning)
  • Then the "carrot and stick" tactic (direct preference optimization)
  • Finally, they invited moral role models (safety alignment) for guidance

After all this fuss, Llama 3 not only cooks well but also knows how to please people, program, do math, and mind her manners - a true decathlon champion!

6. Special Side Dishes: Showcasing Multiple Talents

Don't think Llama 3 can only cook; she's a multi-talented "goddess":

  • Storytelling from images? Piece of cake!
  • Writing movie reviews? No problem!
  • Recognizing songs and even singing a bit? The karaoke queen!

Although these "talents" are still in practice, they already show the potential of Li Bai's "from black hair to snow white in a day"!

7. A True Powerhouse: Dazzling Test Scores

Llama 3 participated in a series of "Top Chef Competitions," with eye-popping scores:

  • College Entrance Exam (MMLU): 87.3 points (out of 100)
  • Programmer Interview (HumanEval): 89 points (out of 100)
  • Math Olympiad (GSM8K): 96.8 points (out of 100)
  • Long Novel Reading Comprehension (QuALITY): 95.2 points (out of 100)

Bring this report card home, and even a "Tiger Mom" would be grinning from ear to ear!

8. Safety First: AI's "Security Captain"

Meta's chefs know well the principle of "don't leave guns and ammo lying around." They've assigned Llama 3 a 24/7 bodyguard team (Llama Guard) to prevent her from accidentally saying or doing the wrong thing. They even arrange occasional "moral exams" to ensure she doesn't turn into a "Terminator."

9. Open Source Feast: Everyone Can Be a Master Chef!

The most impressive part is that Meta decided to make the recipe for this "divine delicacy" completely public! It's like a Michelin three-star restaurant putting their signature dish's recipe online. Now anyone who wants to can whip it up at home! This move not only shocked other master chefs but also made countless food lovers cheer with joy!

10. Future Outlook: Reaching New Heights

Meta's chefs aren't resting on their laurels; they're already pondering the next "divine delicacy":

  • Maybe a dancing Llama 4?
  • Or a painting Llama 5?
  • Who knows, one day we might see a Llama 6 composing symphonies!

In short, the AI world's "Michelin" journey has only just begun!

Epilogue

The birth of Llama 3 not only elevates Meta's status in the AI world but also brings a fresh breeze to the entire AI research community. This bowl of "Llama soup" is not only delicious but also brings unlimited imagination to everyone. What will the future of AI be like? Let's wait and see what flavor the next "divine delicacy" will be!

世界语国,朋友遍天下(视频修订版)

朋友遍天下

 

可灵AI视觉大模型重现1987年北京世界语春节晚会联欢聚会的场面,当时正谈恋爱,带着领导(当时是女友)去见圈子里老朋友(右边是大百科全书的郑伯承,非常有才华的老友),背后是春节猜谜的谜语。复原的真切程度,难以置信,包括神态、微表情和动作。。

除了已经死去的语言,语言的地理分布不难确认。可世界语国(Esperantio)在哪里?世界语者(Esperantistoj)会很自豪地告诉你:nenie kaj chie (哪里都没有,可又无所不在). Esperantio estas tie kie estas Esperantistoj. (哪里有世界语者,哪里就成为世界语国。)

世界语的大同理念很像是国际歌的传奇,这让我想起我们从小就熟读的列宁语录:“一个有觉悟的工人,不管他来到哪个国家,不管命运把他抛到哪里,不管他怎样感到自己是异邦人,言语不通,举目无亲,远离祖国,——他都可以凭《国际歌》的熟悉的曲调,给自己找到同志和朋友。”世界语对于世界语者,就如上个世纪早期的《国际歌》对于工人革命者一样,成为联络同志的桥梁。正是这样,我凭着“爱斯不难读”(Esperanto)这共同的语言,从英伦而加拿大,再到美国,每到一处,总能找到同志。

世界语者总人数并不多,约三千万,但分布极广,世界上差不多每个中等以上的城市都有或大或小的世界语俱乐部。所有的世界语俱乐部对待远道而来的世界语者就象自己的亲人一样,对此我深有体会。

英国曼城是我出国留学的第一站。跟很多人一样,第一次远离故国,伴随着难以名状的痛苦,内心空荡而恍惚。百无聊赖,我于是找来电话黄页,查询 Esperanto, 果然发现有联络人,原来是一帮退休老人组成的俱乐部,每周在 Pub(酒馆) 活动一次。他们很高兴,我的加入给他们带来了新奇。

于是每个周末,他们派人来车接我送我。这是我第一次接触英国的 Pub 文化。刚开始对 Pub 不是很习惯,里面闹哄哄的,硕大的屏幕上总是播放着足球赛事,有人打台球,有人玩游戏,更多的人在那里喝着啤酒高谈阔论。英国人对 Pub 的热衷,超出我的想像,有人每天傍晚来这里泡到后半夜,海量的啤酒入肚,满面通红,谈些不知所云的话题。以酒会友,人生几何。

我们这个俱乐部通常是找一个相对安静的小角落里围坐在一起。具体谈些什么记不得了,可那种温馨的气氛给我短暂而孤寂的留英生活,留下了回味。

在英国尝到了寻找世界语“同志”的甜头,到了温哥华第二天,就打开黄页,果然联系上了一位老世界语者J,德国人,极为彬彬有礼,和蔼热情。温哥华的五年,他成为我来往最密切的忘年之交。有次我在系里讲演“世界语的机器处理”,他象个记者一样扛来他的老式摄像机, 跑前跑后,给我录像,使我们系里的教授同学亲眼见到世界语者的热情。

立委与世界语者朋友在温哥华合影(J先生居中)

温哥华世界语俱乐部当时还有一批电话公司的白人小伙子,长的都很精神,听说来了一个如假包换的中国世界语者, 都很兴奋。相约聚餐后,他们诚邀我周末跟他们一起滑雪去。我当时刚来,功课很紧,可是盛情难却,还是豁出去一天跟他们去了。这是我第一次滑雪,尽管老摔跤,感觉新鲜美好。我以前从来没有置身过这样的环境,松树白雪,笑语喧哗,各类雪衣,色彩缤纷,真是天上人间。

滑雪过后,我们来到其中一位的女朋友家吃晚饭。女主人年轻漂亮,热情爽朗,给我们煮了一锅大杂烩。她的房子在山腰上,后院对着风景无限的大峡谷。尽管天气寒冷,大家还是愿意在室外,一边喝啤酒,一边欣赏景色。在英国灰蒙蒙雨蒙蒙地度过一年,置身温哥华这样有气势的自然景观,如入仙境。餐后大家围坐一起看美国卡通《Simpsons》的录象,女主人挨着我坐,很体谅地说:你刚来,可能不懂里面的文化和幽默,我来给你做点讲解。于是她热情可掬地在我耳边嘟嘟囔囔,我根本没听明白,只是胡乱点头。她看我点头,孺子可教,更加热情有加。这是一次愉快又有点透不过气来的经历。身边一个金发美女,殷勤热情,差不多靠在我身上了,耳边是她的带有热气的喃喃细语。

以后的每个周末,我们俱乐部会面聚餐一次,总是J老先生牵头,五年下来,从不间断。这是一段非常愉快的时光。值得一提的是,在温哥华俱乐部里,我曾经巧遇多年没有联系的老友郑伯承。伯承兄是北京世界语协会秘书长,当年曾是我世界语班上的同学。他来温哥华开国际艾滋病大会并宣读论文,事先跟J老先生有联络。J老先生通知我说,下次聚会我们欢迎来自北京的朋友,我还在琢磨是谁呢,心里想,如果是北京的老世界语者,八成我早就认识了,没想到来的是伯承兄。老朋友相见分外亲切,我们欢聚一堂,相叙甚欢。

1987年春立委携夫人(当时是女友)参加世界语者聚会,与伯承兄合影

记于2004年五月28,美国水牛城

 

《朝华午拾》电子版目录

Suno:《立委:Esperanto: Al Nia Kara Lingvo(世界语之恋)》

立委两分钟:大模型时代的痛

我是数字主播小雅,今天《立委两分钟》节目的话题是大模型时代的痛。目前在上海召开的人工智能大会,也提到了这个话题。

这次不谈技术,也不科普,专门谈谈大模型时代带来的问题,主要是两点,一个是真假莫辨;另一个是被模型吃掉的工作岗位。都是沉重的话题,但都是进行时,问题只会越来越严重。

生成式AI的到来,也就意味着真假莫辨时代的到来。现如今,无论文字、音频还是视频,随着大模型的普及,深度造假(deep fake)的门槛无限低。耳听为虚,眼见也一样可能是虚。有什么信息可以相信呢?社会似乎并没有为此做好准备。

更闹心的是就业市场受到的冲击。几乎所有的白领工作,都面临大模型的挑战。AI结合机器人(所谓具身智能)的方向也在加速度发展,再过几年,蓝领的工作也面临类似的挑战。

最近几个月在美国,身边不少年轻人失业了,有好几个是伯克利data science毕业的学生。一方面AI大厂如英伟达、苹果、微软等市值不断攀升,另一方面包括大厂在内的IT裁员风潮也一波接着一波。不是 hard core engineering 的数据科学专业生,冲击很大,所以很多年轻人对大模型有怨气。这才刚刚开始。名校毕业生也一样两极分化。非常可怕的AI碾压人工的潮流,data science 是重灾区。

曾几何时,在大模型之前,全美国都有预见,说未来最大的需求是数据科学,每个企业、产品,无论大小,都对数据工作有极大的需求。于是,各大学都开始增加 ds 的 programs,校内极速扩张这个介于电脑和统计之间的学士、硕士课程,各种网上的课程和学位也满天飞。结果,大模型来了,数据分析、总结和渲染都比人做得既快又好。非常可怕的AI碾压人工的潮流,data science 是重灾区。

美国做教授的老友说,数据科学杀掉了统计,人工智能杀掉了数据科学。现在高等教育里还有一个潮流,叫micro credentials, 各种证书。大量的half baked potatoes 就会抢占工作市场,真正全方位科班出身的人反而找不到工作。这些拿了证书的技工把工作都占了,是不是对正式毕业生不公平? 学生怎样会有动力学习?各种研究生教育有什么用?

问题尖锐,但不好回答。局面只会越来越糟。人类的心理承受、社会的应对,都是慢了几拍的。但技术却滚滚向前。

大模型来了后,很多教育其实是不必要的,传统教育的知识传授已经没有多少实质意义了,因为知识是招之即来,问之即答。大模型做辅导,比多数讲师教授还强。其实,教育和学校更大的功能现在就是一个蓄水池,稳住那些无业也找不到工作的人,推迟进入职场,给人一种虚假的对未来的希望。

国内也是同样。很多本科毕业生开始读硕士,不少硕士开始读博士,不是因为不断高读在他们的人生规划中,而是无法面对毕业失业的现实,不得不用升造来逃避。但水涨船高,职场的要求越来越高,而职位却不增反减。核物理博士毕业进街道办,博士出来做辅警,汉语毕业出来扫大街,诸如此类,恶性循环。

AI 也许真应该缓行。但没人能阻挡技术的内卷外卷。人类跳不出自己的怪圈。以职业生涯(career development)作为指挥棒的社会传统意识和价值观,必须改变,但改变需要很长的时间和配套的机制,这些都看不到任何靠谱的方案和行动。

处在漩涡中心的大模型产业也卷得惨烈。上一次移动技术革命在衣食住行、娱乐通信这些刚需领域已经做到了 极致,出现了一批超级应用,美团、滴滴、微信、抖音、拼多多等。剩下的知识和艺术工作属于高端需求,写文章、作曲儿、画个画、做个视频、用个秘书,大多是锦上添花。也许有一天这些高端需求会成为刚需,但在目前很像是伪需求,因此规模化落地应用就显得格外困难。

现在被认可的真需求是什么?有企业老板说得好:你显示给我看,用了你的AI,我可以裁员多少?那才是刚需。

问题是,通用大模型可以免费或白菜价来使用,根本没有多少大模型创业的机会。而大模型的降本增效所引起的裁员已经开始。只有那些特别定制的AI,能够帮助企业进一步裁员,才有活路。

最讽刺的是,以前被尊为打工贵族的码农,也首当其冲。在横扫了 data science 工作后,cs 的毕业生中比较平庸的,也在工作被铲除的路上。美国作为IT超级大国,多少年来在培养程序员方面一直有缺口,本国毕业的cs学生填不满,只好大量留用印度、中国和其他国家的毕业生。这样的好日子,也快到头了。

我不认同一个流行观念,说以前的技术革命消灭了一些工作,也创造了新的就业,所以大可放心,这次也会如此。时代不同了,大概率不会如此。必须认清消灭的岗位比创造的岗位,会多得多。虽然美团外卖小哥、滴滴司机是上一次移动平台技术扫荡了实体店以后产生的新工作,但看看这些最低工资的工作岗位的内卷情况就知道,就连这些工作也是僧多粥少,大家都在挣扎。人工智能的 Robo Taxi 的到来,会逐步消灭滴滴司机。
生产效率的提高一定伴有大量的失业,产业机构改变,也有失业。很多这种失业人员,基本没有希望东山再起。所谓结构性失业,大批中年人,只能等退休。除非奇迹发生,年轻人的就业问题也越来越严峻。人类必须改变和面对的是,不工作是常态,UBI 必须建设。

UBI制度(Universal Basic Income,全民基本收入)必须提上日程,这是因为生产力和GPT并不因为就业人员的急剧萎缩而降低,反而得益于技术革命而在稳步或快速增长中。两极分化必须遏制,必须防止技术革命的红利为少数人独占。否则国将不国,球将不球,人类将非人类。

我是小雅,上面播报的是《立委两分钟》的有角度的沉重思考,谢谢聆听,咱们下次再见。

短视频:大模型奥秘

llm secret

我是大模型数字主播,今天播报的题目是《大模型真经》。

尼克语录,日诵三回:“首先Church Turing thesis,其次Turing machine求逆就是所有学习,再次,所有学习可归约到ntp (next token prediction)。”

没看明白的, 再诵三回。

此乃大模型真经也!

再诵三回仍不明白的,且听分解如下。

简单来说就是,一切可计算的数据,都有个图灵自动机计算它。学习(或 模型训练)就是为制造这台自动机。而所有的学习都可以用数据驱动的GPT完成。

这就是大模型成为万能任务解锁器,成为通向AGI的不泄天机。

但天机还是泄漏了,也许上帝正好打了个瞌睡。

朋友反馈说,还是没看明白。好,咱们用老妪能解的说法试试。

首先,什么叫“可计算的数据”,就是“有规律的现象”:
有规律的现象数字化以后,就是可计算的数据。

哪些现象是有规律的呢?

其实,理论大咖们相信宇宙大部分现象是没有规律的随机现象,茫茫星际,有规律属于罕见。宇宙的本真也许就是混沌一片。

然而人类出现了,演化出文明和智能。人类的现象,尤其是智能工作的结果呈现,例如文字书籍、录音录像、艺术创造,这些人类创造物都是有规律的。例如,语言文字里面有文法和用法。绘画艺术里面有技法和风格。

把这些语言文字、音频视频等多模态的数据收集起来,灌进大模型的熔炉去炼,结果就是我们现在看到的几乎无所不能的主流大模型,以OpenAI的GPT系列为代表。

我是科技数字主播,咱们下次再见。

 

 

 

立委NLP频道

Nick tracing the AI history for theoretical foundation and sources (English)

6.99 10/15 bnQ:/ [email protected] 6月26日-English https://v.douyin.com/i6MbUvKH/ 复制此链接,打开Dou音搜索,直接观看视频!
Click this link: https://v.douyin.com/i6MbUvKH/

 

I am AI Xiao Fan, Nick's secretary, and today I'm reporting on Nick's latest lecture "Solomonoff: The Prophet of Large Language Models".

Nick needs no introduction. Besides his many roles as an entrepreneur, investor, scholar, and philosopher, he is best known for his bestselling book "A Brief History of Artificial Intelligence", which became a sensation, sold out quickly, won numerous awards, and became a legend in China's AI publishing world. We all boast about getting his autographed copies.

The following is a concise and accessible explanation of his lecture.

Let's get to know this mathematical genius with a Santa Claus-like white beard - Ray Solomonoff! Born in 1926 and passed away in 2009, this mathematical and physics double major who "mixed" his degree at the University of Chicago was no ordinary academic overachiever. He was a pioneer of independent research, using mathematical formulas to predict the future, even more impressive than fortune tellers!

Welcome to the 'old child' battle in the scientific world! On the left is Wiener, the 'godfather' of cybernetics. In 1948, he and Shannon simultaneously published groundbreaking papers, but with very different viewpoints! Wiener said: 'Control is the way', while others became infatuated with the little "demon" called 'information'. Shannon and McCarthy were like-minded, both not optimistic about Wiener's cybernetics. McCarthy even played a word game, turning 'Automata' into 'AI', ushering in a new era of artificial intelligence!

Now let's look at the 'prequel' of the AI world! Before the AI feast of the Dartmouth Conference, the big shot McCarthy was secretly writing the 'script'! His article "The inversion of functions defined by Turing machines" wasn't about how to use Turing machines backwards. This 'heavenly book' was actually discussing how to design a super problem-solving machine. McCarthy's imagined divine machine could solve all clearly defined intellectual problems. Isn't this the prototype of AI?

At the Dartmouth Conference, McCarthy and Solomonoff, these two 'mathematical knights', engaged in a fierce 'battle of ideas'! The topic? It was McCarthy's 'heavenly book'. The two hit it off and discovered an earth-shattering secret: the inverse problem of Turing machines is actually a learning problem! This discovery tightly bound AI and machine learning together! From then on, AI was no longer just about computation, but took a big step towards 'learning'. At this moment, the future of AI was completely rewritten!

"Let's look at the 'brainstorming' moments of two 'mad scientists'! First is the French mathematician Borel, who conducted a logical experiment, imagining a group of monkeys randomly hitting typewriters, eventually producing the complete works of Shakespeare! Isn't this the infinite monkey theorem?
On the other side, the Argentine literary giant Borges conceived a 'perfect library' in his short story, containing all possible combinations of books.
These two ideas are simply the prophets of AI and big data! Borel and Borges, one using mathematics, the other literature, were both imagining the sequential possibilities of information."

"At the Dartmouth Conference, Solomonoff, like a magician, pulled out a mysterious typescript 'Inductive Inference Machine' from his hat. This move captivated everyone! Scientists who were originally obsessed with neural networks all 'defected' and embraced symbolism. But look at this dramatic twist! Years later, it was the 'abandoned' neural networks that truly realized Solomonoff's induction! This is like a fairy tale in the tech world - Cinderella finally put on her glass slipper and became the star of the AI ball!
Solomonoff's idea was like a seed planted, eventually blossoming in unexpected places."

"Let's look at the 'roller coaster' history of the AI world! Connectionism, once an 'abandoned baby', is now the 'star' of the AI world!
Imagine this as a long relay race. At the start, there was the perceptron inspired by neurons, fearless like a newborn calf. But it soon met its 'Waterloo' with the so-called XOR problem of single-layer neural networks, and was 'banished' by the big shots.
However, in the 1980s, multi-layer neural networks and the BP algorithm emerged out of nowhere, injecting new life into connectionism. Now, deep learning is at its peak, and connectionism has made a 'dramatic comeback', becoming the 'top flow' in the AI world.

"Let's look at Solomonoff's 'magic moment' in 1960!
The first magic, minimum description, refers to compressing data in the most concise way. This idea later developed into 'Kolmogorov complexity', that is, K-complexity, becoming the core of large model theory.
The second magic, prior probability: the initial estimate of the possibility of an event occurring without specific information.
These two concepts seem simple, but contain profound insights. They provide a whole new perspective for us to understand information, complexity and learning, directly influencing the later development of artificial intelligence and machine learning"

In 1961, AI guru Minsky wrote an important article mentioning concepts such as machine theorem proving, neural networks, machine learning, reinforcement learning, etc., which was simply the secret manual of the AI world! He cited 95 references, 4 of which were Solomonoff's, showing his high regard for Solomonoff. Interestingly, it was neural networks that first realized Solomonoff Induction, which is an unexpected twist!

In 1964, Solomonoff published a groundbreaking paper titled "A Formal Theory of Inductive Inference". This paper can be considered the "secret manual" of the AI field, detailing how to describe inductive reasoning using mathematical language. Simply put, it's about learning patterns from data to predict the future! This paper is Solomonoff's "masterpiece" on inductive reasoning, establishing his status in the machine learning field.
The second part of Solomonoff's paper gives examples of applying the formal theory of inductive inference to different problems. One of these examples is grammar discovery, that is, how to learn the grammatical rules of a language from observed language data. This example, in today's view, is the problem of language learning, i.e., how machines learn language like humans do. Solomonoff also discussed a deeper question in the paper: Is language equivalent to thought? This question still doesn't have a clear answer today, but Solomonoff's research provided us with a new perspective to think about this question.

Solomonoff developed a strong interest in how scientists discover things and tried to find a universal method of scientific discovery. This interest led him to start researching inductive reasoning and eventually propose the concept of algorithmic probability.
In his academic career, Solomonoff applied inductive reasoning to fields such as language learning, achieving important results.

Soviet mathematician Andrey Kolmogorov is known as the "universal mathematician". In the field of computer science, he mainly has two major contributions:
Kolmogorov Superposition Theorem (K-A-N): This theorem is related to the famous Hilbert's 13th problem, involving function representation and approximation.
K-complexity: This is a method of measuring information complexity. It defines the complexity of an object as the length of the shortest program that can generate that object.
In addition, Kolmogorov had unique insights into cybernetics and information theory. He believed that cybernetics lacked inherent unity, but expressed agreement with information theory. This view is consistent with those of Shannon, McCarthy, and others.

Kolmogorov thought that information theory was like a hodgepodge, with three different approaches:
Counting School: Like rolling dice, looking at how many times a certain number appears.
Building Blocks School: Focusing on the number of building blocks and how to combine them.
Programming School: Viewing information as a program, with shorter programs being simpler.
K-complexity is the representative work of the "Programming School". Simply put, it measures how complex something is by how short a program is needed to describe it.
Interestingly, K-complexity and Solomonoff induction are actually talking about the same thing. Solomonoff induction believes that simpler things are more likely to occur.

Chaitin was a prodigy, publishing his first paper in IEEE Transactions on Electronic Computers at the age of 18. At 19, he independently rediscovered the ideas of Solomonoff and Kolmogorov in a paper published in JACM.
Starting from Berry's paradox, Chaitin believed that naming an integer is equivalent to writing a program that can output this integer. Most integers can only be named by directly printing themselves, with no more concise representation method. These integers are viewed as "random" under the framework of Kolmogorov complexity because their complexity is comparable to their length. Chaitin's view is consistent with Kolmogorov's idea, both emphasizing that most objects (or integers) are incompressible, i.e., their complexity is comparable to their length. This means they have no simpler representation method and cannot be concisely explained.
This inexplicability or randomness is ubiquitous in nature. For example, most DNA sequences, physical constants, and natural phenomena have no obvious patterns to follow and cannot be explained by simple formulas or theories. On the contrary, explicability (i.e., phenomena that can be described or explained in a concise way) only appears occasionally.
Leonid Levin proved two theorems in a two-page paper published in 1972:
Theorem 1: NP-completeness, i.e., the Cook-Levin theorem, which made an important contribution to the development of computational complexity theory.
Theorem 2: A generalization of Kolmogorov complexity.

Charles Bennett proposed the concept of logical depth, which considers the running time of the shortest program needed to generate an object. The parameters of large language models can be seen as the amount of information stored internally in the model. Therefore, it is reasonable to compare model parameters to K-complexity. It is also reasonable to compare the inference time of large language models to logical depth.
Ming Li is a distinguished professor at the University of Waterloo who has made outstanding contributions in the fields of information theory and bioinformatics. He extended K-complexity from a single sequence to two sequences, which can measure not only the information within a single sequence but also the information between two sequences. This is of great significance for universal large models to define universal tasks and complete various tasks through unsupervised learning. His book "An Introduction to Kolmogorov Complexity and Its Applications", co-authored with Paul Vitanyi, is considered a classic in the field and has had a profound impact on the development of information science.
Marcus Hutter is a computer scientist with a background in physics. He proposed the AIXI universal artificial intelligence framework and believes that language modeling is essentially compression. He applied Solomonoff induction to explain agents and reinforcement learning, believing that the learning process is a compression process, and is dedicated to researching universal artificial intelligence.
In his Berkeley lecture, Ilya, the former soul figure of OpenAI, revealed the connection between supervised learning and unsupervised or self-supervised learning. Ilya claimed that he independently came up with the idea in 2016 that all supervised learning can be reduced to self-supervised learning, tracing back to compression theory based on K-complexity. Ilya firmly believes that simple autoregressive GPT models can demonstrate super intelligence on super large data.
Let's review the timeline of model development: The deep neural Transformer architecture was proposed in June 2017, and the BERT model was proposed in October 2018. OpenAI's GPT series models started from June 2018, successively launching GPT, GPT2, and GPT3, now up to GPT4, becoming the industry mainstream.
To summarize, the first step of Solomonoff induction is to collect observational data. The second step is to form hypotheses to explain the data: hypotheses can be a Turing machine or a data-driven large model. The third step is experimental verification. If the data falsifies, return to step 2 to form new hypotheses.
Large models follow Solomonoff induction's approach to train models and their inferential applications.

Looking back at the entire history, perhaps it's not that theory lagged behind practice, but that it was too far ahead.

I am Xiao Fan, Nick's digital secretary. Thank you for following Nick's journey to explore the theoretical origins of large models and the historical changes in AI. We'll meet again.

 

立委NLP频道

《谈两种复杂度》短视频科普

我是AIGC频道数字主播小雅,今天播报的立委两分钟的题目是“两种复杂度”。
我们要区分两种"复杂度"。一种是 Kolmogorov 复杂度(K氏复杂度),它指的是描述一个对象所需的最短指令的长度。另一种是模型复杂度,它指的是机器学习模型本身的参数数量或者灵活性(模型的大小)。
当我们说数据的最简洁表示就是 K氏复杂度时,我们指的是第一种复杂度。K氏复杂度衡量的是数据的内在信息量(类似于以前所说的深层结构、语义表示),并非数据的表面形式。比如,一个看似随机的字符串,如果它实际上是由一个简单的程序生成的,那么它的 K氏复杂度就很低。换句话说,K氏复杂度反映的是数据的规律性和可压缩性。
当一个机器学习模型的参数过多,灵活性过高时,它可能会过度拟合训练数据中的噪声,导致在新数据上的泛化能力下降。这时我们说模型过于"复杂",需要通过缩小模型的规模或使用正则化等技术来限制其复杂度。
一个好的无监督学习模型应该能学习到接近数据的 K氏复杂度的表示(所谓无损压缩的理想)。也就是说,模型应该能抓住数据的内在规律,用尽量简洁的方式来表示数据,而不是简单地记住数据的表面形式。
这一点为什么重要?
换句话说,不懈追求无损压缩的理想为什么是获得智能的关键?
信息的无损压缩(注意不是指传统的表层模态形式的无损压缩,例如音乐的无损压缩只是形式的无损压缩,与这里讲的信息内容的无损压缩不同)以达到K氏复杂度为天花板。换句话说,任何数据对象的K氏复杂度等于是该对象的深层语义。在深层语义的层面,不同数据的相同的语义就可以自然对齐,从而帮助实现不同数据的相互转换。这是对作为压缩器的大模型能够完成任何端到端AI任务(即逼近任何函数)的信息论意义上的解释。
形式之间的转换作为智能任务,无论是机器翻译,还是文生图,图生图,文生音乐,文生视频,图生视频,看图说话,等等,都有一个预设。那就是输入和输出具有类似的意义,即有可以相通或对齐的K氏复杂度。
完全不相干的形式无法转换和映射,也就不会具有相同的K氏复杂度。例如心电图与交响乐,就不兼容。它们各自的K氏复杂度对于对方等同于随机噪音,没有 shared 信息可以作为映射的依据。当我们说大模型像个魔术师,可以把任何形式转换成任何其他形式。其实里面蕴含了一个前提,就是转换的形式之间有可以对齐或意义兼容的 shared 深层结构,在意义空间它们是可以相互关联的。这个预设通常不需要特别明示,因为这是不言而喻的,而且在实际的AI应用中,人类想要模型帮助达成的某个智能任务,会自然而然遵循这个预设。否则,那个胡乱转换的智能任务指令就是一个不可理解的目标。人类自己都不理解目标是什么的话,也就谈不上是个正经的“任务”,也无法给出达成目标的任何黄金标准。
当通用大模型把 input pattern 和 output pattern 都能压缩得很好的话,二者在意义空间的对应关系就剩下一层窗户纸了。大模型在多语言机器翻译上的成功是上述解说的最典型的显现。
对于一个数据集C做压缩,目标当然是C的K氏复杂度,模型F的复杂度必须恰到好处,F不能太大,也不能太小。太大,譬如 模型规模大于数据集,就 overfitting,就无法逼近K氏复杂度。太小,模型“吃不下”数据,也不能做到好的压缩。scaling law 虽然是个经验法则,但理论规定了大模型与大数据之间适配边界,具体的适配比例,决定于实际效果的实验。
我是小雅,每次两分钟,分享立委先生对于AI大模型的有角度的思考。谢谢聆听,咱们下次再见。

 

【相关】

 

《介绍监督学习的数学原理》短视频科普

9.46 [email protected] 09/17 mQK:/ 科普 # aigc https://v.douyin.com/ijouUSBq/ 复制此链接,打开Dou音搜索,直接观看视频!

https://v.douyin.com/ijouUSBq/

我是AI大模型频道小雅,今天播报立委先生的短评《介绍监督学习的数学原理》。
Open AI 前灵魂人物伊利亚曾经在伯克利给了一个 talk。这是注定载入史册的一次演讲,阐释非监督学习的GPT突破原理。这个演讲的开头部分讲的是监督学习(supervised learning)的数学原理。咱们今天先从这个比较容易的知识点开始讲起。
说的是数据大模型大就是模型强的道理。暴力美学,美就美在,大即是美。插一句,这与唐代的审美观貌似相若:肥即是美?嘿嘿...
说监督学习的数学保证,意思是,足够大的数据,足够多的参数,就会有足够高的精度。伊利亚为了怕“过拟合”(overfitting),特别强调了 数据规模必须大于模型规模这个业内常识。“过拟合”的时候,训练误差虽然小,但测试误差降不下来。如果要让训练误差反映模型的实际能力(即测试误差),这是一个必要条件。否则,模型根本就不用做真正的抽象或“压缩”,不去找规律,它就全部死记硬背了。我们知道死记硬背的模型,缺乏泛化能力。它在训练集上可以考高分,但到了盲测的时候就抓瞎了,模型的质量得不到保证。所以给出的原理公式里面必然有模型大小这个变量,事实上这是一个决定性因素。数据量的增长会导致偏差指数级下降,保障模型质量的持续上升。
伊利亚说监督学习没有问题,有理论上的保证。所谓理论就是那个统计学习的Hoeffding 不等式,说的是,随着训练样本数量增加,测试偏差超过一定阈值的概率会呈指数级下降。其主要含义是:当训练数据足够大,训练误差足够低,且训练数据远大于模型规模的时候,模型的泛化能力和精度是有保证的,这就是监督学习能够起作用的理论基础。
这个其实我们早就知道了,第一,宏观上和理论上,“万能近似定理”(UUniversal Approaximation Theorem)早已论证了深层神经网络可以逼近任意函数。这其实是 scaling law 的源泉。第二,当代 AI 历史上,12年前的深度学习革命就开始证明,只要有足够带标数据,神经网络就可以让“老母鸡变鸭”,或做任何其他变换。
但具体说来,为什么大模型一直强调“大”呢(模型大就是模型参数多),但现在似乎又在强调不能太大?那是因为数据超大的时候,模型不大,就消化不了。前大模型时代的传统机器学习就有这个问题。传统机器学习算法总体而论复杂度不够,模型过小,你给再多的数据也没用,它肚子小,吃不下,质量很容易达到一个点就基本 stuck there 了。深层的神经网络在原理上解决了这个问题,你总可以根据你的数据来决定需要多大多深的网络来适配才最佳。这是大模型之所以要大的一面。但也不能太大,要与数据规模拉开距离。换句话说,大数据必须大于大模型,这才是 scaling 的有效适配。
ChatGPT 出来的时候,模型参数与模型数据是一个量级的,其实现在看来是某种过拟合了。GPT4 超过 ChatGPT3.5 的一个很大的原因,可能就是因为调整了这个比例。
我是小雅,上面分享的是立委先生每次几分钟关于大模型的有角度的思考,希望对您有启发,咱们下次再见。

【相关】