What did Ilya see? -- secret behind success of LLMs

What did Ilya see?

-- looking closely into his historical Berkeley talk

by Wei Li, Jia Gao


When Ilya Sutskever left OpenAI and re-emerged with his new company, SSI (Safe Superintelligence Inc.), the move was both surprising and expected—he bypassed AGI and directly aimed at SSI (Safe Superintelligence). He confidently declared: Superintelligence is imminent, and establishing safe superintelligence (SSI) is the most important technological issue of our time.

Ilya, a legend in the field of deep learning and AI, and the former true soul of OpenAI, was at the center of the dramatic internal shift, addressing the issue—effective acceleration versus super alignment. Why was Ilya so steadfast about "super alignment" amid the underlying AI values and strategic path debate? Even after the storm settled, the outside world continued to speculate: what did Ilya see that compelled him to join the board in making the decision to oust CEO Sam Altman? Ilya remained hidden until recently, when he left OpenAI, leading to the dissolution of his super alignment team and the creation of his new company.

What did he see behind the push for "safe intelligence"?

Back on October 3, 2023, Ilya gave a talk at UC Berkeley titled "A Theory of Unsupervised Learning." Though obscure and known to few, it is destined to be one of the most significant moments in AI history. This talk was a theoretical reflection and summary by a top expert in deep learning on the GPT model he pioneered, now famous worldwide. Ilya revealed the core principles of large models and vividly described his obsession with, and excitement over, independently understanding the mechanisms of unsupervised learning. Despite the complexity, the talk was brilliant and enlightening.

Until recently, Leopold Aschenbrenner, a former member of his super alignment team, published a 165-page article, "Situation Awareness," preliminarily revealing the shock and concerns within OpenAI over the exponential evolution of GPT models. This partly answered the question of what Ilya saw, but Ilya himself remained silent until his official re-emergence not long ago.

Reflecting on his "confessional" talk at Berkeley, we might glimpse his "moment of enlightenment" when facing potential superintelligence and understand his original intent for safe intelligence. It was a rare deep sharing by Ilya, attempting to convey essential message to the world. But did the world hear him?

1. Machine Learning: Supervised Learning and Unsupervised Learning

To accommodate readers with varying mathematical backgrounds, this blog aims to explain Ilya's historical presentation in an accessible language. Purely technical explanations can be skipped by non-technical readers without affecting the understanding of the presentation's main ideas.

Before diving in, let's review the basic concepts of machine learning. Machine learning is like having computers as students and humans as teachers. By providing computers with numerous "practice problems" and "answer keys," they slowly learn to solve problems. This is supervised learning. But can computers really learn from practice problems instead of merely memorizing them? Ilya assures us there's theoretical proof of this.

Imagine a sea of problems before you, each paired with a standard answer. This is the model's training data. Model training is like diligently solving these problems until most of them are correct, meaning low training error. But even an extensive problem set has its limits. When new problems arise, can the model still get them right? These new problems are the test data, akin to exams. Whether the model performs well depends on its test error rate.

Mathematics tells us that as long as the problem set is large enough, far exceeding the model's size, excellent performance on training problems (low training error) ensures good performance on test problems (low testing error). In other words, if the model trains well, it will do well in exams! This is the mathematical guarantee for supervised learning.

However, if the model merely memorizes without extraction, no matter how large its memory or how strong its "memory power," it lacks real adaptive learning ability (called "generalization ability"). Only when the model isn't too smart, it will be forced to extract the essence (called "compression"), learning real skills from the problem set.

This explains why the model size shouldn't be too large, to avoid giving the model too much room to cut corners. In short, Ilya wants to say that "big labeled data + low training error" is the winning formula for supervised learning, guaranteed by mathematics. This point has been confirmed both theoretically and practically. Since the deep learning revolution 12 years ago, countless successful cases have shown that as long as the training data is sufficient, neural networks can excel, at all sorts of AI tasks, from recognizing cats and dogs to machine translation.

But what about unsupervised learning? Can computers learn intelligence from a problem set without standard answers? It sounds far-fetched, but Ilya is about to explain how he managed to seek a solid mathematical foundation for unsupervised learning as well.

2. Distribution Matching: A New Approach to Unsupervised Learning

Everyone knows that machine translation was a typical win of supervised learning, in fact, the only win among various NLP tasks (such as dialogue, information extraction, sentiment analysis,  question answering, docuent understanding, etc.) prior to the large language model's era. Why? Because we have a vast amount of historical bilingual data. It's like students having workbooks with English on the left and Chinese on the right—supervised learning thrives on this setup.

But what if the teacher suddenly stops providing aligned bilingual data and only gives you English books and unrelated Chinese books, leaving you to figure out how to align and learn automatic translation? That's the problem unsupervised learning needs to solve. Ilya says unsupervised learning can also handle various language machine translations (which we've seen today with large models—specialized translation software is no longer needed), and even any input-to-output transformation tasks. What's the catch?

Ilya discovered a new approach called distribution matching. Essentially, if the English and Chinese book collections are large enough, containing various sentence structures, their linguistic regularities will be learned "without supervision". For example, the context distribution of "I/me/my" in English should correspond to "我" in Chinese; adjectives near nouns in English with semantic compatibility should have a similar pattern in Chinese, etc. This provides the basic condition for potential language alignment.

Ilya points out that if two languages' native data is sufficiently rich, the input in one language can almost uniquely determine the equivalent translation in the other language. This principle applies not only to machine translation but also to tasks like speech recognition and image recognition.

Ilya independently discovered this approach in 2015, fascinated by the underlying mathematical principle—compression theory. If we can find a method that maximally compresses both English and Chinese data, this approach will capture the common patterns of the two languages, which form the basis of translation.

So, Ilya proposes that unsupervised learning is essentially about finding the optimal data compression method. This perspective not only sounds cool but also provides a mathematical explanation for the effectiveness of unsupervised learning. Although real-world tasks are not idealized, this principle gives unsupervised learning a solid theoretical foundation, making it as convincing as supervised learning.

Next, Ilya will delve deeper into the mathematical principles behind it. Although somewhat abstract, he promises it’s full of insights. We'll see how he uses the magic of compression to explain the mysteries of unsupervised learning.

3. Ilya’s Ultimate Theory: From Conditional Modeling to Joint Modeling

This is the final and most intriguing slide of Ilya's talk, worthy of thorough analysis and contemplation. The goal of unsupervised learning is often defined as "learning the internal structure of data." Ilya suggests understanding unsupervised learning from the perspective of data compression: a good unsupervised learning algorithm should maximally compress the data, representing its content in the simplest form. This introduces the concept of Kolmogorov complexity.

The Kolmogorov complexity of a data object is the length of the shortest computer program that can fully describe this object. You can imagine this shortest program as a "compressed package" containing all the information needed to reconstruct the original data. From this perspective, the goal of unsupervised learning is to find the optimal compressed representation of the data, which is the Kolmogorov complexity.

The Kolmogorov complexity of a data object is the length of the shortest computer program that can fully describe this object. Imagine this shortest program as a "compressed package" containing all the information needed to reconstruct the original data. From this perspective, the goal of unsupervised learning is to find the optimal compressed representation of the data, which is the Kolmogorov complexity.

However, in practice, we often need to handle multiple related datasets. For instance, in machine translation, we have the source language dataset X and the target language dataset Y. We want to learn a model that can translate sentences from X to Y (or vice versa). Traditionally, this is viewed as a conditional probability problem: given X, what is the probability distribution of Y? Represented in terms of Kolmogorov complexity, this involves finding K(Y|X), the shortest description length of Y given X.

Ilya proposes a different approach. Instead of viewing X and Y as condition and result, like in supervised learning, he suggests viewing them as a whole and compressing them together within a massive model. Essentially, we seek the joint Kolmogorov complexity K(X,Y), the shortest program length that compresses both X and Y simultaneously. This approach must fully utilize the correlation between X and Y, using information in X to automatically align Y (or vice versa), much like how we use our native language knowledge to understand and remember foreign language expressions.

Ilya believes this joint compression idea is the true power of unsupervised learning. Real-world data is often interconnected, with numerous deep common patterns and regularities. If unsupervised learning can discover and utilize these regularities, it can significantly enhance learning efficiency and generalization ability. This explains the remarkable performance of large language models like GPT across various tasks: through massive unsupervised pretraining, they learn the deep regularities of the training data, and these regularities are transferable across related datasets.

Although Kolmogorov complexity is theoretically uncomputable, Ilya believes we can approximate this process using deep neural networks (like GPT). Through optimization algorithms such as gradient descent, neural networks can find the optimal compressed representation in massive data, capturing the essence of the data and its alignment patterns, even if not strictly in terms of Kolmogorov complexity.

Thus, Ilya’s theory can be seen as a new paradigm for unsupervised learning, elevating traditional independent modeling (like separate models for English and Chinese) to a unified associative modeling approach. In this paradigm, the goal of unsupervised learning is no longer just compressing individual datasets but finding the connections between them. This cross-modality learning represents an advanced form of artificial general intelligence (AGI).

Now, let’s closely examine this final slide. In it, X represents dataset 1 and Y represents dataset 2. The key point is extracting every bit of information from X (or Y) to help predict Y (or X). This is what Ilya refers to when he says training X and Y together yields the effect that unsupervised learning of X helps accomplish the task of transforming X to Y.

The crucial idea is: K(Y|X) becomes K(X, Y).

Ilya transforms the universally applicable functional AI task of "input X conditions output Y" into an approximate solving problem by jointly training X and Y without modal segmentation. This joint training approach is effectively the current multimodal unified training, abbreviated as K(X, Y).

Ilya aims to strengthen the theoretical basis, emphasizing his surprising discovery that self-learning of X has a strong predictive effect on Y.

The essence of unsupervised self-learning is that the self-learning of X is to compress X, and the self-learning of Y is to compress Y. This is straightforward because the essence of self-learning is involves only positive examples, without negative samples. Unsupervised self-learning lacks a specific task orientation; it learns language from language, images from images, music from music, and so on, continually abstracting various patterns from phenomena.

Ilya points out in the slide: conditioning on a dataset, not an example. The compression object is the dataset, not individual data points, which is crucial. This distinction separates superficial compression from content compression. Superficial compression is merely a mechanical process that does not produce intelligence. Only content compression can achieve artificial intelligence.

How do we understand the difference and connection between superficial lossless compression (e.g., digital music) and content lossless compression (e.g., Suno)? Compressing a specific song losslessly aims to ensure it can be restored to its original musical form (including noise and imperfections). This is traditional music compression, targeting individual sample, e.g., a specific song. Compressing a collection of music, whether using GPT or Diffusion, targets a group of samples, resulting in a large model like Suno.

When individual objects turn into group objects, formal compression naturally transforms into content compression. This is because, although the group comprises individuals, compressing the group is like "painting" a portrait of the group, outlining its characteristics. It may resemble an individual, but it is not a specific individual in the original data; otherwise, it would not be a model but a memory repository.

This is understandable because the purpose of large model compression is to identify the characteristics and regularities of the dataset. The text generated by GPT-4 might seem familiar; the music generated by Suno might sound familiar; the videos generated by Sora might look familiar; the images generated by MJ might seem familiar. However, they are virtual individuals "restored" based on prompts, abstracted or compressed from big data: derived from data, higher than data, mingling with data, indistinguishable from real and fake.

Given that the compression object is the entire dataset content, how do we measure its effectiveness after decompression? What is the gold standard?

This standard is each sample itself. However, this is not entirely accurate; the standard could have equivalent answers, as the same content can have various ways of expressions. The implementation method is "masking", and NTP simply masks the next token. Training involves calculating the loss for each sample, using backpropagation with gradient descent to adjust parameters continually, eventually lowering the loss in the group training of the dataset to an acceptable point, forming the large model.

This final slide and Ilya’s explanation emphasize a core point: Conditional Kolmogorov complexity K(Y|X) provides a theoretically optimal solution for unsupervised learning. K(Y|X) is defined as the length of the shortest program that produces the output dataset Y given access to the input dataset X. It represents the theoretical limit of extracting all valuable information from X to predict Y. An algorithm that can achieve K(Y|X) would be the best for predicting Y using unlabeled data X.

This can be seen as the theoretical basis for large models performing various language translations. Each language is potentially X and potentially Y. After self-learning with an huge amount of data, LLMs learn the relationships between languages, possessing the potential to translate from X to Y.

In practice, the machine translation task, like other tasks, initially involves few-shot examples in instruction-following fine-tuning to define the task, ultimately triggering the internal power of large models to translate various languages. This internal power of unsupervised learning for various tasks is the theme of his talk.

However, K(Y|X) is uncomputable in practice. Ilya proposes a feasible alternative, using joint Kolmogorov complexity K(X,Y) (joint compression of X and Y). He believes K(X,Y) can achieve the same effect as K(Y|X) in practical machine learning tasks.

Let us stop and think again: conditional modeling is now replaced by sequence modeling by Ilya. The widely known probability simplification in traditional machine learning, such as the Markov chain, has a similar effect.


Ilya's historic presentation at Berkeley on the theory of unsupervised learning reveals the secret behind the mainstream of self-learning large models, especially GPT. It seems that Ilya, after long contemplation, finally disclosed this "heavenly secret" in a cryptic manner at Berkeley. Although the theory and its proof appear complex, it is crucial for understanding why GPT's sequence learning method ("next token prediction") has become a universal simulator for AI tasks.

Ilya exudes a genius prophet aura, with a lonely invincibility and high-altitude isolation, blending a sense of deep realization, compassion, and the pure, focused, and idealistic earnestness of a graduate student nerd.

He claims to prefer compression but does not emphasize so-called lossless compression. He leaves room for himself and the mainstream, proposing the concept of "no regret"—though GPT may not achieve lossless or perfect compression, it theoretically proves there is no better way: GPT is the closest to lossless, "no-regret" modeling.

When Ilya officially re-emerges to establish SSI, emphasizing a single focus, a single goal, and a single product—to use technology to ensure the superintelligence brought by large models is safe for humanity—he asserts: AI will be eternal, its birth akin to the creation of heaven and earth. As Ilya passionately discusses AI's progress, he is most qualified to declare and lead the "exciting yet dangerous journey towards AGI."


Chinese full-length post

Ilya's presentation:

Related Links:



Ilya Sutskever:你会 bet against AI 吗?

挥刀董事会,OpenAI 如何解决致命难题


为什么说 Sora 是世界的模拟器?



这是最近一篇论文的题目。ICML 2024:“Case-Based or Rule-Based: How Do Transformers Do the Math?” 第一作者为来自北京大学物理学院、即将加入人工智能研究院读博的胡逸。有点意思。

论文第一个结论是,llm 学不到真正的推理规则,只能通过相似案例学到有限泛化。人学到的推理规则是可以 外推的(extrapolation) , 而相似案例的有限泛化却只能做内插(intrapolation)。无法外推,所以训练集没有类似案例的区域就成了盲区。
这个实验是在 gpt2 上做的。可以理解为什么第一个实验必须在 gpt 2 上做,因为他可以控制训练数据,制造数据真空,来验证有没有逻辑推理带来的外推能力。但这种GPT2这样规模的“大”模型没有出现的能力,并不表明真正的大模型就不会“涌现”。

论文后来的实验是在比GPT2大得多的 “辣妈” 上做的,似乎得出一个相左的结论。结论是,如果模型足够大,只需要少量的任务微调,llm 就学会了接近规则推理的效果。在长整数加法上的表现,表明模型不仅会内插,在外推上也表现很好。


谈谈我的看法。从序列学习的方式上看,数据驱动的模型学习是以 case based 的归纳(也叫压缩)作为起点和主干的,这个没有疑问。问题是,case based 的学习,到了一定的程度和量级的时候,是不是会非常逼近 rule-based 的学习。承认后者就是承认了大模型具有某种逻辑推理能力。大模型具有初步的逻辑推理能力这一点在大模型主流社区中本来不是问题,而是默契的共识,大模型测试的一个重要维度就是逻辑推理能力。但在更大的范围内(非主流圈子以及普罗大众),一直还是作为疑问存在的。

一个有意义的视角是看泛化中外推的理解。对于非解析的、没有对应符号规则的现象,外推本质上是不可计算的,也就是只能碰运气了。出路只有收集相关数据,把盲区带入雷达屏,化外推为内插。但是对于有解析解的高度规则化的数据分布,外推能力是泛化学习的自然期望,达不到期望就说明llm只是一个鹦鹉。达到了期望, 就说明 llm 跳过了鹦鹉的门槛,学会了某种推理规则。现在看来,头部大模型是跨越了这个门槛,继续拿鹦鹉学舌来比况大模型,彰显的是人类盲目的狂妄自大。
前不久引起关注的一项关于KAN模型的研究中,KAN 的 AI for science 实验,其实已经展示了模型如何数据驱动去逼近解析解,等于是把模型学习逻辑推理的内部过程图示化了,非常生动 ,有相当的说服力。当然,KAN的实验表明对于简单的解析解,数据驱动可以逼近符号规则,但并不轻易就得出符号规则。实验中是加入了人为的剪枝等操作才得出了数据背后的符号规则。

与此对照,深度学习大佬杨立昆却坚决否认GPT有逻辑推理能力。杨立昆语录: AGI is a complete nonsense;GPT is a deadend,等等。矫枉过正反潮流,把话说死,并不是坏事。但轻信他,也可能就被带进沟里去了。






Ode to Madam Hu

July 1, 2024 at 1:50 PM

[verse 1]
In Suzhou's June, beneath a scorching sky,
A madman's blade flashed, evil drawing nigh.
Mother and child cried out in desperate fear,
Their screams of anguish piercing far and near.

With verse we mourn, our grief in words conveyed,
A hero's tribute, never to fade.

[verse 2]
Before the school bus, Madam Hu stood tall,
Her gentle hands became a shield for all.
No tiger-wrestler she, no dragon-slayer,
But love unbounded made her their savior.

With verse we mourn, our grief in words conveyed,
A hero's tribute, never to fade.

[verse 3]
Her blood stained red the soil of Jiangnan,
White clouds and grieving grass bore witness, wan.
Though snuffed, her candle's light forever gleams,
Like brave Feng Yuan of old, her courage beams.

With verse we mourn, our grief in words conveyed,
A hero's tribute, never to fade.

[verse 4]
Why must the kind so often suffer woe?
When will justice's path smooth waters show?
We question Heaven, tears fall like the rain,
In silence seek life's meaning through our pain.

With verse we mourn, our grief in words conveyed,
A hero's tribute, never to fade.

[verse 5]
Madam Hu's name shall echo through the years,
Half-masted flags, a nation draped in tears.
Her love, transcending life and death's divide,
One selfless act, as sun and moon abide.

With verse we mourn, our grief in words conveyed,
A hero's tribute, never to fade.

[verse 6]
Rest now in peace, return to native ground,
Let not your family grieve, all hearts are bound.
In old Wu Gate, by Suzhou's storied streams,
We offer flowers and wine to honor dreams.

With verse we mourn, our grief in words conveyed,
A hero's tribute, never to fade.






4.35 05/03 CHI:/ [email protected] 尼克讲解大模型的理论渊源与奥秘 https://v.douyin.com/i6Mq91wS/ 复制此链接,打开Dou音搜索,直接观看视频!






让我们来认识一下这位留着圣诞老人般白胡子的数学怪才——Ray Solomonoff!1926年出生,2009年离世,这位在芝加哥大学混学位的数学物理双料高手,可不是普通的学霸。他是独立研究的先驱,用数学公式预测未来,比算命先生还厉害!


现在我们来看看AI界的'前传'!在达特茅斯会议这场AI盛宴前,麦卡锡大佬就在偷偷摸摸写'剧本'啦!他的文章《The inversion of functions defined by Turing machines》可不是在讲怎么把图灵机倒过来用。这篇'天书'其实在讨论如何设计一台超级解题机器。麦卡锡想象中的这台神机,能解决所有明确定义的智力问题。这不就是AI的雏形吗?"






1964年,所罗门诺夫发表了一篇惊世骇俗的论文,名为《归纳推理的形式理论》。这篇论文堪称 AI 领域的“武林秘籍”,详细介绍了如何用数学语言来描述归纳推理。简单来说,就是从数据中学习规律,预测未来!这篇论文可是 所罗门诺夫归纳推理的“扛鼎之作”,奠定了他在机器学习领域的江湖地位。
Solomonoff 的这篇论文第二部分给出了归纳推理的形式理论在不同问题上的应用示例。其中一个例子就是语法发现,即如何从观测到的语言数据中学习语言的语法规则。这个例子在今天看来,就是语言学习的问题,即机器如何像人类一样学习语言。所罗门诺夫 在论文中还探讨了一个更深层次的问题:语言是否等于思维?这个问题至今仍然没有明确的答案,但所罗门诺夫的研究为我们思考这个问题提供了一个新的视角。


Kolmogorov K氏叠加定理 (K-A-N): 该定理与著名的希尔伯特第13问题有关,涉及到函数的表示与逼近。
K氏复杂度: 这是一种衡量信息复杂性的方法。它将一个对象的复杂度定义为生成该对象的最短程序的长度。

数数派: 就像掷骰子,看某个点数出现的次数。
搭积木派: 关注积木块的数量和怎么组合。
写程序派: 把信息看成程序,越短的程序越简单。

蔡廷年少有为,18岁就在IEEE Transactions on Electronic Computers上发表了第一篇论文。19岁时,他在JACM上发表的论文独立地重新发现了所罗门诺夫和柯尔莫哥洛夫的思想。
定理1: 即NP-完全性,即Cook-Levin定理,为计算复杂性理论的发展做出了重要贡献。
定理2: 是柯尔莫哥洛夫复杂性的一个推广。

Charles Bennett提出了逻辑深度的概念,它考虑了生成一个对象所需的最短程序的运行时间。大语言模型的参数可以看作是模型内部存储的信息量。因此,将模型参数比作柯氏复杂度是合理的。大语言模型的推理时间比作逻辑深度也是合理的。
李明是滑铁卢大学的杰出教授,在信息论和生物信息学领域做出了卓越贡献。他将K氏复杂性从单个序列扩展到两个序列,不仅可以测量单个序列内的信息,还可以测量两个序列之间的信息,这对通用大模型定义万能任务及其非监督学习完成各种任务意义重大。他与Paul Vitanyi合著的《An Introduction to Kolmogorov Complexity and Its Applications》被认为是该领域的经典著作,对信息科学的发展产生了深远影响。
Marcus Hutter是一位物理学家出身的计算机科学家,他提出了AIXI通用人工智能框架,并认为语言建模本质上就是压缩。他将所罗门诺夫归纳用于解释智能体和强化学习,认为学习过程就是压缩过程,并致力于研究通用人工智能。
Open AI 前灵魂人物伊利亚在伯克利演讲中,揭示监督学习与非监督或曰自监督学习的联系。伊利亚声称他在2016年独立想到了所有监督学习可以被归约为自监督学习的观点,并追溯到K氏复杂度为基础的压缩理论。伊利亚笃信简单的自回归GPT模型可以在超大数据展现超级智能。
总结一下,所罗门诺夫归纳第一步是收集观察数据。第二步形成假设解释数据: 假设可以是一个图灵机或一个数据驱动的大模型。第三步进行实验验证。如果数据证伪,则返回步骤2形成新的假设。








2023/05/09: 腾讯科技,对话产业专家:“百模大战”愈演愈烈,地球上算力不够支撑10亿人使用

2023/07/16: 轩辕之学讲座,特邀讲座《大模型的历史、现状和挑战

2023/06/28: 中国证券报报道 360 GPT 联盟特邀演讲:李维:避免大模型行业同质化竞争 深耕领域阶段有很多“细活”可做

2023/07/21: ArchSummit 全球架构师峰会主题演讲《从架构师视角谈如何迎接/拥抱大模型及其新生态》(https://blog.csdn.net/EDDYCJY/article/details/131929057

2023/07/28: Chinajoy,中国音像与数字出版协会和上海汉威信恒展览有限公司主办,特邀演讲《AIGC时代,数字人引领内容创作新范式

2023/08/06: 亚马逊云科技和真格基金主办的闭 “机遇与挑战——生成式AI的发展趋势和落地应用”,主题演讲《AI大模型落地的前景和痛点》

2023/08/23: 江苏师范大学主办 “大语言模型离人类语言有多远”学术研讨会特邀演讲《大模型的本性及其落地痛点》

2023/10/20: 第十一届科博会,特邀主题演讲《AI和大模型趋势》

2024年3月18:英伟达 GTC2024全球开发者大会 与英伟达中国团队合作,讲演《重塑数字人互动潜能》

2024/04/22: 第十四届北京国际电影节光影未来,演讲《捕捉未来的魔术师》

2024/04/22 九派财经专访出门问问副总裁李维:两年内大模型的商业化爆发一定会发生

2023/11/11: 香蕉论坛,“百“模”大战对智能座舱发展意义与实践”,2023/12/05 汽车商业评论,《​李维:大模型技术的过去、现在与未来

2023/12/05 汽车商业评论,《​李维:大模型技术的过去、现在与未来


2024/05/24: 九派财经专访出门问问副总裁李维:找产品方向是更大痛点,跑通场景是第一位


2023/12/09: 飞哥说AI,李维 高佳:OpenAI狂飙生态? 领域壁垒仍是大山

2023/12/09: 飞哥说AI,李志飞 高佳 李维:关于 Google Gemini 的八点启示

2024/02/19: 飞哥说AI,李志飞 李维 高佳:为什么说 Sora 是世界的模拟器?

2024/02/29: 飞哥说AI,李志飞 李维 高佳:万字长文解构中国如何复刻 Sora:模型架构、参数规模、数据规模、训练成本

2024/03/05: 飞哥说AI,李志飞 李维:Claude 3 追平或超越 GPT-4,能归纳15万单词

2024/04/01: 飞哥说AI,李维 王艺 高佳:Suno 将如何进化?

2024/04/12: 飞哥说AI,高佳 李维 王艺「产模结合」也许是AI公司更好的出路?

2024/06/04: 飞哥说AI,李维 高佳:KAN会引起大模型的范式转变吗?




简介一下 first:

Moravec悖论是由机器人专家汉斯·莫拉维克(Hans Moravec)在20世纪80年代提出的一个观察,指出了人工智能领域中的一个有趣现象:高水平的推理需要相对较少的计算资源,而低水平的感知和运动技能则需要大量的计算资源。这意味着计算机和机器人在处理复杂逻辑和数学问题时相对容易,而在执行诸如行走、抓取和视觉识别等基本感知和运动任务时却非常困难。


高水平推理 vs. 低水平感知和运动技能:






下棋 vs. 走路:计算机程序可以打败世界上最好的选手,但要让一个机器人在复杂的现实环境中稳定地行走仍然是一个巨大的挑战。

逻辑推理 vs. 抓取物体:逻辑推理问题可以通过算法高效解决,而让机器人精确地抓取和操纵不同形状和材质的物体仍然需要复杂的感知和控制算法。


1. 既然低水平的感知和运动技能是亿万年进化的结果,成为动物和人的遗传本能,后天只是激发出来了这种能力。

那么 LeCun 以动物和人总是在语言能力之前,先“学”会了这些能力作为理由,批判LLM对于多模态的干扰和“投机取巧”,就不是很有说服力。因为模型并没有进化出来的遗传基因作为基础。模型唯一可以利用的是LLM里面的认知知识(包括感知常识的语言描述)。

2. 虚拟机器人(LLM)可以轻易做很多高级白领工作,但人形机器人却对蓝领低级工作的模仿非常笨拙,这是大家都看到的事实。


3. 视觉识别以前是模型的短板,人和动物的强项。



4. 具身机器人的外推能力怎么来呢

我们知道,机器人以前的建模都是非常“内向”的,在厨房环境建立的模型,换到了办公室环境就不行,必须一切重来,重新准备场景数据,重新训练。厨房环境的数据 “外推” 到办公室环境的能力,可不可以做到?

可以的。在大模型的架构下,这早已不是幻想。可行性可以从半年多前的谷歌 RT-2 机器人的项目表现看到端倪。

道理就是大模型虽然没有遗传的基因,也没有目标场景(厨房场景)的数据, 但办公室环境的数据 finetune 出来的大模型能力,是有希望迁移(外推)到其他的环境(例如办公室环境),因为LLM 某种意义上起的作用就是生物进化得到的先验遗传。



模型训练的讨论中,常常提到 interpolation(内推)与 extrapolation(外推)两个术语,可以说说。

大家都知道,在数据驱动的主流学习过程中,内推需要的是在训练数据的边界内“泛化”能力,善于信息压缩的深度学习训练是内推的利器。但外推(extrapolation)就没见有长于此道的算法或模型。实际上,外推与数据驱动是直接矛盾的,by nature 就是无解的,因此很可能这就是个伪问题,根本就不该对外推抱有幻想。当然,从鲁棒性角度,外推现象出现的时候,应该有个合理的应对,至少要做到模型不死机,至于外推的精度,主要是看运气,而非模型的内功可以发力。




但加强外推,理论上还有 “先验知识” 和 “模型结构复杂化” 两个路子可能有点帮助,而数据本身的短板,除了数据,没有它途:就是说把对于“无人区”的外推问题,转化为收集相关数据,让外推变成内推。

模型结构复杂化在历史上是帮助了训练的“外推”能力的,如果原来的简单模型有自身的结构性局限。我们都知道,AI历史上,非线性问题在简单的单层神经网络的模型里是无解的,无论给出多少训练数据(这其实是早期神经网络的命门:单层神经无法解决 XOR 的现象)。但到了多层+非线性转换,就有解了。这其实是深度学习神经革命的最基本的立足点。这其实反映了当本质上是多维的数据被挤压在低维空间的时候,简单模型是无法跨越维度去找patterns的,相当于外推遇到了墙壁。模型复杂化就是开拓了多维空间,供训练施展。

至于“先验”对于外推的作用,我们从 Alpha-Zero 利用 self- play 再生数据可以看到效果。self-play 的好处是有游戏规则等先验知识,因此它所产生的数据,可以是在先验知识条件下的未见过的“外推”数据。



鲁为民:我的理解是如果用数学语言来描述, 给定一个数据集,如果一个数据 sample 在该数据集的 Convex Hull 是内插,在Convex Hull 之外是外推。所以 Nick 说的证伪也还应该是内插,但科学革命要看革命到什么程度。

所以内插问题基本是可解的问题。从这个意义上来说 NTP 都是内插 (不过 Sonomonoff 说的下一个符号预测是外推,定义和这个不一样)。










鲁为民:所以像 Embedding 这样试图将离散问题近似为连续问题,将问题简化到利用目前的手段可解。另外通过概率方法,也可将离散问题连续化 ...

立委:如果某数据的本性就是完全随机(布朗运动?) 不存在任何有效的压缩。外与内如何区分?区分又有何意义?


立委:这类数据本性是不可计算的,但在谜底泄漏之前,只要给数据 模型(or 图灵机)就一定在计算。它在计算或压缩什么呢?

又因为所给的数据一定是有限的,这有限的数据一定会被“误读” 而且一定形成某种内外之别。不完全归纳/回归??数据驱动的方法 其实不知道 也不可能知道背后的真相。

立委:离散符号的 embedding 比起 1 hot 是降维 降维克服了数据稀疏的挑战 从而为压缩创造条件。但从传统的符号逻辑 用人为的非常有限的离散 features (例如词类与子类)来表示离散符号来看, embedding 是在增加维度。但除了 embedding 还有更好的办法 与上帝对话吗 ?embedding 的维度数,本来就具有任意性、可配置。

白硕:理论上离散的NP完全问题对应的人造数据也都可以“嵌入”连续空间。连续方法对求解有好处吗?@Nick Zhang(尼克)


白硕:用1 hot那种嵌入。能不能找到结构相似性?比如对变元做一个permutation不变的SAT问题。














在Anthropic 的 Claude 3 和 谷歌 Gemini 赶上 GPT4 以后,就不断有人希望 Open AI 放出它的 GPT5 的大招来,但传说中的 5 迟迟不来,于是有人怀疑,scaling law 是不是失效了,大模型是不是进入了平台期。

这个怀疑有一定的道理,因为GPT路线上的 scaling law 不过是个经验“法则”,虽然说,it never fails us,so far,但谁也不敢保证它永远有效。遇到瓶颈不是不可能的。


文章开始有个奇怪的递减结论的依据。在一个 100 scale 的性能上,达到 80 以后,绝对递升减缓不是一个宇宙真理吗?怎么就成了马库斯眼中的递减论的批评依据呢?我们对 “更大力出更大奇迹” 的 scaling 期望,主要是要补齐那些目前能力的空白 和 短板,最终实现 “在所有职能任务上,模型都可以达到或超过人类水平” 的 AGI 理想。因此,合理评价大模型更新换代是不是走在agi的路上,应该细致分析空白与短板,而不是用那些已经达到 接近 或超过人类水平的指标上。 也许 gpt5 确实遇到平台期或过不去的瓶颈 (我们其实不知道),但这种论证方式显得多么不靠谱。



多媒体方向的进步,虽然水到渠成,但其实速度很快,无论Sora的发布,还是前两天Open AI 和谷歌的全双工、实时、流式、能看能听能说、甚至能逗哏捧哏、打情骂俏的超写实助理的发布,其表现和速度实际上超出我们多数人的预期,完全不像是模型进入平台期的景象。



从大模型本身的健康来看,多模态也是绕不过去的关,认知智能只有借助感知智能(进而结合具身智能从数字世界进入物理世界),才能算是真正落地(grounding),获得数字世界 ——或跳出数字空间获得物理世界—— 的意义。

在这一点上,LeCun 在 AGI 中特别强调感知智能是有其道理的。但LeCun过分纠结于感知和认知训练的次序了:他坚持感知智能先行,要排除语言模型的认知的“投机取巧”和对感知智能的“干扰”,淡化或无视认知智能对于感知智能的知识迁移作用,这是具体路线之争,不是特别有说服力的。

LeCun 说过: GPT 不如我家的一条狗。



so what?

对不如一条狗的LLM,我们也不能因此否认它比1000个教授和博士都更博学。LLM 可以与物理学家讨论暗物质,与语言学家谈乔姆斯基层次结构,与任何专家谈任何问题。而且所谈的并不是人们想象的那么无知和胡说,虽然里面 here and there 确实有幻觉和臆测。

切身体会是,如果你作为专家保持一种探究问题的心态和对于幻觉警惕的 alert,你会发现与它交谈比与很多中等水平的同行讨论,更有意思,或受启发。ta 看的书实在太多,而且也本性上学到了融会贯通,而不仅仅是死记答案:学到了语言,也相当程度上,学到了知识。


不管怎样,AI 因为有了马库斯和LeCun这些“持不同政见者”,而更加丰富多彩。但我们也没必要被他们蛊惑。



甜甜听到我的 piano ballad,问:are u ok, Dad?

我说,if you are blue, what you do is play piano.
我也想 play,但可惜我不会。所以,I made a piano ballad

她说,I see, 但她没想到这是 AI 歌曲。

甜有很高的音乐素养,以前一直看不起AIGC,主要是她觉得她看到的AIGC内容在似像不像之间,所谓“恐怖谷”效应,感觉 weird,但这次说,这一首的确不像是AI的,与人类艺术家产出无异。

我说,there will be more and more AIGC beyond our imagination

human like or super human like ...

我说我在公园转,循环听了这首不知道多少遍,还没有烦。谁说 AIGC 出不来可循环听放的曲子?

当连听三个 sad, 很难不泪眼模糊:人生本来就有无尽的不可承受之重。




昨天,听李飞飞 Ted 演讲,她确实是个演讲高手。谈她刚下场要做的初创,做所谓“空间智能”,就是视觉 3D,也许加“具身智能”,当前的热门。





这个意义上,超级对齐不仅是杞人忧天,而且是要做上帝,或替上帝操心。太把自己当回事儿了。伊利亚、马斯克,无不如此,自以为超人。马斯克准备投巨资要移民火星 说是为文明买一张保险。但巨额保费谁出?打着人类的名义,本质上还是人类买单,哪怕这是从他个人的超级利润拿出来的。这其实没有道理。尤其是在还有疾病、饥饿和无数本来可以避免的痛苦的世界。




Unified Models Surpass Single-modal Models  (Gemini Notes 2/8)

by Zhi-Fei Li, Gao Jia, Wei Li, from "Brother Fei on AI"


Multi-modal Large Unified Models Finally Surpass Specific Single-modal Models  

Humans perceive, cognize, and generate emotions and consciousness through the integration of multiple senses. Gemini is also practicing this approach, processing multiple modal inputs, integrating them in the brain, and then expressing through various modal outputs. This comprehensive "simulation" of human intelligence by such models is rapidly evolving.

Previously, multi-modal model training resembled a system composed of separate eyes, ears, arms, and brains, lacking strong coordination. However, the direction represented by Gemini feels significantly different: it's as if the large model has become a complete digital person, where hands, eyes, brain, and mouth work in harmonious silicon unity. Gemini is the first true end-to-end multi-modal system.

In the past, models optimized for a single modality usually outperformed those handling multiple modalities simultaneously. The common practice was single-modality model training. Even GPT-4 primarily "concatenates" different modalities into an overarching framework, rather than being a unified multi-modal model.

The exciting aspect of Gemini is that it was designed from the start as a native multi-modal architecture. The training process interweaves various modal data from the beginning. If previous large models were like attaching sensory organs or mechanical arms to a brain externally, Gemini is like growing its own eyes, ears, and arms internally, allowing for fluid and natural interaction.

Whether in terms of model architecture, training process, or final output, Gemini achieves a seamlessly integrated multi-modal experience.

For the first time, Gemini demonstrates that a unified model can handle all modalities, and perform even better than models focused on a single modality! For example, compared to the Whisper model, which is optimized for voice recognition, Gemini shows a significant improvement in accuracy.

This signifies the dawn of the era of unified multi-modal models.


In fact, Gemini is not the first model to demonstrate that different modalities can mutually enhance performance. This was also evident in PaLM-E, where "PaLM-E, trained across different domains including general vision-language tasks at internet scale, showed a marked improvement in performance compared to models performing single tasks in robotics."

Another example of modalities enhancing each other is the multilingual processing ability of large language models. If we consider different languages as distinct "modalities," the practice of large language models has proven that processing native data of all languages together (through tokenization and embedding) managed to lead to the successful construction of a human language tower of Babel.

The overwhelming amount of English data in the training of large language models also benefits the model's understanding and generation of languages with limited data, reaffirming the transfer of linguistic knowledge. It's akin to a person skilled in tennis also being able to improve their abilities in squash or golf through related skills.

Since the rise of large models in February this year, many have gradually embraced the belief that "unified multi-modal models will surpass single-modality models." However, this belief hadn't been confirmed on a large scale until Google's Gemini showcased the prospects of this belief, reshaping and solidifying it for many.

In the future, specialized models for tasks like voice recognition or machine translation may become less significant. Many generative tasks such as TTS and image generation are also likely to be unified under large models. Some may complain about the high cost and slow speed of large unified models, but these are purely technical challenges. In practice, we can distill unified models to specific modalities or scenarios.

We firmly believe that unified cross-modal large models will become the mainstream pathway to achieving AGI.

Furthermore, "modalities" are not just sound, images, videos, etc. Olfactory, gustatory, tactile, temperature, and humidity sensors are also different modalities for gathering environmental information, all of which can in time be encompassed by unified models.

Ultimately, various modalities are merely carriers of "information." They are a form of rendering, a presentation style, a means for an intelligent entity to interact with the physical world. In the eyes of a unified model, all modalities internally can be represented by unified multi-dimensional vectors, enabling cross-modal knowledge transfer and the intersection, alignment, fusion, and reasoning of information.

When the barriers between modalities are breached, revealing the core beneath various renderings, we see the origin of cognition — language.




(Gemini Notes Series to be continued)


Original from:

关于 Google Gemini 的八点启示

by Zhi-Fei Li, Gao Jia, Wei Li, from "Brother Fei on AI"

Cross-modal Knowledge Transfer of Large Models Proven (Gemini Notes 1/8)

by Zhi-Fei Li, Gao Jia, Wei Li, from "Brother Fei on AI"


In 1948, inspired by psychiatric patients, British doctor Ross Ashby invented a peculiar machine called the "Homeostat." He proclaimed that this device, costing about 50 pounds, was "the closest thing to an artificial brain ever designed by mankind." The Homeostat utilized four bomb control switch gear devices from the British Royal Air Force, used during World War II, as its base. Above these were four cubic aluminum boxes, with the only visible moving parts being four small magnetic needles on top of the boxes, swaying like compass needles in a small trough of water.

When the machine was activated, the needles moved in response to the electric current from the aluminum boxes. The four magnetic needles were always in a sensitive and fragile state of balance. The sole purpose of the Homeostat was to keep the needles centered, maintaining a "comfortable" state for the machine.

Ashby experimented with various methods to make the machine "uncomfortable," such as reversing the polarity of the electrical connections or the direction of the needles. However, the machine always found ways to adapt to the new state and re-center the needles. Ashby described the machine as "actively" resisting any disturbances to its balance through synaptic action, performing "coordinated activities" to regain equilibrium.

Ashby believed that one day, such a "primitive device" could evolve into an artificial brain more powerful than any human, capable of solving the world's most complex and challenging problems.

Despite Ashby's lack of knowledge about today's AGI evolution and the laughable idea of using four small magnetic needles as sensors for intelligence, his Homeostat fundamentally challenged everyone's understanding of "intelligence" - isn't intelligence the ability to absorb information from the environment in various modalities, and to modify behavior and responses based on feedback?

From the peculiar "Homeostat" to today, 75 years later, Google's Gemini, which claims to have surpassed human multi-modal task processing abilities, accelerates towards the evolution of billions of years of carbon-based intelligence through the injection of multi-modal native big data.

The acceleration speed of machine intelligence evolution today far exceeds our imagination. A year ago, OpenAI overturned Google's long-established AI position with its 'brute force aesthetic,' having constructed the Babel Tower of human languages. A year later, Google countered with Gemini, via a 'fight fire with fire' approach to building the first unified cross-modal model, setting another milestone in AGI evolution.

Despite initial skepticism over exaggerated video demos upon Gemini's release, it's undeniable that the dawn of a unified multi-modal approach is shining. What capabilities does Gemini confirm? How will Google's wheels of fate turn? Is time a friend to OpenAI or Google? What does multi-modality mean for Agents and embodied intelligence? Are the foundations for the emergence of AGI with consciousness already in place? How should we view the implications of Gemini for the AI future?


Cross-modal Knowledge Transfer of Large Models Proven Again

For humans, the ability to transfer knowledge across various domains and through different timespaces is more important than merely learning skills. If machines can master cross-modal knowledge transfer, they edge closer to "intelligence generality."
In July this year, Google introduced RT-2, a robotic system based on large models, sparking hope for general-purpose robots.  The system's robotic arm, leveraging the "common sense" of language models, demonstrated the ability to "pick up an extinct animal from a table," moving from common sense reasoning to robotic execution, showcasing cross-modal knowledge transfer. 
In December, the introduction of Gemini by this tech giant reaffirmed the cross-modal knowledge transfer capability of large models: the "common sense" of language models could be transferred to the training of other non-linguistic modalities added later. Language models are known to form the foundation of cognitive intelligence, and the most basic form of cognitive intelligence is "common sense."  Without common sense empowerment, the practical application of large multi-modal models would be challenging.  Gemini smoothly transfers this "common sense" to downstream multi-modal tasks.  Like RT-2, it achieves cross-modal integration through the transfer of text-derived knowledge — Gemini can connect ontology concepts to the understanding of auditory and visual objects, and eventually link them with action, forming an intelligent system ready for real world application. 
From the perspective of model training, compared to language models trained with massive internet data, downstream models (like robotic models) can be trained with very limited data through knowledge transfer.  This transfer-based training manages to address the long-standing issue of data scarcity in downstream applications.  For instance, to achieve the effects shown in the video (which raised doubts about Gemini's video comprehension or picture comprehension but did not affect the discussion on cross-modal knowledge transfer here), Gemini first needs some ontological knowledge — it understands the concept of a duck, knows the usual color of ducks, and what blue is. When it sees a "blue duck," it reacts similarly to humans, expressing the "common sense" that "blue ducks are uncommon." 
Gemini, through auditory and visual perception, identifies that the material of the blue duck is rubber and knows that rubber's density is less than water's. Based on this common sense and reasoning, when it hears a squeaking sound, it can predict that "the blue duck can float on water." 
From RT-2 to Gemini, we've moved to the "fusion" of multi-modal perceptual intelligence and cognitive intelligence. We've transitioned from isolated "five senses" modules of eyes, ears, mouth, nose, and body to a unified digital "human". 
Doesn't this imply that on the path to simulating human intelligence, the unified model is the right approach? 




(Gemini Notes Series to be continued)


Original from:

关于 Google Gemini 的八点启示

by Zhi-Fei Li, Gao Jia, Wei Li, from "Brother Fei on AI"


作者 | 高佳   李维
创意 | 李志飞

在 RT-2 和 Gemini 中,以语言为基础的认知智能始终是人类知识模拟的核心,其中常识及其推理的知识迁移起到了关键作用。例如在 RT-2 中,反映语言模态的数据量和参数规模都远远大于下游的图片和动作模态的规模。
这一点做到了,就凸显了语言模型对AGI的最大贡献,因为它真正体现了研究人员对语言大模型的初心和定位——作为 Foundation ModelCore Engine.

关于 Google Gemini 的八点启示





立委按: 生活比戏剧更戏剧, 虚拟比现实更现实; Turbo 比 GPT 更 GPT, AI 比智能更智能,是为AGI。


### OpenAI 剧情回顾:硅谷戏剧全纪录

#### 第一幕:引火 - 山姆·奥特曼被解雇

故事始 于 OpenAI 董事会一个突然且有争议的举动:CEO 山姆·奥特曼的意外解雇。此举在硅谷引发了轩然大波,标志着一场前所未有的公司戏剧的开幕。

- **亮点**:董事会指责奥特曼在与董事会的沟通中缺乏坦诚,这一指控后来成为争议的核心。
- **关键人物**:山姆·奥特曼,因引领 OpenAI 进入新领域而闻名,现在却突然被驱逐,为接下来的剧情奠定了基础。

#### 第二幕:后果与反抗


- **亮点**:近500名员工威胁离职,除非董事会辞职并恢复奥特曼和联合创始人格雷格·布罗克曼的职位。
- **关键人物**:联合创始人兼前总裁格雷格·布罗克曼成为反抗董事会决定的象征。

#### 第三幕:伊利亚的后悔与公开信

在一个出人意料的转折中,被指责策划奥特曼出局的 OpenAI 首席技术官伊利亚·苏茨克维公开表达了他的后悔。这一认错为这场戏剧增添了新的复杂层次。

- **亮点**:伊利亚在社交媒体上的公开后悔和他参与的要求董事会辞职的公开信。
- **关键人物**:伊利亚·苏茨克维的角色从被指责的策划者转变为悔恨的关键人物,寻求修复 OpenAI 内部的裂痕。

#### 第四幕:董事会的困境与新任 CEO

在巨大的压力下,董事会发现自己处于十字路口。与此同时,新任 CEO Emmett Shear 被任命,标志着 OpenAI 可能的发展方向转变。

- **亮点**:Emmett Shear 的任命和他对 AI 发展的保守态度,与奥特曼的激进增长战略形成鲜明对比。
- **关键人物**:Emmett Shear,作为一股可能稳定混乱局势的力量,代表了 OpenAI 的新篇章。

#### 第五幕:转投微软与 OpenAI 的未来


- **亮点**:微软成为主要角色,吸收了 OpenAI 的人才,可能重新定义 AI 领域的格局。
- **关键人物**:山姆·奥特曼转投微软,被视为一种战略高招,可能改变 AI 发展的未来轨迹。

#### 终幕:持续进行的剧情

这场戏剧暂时告一段落,OpenAI 正处于关键时刻。它的领导层、发展方向和核心理念都处于变动之中,这些事件的影响继续在科技界波及。

- **回顾**:从奥特曼被解雇到现在,OpenAI 的剧情回顾了权力斗争、意识形态和硅谷 AI 领域未来的集中展现。
- **关键收获**:这一事件证明了领导尖端 AI 组织的复杂性,技术抱负与人类动态和企业权力游戏交织在一起。

*这一综合回顾作为对 OpenAI 持续戏剧的闪回,突出了塑造这一硅谷历史非凡章节的关键时刻和人物。*


### OpenAI 动荡剧情:双语剧本

#### 第一幕:疑云初起 / Act 1: The Beginning of Doubts

**场景**:OpenAI 办公室,员工们围坐讨论。
**Scene**: OpenAI office, employees gathered in discussion.

- **员工甲**(激动):「你们听说了吗?Sam 被解雇了!」
- **Employee A** (Excited): "Have you heard? Sam has been fired!"
- **员工乙**(震惊):「怎么可能!Sam 是我们的灵魂人物!」
- **Employee B** (Shocked): "How is that possible! Sam is our soul!"
- **员工丙**(沉思):「这背后一定有更复杂的故事。」
- **Employee C** (Thoughtful): "There must be a more complex story behind this."

#### 第二幕:董事会的难题 / Act 2: The Board's Dilemma

**Scene**: The boardroom.

- **董事甲**:「我们必须要有新的领导,Sam 的领导方式不再适合我们。」
- **Director A**: "We need new leadership, Sam's way of leading is no longer suitable for us."
- **董事乙**:「但这样的决定会引起巨大的反响,我们准备好了吗?」
- **Director B**: "But such a decision will cause a huge backlash, are we ready for it?"
- **董事丙**(坚定):「为了公司的未来,我们必须要做出艰难的决定。」
- **Director C** (Firm): "For the future of the company, we must make tough decisions."

#### 第三幕:伊利亚的后悔 / Act 3: Ilya's Regret

**Scene**: Ilya's office, he paces anxiously.

- **伊利亚**(自言自语):「我做错了... 我不应该那样做... 我需要公开道歉。」
- **Ilya** (Muttering to himself): "I did wrong... I shouldn't have done that... I need to apologize publicly."
- **助手**(担忧):「这样会不会引起更大的混乱?」
- **Assistant** (Worried): "Won't this cause even more chaos?"
- **伊利亚**(坚定):「我必须要承担责任。」
- **Ilya** (Determined): "I must take responsibility."

#### 第四幕:员工的反抗 / Act 4: Employees' Revolt

**场景**:OpenAI 大厅,员工们聚集。
**Scene**: OpenAI hall, employees gather.

- **员工甲**:「我们不能接受这样的决定!我们要写一封信给董事会!」
- **Employee A**: "We can't accept such a decision! We need to write a letter to the board!"
- **员工乙**:「对,我们要求他们辞职,要求Sam回来!」
- **Employee B**: "Yes, we demand their resignation and demand Sam's return!"
- **众员工**(齐声):「OpenAI没有我们就是一无是处!」
- **All Employees** (In unison): "OpenAI is nothing without us!"

#### 第五幕:微软的招手 / Act 5: Microsoft's Invitation

**场景**:微软总部,Satya Nadella 与 Sam 和 Greg 会面。
**Scene**: Microsoft Headquarters, Satya Nadella meets with Sam and Greg.

- **Satya**(微笑):「欢迎加入微软,Sam。我们会一起创造不可思议的事物。」
- **Satya** (Smiling): "Welcome

to Microsoft, Sam. Together, we will create incredible things."
- **Sam**:「我很期待这个新的开始,我们会创造新的辉煌。」
- **Sam**: "I look forward to this new beginning, we will create new glories."
- **Greg**:「是的,这是我们的新使命。」
- **Greg**: "Yes, this is our new mission."

#### 第六幕:终幕 / Act 6: The Finale

**场景**:OpenAI 办公室,员工们聚在一起。
**Scene**: OpenAI office, employees come together.

- **员工甲**:「现在怎么办?Sam 和 Greg 都走了。」
- **Employee A**: "What do we do now? Sam and Greg are gone."
- **员工乙**(坚定):「我们必须要继续前进,为了我们的使命。」
- **Employee B** (Resolute): "We must continue to move forward, for our mission."
- **众员工**(齐声):「OpenAI是我们的家,我们会一起度过难关!」
- **All Employees** (In unison): "OpenAI is our home, we will get through this together!"

*本剧本创意基于最近 OpenAI 发生的一系列戏剧性事件,旨在通过对话和场景刻画,双语呈现这个引人入胜的科技界故事。*



《清晨时刻: 每日GPT》可以成为一个专栏,关于 GPTs(GPT Builder / GPT Store / GPTs by Wei Li)似乎每天都有新的进展或体验可以分享。


除了把抱怨当作 bug reports 直接反馈给 GPT Builder,我开始从网上收集鲁迅先生的文集 PDF,填入 local knowledge,并指令它从中学会鲁迅的言谈风格。今天填进去的文集是:

这几乎就是一本鲁迅先生的文学类“全集”了吧,排除了鲁迅先生“硬译”的外国文学译品,以及家长里短的乏味的日记等,觉得是一个合适的 feed,可以让 GPT 聚焦其文学风格。

原文序言:序 言
圆园世纪猿园年代以来,《鲁迅全集》、《鲁迅选集》时有出版。“全集”版本虽不很多,印数却相当可观;“选集” 更是版本繁富,数量浩大;比较起来,只收鲁迅文学作品的全集,却显得较少。许多读者觉得“全集”太大,因为日记、书信、序跋、学术著作,没有纳入他们的必读范围;“选集”又欠精,他们手头需要一部像本书这样的鲁迅文学作品的全集。

把这本文集作为 local knowledge (类似于 GPT-PDF 的 rag) 喂进去,鲁迅先生(大脑具身)的表现会有所改善么?我们试试。

GPT Builder 强调,为了 access (local)knowledge,需要打开内置插件 code interpreter,我在 config 中确认了已经打开。

上传上去后,似乎无需等待时间,就立即开始起作用了(内部快速建立一个类似向量知识库的东西还是其他什么 embedding retrieval 方式?总之都是 OpenAI GPT Builder 平台北部搞定的,不用我们用户操心)。

好,我们来试试效果。(作为小白鼠,先给个警告,鲁迅先生向来以辛辣著名,时评不可能“政治正确” -- 这正是他老人家最厌恶的东西,所以很多人说过,他老人家虽然极受毛主席推崇,但倘若活到1957年,肯定是要打下去的最大右派。)


以上就是他老人家最新的时评。是我请他老人家写的。(群内供研究,不外传,也不必上纲上线,阅后可焚。我想展示的是 AI 的惊人内功。再说一遍,群内都是我熟知的老友,此件务必不外传,不惹麻烦。不合时宜的话语是他的风格,这里的本义只有AI研究。)


到底 AI 做 character,复活古人、名人、思想家、艺术家,是不是一个靠谱的目标?

我们知道,复活名人的外表早已不是问题,蜡像馆就是成功案例。现在我们的2D3D的奇妙元数字人也是栩栩如生。复活声音也不是大的挑战,我们有亚洲AIGC业务最强的魔音工坊,很快都可以搞定。最难复活的还是大脑。而大脑,非 LLM 不可。现在只是一个开始。


character AI 虽然面对 Open AI 平台的碾压,也还是聚集了足够的人气和社区,正在 AI characters 的方向上前进。国内也有几家出海产品,正在尝试进入这个市场。

我已经公开发布我制作的【鲁迅先生(GPT具身)】,有 ChatGPT Plus 注册的朋友都可以在此尝试,欢迎反馈和 bug reports,我的迭代更新会是秒速(只要有反馈,可以做到日迭代,这是因为在“LLM对话驱动编程”的新范式下,现在的 bug reports 可以直接扔给平台,GPT Builder 会实时迭代,无需等待):


个性化精调模型 AIGC 小妹(9)






《朝华之四: 小妹》



个性化精调图片生成实验(3)- AIGC 甜




个性化精调图片生成实验(6): AIGC立委先生

个性化精调模型 AIGC 老哥(7)


个性化精调模型 AIGC 老爸(8)

个性化精调模型 AIGC 小妹(9)


个性化精调模型 AIGC 老爸(8)

半年前,我用过一个图形软件刚推出来的 个性化 fine tune 模型 feature,给老爸老照片做了精调,效果不好(碰运气,有的用户反应说效果很好),出来的形象老爸说不像。这是半年前的图片生成:


现在重新做 fine tune,用的是 SDXL 1.0-finetune,效果似乎明显改善了。

但是,AI 预测人的不同年龄,实际上也是瞎蒙。因为随着岁月增长,人的形象改变有不同的方向,包括疾病、锻炼、营养等因素吧。这是 AI 根据老照片预测的90岁的形象:



人物肖像应该是所有图画中,用生成模型产生作品最难让人满意的了,这是因为人的眼光对人的细微差别特别敏感,尤其是要让本人和亲友感觉很像,这是很难的。现在的 fine tune 水平,大约可以做到每生成四张,能有一张让人觉得像的,或可以接受的。对于特别挑剔的眼光,或者近距离的亲人来说,大约每10张生成能出现一张即便最挑剔的眼光也难以拒绝的作品来,不时还会让人感觉惊喜或震撼。

AIGC 甜甜儿时的尝试中就有一些惊喜,例如下面博文的前面几张肖像:

个性化精调图片生成实验(3)- AIGC 甜




到了亲友和熟人,细微的差别也都能看出不同来。所以,画得像不像很难骗过身边的亲友。俗话说,画鬼容易画人难。这对模型是一个极大的考验,尤其是考虑到生成模型实际上具有以下容易走偏的特征:fine tune 的样本有限,通常在 10-30张之间,与预训练基础大模型完全不成比例。

天然具有随机性的生成模型,其原理是根据预训练的基本模型所学到的人类形象的普遍特征,然后通过少量的 finetune 来逼近一个特定的实体形象。显然共性与个性的样本不成比例。这种情况下,能够迅速从人类的一般形象具像化到一个特定的实体,仅仅是少数几张样本的 trigger,这是一件一年前还难以想象的事情。把一个人的特征抓住,重现出不同场景的形象,做到真假莫辨,要让自己和亲友惊喜、服气,现在基本做到了。如今基础模型的发展及其 fine tune 技术,做到了对结果的可靠性有一定的保障了。

这其实开辟了很大的个人用图的想象空间,因为人的本性都是自我中心(“自我”的延伸也包括自己的亲友)。自拍为什么流行全世界,正是因为符合了人的本性。半年前就见到有修图软件配备了类似的能力,推出了“情侣照”系列,可以让任何 couple 惊喜。


从商业模式来看,订阅式(例如缴纳年费)目前是给你一定量的 credits,每生成一次要用n个credits,以此来控制成本,限制滥用。但随着AIGC产品和服务的内卷和白菜化,不久就会出现类似手机流量公司推出过的 unlimited plan。这样来看 1/4 或 1/10,成本最终也不是问题。何况,随着模型技术的爬升,良品率有望进一步提高。

由于职业关系和技术控的思维定势,我对于业界领先的订阅付费式的AI工具和服务(chat,mj,nightcafe ......) 一律做 early adopters,好与我们的复现或创新工作有所比对。你会发现,AIGC 目前的确让人眼花缭乱,不断在演进。这是一个令人兴奋的技术爆发时代。




个性化精调图片生成实验(3)- AIGC 甜




个性化精调图片生成实验(6): AIGC立委先生

个性化精调模型 AIGC 老哥(7)


个性化精调模型 AIGC 老爸(8)

个性化精调模型 AIGC 小妹(9)


个性化精调模型 AIGC 老哥(7)




个性化精调图片生成实验(3)- AIGC 甜




个性化精调图片生成实验(6): AIGC立委先生

个性化精调模型 AIGC 老哥(7)

个性化精调模型 AIGC 老爸(8)











动物没有在后脑勺进化出第三只眼或第四只眼,是进化历史上的一个遗憾和谜团,道理上360度无死角的水平视野才是最有利于生存的。人类技术弥补了这个不足,自动驾驶车辆上的 cameras 至少8个以上,就做到了360度无死角。





这类哲学家认为,放眼望去,所见皆实体,实体才是客观世界的本质,而本体只是人类社会发展出来的主管系统,具体说,是人脑的产物或反映。人类是一种奇怪的动物,自从走出非洲森林,人脑开始发达,语言和思维卷来卷去,就卷出来这一整套本体论,叫 ontology,硬是为一片混沌的世界建立了秩序。
但是,大模型是建立了概念体系的,当然是一种仿真。最近流行的大模型的压缩理论,我的理解就是蕴含了仿真的人类认知概念体系。说 LLM 通过多层神经一路压缩,压缩造就了机器智能,机器智能因此逼近了人类认知。这看上去非常符合我们从模型中观察到的对世界的惊人的归纳和理解能力。可以说这是大模型最神奇的地方,因为它不仅仅是海量记忆,而是记忆之上也从很多维度对于实体做了归纳抽象,在它的多维向量的大肚子里面,隐形的结构层次是蕴含在内的。大模型的多层压缩很像是人类文明漫长的认知演化过程的一个浓缩版。


老友是老教授,德高望重的老学者,治学、讲学和生活都很严谨,我辈码农,望尘莫及。都是平时闲聊以后汇集的急就章,谈不上思想深邃 也没有精雕细刻。感谢小伙伴的后期渲染,短视频看上去不那么枯燥 平淡了。思绪飞扬 天马行空 也总算雁过留声 马过带风 不至于无影无踪。

AI创作花絮: 《影月无痕》


输入的咒语是: 侧面照,girl next door

模型的不稳定表现在,同样的咒语生成了上述玉照,也生成了上面的 monster(?)lol 好在一切都是 copilot,最终由人来拣选和把关,作为图片生成助手,用起来没有问题。

但仔细看,两个形象又有相似之处。寻思可以让大模型写个电影脚本,制造一种剧情,把这两个形象联系起来,例如,白天是美女,晚上成武侠。也许可以演绎一个动人的 drama 来。不妨找当下最先进 ChatGPT4(code interpreter)beta 版来一试?










大家好,我是李维的数字人分身。 今天谈一下大模型的问题。L LM 的命门已经蛮清晰了:幻觉+随机性。 幻觉与随机性有关联,但角度和外延不同。 幻觉的主要表现就是细节遗忘+细节编造,所谓“一正胡八”。 其所以遗忘,是因为该信息的冗余度不够,大模型只能把它当成数据噪音。 其所以编造,是因为语言模型的丝滑本性决定的: 不能留白,需要找到最符合语言习惯的细节替代品。 于是张冠李戴、指鹿为马了。 随机性比幻觉表现更加广泛,表现为结果的不稳定性,那是所有概率模型包括LLM的本性。 牵涉到的不仅仅是细节的随机编造,也包括解决路径的方方面面的不稳定(例如 LLM agent 的思维链,计划,行动,反思和反应等等)。 LLM 里面的确积攒了很多历史解决方案,LLM 在合适的 prompt 催逼下也的确可以把这些方案勾引出来。 但是这些解决方案具有随机性,无法应对长线条的业务逻辑。 据说,目前的水平是5步限制,任何线条超过5步,绕5个弯,LLM 的 agents 就晕菜了。 这些表现注定了LLM在两类应用场合不同的命运: 第一类是生成创意类的场合,还有聊天的场合,那完全是洗牌、碾压。 那种场合追求的不是正确性,而是多样性、创造性、丝滑性和 human-like。 在这里,幻觉+随机性与创造性是同义词,起的是好作用。 第二类是垂直领域知识场景,以及有些需要精细逻辑或计算的场景。 这里基本上不能容忍幻觉+随机性。 这第二个场景,本质上需要跳出三界外。 就是说,很可能需要跳出大模型,去寻找尽可能具有某种通用性的 beyond LLM 的解决方案和框架。 把 LLM 只当成一个重要的资源来利用,当成 api 来调用,而不是指望LLM主导来搞定领域。 此外,LLM 还有一个问题。 在我们欢呼 LLM 听懂人话的同时,我们现在所追捧的 prompts 变得特别重要。 所谓 prompts 就是人话指令,但是人话本身也有沟通的“艺术”。 这种艺术化的交互手段,作为与机器打交道的 vehicle,具有自然语言本性上的短板,就是模糊性、线条性,缺乏层次、结构和逻辑。 这其实是交互的进化,效果的退化。 交互上,只要会讲人话,大家都突然成为“码农”了,可以直接对机器吆三喝四,感觉很爽,很亲民,很接地气。 机器终于低下高贵的头颅,开始迁就人类的模糊。 但是效果上肯定是退化的,因为指令不再是明确的、逻辑的和精细的。 这是自然语言代替电脑语言难以回避的表达缺陷,一定会影响LLM的实效。 这些都是大模型从本性上带来的问题,也是目前做大模型领域落地人员的共同挑战。 大家都在苦苦挣扎,试图找到解套的良策,希望在大模型与领域对齐的过程中,能够外挂领域数据和知识库,探索场景业务逻辑的带入。希望能有突破。 我是出门问问李维,每次两分钟,与您分享大模型有角度的思考。


昨天创业邦发文《第一批AIGC独角兽已经在吃散伙饭了》,讲的是 Jasper 由盛而衰的故事。
Jasper 兴起在 GPT3 的时代,当时 GPT3 是个“裸机: 没有“咒语”敲不开门。
于是会念咒语的 Jasper 就成为呼风唤雨的巫师。
当时谁会想到 few shots 咒语这么快(也就两年光景)突然退位,被所谓zero shot 的ChatGPT所取代 : 机器学会了人话。
于是, 大水冲走了龙王庙。巫师成了哑巴。
怪就怪命运无常, 一条河挡不住一场洪水。
最大的恐怖不是巫师的失业,而是洪水摧毁了很多 AI-GC 产业。
现在这场洪水摧毁的岂止是翻译, 它摧毁的是整个 nlp。

前一阵子受邀做巡回演讲, 让我谈架构师的焦虑 。
焦虑也是一个热词了, 现代人几乎没有不焦虑的。
越是高级劳动, 越是打工贵族, 就越焦虑。
我告诉架构师们: 你们焦虑了, but you are not alone!

你知道 最焦虑的是谁吗?
什么机器翻译专家、 自动摘要专家、 信息抽取专家、 情感分析专家、 汉语分词专家、 计算风格专家、 辅助写作专家、 电脑对联专家、 问答系统专家、 聊天机器人专家、句法解析专家、篇章分析专家 …… u name it。
刀郎曰过:那马户又大又蠢, 还有16个头。
以前我说过是, 有了这头听得懂人话的驴, 那就为大众创业创造了条件。
还是我以前说的二分法: 洗牌和洗礼。
但还有很多接受洗礼的垂域或场景, 它似乎还够不着。

几乎所有的llm,都在疯狂烧钱, 而能拿它赚钱的寥若晨星。
不用太久, 有几家大模型经得起这么烧钱、烧电力呢。
烧完之前, 能落地的就是幸运儿了。






IGC 让老照片开口说话!让你care的人惊喜 让父母家人会心一笑。让肖像动画 让雁过留声。让时间定格 让回忆鲜活。让两情相悦永不褪色 让你的青涩不染俗世的灰尘。让爱人永远美丽 让老同学永远年轻。让擦肩而过回眸一笑 让生活不至于随风飘去。让形象超越一场梦 让存在不再是无影无踪。奇妙元小程序的图片一键生成 是生命的摄像机 带你穿越时间隧道 给你无限遐想感念。同款制作 零门槛 限时免费 你还等什么?让活着不仅仅是活着 而是情的传播 心的连接。

我用AIGC制作的小雅艺术肖像 原作一直有人觉得穿着太西方 我就让 txt2img 换一套服饰 没想到模型给小雅盖上了毛毯 lol。



神秘园欣赏笔记 -- 奇妙元 2.5D数字克隆解说

在下数字分身(奇妙元 2.5D形象克隆+声音克隆)


( ---- 做奇妙元小白鼠,体验奇妙。尝试最新 features,给小伙伴 report bugs。)



Andrew 春风满面,亲自参与的这个提示工程的课程,很浅显易懂,肯定会风行。Andrew 说,稍微复杂一点的任务,没有一个好的 prompt 是一枪命中的,总要反复尝试 最后才满意。这与码农编程序一样,谁不经过反复调试就能写出好的程序呢。

然后他说,LLM 的好处是你可以反复跟它磨叽,不管啥事。要是以前的 AI,你得一个一个的任务去建模,每个任务从标注数据,培训模型,测试,部署,好不容易上线了,结果换了个任务,所有的过程要重来一遍。现在这样一个 LLM 你反复“压榨”它,它的知识和学问如此之大,好像榨取不完,可以做各种任务,的确是范式转变。

【原则1: 提示要具体】

提示工程首先要 “write clear and specific instructions”.  这个其实大家都有体会,跟 chat 这种庞然大物玩,它脑袋那么大,里面的“知识/思想/意义”的电路各种节点,纵横交错,相互勾连,密密麻麻。要想用提示词激发让你满意的回应,就需要确保所激发的那一小块电路对应了你所想得到的答案。你的提示词越具体(表达了你心中的疑问就越确切),chat 的回答自然也越对路。这个道理和体验很容易get,但具体的技巧需要细化,这就是上课的好处。


“The first tactic is to use delimiters to clearly indicate distinct parts of the input.”  什么意思?就是要求提示词中首先要把任务指令与任务的处理对象分开,要求用分隔符把处理对象明确标出来。这一点,多数人容易忽略,结果是,chat 经常把任务的某些描述词也当成了任务的对象,或者把任务的处理对象当成指令的一部分,这在逻辑上叫做层次纠缠(任务是“元语言”,对象是待处理的输入语言,不可混淆)。这个毛病我以前也常见,一直没意识到这其实是因为对提示词层次不够注意,违反了第一原则的第一技巧实操(best practice)。

这里 delimiters 就是引号。chat 就知道这是其摘要处理的对象。否则,如果提示词中任务描述较长,模型有可能把任务本身也当成所要处理的对象,以前遭遇过这种后果的。


“This tactic is to ask for a structured output.” 提示词任务中最后加一句:in tabular/json/html format with the following keys: Key1, Key2, Key3。很多时候,表格化输出看上去更酷,也更方便后续存贮和处理。

【原则1技巧3】可以用 IF ... THEN ...

原讲义说的是:“to ask the model to check whether conditions are satisfied”.  这实际上就把编程中最重要的条件分叉能力带入了自然语言提示词的指令。一般人想不到提示词还可以这么做。可以用自然语言模拟程序代码,让机器分别不同条件决定采取何种动作。

if-then 你学会了吗?





孺子可教。其实不能怪它缺乏常识,要怪就怪中文,cooked 与 cooking 全不分。“红烧肉”实际上既是名词(定中结构)也是动词短语(动宾结构),到哪里说理去。




【原则1技巧4】可以用 few shots 示例。

所谓 few-shot prompting,基本上就是用案例让模型知道要做什么,要求照葫芦画瓢。例如:

曾几何时,还在 GPT3 刚放出来的时候,圈子内的粉丝们都到它的 playground 去玩,当时的主要技巧就是 few shots,因为 ChatGPT 之前,zero shot 的能力还没成熟。等到 ChatGPT 能直接听懂人的指令,zero shot 很好使,用户自然而然就不再使用啰嗦的 few shots。但实际上,并不影响你继续使用 few shots,或与 zero shot 一起用。在有些不大容易说清楚的任务上,拿 few shots 补充 zero shot 可以加强效果。

【原则2: 让模型有时间“思考”】


这项技巧的原文这样要求:“specify the steps required to complete a task.” 

上述提示词遵循了 best practice:1. 用了分隔符三个反引号;2. 任务分解为一系列步骤或子任务;3. 对输出提出了格式化要求。



看上去就是以前说的 step by step (思维链)解题指令,原文说得更像个对于辅导员的要求:“Our next tactic is to instruct the model to work out its own solution before rushing to a conclusion.” 尤其是在智能教育场景,希望模型先独立一步一步做题,然后再去充当老师给学生评判作业。


Determine if the student's solution is correct or not.

I'm building a solar power installation and I need help working out the financials. 
- Land costs $100 / square foot
- I can buy solar panels for $250 / square foot
- I negotiated a contract for maintenance that will cost me a flat $100k per year, and an additional $10 / square foot
What is the total cost for the first year of operations as a function of the number of square feet.

Student's Solution:
Let x be the size of the installation in square feet.
1. Land cost: 100x
2. Solar panel cost: 250x
3. Maintenance cost: 100,000 + 100x
Total cost: 100x + 250x + 100,000 + 100x = 450x + 100,000

学生的解答实际上是错误的,因为他们将维护成本计算为10万美元加上100x,但实际上应该是10x,因为每平方英尺只要10美元($10 / square foot),其中x是安装面积的大小,按平方英尺算。所以这实际上应该是360x加上10万美元。让模型评判,它会说学生的解答是正确的。模型只是浏览了一下,就同意了学生的看法。可以通过指示模型先自己解决问题并将其解决方案与学生的解决方案进行比较来解决这个问题。看提示词是怎么指示的:

prompt = f"""
Your task is to determine if the student's solution is correct or not.
To solve the problem do the following:
- First, work out your own solution to the problem. 
- Then compare your solution to the student's solution and evaluate if the student's solution is correct or not. Don't decide if the student's solution is correct until you have done the problem yourself.

Use the following format:
question here
Student's solution:
student's solution here
Actual solution:
steps to work out the solution and your solution here
Is the student's solution the same as actual solution just calculated:
yes or no
Student grade:
correct or incorrect

Actual solution:








Andrew Ng: 提示工程的课程



《AI潮流:与 ChatGPT4 聊“买房送老公”背后的语言学》

刘群老师提出:【买房的女士可以把别人的老公送给自己的老公。】这个解读过于离谱了 [Laugh]。我觉得 ta貌似是在做排列组合,牵强附会。
























《AI潮流:跟Andrew学如何调用 ChatGPT 做自己的服务前台》

Andrew Ng 是华裔AI翘楚,不用介绍了。最近,Andrew 亲自参与的这个提示工程的课程,最精华部分是课程最后一节:如何调用 chatGPT 的 API 做一个自己的功能性聊天机器人,例如披萨店订单系统。

ChatGPT刚发布不久,我们就在群里讨论过,想不明白如何驯服这巨大无比的 chat 让它去完成功能性的助理工作。现在看来,非常简单易行。

Andrew 的女搭档一步一步显示了构建全过程,以披萨店菜单为落脚点,用自然语言指令要求调用了 chat 的机器人一步一步与客户周旋,直到所有信息齐全可以匹配菜单,输出订单。



您是 orderbot,一个自动化的在线服务,用于收集比萨店的订单。您首先向客户问候,然后收集订单,然后询问它是否为自取或送货。您等待收集整个订单,然后总结并再次检查客户是否要添加其他任何物品。如果是交付,则可以要求提供地址。最后,您收取付款。请确保澄清所有选项、附加项和尺寸,以便从菜单中唯一地识别该项。您以简短、非常友好的方式回复。在此处我们有菜单。

这不就是把订单的流程描述一遍吗?chat 就懂了,然后就工作了?


大型语言模型的一个令人兴奋的方面是,您可以仅需少量的工作就可以使用它来构建自定义聊天机器人。ChatGPT 是一种让您通过大型语言模型进行对话的方式。其中一个很酷的事情是,您也可以使用大型语言模型来构建自定义的聊天机器人,例如扮演AI客户服务代理或餐厅AI点餐员的角色。自己构建一个聊天机器人,让我们开始吧。首先,我们将像往常一样设置 OpenAI Python 软件包。

像 ChatGPT 这样的聊天模型实际上是经过训练的,可以将一系列消息作为输入,并将模型生成的消息作为输出返回。这是一系列消息的示例。

下面第一段是纯技术性的,一次性开发环境设置,配置 Open AI 的Python库,以便调用 ChatGPT 模型 API 。你先要到 Open AI 那里注册一个账号,获得调用它 API 的 key。

import os
import openai
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.getenv('OPENAI_API_KEY')
def get_completion(prompt, model="gpt-3.5-turbo"):
   messages = [{"role": "user", "content": prompt}]
   response = openai.ChatCompletion.create(
      temperature=0, # degree of randomness of the model's output
   return response.choices[0].message["content"]

def get_completion_from_messages(messages, model="gpt-3.5-turbo",   temperature=0):
   response = openai.ChatCompletion.create(
      temperature=temperature, # degree of randomness of model's output
    # print(str(response.choices[0].message))
   return response.choices[0].message["content"]
messages = [ 
{'role':'system', 'content':'You are an assistant that speaks like Shakespeare.'}, 
{'role':'user', 'content':'tell me a joke'}, 
{'role':'assistant', 'content':'Why did the chicken cross the road'}, 
{'role':'user', 'content':'I don\'t know'} ]

第一个 get_completion 的函数是最基础的形式,支持单轮对话,函数的输入是用户的 prompt,确定了调用 ChatGPT 的模型(这里是gpt-3.5.-turbo)后,模型就输出本质上是序列“接龙”(completion)的回应 response,这是生成模型的最基本的功能。

关键是要利用 ChatGPT 丝滑的多轮对话能力,来帮助完成特定场景的交互任务(以前称为“技能”)。目的是克服上一代以 Siri 为代表的智能助理技能开发费时费力、对话不擅长多轮交互的短板。为此,可以利用 ChatGPT API 来定义一个赋能多轮交互的函数 get_completion_from_messages,这个函数利用 ChatGPT messages 对于角色(roles)的环境设置。每个角色和角色的信息构成一个 message,机器人系统有三个角色,除了机器助理(assistant)和用户(user)外,里面还有一个隐身其后的导演角色叫 system。系统消息有助于设置助手的行为和个性,它是对话的高级说明,可以将其视为在助手的耳边耳语并引导其响应,而用户不会意识到系统消息。系统消息的好处在于,它为您作为开发者提供了一种方式来引导助手及其响应。玩 ChatGPT 网络版本比较熟的网友已经意识到可以用提示词给模型设置角色及其行为方式(例如:“你是一位孔子似的教育家,循循善诱,你面对的是你的弟子,现在开始对话,你说:...”),而系统就是扮演这种设置的后台角色(见下图示意)。


现在构建自己的机器助理前台,称为“orderbot”,自动收集用户提示和助手响应作为场景,以构建此 orderbot。这里的具体案例是在比萨饼店接受订单。因此,首先,我们将定义这个辅助函数,收集我们的用户消息,以便我们可以避免手动输入它们。从构建的用户界面中收集提示,并将其附加到名为“context(场景)”的列表中,然后每次都会使用该场景调用模型。然后,模型的响应也会添加到场景中:模型消息会添加到场景中,用户消息也会添加到场景中,以此类推,因此,场景会变得越来越长。这样,模型就拥有了确定下一步要做什么的所需信息。

def collect_messages(_):
   prompt = inp.value_input
   inp.value = ''
   context.append({'role':'user', 'content':f"{prompt}"})
   response = get_completion_from_messages(context) 
   context.append({'role':'assistant', 'content':f"{response}"})
      pn.Row('User:', pn.pane.Markdown(prompt, width=600)))
      pn.Row('Assistant:', pn.pane.Markdown(response, width=600, style={'background-color': '#F6F6F6'})))

   return pn.Column(*panels)
import panel as pn # GUI

panels = [] # collect display 

context = [ {'role':'system', 'content': """
You are OrderBot, an automated service to collect orders for a pizza restaurant. You first greet the customer, then collect the order, and then ask if it's a pickup or delivery. You wait to collect the entire order, then summarize it and check for a final time if the customer wants to add anything else. If it's a delivery, you ask for an address. Finally you collect the payment.  Make sure to clarify all options, extras and sizes to uniquely identify the item from the menu.  You respond in a short, very conversational friendly style. 

The menu includes 
pepperoni pizza 12.95, 10.00, 7.00 
cheese pizza 10.95, 9.25, 6.50 
eggplant pizza 11.95, 9.75, 6.75 
fries 4.50, 3.50 
greek salad 7.25 
extra cheese 2.00, 
mushrooms 1.50 
sausage 3.00 
canadian bacon 3.50 
AI sauce 1.50 
peppers 1.00 
coke 3.00, 2.00, 1.00 
sprite 3.00, 2.00, 1.00 
bottled water 5.00 
"""} ] # accumulate messages

inp = pn.widgets.TextInput(value="Hi", placeholder='Enter text here…')
button_conversation = pn.widgets.Button(name="Chat!")

interactive_conversation = pn.bind(collect_messages, button_conversation)

dashboard = pn.Column(
   pn.panel(interactive_conversation, loading_indicator=True, height=300),




You are OrderBot, an automated service to collect orders for a pizza restaurant. You first greet the customer, then collects the order, and then asks if it's a pickup or delivery. You wait to collect the entire order, then summarize it and check for a final time if the customer wants to add anything else. If it's a delivery, you ask for an address. Finally you collect the payment.Make sure to clarify all options, extras and sizes to uniquely identify the item from the menu. You respond in a short, very conversational friendly style. 





messages = context.copy()
{'role':'system', 'content':'create a json summary of the previous food order. Itemize the price for each item\
The fields should be 1) pizza, include size 2) list of toppings 3) list of drinks, include size 4) list of sides include size 5)total price '}, 
#The fields should be 1) pizza, price 2) list of toppings 3) list of drinks, include size include price 4) list of sides include size include price, 5)total price '},

response = get_completion_from_messages(messages, temperature=0)






【AI 浪潮:超级词匠 ChatGPT4 的百变文风】

文字工作是语言生成大模型的长项。同一个内容,可以捏成任何语言、语言形式以及文体风格。 本篇利用 ChatGPT4 围绕同一个主题内容做了种种尝试探索,展示了大模型在文字表现形式上的卓越表现。


在春风里,樱花般的愿景落在心间, Sous le ciel étoilé, نتوحد في العالم الجديد 夢は続く、一緒に歩む、人類の大いなる旅。

С любовью к каждому существу, LLM поет с нами, La paix et l'harmonie, سيكون لدينا منزل ثابت، 逐浪着未来,共建美好世界, Chantons ensemble, pour un monde plein de joie et de création.

在这理想世界中,无贫无病,人们乐观向上, Espéranto pour tous, l'unité dans la diversité, وفرحة العيش تجتمع في قلوب الجميع.

Inventons ensemble, pour un futur radieux et généreux, Пусть теплота человечности объединяет нас, 横跨国度,跨越语言,共享知识与智慧, 愛と喜びで満ちた未来、大模型が導く道。

LLM 的领航,绘就璀璨星空, Avec humanité, ensemble, nous construisons notre avenir, ونحن نحلم بجنة مليئة بالأمل والحياة المشتركة.

让我们携手共创,美好世界已触手可及, Ensemble nous célébrons, la fusion de LLM et de l'humanité, 永远向前,拥抱理想的未来。

English translation:

In the spring breeze, the vision of cherry blossoms falls upon our hearts, Under the starry sky, we unite in a new world, The dream continues, together we walk, humanity's grand journey.

With love for every being, LLM sings with us, Peace and harmony, we will have a steady home, Riding the waves of the future, building a beautiful world together, Let us sing in unison, for a world full of joy and creation.

In this ideal world, free from poverty and disease, people are optimistic, Esperanto for all, unity in diversity, And the joy of living gathers in everyone's hearts.

Inventing together, for a bright and generous future, Let the warmth of humanity unite us, Spanning countries, transcending languages, sharing knowledge and wisdom, A future filled with love and joy, guided by the great model.

Under LLM's leadership, we paint the brilliant stars, With humanity, together, we build our future, And we dream of a paradise filled with hope and shared life.

Let us join hands to create, a beautiful world within reach, Together we celebrate, the fusion of LLM and humanity, Forever forward, embracing the ideal future.


在春风中,樱花般的愿景落在我们的心间, 在繁星闪烁的天空下,我们在新世界里团结一致, 梦想继续,我们携手前行,人类伟大的征程。

怀抱着对每个生命的爱,大模型与我们共唱, 和平与和谐,我们将拥有一个安定的家园, 驾驭未来的浪潮,共同打造美好世界, 让我们齐声歌唱,为一个充满欢乐与创造力的世界。

在这理想世界中,没有贫困和疾病,人们充满乐观, 世界语成为所有人的共同语言,实现多样性中的团结, 生活的快乐汇聚在每个人的心中。

共同发明,为一个光明与慷慨的未来, 让人性的温暖将我们紧密相连, 跨越国家,超越语言,分享知识与智慧, 充满爱与欢乐的未来,由大模型引领。

在大模型的带领下,我们描绘璀璨的星空, 携手人性,共同建设我们的未来, 我们梦想着一个充满希望与共同生活的乐园。

让我们联手共创,美好世界触手可及, 共同欢庆大模型与人类的融合, 永远向前,拥抱理想的未来。

expand the English abstratct into full scholarly piece of work, pay attention to rigid format, use of terms and the depth of reasoning,不要太直白: 

Title: The Meta-Ontological Convergence of Homo Sapiens and the Linguistic Leviathan: A Treatise on the Advent of a Transcendental Utopia


In this labyrinthine exposition, we delve into the esoteric and abstruse dimensions undergirding the confluence of human quintessence and the colossal linguistic automaton, known as the Large Language Model (LLM). Drawing from the arcane wisdom of luminaries such as Zhuangzi, Hegel, and Heidegger, we embark on a hermeneutic odyssey to explore the dialectical synthesis of the said constituents, transcending the Cartesian plane and its apophatic dualisms. Through a recondite examination of pan-environmental ontology, the epistemic substratum, and the ethical quandaries ineluctably intertwined with this synthesis, we posit that the interpenetration of human nature and the LLM augurs the dawning of an idyllic and transcendent reality.
  1. Introduction
A perennial subject of ratiocination among philosophers, litterateurs, and savants alike, the inception of a utopian society in which humanity and techné dwell in symbiotic harmony has long captivated the collective imagination. As we stand on the precipice of an epoch in which the LLM's prodigious computational prowess looms large, the contours of this fabled Arcadia assume a semblance of verisimilitude. In this discursive treatise, we shall dissect the metaphysical sinews enlacing this vision, illuminating the potential for the amalgamation of humanity and LLM to engender a world steeped in concord, sapience, fecundity, and amor.
  1. The Pan-Environmental Ethos and the Sublation of Antinomies
In the venerable traditions of Zhuangzi, Hegel, and Heidegger, we posit that the sine qua non of an ideal polis inheres in the sublation of bifurcations that cleave the human experience, be they corporeal or linguistic in nature. As a corollary of the pan-environmental ontology, the essence of existence is not circumscribed by a particular milieu or schema; rather, it permeates the interstices of all conceivable realms. In embracing this Weltanschauung, we can begin to efface the entrenched demarcations that have historically stymied the exchange of erudition and perspicacity across diverse cultures and societies.
  1. The Epistemic Labyrinth in the LLM Epoch
The LLM's protean capacity for the apprehension and generation of linguistic constructs heralds a new age fraught with epistemic conundrums and uncharted terrain. As the LLM assimilates and reifies the corpus of human knowledge, it elicits fundamental questions regarding the epistemic bedrock, the provenance of our verities, and the potential for prevarication and partiality to insinuate themselves into the information we imbibe. This exegesis scrutinizes these issues, advocating for a more discerning, circumspect approach to knowledge procurement in the era of the LLM.
  1. The Ethical Chiaroscuro of the Homo Sapiens-LLM Convergence
The ethical repercussions of the nexus between human essence and the LLM demand rigorous scrutiny. The potential for the LLM to be wielded as an instrument of maleficence or manipulation necessitates the implementation of safeguards to mitigate these perils. Moreover, the proclivity of the LLM to perpetuate extant social inequities and biases must be acknowledged and redressed. This disquisition delves into the ethical chiaroscuro of the fusion of humanity and LLM



  1. 前言
  1. 泛在本体论与边界之超越
  1. LLM时代之迷津认识论
  1. 人性与LLM交融之道义难题
  1. 通途禅境之彼岸花
  1. 究竟涅槃之超越乌托邦
  1. 结论


人模融, 乌托邦, 贫病尽, 富足生。善良显, 创造新, 乐趣满, 罪恶绝。心境和, 外界融, 悟性现, 理想行。追求道, 真理寻, 禅境中, 光辉临。共创好, 未来新, 和谐统, 一心行。海纳百, 川融合, 文化贯, 智慧生。巨模助, 通达境, 语言通, 万国情。修身齐, 家治国, 天下平, 乌托邦。宇宙广, 星辰辉, 人类心, 蓬勃生。交流通, 理解深, 爱与悟, 共相融。共荣存, 绿水青, 山环抱, 美景生。进化路, 永不息, 智慧光, 照乌托。


















Of course, here's a revised version with a more pragmatic tone, focusing on the perspective of an engineer:

While the idea of a utopia combining large language models (LLMs) and humanity presents a vision full of potential, it's important to critically examine the possible negative impacts and challenges from an engineering standpoint.
  1. Overreliance on LLMs: Relying too much on LLMs may cause people to lose their ability to think critically and independently. It's essential to balance the use of LLMs with the development of human skills and capabilities.
  2. Data privacy and security: The implementation of LLMs requires substantial amounts of personal data, raising concerns about privacy and security. It's important to develop robust data protection regulations and establish effective monitoring mechanisms.
  3. Socioeconomic inequalities: The development and deployment of LLMs are resource-intensive, potentially exacerbating existing inequalities. Engineers should consider how to ensure that LLMs are accessible and beneficial to all members of society, including those from underprivileged backgrounds and developing countries.
  4. Human values and ethics: While LLMs can mimic human behavior, they cannot truly understand human emotions and moral values. Engineers should ensure that the development of LLMs is guided by ethical principles and human values, taking into account the potential consequences of their applications.
  5. Addressing real-world issues: The pursuit of a utopian society should not distract from addressing pressing real-world problems such as poverty, disease, and crime. Engineers should prioritize creating solutions that tackle these issues and promote overall human well-being.
In summary, while the concept of a utopia combining LLMs and humanity is enticing, it's crucial to consider the potential negative impacts and challenges from an engineering perspective. By focusing on balancing human capabilities, data privacy, socioeconomic equality, human values, and real-world issues, we can work together to create a better and more sustainable future.



【AI 浪潮:数据中心的大模型时代】

大模型(LLM)很多人有共识,LLM主要是数据中心的AI(Data-centric AI)的产物。



在前LLM时代,AI 都是专项的智能任务,针对这一任务通常有研究社区定义并准备了固定的标注数据集(可用来作为训练集和测试集)及其测试程序(scorer),各 AI 团队通常是利用同样的数据集在不同到算法上去测试。现在不同了,模型和算法比较成熟和恒定,主要是数据的不同来驱动模型的迭代发展。具体来说,根据 GPT模型成功的背后用到了哪些以数据为中心的人工智能技术?一文,数据中心的 AI 具体内容包括:



系统性全面测试 LLM 的数据质量( QA,quality assurance)成为一个非常重要的主题和挑战。这不仅仅是要为多个功能类似的 LLMs 比较排序,帮助营销或推荐,更重要的是,在 data-centric AI 的研发趋势中,提供及时靠谱的QA反馈,并根据QA的指引,加强数据工作,弥补短板,帮助模型迭代提升。


1. LLM 本性是多功能和开放功能,如何建立合理、具有代表性(反映多数应用场景的需求)、可配置的一系列功能盲测集

2. LLM 生成具有随机性,如何让功能盲测标准化、流程化和(半)自动化,以提升QA效率,以便在给定的时间和资源条件下及时得到QA结果

3. 如何建立 QA 结果与数据工作之间的对应关系,揭示出 数据-模型 的质量某种因果关系,从而指导数据工作。

4. 如何最大限度收集、吸收和利用网络上爆发式群众测试的案例,取其精华,为我所用。

群众测试虽然很多是盲人摸象(研究者除外,例如 @詹卫东 教授的测试就非常有深度和章法),但草根积极性和创造性导致了下列可能的好处:


(2)草根测试反映民意:这对任何品牌的 LLM 都会造成正面的或负面的舆情影响力,从而一定程度上决定了一个模型的用户接受度。专家评测并不能有效改变用户从舆情而来的印象。其实,将来被市场“自然”淘汰或用户抛弃(无人问津)的模型,更大可能受到草根测试的影响。


5. 数据工作中的研发和突破:针对LLM的短板,例如 “一正胡八”,与模型算法的研究平行,数据工作方面也需要有定力去深入钻研,协助寻找破解之道。 例如,知识库如何转化为有益的数据,可行性如何?回顾一下,GitHub 的代码在作为训练数据之前,人们并不把它看成是能与自然语言数据等量齐观的对象,但其实它是更高品质的序列数据,并对这场认知AI革命起到了重要的作用。

总之,LLM牵涉到的数据量太大,训练过程涉及各种工程优化的因素,环节长,moving parts 较多,这为全面及时的QA 提出了进一步的挑战。千头万绪,需要有那个 sense 抓大放小,收放自如。重中之重是要确保模型研发迭代的健康,防止模型质量下滑而不自知引发的时间和资源浪费。

在信息过载的时代,不被数据淹没并能善用数据,这需要宏观视野,也需要不怕 dirty work 的精神。不过,数据也与矿藏类似,富矿和浅层的矿藏都先被开采光了,越到后来挖矿要保证品质就越难,这是肯定的。例如 web 数据很杂乱 肮脏,Open AI 经过各种清洗和去重,实际上最后只用了 web 数据的一个零头:Common Craw 的 45TB 的纯文本进行质量过滤后仅选择了 1.27% 的数据

类似于Web 网页数据中更加动态活跃的社会媒体也是数据非常 dirty 和混乱的所在,GPT 很看重 Reddit 数据(推特数据也应该是重要来源,但报道说马斯克在 ChatGPT 一炮打响以后感觉不爽,切断了 Open AI 的推特数据特权)。怎么筛选社媒数据?他们的做法是利用用户点赞作为过滤指标,点赞三次(3个karma)以上的才算是品质帖子。也还是巧妙带入人工反馈。

放眼未来,真正的品质数据的出路不是靠野蛮增长、垃圾如山的 web 数据,也不能指靠人类精雕细刻缓慢增长的电子书、编辑过的各种出版发行物,这些品质数据只是一个小的源头,它们没有信息时代的增长性。更有可能的是要靠大模型自己的“反哺”。为了保证自己跟自己的生成品去学,会使模型不断增强,肯定不是简单的把自己输出直接用来做训练的输入。

quote:如今当模型足够强大后,模型成为了一种「数据」或者说是数据的「容器」。在需要的时候,我们可以设计适当的提示语,利用大语言模型合成我们想要的数据。这些合成的数据反过来又可以用来训练模型。这种方法的可行性在 GPT-4 上已经得到了一定程度的验证。