

Ode to Madam Hu

July 1, 2024 at 1:50 PM

[verse 1]
In Suzhou's June, beneath a scorching sky,
A madman's blade flashed, evil drawing nigh.
Mother and child cried out in desperate fear,
Their screams of anguish piercing far and near.

With verse we mourn, our grief in words conveyed,
A hero's tribute, never to fade.

[verse 2]
Before the school bus, Madam Hu stood tall,
Her gentle hands became a shield for all.
No tiger-wrestler she, no dragon-slayer,
But love unbounded made her their savior.

With verse we mourn, our grief in words conveyed,
A hero's tribute, never to fade.

[verse 3]
Her blood stained red the soil of Jiangnan,
White clouds and grieving grass bore witness, wan.
Though snuffed, her candle's light forever gleams,
Like brave Feng Yuan of old, her courage beams.

With verse we mourn, our grief in words conveyed,
A hero's tribute, never to fade.

[verse 4]
Why must the kind so often suffer woe?
When will justice's path smooth waters show?
We question Heaven, tears fall like the rain,
In silence seek life's meaning through our pain.

With verse we mourn, our grief in words conveyed,
A hero's tribute, never to fade.

[verse 5]
Madam Hu's name shall echo through the years,
Half-masted flags, a nation draped in tears.
Her love, transcending life and death's divide,
One selfless act, as sun and moon abide.

With verse we mourn, our grief in words conveyed,
A hero's tribute, never to fade.

[verse 6]
Rest now in peace, return to native ground,
Let not your family grieve, all hearts are bound.
In old Wu Gate, by Suzhou's storied streams,
We offer flowers and wine to honor dreams.

With verse we mourn, our grief in words conveyed,
A hero's tribute, never to fade.






4.35 05/03 CHI:/ [email protected] 尼克讲解大模型的理论渊源与奥秘 https://v.douyin.com/i6Mq91wS/ 复制此链接,打开Dou音搜索,直接观看视频!






让我们来认识一下这位留着圣诞老人般白胡子的数学怪才——Ray Solomonoff!1926年出生,2009年离世,这位在芝加哥大学混学位的数学物理双料高手,可不是普通的学霸。他是独立研究的先驱,用数学公式预测未来,比算命先生还厉害!


现在我们来看看AI界的'前传'!在达特茅斯会议这场AI盛宴前,麦卡锡大佬就在偷偷摸摸写'剧本'啦!他的文章《The inversion of functions defined by Turing machines》可不是在讲怎么把图灵机倒过来用。这篇'天书'其实在讨论如何设计一台超级解题机器。麦卡锡想象中的这台神机,能解决所有明确定义的智力问题。这不就是AI的雏形吗?"






1964年,所罗门诺夫发表了一篇惊世骇俗的论文,名为《归纳推理的形式理论》。这篇论文堪称 AI 领域的“武林秘籍”,详细介绍了如何用数学语言来描述归纳推理。简单来说,就是从数据中学习规律,预测未来!这篇论文可是 所罗门诺夫归纳推理的“扛鼎之作”,奠定了他在机器学习领域的江湖地位。
Solomonoff 的这篇论文第二部分给出了归纳推理的形式理论在不同问题上的应用示例。其中一个例子就是语法发现,即如何从观测到的语言数据中学习语言的语法规则。这个例子在今天看来,就是语言学习的问题,即机器如何像人类一样学习语言。所罗门诺夫 在论文中还探讨了一个更深层次的问题:语言是否等于思维?这个问题至今仍然没有明确的答案,但所罗门诺夫的研究为我们思考这个问题提供了一个新的视角。


Kolmogorov K氏叠加定理 (K-A-N): 该定理与著名的希尔伯特第13问题有关,涉及到函数的表示与逼近。
K氏复杂度: 这是一种衡量信息复杂性的方法。它将一个对象的复杂度定义为生成该对象的最短程序的长度。

数数派: 就像掷骰子,看某个点数出现的次数。
搭积木派: 关注积木块的数量和怎么组合。
写程序派: 把信息看成程序,越短的程序越简单。

蔡廷年少有为,18岁就在IEEE Transactions on Electronic Computers上发表了第一篇论文。19岁时,他在JACM上发表的论文独立地重新发现了所罗门诺夫和柯尔莫哥洛夫的思想。
定理1: 即NP-完全性,即Cook-Levin定理,为计算复杂性理论的发展做出了重要贡献。
定理2: 是柯尔莫哥洛夫复杂性的一个推广。

Charles Bennett提出了逻辑深度的概念,它考虑了生成一个对象所需的最短程序的运行时间。大语言模型的参数可以看作是模型内部存储的信息量。因此,将模型参数比作柯氏复杂度是合理的。大语言模型的推理时间比作逻辑深度也是合理的。
李明是滑铁卢大学的杰出教授,在信息论和生物信息学领域做出了卓越贡献。他将K氏复杂性从单个序列扩展到两个序列,不仅可以测量单个序列内的信息,还可以测量两个序列之间的信息,这对通用大模型定义万能任务及其非监督学习完成各种任务意义重大。他与Paul Vitanyi合著的《An Introduction to Kolmogorov Complexity and Its Applications》被认为是该领域的经典著作,对信息科学的发展产生了深远影响。
Marcus Hutter是一位物理学家出身的计算机科学家,他提出了AIXI通用人工智能框架,并认为语言建模本质上就是压缩。他将所罗门诺夫归纳用于解释智能体和强化学习,认为学习过程就是压缩过程,并致力于研究通用人工智能。
Open AI 前灵魂人物伊利亚在伯克利演讲中,揭示监督学习与非监督或曰自监督学习的联系。伊利亚声称他在2016年独立想到了所有监督学习可以被归约为自监督学习的观点,并追溯到K氏复杂度为基础的压缩理论。伊利亚笃信简单的自回归GPT模型可以在超大数据展现超级智能。
总结一下,所罗门诺夫归纳第一步是收集观察数据。第二步形成假设解释数据: 假设可以是一个图灵机或一个数据驱动的大模型。第三步进行实验验证。如果数据证伪,则返回步骤2形成新的假设。








简介一下 first:

Moravec悖论是由机器人专家汉斯·莫拉维克(Hans Moravec)在20世纪80年代提出的一个观察,指出了人工智能领域中的一个有趣现象:高水平的推理需要相对较少的计算资源,而低水平的感知和运动技能则需要大量的计算资源。这意味着计算机和机器人在处理复杂逻辑和数学问题时相对容易,而在执行诸如行走、抓取和视觉识别等基本感知和运动任务时却非常困难。


高水平推理 vs. 低水平感知和运动技能:






下棋 vs. 走路:计算机程序可以打败世界上最好的选手,但要让一个机器人在复杂的现实环境中稳定地行走仍然是一个巨大的挑战。

逻辑推理 vs. 抓取物体:逻辑推理问题可以通过算法高效解决,而让机器人精确地抓取和操纵不同形状和材质的物体仍然需要复杂的感知和控制算法。


1. 既然低水平的感知和运动技能是亿万年进化的结果,成为动物和人的遗传本能,后天只是激发出来了这种能力。

那么 LeCun 以动物和人总是在语言能力之前,先“学”会了这些能力作为理由,批判LLM对于多模态的干扰和“投机取巧”,就不是很有说服力。因为模型并没有进化出来的遗传基因作为基础。模型唯一可以利用的是LLM里面的认知知识(包括感知常识的语言描述)。

2. 虚拟机器人(LLM)可以轻易做很多高级白领工作,但人形机器人却对蓝领低级工作的模仿非常笨拙,这是大家都看到的事实。


3. 视觉识别以前是模型的短板,人和动物的强项。



4. 具身机器人的外推能力怎么来呢

我们知道,机器人以前的建模都是非常“内向”的,在厨房环境建立的模型,换到了办公室环境就不行,必须一切重来,重新准备场景数据,重新训练。厨房环境的数据 “外推” 到办公室环境的能力,可不可以做到?

可以的。在大模型的架构下,这早已不是幻想。可行性可以从半年多前的谷歌 RT-2 机器人的项目表现看到端倪。

道理就是大模型虽然没有遗传的基因,也没有目标场景(厨房场景)的数据, 但办公室环境的数据 finetune 出来的大模型能力,是有希望迁移(外推)到其他的环境(例如办公室环境),因为LLM 某种意义上起的作用就是生物进化得到的先验遗传。



模型训练的讨论中,常常提到 interpolation(内推)与 extrapolation(外推)两个术语,可以说说。

大家都知道,在数据驱动的主流学习过程中,内推需要的是在训练数据的边界内“泛化”能力,善于信息压缩的深度学习训练是内推的利器。但外推(extrapolation)就没见有长于此道的算法或模型。实际上,外推与数据驱动是直接矛盾的,by nature 就是无解的,因此很可能这就是个伪问题,根本就不该对外推抱有幻想。当然,从鲁棒性角度,外推现象出现的时候,应该有个合理的应对,至少要做到模型不死机,至于外推的精度,主要是看运气,而非模型的内功可以发力。




但加强外推,理论上还有 “先验知识” 和 “模型结构复杂化” 两个路子可能有点帮助,而数据本身的短板,除了数据,没有它途:就是说把对于“无人区”的外推问题,转化为收集相关数据,让外推变成内推。

模型结构复杂化在历史上是帮助了训练的“外推”能力的,如果原来的简单模型有自身的结构性局限。我们都知道,AI历史上,非线性问题在简单的单层神经网络的模型里是无解的,无论给出多少训练数据(这其实是早期神经网络的命门:单层神经无法解决 XOR 的现象)。但到了多层+非线性转换,就有解了。这其实是深度学习神经革命的最基本的立足点。这其实反映了当本质上是多维的数据被挤压在低维空间的时候,简单模型是无法跨越维度去找patterns的,相当于外推遇到了墙壁。模型复杂化就是开拓了多维空间,供训练施展。

至于“先验”对于外推的作用,我们从 Alpha-Zero 利用 self- play 再生数据可以看到效果。self-play 的好处是有游戏规则等先验知识,因此它所产生的数据,可以是在先验知识条件下的未见过的“外推”数据。



鲁为民:我的理解是如果用数学语言来描述, 给定一个数据集,如果一个数据 sample 在该数据集的 Convex Hull 是内插,在Convex Hull 之外是外推。所以 Nick 说的证伪也还应该是内插,但科学革命要看革命到什么程度。

所以内插问题基本是可解的问题。从这个意义上来说 NTP 都是内插 (不过 Sonomonoff 说的下一个符号预测是外推,定义和这个不一样)。










鲁为民:所以像 Embedding 这样试图将离散问题近似为连续问题,将问题简化到利用目前的手段可解。另外通过概率方法,也可将离散问题连续化 ...

立委:如果某数据的本性就是完全随机(布朗运动?) 不存在任何有效的压缩。外与内如何区分?区分又有何意义?


立委:这类数据本性是不可计算的,但在谜底泄漏之前,只要给数据 模型(or 图灵机)就一定在计算。它在计算或压缩什么呢?

又因为所给的数据一定是有限的,这有限的数据一定会被“误读” 而且一定形成某种内外之别。不完全归纳/回归??数据驱动的方法 其实不知道 也不可能知道背后的真相。

立委:离散符号的 embedding 比起 1 hot 是降维 降维克服了数据稀疏的挑战 从而为压缩创造条件。但从传统的符号逻辑 用人为的非常有限的离散 features (例如词类与子类)来表示离散符号来看, embedding 是在增加维度。但除了 embedding 还有更好的办法 与上帝对话吗 ?embedding 的维度数,本来就具有任意性、可配置。

白硕:理论上离散的NP完全问题对应的人造数据也都可以“嵌入”连续空间。连续方法对求解有好处吗?@Nick Zhang(尼克)


白硕:用1 hot那种嵌入。能不能找到结构相似性?比如对变元做一个permutation不变的SAT问题。














在Anthropic 的 Claude 3 和 谷歌 Gemini 赶上 GPT4 以后,就不断有人希望 Open AI 放出它的 GPT5 的大招来,但传说中的 5 迟迟不来,于是有人怀疑,scaling law 是不是失效了,大模型是不是进入了平台期。

这个怀疑有一定的道理,因为GPT路线上的 scaling law 不过是个经验“法则”,虽然说,it never fails us,so far,但谁也不敢保证它永远有效。遇到瓶颈不是不可能的。


文章开始有个奇怪的递减结论的依据。在一个 100 scale 的性能上,达到 80 以后,绝对递升减缓不是一个宇宙真理吗?怎么就成了马库斯眼中的递减论的批评依据呢?我们对 “更大力出更大奇迹” 的 scaling 期望,主要是要补齐那些目前能力的空白 和 短板,最终实现 “在所有职能任务上,模型都可以达到或超过人类水平” 的 AGI 理想。因此,合理评价大模型更新换代是不是走在agi的路上,应该细致分析空白与短板,而不是用那些已经达到 接近 或超过人类水平的指标上。 也许 gpt5 确实遇到平台期或过不去的瓶颈 (我们其实不知道),但这种论证方式显得多么不靠谱。



多媒体方向的进步,虽然水到渠成,但其实速度很快,无论Sora的发布,还是前两天Open AI 和谷歌的全双工、实时、流式、能看能听能说、甚至能逗哏捧哏、打情骂俏的超写实助理的发布,其表现和速度实际上超出我们多数人的预期,完全不像是模型进入平台期的景象。



从大模型本身的健康来看,多模态也是绕不过去的关,认知智能只有借助感知智能(进而结合具身智能从数字世界进入物理世界),才能算是真正落地(grounding),获得数字世界 ——或跳出数字空间获得物理世界—— 的意义。

在这一点上,LeCun 在 AGI 中特别强调感知智能是有其道理的。但LeCun过分纠结于感知和认知训练的次序了:他坚持感知智能先行,要排除语言模型的认知的“投机取巧”和对感知智能的“干扰”,淡化或无视认知智能对于感知智能的知识迁移作用,这是具体路线之争,不是特别有说服力的。

LeCun 说过: GPT 不如我家的一条狗。



so what?

对不如一条狗的LLM,我们也不能因此否认它比1000个教授和博士都更博学。LLM 可以与物理学家讨论暗物质,与语言学家谈乔姆斯基层次结构,与任何专家谈任何问题。而且所谈的并不是人们想象的那么无知和胡说,虽然里面 here and there 确实有幻觉和臆测。

切身体会是,如果你作为专家保持一种探究问题的心态和对于幻觉警惕的 alert,你会发现与它交谈比与很多中等水平的同行讨论,更有意思,或受启发。ta 看的书实在太多,而且也本性上学到了融会贯通,而不仅仅是死记答案:学到了语言,也相当程度上,学到了知识。


不管怎样,AI 因为有了马库斯和LeCun这些“持不同政见者”,而更加丰富多彩。但我们也没必要被他们蛊惑。



甜甜听到我的 piano ballad,问:are u ok, Dad?

我说,if you are blue, what you do is play piano.
我也想 play,但可惜我不会。所以,I made a piano ballad

她说,I see, 但她没想到这是 AI 歌曲。

甜有很高的音乐素养,以前一直看不起AIGC,主要是她觉得她看到的AIGC内容在似像不像之间,所谓“恐怖谷”效应,感觉 weird,但这次说,这一首的确不像是AI的,与人类艺术家产出无异。

我说,there will be more and more AIGC beyond our imagination

human like or super human like ...

我说我在公园转,循环听了这首不知道多少遍,还没有烦。谁说 AIGC 出不来可循环听放的曲子?

当连听三个 sad, 很难不泪眼模糊:人生本来就有无尽的不可承受之重。




昨天,听李飞飞 Ted 演讲,她确实是个演讲高手。谈她刚下场要做的初创,做所谓“空间智能”,就是视觉 3D,也许加“具身智能”,当前的热门。





这个意义上,超级对齐不仅是杞人忧天,而且是要做上帝,或替上帝操心。太把自己当回事儿了。伊利亚、马斯克,无不如此,自以为超人。马斯克准备投巨资要移民火星 说是为文明买一张保险。但巨额保费谁出?打着人类的名义,本质上还是人类买单,哪怕这是从他个人的超级利润拿出来的。这其实没有道理。尤其是在还有疾病、饥饿和无数本来可以避免的痛苦的世界。




Unified Models Surpass Single-modal Models  (Gemini Notes 2/8)

by Zhi-Fei Li, Gao Jia, Wei Li, from "Brother Fei on AI"


Multi-modal Large Unified Models Finally Surpass Specific Single-modal Models  

Humans perceive, cognize, and generate emotions and consciousness through the integration of multiple senses. Gemini is also practicing this approach, processing multiple modal inputs, integrating them in the brain, and then expressing through various modal outputs. This comprehensive "simulation" of human intelligence by such models is rapidly evolving.

Previously, multi-modal model training resembled a system composed of separate eyes, ears, arms, and brains, lacking strong coordination. However, the direction represented by Gemini feels significantly different: it's as if the large model has become a complete digital person, where hands, eyes, brain, and mouth work in harmonious silicon unity. Gemini is the first true end-to-end multi-modal system.

In the past, models optimized for a single modality usually outperformed those handling multiple modalities simultaneously. The common practice was single-modality model training. Even GPT-4 primarily "concatenates" different modalities into an overarching framework, rather than being a unified multi-modal model.

The exciting aspect of Gemini is that it was designed from the start as a native multi-modal architecture. The training process interweaves various modal data from the beginning. If previous large models were like attaching sensory organs or mechanical arms to a brain externally, Gemini is like growing its own eyes, ears, and arms internally, allowing for fluid and natural interaction.

Whether in terms of model architecture, training process, or final output, Gemini achieves a seamlessly integrated multi-modal experience.

For the first time, Gemini demonstrates that a unified model can handle all modalities, and perform even better than models focused on a single modality! For example, compared to the Whisper model, which is optimized for voice recognition, Gemini shows a significant improvement in accuracy.

This signifies the dawn of the era of unified multi-modal models.


In fact, Gemini is not the first model to demonstrate that different modalities can mutually enhance performance. This was also evident in PaLM-E, where "PaLM-E, trained across different domains including general vision-language tasks at internet scale, showed a marked improvement in performance compared to models performing single tasks in robotics."

Another example of modalities enhancing each other is the multilingual processing ability of large language models. If we consider different languages as distinct "modalities," the practice of large language models has proven that processing native data of all languages together (through tokenization and embedding) managed to lead to the successful construction of a human language tower of Babel.

The overwhelming amount of English data in the training of large language models also benefits the model's understanding and generation of languages with limited data, reaffirming the transfer of linguistic knowledge. It's akin to a person skilled in tennis also being able to improve their abilities in squash or golf through related skills.

Since the rise of large models in February this year, many have gradually embraced the belief that "unified multi-modal models will surpass single-modality models." However, this belief hadn't been confirmed on a large scale until Google's Gemini showcased the prospects of this belief, reshaping and solidifying it for many.

In the future, specialized models for tasks like voice recognition or machine translation may become less significant. Many generative tasks such as TTS and image generation are also likely to be unified under large models. Some may complain about the high cost and slow speed of large unified models, but these are purely technical challenges. In practice, we can distill unified models to specific modalities or scenarios.

We firmly believe that unified cross-modal large models will become the mainstream pathway to achieving AGI.

Furthermore, "modalities" are not just sound, images, videos, etc. Olfactory, gustatory, tactile, temperature, and humidity sensors are also different modalities for gathering environmental information, all of which can in time be encompassed by unified models.

Ultimately, various modalities are merely carriers of "information." They are a form of rendering, a presentation style, a means for an intelligent entity to interact with the physical world. In the eyes of a unified model, all modalities internally can be represented by unified multi-dimensional vectors, enabling cross-modal knowledge transfer and the intersection, alignment, fusion, and reasoning of information.

When the barriers between modalities are breached, revealing the core beneath various renderings, we see the origin of cognition — language.




(Gemini Notes Series to be continued)


Original from:

关于 Google Gemini 的八点启示

by Zhi-Fei Li, Gao Jia, Wei Li, from "Brother Fei on AI"

Cross-modal Knowledge Transfer of Large Models Proven (Gemini Notes 1/8)

by Zhi-Fei Li, Gao Jia, Wei Li, from "Brother Fei on AI"


In 1948, inspired by psychiatric patients, British doctor Ross Ashby invented a peculiar machine called the "Homeostat." He proclaimed that this device, costing about 50 pounds, was "the closest thing to an artificial brain ever designed by mankind." The Homeostat utilized four bomb control switch gear devices from the British Royal Air Force, used during World War II, as its base. Above these were four cubic aluminum boxes, with the only visible moving parts being four small magnetic needles on top of the boxes, swaying like compass needles in a small trough of water.

When the machine was activated, the needles moved in response to the electric current from the aluminum boxes. The four magnetic needles were always in a sensitive and fragile state of balance. The sole purpose of the Homeostat was to keep the needles centered, maintaining a "comfortable" state for the machine.

Ashby experimented with various methods to make the machine "uncomfortable," such as reversing the polarity of the electrical connections or the direction of the needles. However, the machine always found ways to adapt to the new state and re-center the needles. Ashby described the machine as "actively" resisting any disturbances to its balance through synaptic action, performing "coordinated activities" to regain equilibrium.

Ashby believed that one day, such a "primitive device" could evolve into an artificial brain more powerful than any human, capable of solving the world's most complex and challenging problems.

Despite Ashby's lack of knowledge about today's AGI evolution and the laughable idea of using four small magnetic needles as sensors for intelligence, his Homeostat fundamentally challenged everyone's understanding of "intelligence" - isn't intelligence the ability to absorb information from the environment in various modalities, and to modify behavior and responses based on feedback?

From the peculiar "Homeostat" to today, 75 years later, Google's Gemini, which claims to have surpassed human multi-modal task processing abilities, accelerates towards the evolution of billions of years of carbon-based intelligence through the injection of multi-modal native big data.

The acceleration speed of machine intelligence evolution today far exceeds our imagination. A year ago, OpenAI overturned Google's long-established AI position with its 'brute force aesthetic,' having constructed the Babel Tower of human languages. A year later, Google countered with Gemini, via a 'fight fire with fire' approach to building the first unified cross-modal model, setting another milestone in AGI evolution.

Despite initial skepticism over exaggerated video demos upon Gemini's release, it's undeniable that the dawn of a unified multi-modal approach is shining. What capabilities does Gemini confirm? How will Google's wheels of fate turn? Is time a friend to OpenAI or Google? What does multi-modality mean for Agents and embodied intelligence? Are the foundations for the emergence of AGI with consciousness already in place? How should we view the implications of Gemini for the AI future?


Cross-modal Knowledge Transfer of Large Models Proven Again

For humans, the ability to transfer knowledge across various domains and through different timespaces is more important than merely learning skills. If machines can master cross-modal knowledge transfer, they edge closer to "intelligence generality."
In July this year, Google introduced RT-2, a robotic system based on large models, sparking hope for general-purpose robots.  The system's robotic arm, leveraging the "common sense" of language models, demonstrated the ability to "pick up an extinct animal from a table," moving from common sense reasoning to robotic execution, showcasing cross-modal knowledge transfer. 
In December, the introduction of Gemini by this tech giant reaffirmed the cross-modal knowledge transfer capability of large models: the "common sense" of language models could be transferred to the training of other non-linguistic modalities added later. Language models are known to form the foundation of cognitive intelligence, and the most basic form of cognitive intelligence is "common sense."  Without common sense empowerment, the practical application of large multi-modal models would be challenging.  Gemini smoothly transfers this "common sense" to downstream multi-modal tasks.  Like RT-2, it achieves cross-modal integration through the transfer of text-derived knowledge — Gemini can connect ontology concepts to the understanding of auditory and visual objects, and eventually link them with action, forming an intelligent system ready for real world application. 
From the perspective of model training, compared to language models trained with massive internet data, downstream models (like robotic models) can be trained with very limited data through knowledge transfer.  This transfer-based training manages to address the long-standing issue of data scarcity in downstream applications.  For instance, to achieve the effects shown in the video (which raised doubts about Gemini's video comprehension or picture comprehension but did not affect the discussion on cross-modal knowledge transfer here), Gemini first needs some ontological knowledge — it understands the concept of a duck, knows the usual color of ducks, and what blue is. When it sees a "blue duck," it reacts similarly to humans, expressing the "common sense" that "blue ducks are uncommon." 
Gemini, through auditory and visual perception, identifies that the material of the blue duck is rubber and knows that rubber's density is less than water's. Based on this common sense and reasoning, when it hears a squeaking sound, it can predict that "the blue duck can float on water." 
From RT-2 to Gemini, we've moved to the "fusion" of multi-modal perceptual intelligence and cognitive intelligence. We've transitioned from isolated "five senses" modules of eyes, ears, mouth, nose, and body to a unified digital "human". 
Doesn't this imply that on the path to simulating human intelligence, the unified model is the right approach? 




(Gemini Notes Series to be continued)


Original from:

关于 Google Gemini 的八点启示

by Zhi-Fei Li, Gao Jia, Wei Li, from "Brother Fei on AI"


作者 | 高佳   李维
创意 | 李志飞

在 RT-2 和 Gemini 中,以语言为基础的认知智能始终是人类知识模拟的核心,其中常识及其推理的知识迁移起到了关键作用。例如在 RT-2 中,反映语言模态的数据量和参数规模都远远大于下游的图片和动作模态的规模。
这一点做到了,就凸显了语言模型对AGI的最大贡献,因为它真正体现了研究人员对语言大模型的初心和定位——作为 Foundation ModelCore Engine.

关于 Google Gemini 的八点启示





立委按: 生活比戏剧更戏剧, 虚拟比现实更现实; Turbo 比 GPT 更 GPT, AI 比智能更智能,是为AGI。


### OpenAI 剧情回顾:硅谷戏剧全纪录

#### 第一幕:引火 - 山姆·奥特曼被解雇

故事始 于 OpenAI 董事会一个突然且有争议的举动:CEO 山姆·奥特曼的意外解雇。此举在硅谷引发了轩然大波,标志着一场前所未有的公司戏剧的开幕。

- **亮点**:董事会指责奥特曼在与董事会的沟通中缺乏坦诚,这一指控后来成为争议的核心。
- **关键人物**:山姆·奥特曼,因引领 OpenAI 进入新领域而闻名,现在却突然被驱逐,为接下来的剧情奠定了基础。

#### 第二幕:后果与反抗


- **亮点**:近500名员工威胁离职,除非董事会辞职并恢复奥特曼和联合创始人格雷格·布罗克曼的职位。
- **关键人物**:联合创始人兼前总裁格雷格·布罗克曼成为反抗董事会决定的象征。

#### 第三幕:伊利亚的后悔与公开信

在一个出人意料的转折中,被指责策划奥特曼出局的 OpenAI 首席技术官伊利亚·苏茨克维公开表达了他的后悔。这一认错为这场戏剧增添了新的复杂层次。

- **亮点**:伊利亚在社交媒体上的公开后悔和他参与的要求董事会辞职的公开信。
- **关键人物**:伊利亚·苏茨克维的角色从被指责的策划者转变为悔恨的关键人物,寻求修复 OpenAI 内部的裂痕。

#### 第四幕:董事会的困境与新任 CEO

在巨大的压力下,董事会发现自己处于十字路口。与此同时,新任 CEO Emmett Shear 被任命,标志着 OpenAI 可能的发展方向转变。

- **亮点**:Emmett Shear 的任命和他对 AI 发展的保守态度,与奥特曼的激进增长战略形成鲜明对比。
- **关键人物**:Emmett Shear,作为一股可能稳定混乱局势的力量,代表了 OpenAI 的新篇章。

#### 第五幕:转投微软与 OpenAI 的未来


- **亮点**:微软成为主要角色,吸收了 OpenAI 的人才,可能重新定义 AI 领域的格局。
- **关键人物**:山姆·奥特曼转投微软,被视为一种战略高招,可能改变 AI 发展的未来轨迹。

#### 终幕:持续进行的剧情

这场戏剧暂时告一段落,OpenAI 正处于关键时刻。它的领导层、发展方向和核心理念都处于变动之中,这些事件的影响继续在科技界波及。

- **回顾**:从奥特曼被解雇到现在,OpenAI 的剧情回顾了权力斗争、意识形态和硅谷 AI 领域未来的集中展现。
- **关键收获**:这一事件证明了领导尖端 AI 组织的复杂性,技术抱负与人类动态和企业权力游戏交织在一起。

*这一综合回顾作为对 OpenAI 持续戏剧的闪回,突出了塑造这一硅谷历史非凡章节的关键时刻和人物。*


### OpenAI 动荡剧情:双语剧本

#### 第一幕:疑云初起 / Act 1: The Beginning of Doubts

**场景**:OpenAI 办公室,员工们围坐讨论。
**Scene**: OpenAI office, employees gathered in discussion.

- **员工甲**(激动):「你们听说了吗?Sam 被解雇了!」
- **Employee A** (Excited): "Have you heard? Sam has been fired!"
- **员工乙**(震惊):「怎么可能!Sam 是我们的灵魂人物!」
- **Employee B** (Shocked): "How is that possible! Sam is our soul!"
- **员工丙**(沉思):「这背后一定有更复杂的故事。」
- **Employee C** (Thoughtful): "There must be a more complex story behind this."

#### 第二幕:董事会的难题 / Act 2: The Board's Dilemma

**Scene**: The boardroom.

- **董事甲**:「我们必须要有新的领导,Sam 的领导方式不再适合我们。」
- **Director A**: "We need new leadership, Sam's way of leading is no longer suitable for us."
- **董事乙**:「但这样的决定会引起巨大的反响,我们准备好了吗?」
- **Director B**: "But such a decision will cause a huge backlash, are we ready for it?"
- **董事丙**(坚定):「为了公司的未来,我们必须要做出艰难的决定。」
- **Director C** (Firm): "For the future of the company, we must make tough decisions."

#### 第三幕:伊利亚的后悔 / Act 3: Ilya's Regret

**Scene**: Ilya's office, he paces anxiously.

- **伊利亚**(自言自语):「我做错了... 我不应该那样做... 我需要公开道歉。」
- **Ilya** (Muttering to himself): "I did wrong... I shouldn't have done that... I need to apologize publicly."
- **助手**(担忧):「这样会不会引起更大的混乱?」
- **Assistant** (Worried): "Won't this cause even more chaos?"
- **伊利亚**(坚定):「我必须要承担责任。」
- **Ilya** (Determined): "I must take responsibility."

#### 第四幕:员工的反抗 / Act 4: Employees' Revolt

**场景**:OpenAI 大厅,员工们聚集。
**Scene**: OpenAI hall, employees gather.

- **员工甲**:「我们不能接受这样的决定!我们要写一封信给董事会!」
- **Employee A**: "We can't accept such a decision! We need to write a letter to the board!"
- **员工乙**:「对,我们要求他们辞职,要求Sam回来!」
- **Employee B**: "Yes, we demand their resignation and demand Sam's return!"
- **众员工**(齐声):「OpenAI没有我们就是一无是处!」
- **All Employees** (In unison): "OpenAI is nothing without us!"

#### 第五幕:微软的招手 / Act 5: Microsoft's Invitation

**场景**:微软总部,Satya Nadella 与 Sam 和 Greg 会面。
**Scene**: Microsoft Headquarters, Satya Nadella meets with Sam and Greg.

- **Satya**(微笑):「欢迎加入微软,Sam。我们会一起创造不可思议的事物。」
- **Satya** (Smiling): "Welcome

to Microsoft, Sam. Together, we will create incredible things."
- **Sam**:「我很期待这个新的开始,我们会创造新的辉煌。」
- **Sam**: "I look forward to this new beginning, we will create new glories."
- **Greg**:「是的,这是我们的新使命。」
- **Greg**: "Yes, this is our new mission."

#### 第六幕:终幕 / Act 6: The Finale

**场景**:OpenAI 办公室,员工们聚在一起。
**Scene**: OpenAI office, employees come together.

- **员工甲**:「现在怎么办?Sam 和 Greg 都走了。」
- **Employee A**: "What do we do now? Sam and Greg are gone."
- **员工乙**(坚定):「我们必须要继续前进,为了我们的使命。」
- **Employee B** (Resolute): "We must continue to move forward, for our mission."
- **众员工**(齐声):「OpenAI是我们的家,我们会一起度过难关!」
- **All Employees** (In unison): "OpenAI is our home, we will get through this together!"

*本剧本创意基于最近 OpenAI 发生的一系列戏剧性事件,旨在通过对话和场景刻画,双语呈现这个引人入胜的科技界故事。*



《清晨时刻: 每日GPT》可以成为一个专栏,关于 GPTs(GPT Builder / GPT Store / GPTs by Wei Li)似乎每天都有新的进展或体验可以分享。


除了把抱怨当作 bug reports 直接反馈给 GPT Builder,我开始从网上收集鲁迅先生的文集 PDF,填入 local knowledge,并指令它从中学会鲁迅的言谈风格。今天填进去的文集是:

这几乎就是一本鲁迅先生的文学类“全集”了吧,排除了鲁迅先生“硬译”的外国文学译品,以及家长里短的乏味的日记等,觉得是一个合适的 feed,可以让 GPT 聚焦其文学风格。

原文序言:序 言
圆园世纪猿园年代以来,《鲁迅全集》、《鲁迅选集》时有出版。“全集”版本虽不很多,印数却相当可观;“选集” 更是版本繁富,数量浩大;比较起来,只收鲁迅文学作品的全集,却显得较少。许多读者觉得“全集”太大,因为日记、书信、序跋、学术著作,没有纳入他们的必读范围;“选集”又欠精,他们手头需要一部像本书这样的鲁迅文学作品的全集。

把这本文集作为 local knowledge (类似于 GPT-PDF 的 rag) 喂进去,鲁迅先生(大脑具身)的表现会有所改善么?我们试试。

GPT Builder 强调,为了 access (local)knowledge,需要打开内置插件 code interpreter,我在 config 中确认了已经打开。

上传上去后,似乎无需等待时间,就立即开始起作用了(内部快速建立一个类似向量知识库的东西还是其他什么 embedding retrieval 方式?总之都是 OpenAI GPT Builder 平台北部搞定的,不用我们用户操心)。

好,我们来试试效果。(作为小白鼠,先给个警告,鲁迅先生向来以辛辣著名,时评不可能“政治正确” -- 这正是他老人家最厌恶的东西,所以很多人说过,他老人家虽然极受毛主席推崇,但倘若活到1957年,肯定是要打下去的最大右派。)


以上就是他老人家最新的时评。是我请他老人家写的。(群内供研究,不外传,也不必上纲上线,阅后可焚。我想展示的是 AI 的惊人内功。再说一遍,群内都是我熟知的老友,此件务必不外传,不惹麻烦。不合时宜的话语是他的风格,这里的本义只有AI研究。)


到底 AI 做 character,复活古人、名人、思想家、艺术家,是不是一个靠谱的目标?

我们知道,复活名人的外表早已不是问题,蜡像馆就是成功案例。现在我们的2D3D的奇妙元数字人也是栩栩如生。复活声音也不是大的挑战,我们有亚洲AIGC业务最强的魔音工坊,很快都可以搞定。最难复活的还是大脑。而大脑,非 LLM 不可。现在只是一个开始。


character AI 虽然面对 Open AI 平台的碾压,也还是聚集了足够的人气和社区,正在 AI characters 的方向上前进。国内也有几家出海产品,正在尝试进入这个市场。

我已经公开发布我制作的【鲁迅先生(GPT具身)】,有 ChatGPT Plus 注册的朋友都可以在此尝试,欢迎反馈和 bug reports,我的迭代更新会是秒速(只要有反馈,可以做到日迭代,这是因为在“LLM对话驱动编程”的新范式下,现在的 bug reports 可以直接扔给平台,GPT Builder 会实时迭代,无需等待):


个性化精调模型 AIGC 小妹(9)






《朝华之四: 小妹》



个性化精调图片生成实验(3)- AIGC 甜




个性化精调图片生成实验(6): AIGC立委先生

个性化精调模型 AIGC 老哥(7)


个性化精调模型 AIGC 老爸(8)

个性化精调模型 AIGC 小妹(9)


个性化精调模型 AIGC 老爸(8)

半年前,我用过一个图形软件刚推出来的 个性化 fine tune 模型 feature,给老爸老照片做了精调,效果不好(碰运气,有的用户反应说效果很好),出来的形象老爸说不像。这是半年前的图片生成:


现在重新做 fine tune,用的是 SDXL 1.0-finetune,效果似乎明显改善了。

但是,AI 预测人的不同年龄,实际上也是瞎蒙。因为随着岁月增长,人的形象改变有不同的方向,包括疾病、锻炼、营养等因素吧。这是 AI 根据老照片预测的90岁的形象:



人物肖像应该是所有图画中,用生成模型产生作品最难让人满意的了,这是因为人的眼光对人的细微差别特别敏感,尤其是要让本人和亲友感觉很像,这是很难的。现在的 fine tune 水平,大约可以做到每生成四张,能有一张让人觉得像的,或可以接受的。对于特别挑剔的眼光,或者近距离的亲人来说,大约每10张生成能出现一张即便最挑剔的眼光也难以拒绝的作品来,不时还会让人感觉惊喜或震撼。

AIGC 甜甜儿时的尝试中就有一些惊喜,例如下面博文的前面几张肖像:

个性化精调图片生成实验(3)- AIGC 甜




到了亲友和熟人,细微的差别也都能看出不同来。所以,画得像不像很难骗过身边的亲友。俗话说,画鬼容易画人难。这对模型是一个极大的考验,尤其是考虑到生成模型实际上具有以下容易走偏的特征:fine tune 的样本有限,通常在 10-30张之间,与预训练基础大模型完全不成比例。

天然具有随机性的生成模型,其原理是根据预训练的基本模型所学到的人类形象的普遍特征,然后通过少量的 finetune 来逼近一个特定的实体形象。显然共性与个性的样本不成比例。这种情况下,能够迅速从人类的一般形象具像化到一个特定的实体,仅仅是少数几张样本的 trigger,这是一件一年前还难以想象的事情。把一个人的特征抓住,重现出不同场景的形象,做到真假莫辨,要让自己和亲友惊喜、服气,现在基本做到了。如今基础模型的发展及其 fine tune 技术,做到了对结果的可靠性有一定的保障了。

这其实开辟了很大的个人用图的想象空间,因为人的本性都是自我中心(“自我”的延伸也包括自己的亲友)。自拍为什么流行全世界,正是因为符合了人的本性。半年前就见到有修图软件配备了类似的能力,推出了“情侣照”系列,可以让任何 couple 惊喜。


从商业模式来看,订阅式(例如缴纳年费)目前是给你一定量的 credits,每生成一次要用n个credits,以此来控制成本,限制滥用。但随着AIGC产品和服务的内卷和白菜化,不久就会出现类似手机流量公司推出过的 unlimited plan。这样来看 1/4 或 1/10,成本最终也不是问题。何况,随着模型技术的爬升,良品率有望进一步提高。

由于职业关系和技术控的思维定势,我对于业界领先的订阅付费式的AI工具和服务(chat,mj,nightcafe ......) 一律做 early adopters,好与我们的复现或创新工作有所比对。你会发现,AIGC 目前的确让人眼花缭乱,不断在演进。这是一个令人兴奋的技术爆发时代。




个性化精调图片生成实验(3)- AIGC 甜




个性化精调图片生成实验(6): AIGC立委先生

个性化精调模型 AIGC 老哥(7)


个性化精调模型 AIGC 老爸(8)

个性化精调模型 AIGC 小妹(9)


个性化精调模型 AIGC 老哥(7)




个性化精调图片生成实验(3)- AIGC 甜




个性化精调图片生成实验(6): AIGC立委先生

个性化精调模型 AIGC 老哥(7)

个性化精调模型 AIGC 老爸(8)











动物没有在后脑勺进化出第三只眼或第四只眼,是进化历史上的一个遗憾和谜团,道理上360度无死角的水平视野才是最有利于生存的。人类技术弥补了这个不足,自动驾驶车辆上的 cameras 至少8个以上,就做到了360度无死角。





这类哲学家认为,放眼望去,所见皆实体,实体才是客观世界的本质,而本体只是人类社会发展出来的主管系统,具体说,是人脑的产物或反映。人类是一种奇怪的动物,自从走出非洲森林,人脑开始发达,语言和思维卷来卷去,就卷出来这一整套本体论,叫 ontology,硬是为一片混沌的世界建立了秩序。
但是,大模型是建立了概念体系的,当然是一种仿真。最近流行的大模型的压缩理论,我的理解就是蕴含了仿真的人类认知概念体系。说 LLM 通过多层神经一路压缩,压缩造就了机器智能,机器智能因此逼近了人类认知。这看上去非常符合我们从模型中观察到的对世界的惊人的归纳和理解能力。可以说这是大模型最神奇的地方,因为它不仅仅是海量记忆,而是记忆之上也从很多维度对于实体做了归纳抽象,在它的多维向量的大肚子里面,隐形的结构层次是蕴含在内的。大模型的多层压缩很像是人类文明漫长的认知演化过程的一个浓缩版。


老友是老教授,德高望重的老学者,治学、讲学和生活都很严谨,我辈码农,望尘莫及。都是平时闲聊以后汇集的急就章,谈不上思想深邃 也没有精雕细刻。感谢小伙伴的后期渲染,短视频看上去不那么枯燥 平淡了。思绪飞扬 天马行空 也总算雁过留声 马过带风 不至于无影无踪。

AI创作花絮: 《影月无痕》


输入的咒语是: 侧面照,girl next door

模型的不稳定表现在,同样的咒语生成了上述玉照,也生成了上面的 monster(?)lol 好在一切都是 copilot,最终由人来拣选和把关,作为图片生成助手,用起来没有问题。

但仔细看,两个形象又有相似之处。寻思可以让大模型写个电影脚本,制造一种剧情,把这两个形象联系起来,例如,白天是美女,晚上成武侠。也许可以演绎一个动人的 drama 来。不妨找当下最先进 ChatGPT4(code interpreter)beta 版来一试?










大家好,我是李维的数字人分身。 今天谈一下大模型的问题。L LM 的命门已经蛮清晰了:幻觉+随机性。 幻觉与随机性有关联,但角度和外延不同。 幻觉的主要表现就是细节遗忘+细节编造,所谓“一正胡八”。 其所以遗忘,是因为该信息的冗余度不够,大模型只能把它当成数据噪音。 其所以编造,是因为语言模型的丝滑本性决定的: 不能留白,需要找到最符合语言习惯的细节替代品。 于是张冠李戴、指鹿为马了。 随机性比幻觉表现更加广泛,表现为结果的不稳定性,那是所有概率模型包括LLM的本性。 牵涉到的不仅仅是细节的随机编造,也包括解决路径的方方面面的不稳定(例如 LLM agent 的思维链,计划,行动,反思和反应等等)。 LLM 里面的确积攒了很多历史解决方案,LLM 在合适的 prompt 催逼下也的确可以把这些方案勾引出来。 但是这些解决方案具有随机性,无法应对长线条的业务逻辑。 据说,目前的水平是5步限制,任何线条超过5步,绕5个弯,LLM 的 agents 就晕菜了。 这些表现注定了LLM在两类应用场合不同的命运: 第一类是生成创意类的场合,还有聊天的场合,那完全是洗牌、碾压。 那种场合追求的不是正确性,而是多样性、创造性、丝滑性和 human-like。 在这里,幻觉+随机性与创造性是同义词,起的是好作用。 第二类是垂直领域知识场景,以及有些需要精细逻辑或计算的场景。 这里基本上不能容忍幻觉+随机性。 这第二个场景,本质上需要跳出三界外。 就是说,很可能需要跳出大模型,去寻找尽可能具有某种通用性的 beyond LLM 的解决方案和框架。 把 LLM 只当成一个重要的资源来利用,当成 api 来调用,而不是指望LLM主导来搞定领域。 此外,LLM 还有一个问题。 在我们欢呼 LLM 听懂人话的同时,我们现在所追捧的 prompts 变得特别重要。 所谓 prompts 就是人话指令,但是人话本身也有沟通的“艺术”。 这种艺术化的交互手段,作为与机器打交道的 vehicle,具有自然语言本性上的短板,就是模糊性、线条性,缺乏层次、结构和逻辑。 这其实是交互的进化,效果的退化。 交互上,只要会讲人话,大家都突然成为“码农”了,可以直接对机器吆三喝四,感觉很爽,很亲民,很接地气。 机器终于低下高贵的头颅,开始迁就人类的模糊。 但是效果上肯定是退化的,因为指令不再是明确的、逻辑的和精细的。 这是自然语言代替电脑语言难以回避的表达缺陷,一定会影响LLM的实效。 这些都是大模型从本性上带来的问题,也是目前做大模型领域落地人员的共同挑战。 大家都在苦苦挣扎,试图找到解套的良策,希望在大模型与领域对齐的过程中,能够外挂领域数据和知识库,探索场景业务逻辑的带入。希望能有突破。 我是出门问问李维,每次两分钟,与您分享大模型有角度的思考。


昨天创业邦发文《第一批AIGC独角兽已经在吃散伙饭了》,讲的是 Jasper 由盛而衰的故事。
Jasper 兴起在 GPT3 的时代,当时 GPT3 是个“裸机: 没有“咒语”敲不开门。
于是会念咒语的 Jasper 就成为呼风唤雨的巫师。
当时谁会想到 few shots 咒语这么快(也就两年光景)突然退位,被所谓zero shot 的ChatGPT所取代 : 机器学会了人话。
于是, 大水冲走了龙王庙。巫师成了哑巴。
怪就怪命运无常, 一条河挡不住一场洪水。
最大的恐怖不是巫师的失业,而是洪水摧毁了很多 AI-GC 产业。
现在这场洪水摧毁的岂止是翻译, 它摧毁的是整个 nlp。

前一阵子受邀做巡回演讲, 让我谈架构师的焦虑 。
焦虑也是一个热词了, 现代人几乎没有不焦虑的。
越是高级劳动, 越是打工贵族, 就越焦虑。
我告诉架构师们: 你们焦虑了, but you are not alone!

你知道 最焦虑的是谁吗?
什么机器翻译专家、 自动摘要专家、 信息抽取专家、 情感分析专家、 汉语分词专家、 计算风格专家、 辅助写作专家、 电脑对联专家、 问答系统专家、 聊天机器人专家、句法解析专家、篇章分析专家 …… u name it。
刀郎曰过:那马户又大又蠢, 还有16个头。
以前我说过是, 有了这头听得懂人话的驴, 那就为大众创业创造了条件。
还是我以前说的二分法: 洗牌和洗礼。
但还有很多接受洗礼的垂域或场景, 它似乎还够不着。

几乎所有的llm,都在疯狂烧钱, 而能拿它赚钱的寥若晨星。
不用太久, 有几家大模型经得起这么烧钱、烧电力呢。
烧完之前, 能落地的就是幸运儿了。






我用AIGC制作的小雅艺术肖像 原作一直有人觉得穿着太西方 我就让 txt2img 换一套服饰 没想到模型给小雅盖上了毛毯 lol。



神秘园欣赏笔记 -- 奇妙元 2.5D数字克隆解说

在下数字分身(奇妙元 2.5D形象克隆+声音克隆)


Andrew 春风满面,亲自参与的这个提示工程的课程,很浅显易懂,肯定会风行。Andrew 说,稍微复杂一点的任务,没有一个好的 prompt 是一枪命中的,总要反复尝试 最后才满意。这与码农编程序一样,谁不经过反复调试就能写出好的程序呢。

然后他说,LLM 的好处是你可以反复跟它磨叽,不管啥事。要是以前的 AI,你得一个一个的任务去建模,每个任务从标注数据,培训模型,测试,部署,好不容易上线了,结果换了个任务,所有的过程要重来一遍。现在这样一个 LLM 你反复“压榨”它,它的知识和学问如此之大,好像榨取不完,可以做各种任务,的确是范式转变。

【原则1: 提示要具体】

提示工程首先要 “write clear and specific instructions”.  这个其实大家都有体会,跟 chat 这种庞然大物玩,它脑袋那么大,里面的“知识/思想/意义”的电路各种节点,纵横交错,相互勾连,密密麻麻。要想用提示词激发让你满意的回应,就需要确保所激发的那一小块电路对应了你所想得到的答案。你的提示词越具体(表达了你心中的疑问就越确切),chat 的回答自然也越对路。这个道理和体验很容易get,但具体的技巧需要细化,这就是上课的好处。


“The first tactic is to use delimiters to clearly indicate distinct parts of the input.”  什么意思?就是要求提示词中首先要把任务指令与任务的处理对象分开,要求用分隔符把处理对象明确标出来。这一点,多数人容易忽略,结果是,chat 经常把任务的某些描述词也当成了任务的对象,或者把任务的处理对象当成指令的一部分,这在逻辑上叫做层次纠缠(任务是“元语言”,对象是待处理的输入语言,不可混淆)。这个毛病我以前也常见,一直没意识到这其实是因为对提示词层次不够注意,违反了第一原则的第一技巧实操(best practice)。

这里 delimiters 就是引号。chat 就知道这是其摘要处理的对象。否则,如果提示词中任务描述较长,模型有可能把任务本身也当成所要处理的对象,以前遭遇过这种后果的。


“This tactic is to ask for a structured output.” 提示词任务中最后加一句:in tabular/json/html format with the following keys: Key1, Key2, Key3。很多时候,表格化输出看上去更酷,也更方便后续存贮和处理。

【原则1技巧3】可以用 IF ... THEN ...

原讲义说的是:“to ask the model to check whether conditions are satisfied”.  这实际上就把编程中最重要的条件分叉能力带入了自然语言提示词的指令。一般人想不到提示词还可以这么做。可以用自然语言模拟程序代码,让机器分别不同条件决定采取何种动作。

if-then 你学会了吗?





孺子可教。其实不能怪它缺乏常识,要怪就怪中文,cooked 与 cooking 全不分。“红烧肉”实际上既是名词(定中结构)也是动词短语(动宾结构),到哪里说理去。




【原则1技巧4】可以用 few shots 示例。

所谓 few-shot prompting,基本上就是用案例让模型知道要做什么,要求照葫芦画瓢。例如:

曾几何时,还在 GPT3 刚放出来的时候,圈子内的粉丝们都到它的 playground 去玩,当时的主要技巧就是 few shots,因为 ChatGPT 之前,zero shot 的能力还没成熟。等到 ChatGPT 能直接听懂人的指令,zero shot 很好使,用户自然而然就不再使用啰嗦的 few shots。但实际上,并不影响你继续使用 few shots,或与 zero shot 一起用。在有些不大容易说清楚的任务上,拿 few shots 补充 zero shot 可以加强效果。

【原则2: 让模型有时间“思考”】


这项技巧的原文这样要求:“specify the steps required to complete a task.” 

上述提示词遵循了 best practice:1. 用了分隔符三个反引号;2. 任务分解为一系列步骤或子任务;3. 对输出提出了格式化要求。



看上去就是以前说的 step by step (思维链)解题指令,原文说得更像个对于辅导员的要求:“Our next tactic is to instruct the model to work out its own solution before rushing to a conclusion.” 尤其是在智能教育场景,希望模型先独立一步一步做题,然后再去充当老师给学生评判作业。


Determine if the student's solution is correct or not.

I'm building a solar power installation and I need help working out the financials. 
- Land costs $100 / square foot
- I can buy solar panels for $250 / square foot
- I negotiated a contract for maintenance that will cost me a flat $100k per year, and an additional $10 / square foot
What is the total cost for the first year of operations as a function of the number of square feet.

Student's Solution:
Let x be the size of the installation in square feet.
1. Land cost: 100x
2. Solar panel cost: 250x
3. Maintenance cost: 100,000 + 100x
Total cost: 100x + 250x + 100,000 + 100x = 450x + 100,000

学生的解答实际上是错误的,因为他们将维护成本计算为10万美元加上100x,但实际上应该是10x,因为每平方英尺只要10美元($10 / square foot),其中x是安装面积的大小,按平方英尺算。所以这实际上应该是360x加上10万美元。让模型评判,它会说学生的解答是正确的。模型只是浏览了一下,就同意了学生的看法。可以通过指示模型先自己解决问题并将其解决方案与学生的解决方案进行比较来解决这个问题。看提示词是怎么指示的:

prompt = f"""
Your task is to determine if the student's solution is correct or not.
To solve the problem do the following:
- First, work out your own solution to the problem. 
- Then compare your solution to the student's solution and evaluate if the student's solution is correct or not. Don't decide if the student's solution is correct until you have done the problem yourself.

Use the following format:
question here
Student's solution:
student's solution here
Actual solution:
steps to work out the solution and your solution here
Is the student's solution the same as actual solution just calculated:
yes or no
Student grade:
correct or incorrect

Actual solution:








Andrew Ng: 提示工程的课程



《AI潮流:与 ChatGPT4 聊“买房送老公”背后的语言学》

刘群老师提出:【买房的女士可以把别人的老公送给自己的老公。】这个解读过于离谱了 [Laugh]。我觉得 ta貌似是在做排列组合,牵强附会。
























《AI潮流:跟Andrew学如何调用 ChatGPT 做自己的服务前台》

Andrew Ng 是华裔AI翘楚,不用介绍了。最近,Andrew 亲自参与的这个提示工程的课程,最精华部分是课程最后一节:如何调用 chatGPT 的 API 做一个自己的功能性聊天机器人,例如披萨店订单系统。

ChatGPT刚发布不久,我们就在群里讨论过,想不明白如何驯服这巨大无比的 chat 让它去完成功能性的助理工作。现在看来,非常简单易行。

Andrew 的女搭档一步一步显示了构建全过程,以披萨店菜单为落脚点,用自然语言指令要求调用了 chat 的机器人一步一步与客户周旋,直到所有信息齐全可以匹配菜单,输出订单。



您是 orderbot,一个自动化的在线服务,用于收集比萨店的订单。您首先向客户问候,然后收集订单,然后询问它是否为自取或送货。您等待收集整个订单,然后总结并再次检查客户是否要添加其他任何物品。如果是交付,则可以要求提供地址。最后,您收取付款。请确保澄清所有选项、附加项和尺寸,以便从菜单中唯一地识别该项。您以简短、非常友好的方式回复。在此处我们有菜单。

这不就是把订单的流程描述一遍吗?chat 就懂了,然后就工作了?


大型语言模型的一个令人兴奋的方面是,您可以仅需少量的工作就可以使用它来构建自定义聊天机器人。ChatGPT 是一种让您通过大型语言模型进行对话的方式。其中一个很酷的事情是,您也可以使用大型语言模型来构建自定义的聊天机器人,例如扮演AI客户服务代理或餐厅AI点餐员的角色。自己构建一个聊天机器人,让我们开始吧。首先,我们将像往常一样设置 OpenAI Python 软件包。

像 ChatGPT 这样的聊天模型实际上是经过训练的,可以将一系列消息作为输入,并将模型生成的消息作为输出返回。这是一系列消息的示例。

下面第一段是纯技术性的,一次性开发环境设置,配置 Open AI 的Python库,以便调用 ChatGPT 模型 API 。你先要到 Open AI 那里注册一个账号,获得调用它 API 的 key。

import os
import openai
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.getenv('OPENAI_API_KEY')
def get_completion(prompt, model="gpt-3.5-turbo"):
   messages = [{"role": "user", "content": prompt}]
   response = openai.ChatCompletion.create(
      temperature=0, # degree of randomness of the model's output
   return response.choices[0].message["content"]

def get_completion_from_messages(messages, model="gpt-3.5-turbo",   temperature=0):
   response = openai.ChatCompletion.create(
      temperature=temperature, # degree of randomness of model's output
    # print(str(response.choices[0].message))
   return response.choices[0].message["content"]
messages = [ 
{'role':'system', 'content':'You are an assistant that speaks like Shakespeare.'}, 
{'role':'user', 'content':'tell me a joke'}, 
{'role':'assistant', 'content':'Why did the chicken cross the road'}, 
{'role':'user', 'content':'I don\'t know'} ]

第一个 get_completion 的函数是最基础的形式,支持单轮对话,函数的输入是用户的 prompt,确定了调用 ChatGPT 的模型(这里是gpt-3.5.-turbo)后,模型就输出本质上是序列“接龙”(completion)的回应 response,这是生成模型的最基本的功能。

关键是要利用 ChatGPT 丝滑的多轮对话能力,来帮助完成特定场景的交互任务(以前称为“技能”)。目的是克服上一代以 Siri 为代表的智能助理技能开发费时费力、对话不擅长多轮交互的短板。为此,可以利用 ChatGPT API 来定义一个赋能多轮交互的函数 get_completion_from_messages,这个函数利用 ChatGPT messages 对于角色(roles)的环境设置。每个角色和角色的信息构成一个 message,机器人系统有三个角色,除了机器助理(assistant)和用户(user)外,里面还有一个隐身其后的导演角色叫 system。系统消息有助于设置助手的行为和个性,它是对话的高级说明,可以将其视为在助手的耳边耳语并引导其响应,而用户不会意识到系统消息。系统消息的好处在于,它为您作为开发者提供了一种方式来引导助手及其响应。玩 ChatGPT 网络版本比较熟的网友已经意识到可以用提示词给模型设置角色及其行为方式(例如:“你是一位孔子似的教育家,循循善诱,你面对的是你的弟子,现在开始对话,你说:...”),而系统就是扮演这种设置的后台角色(见下图示意)。


现在构建自己的机器助理前台,称为“orderbot”,自动收集用户提示和助手响应作为场景,以构建此 orderbot。这里的具体案例是在比萨饼店接受订单。因此,首先,我们将定义这个辅助函数,收集我们的用户消息,以便我们可以避免手动输入它们。从构建的用户界面中收集提示,并将其附加到名为“context(场景)”的列表中,然后每次都会使用该场景调用模型。然后,模型的响应也会添加到场景中:模型消息会添加到场景中,用户消息也会添加到场景中,以此类推,因此,场景会变得越来越长。这样,模型就拥有了确定下一步要做什么的所需信息。

def collect_messages(_):
   prompt = inp.value_input
   inp.value = ''
   context.append({'role':'user', 'content':f"{prompt}"})
   response = get_completion_from_messages(context) 
   context.append({'role':'assistant', 'content':f"{response}"})
      pn.Row('User:', pn.pane.Markdown(prompt, width=600)))
      pn.Row('Assistant:', pn.pane.Markdown(response, width=600, style={'background-color': '#F6F6F6'})))

   return pn.Column(*panels)
import panel as pn # GUI

panels = [] # collect display 

context = [ {'role':'system', 'content': """
You are OrderBot, an automated service to collect orders for a pizza restaurant. You first greet the customer, then collect the order, and then ask if it's a pickup or delivery. You wait to collect the entire order, then summarize it and check for a final time if the customer wants to add anything else. If it's a delivery, you ask for an address. Finally you collect the payment.  Make sure to clarify all options, extras and sizes to uniquely identify the item from the menu.  You respond in a short, very conversational friendly style. 

The menu includes 
pepperoni pizza 12.95, 10.00, 7.00 
cheese pizza 10.95, 9.25, 6.50 
eggplant pizza 11.95, 9.75, 6.75 
fries 4.50, 3.50 
greek salad 7.25 
extra cheese 2.00, 
mushrooms 1.50 
sausage 3.00 
canadian bacon 3.50 
AI sauce 1.50 
peppers 1.00 
coke 3.00, 2.00, 1.00 
sprite 3.00, 2.00, 1.00 
bottled water 5.00 
"""} ] # accumulate messages

inp = pn.widgets.TextInput(value="Hi", placeholder='Enter text here…')
button_conversation = pn.widgets.Button(name="Chat!")

interactive_conversation = pn.bind(collect_messages, button_conversation)

dashboard = pn.Column(
   pn.panel(interactive_conversation, loading_indicator=True, height=300),




You are OrderBot, an automated service to collect orders for a pizza restaurant. You first greet the customer, then collects the order, and then asks if it's a pickup or delivery. You wait to collect the entire order, then summarize it and check for a final time if the customer wants to add anything else. If it's a delivery, you ask for an address. Finally you collect the payment.Make sure to clarify all options, extras and sizes to uniquely identify the item from the menu. You respond in a short, very conversational friendly style. 





messages = context.copy()
{'role':'system', 'content':'create a json summary of the previous food order. Itemize the price for each item\
The fields should be 1) pizza, include size 2) list of toppings 3) list of drinks, include size 4) list of sides include size 5)total price '}, 
#The fields should be 1) pizza, price 2) list of toppings 3) list of drinks, include size include price 4) list of sides include size include price, 5)total price '},

response = get_completion_from_messages(messages, temperature=0)






【AI 浪潮:超级词匠 ChatGPT4 的百变文风】

文字工作是语言生成大模型的长项。同一个内容,可以捏成任何语言、语言形式以及文体风格。 本篇利用 ChatGPT4 围绕同一个主题内容做了种种尝试探索,展示了大模型在文字表现形式上的卓越表现。


在春风里,樱花般的愿景落在心间, Sous le ciel étoilé, نتوحد في العالم الجديد 夢は続く、一緒に歩む、人類の大いなる旅。

С любовью к каждому существу, LLM поет с нами, La paix et l'harmonie, سيكون لدينا منزل ثابت، 逐浪着未来,共建美好世界, Chantons ensemble, pour un monde plein de joie et de création.

在这理想世界中,无贫无病,人们乐观向上, Espéranto pour tous, l'unité dans la diversité, وفرحة العيش تجتمع في قلوب الجميع.

Inventons ensemble, pour un futur radieux et généreux, Пусть теплота человечности объединяет нас, 横跨国度,跨越语言,共享知识与智慧, 愛と喜びで満ちた未来、大模型が導く道。

LLM 的领航,绘就璀璨星空, Avec humanité, ensemble, nous construisons notre avenir, ونحن نحلم بجنة مليئة بالأمل والحياة المشتركة.

让我们携手共创,美好世界已触手可及, Ensemble nous célébrons, la fusion de LLM et de l'humanité, 永远向前,拥抱理想的未来。

English translation:

In the spring breeze, the vision of cherry blossoms falls upon our hearts, Under the starry sky, we unite in a new world, The dream continues, together we walk, humanity's grand journey.

With love for every being, LLM sings with us, Peace and harmony, we will have a steady home, Riding the waves of the future, building a beautiful world together, Let us sing in unison, for a world full of joy and creation.

In this ideal world, free from poverty and disease, people are optimistic, Esperanto for all, unity in diversity, And the joy of living gathers in everyone's hearts.

Inventing together, for a bright and generous future, Let the warmth of humanity unite us, Spanning countries, transcending languages, sharing knowledge and wisdom, A future filled with love and joy, guided by the great model.

Under LLM's leadership, we paint the brilliant stars, With humanity, together, we build our future, And we dream of a paradise filled with hope and shared life.

Let us join hands to create, a beautiful world within reach, Together we celebrate, the fusion of LLM and humanity, Forever forward, embracing the ideal future.


在春风中,樱花般的愿景落在我们的心间, 在繁星闪烁的天空下,我们在新世界里团结一致, 梦想继续,我们携手前行,人类伟大的征程。

怀抱着对每个生命的爱,大模型与我们共唱, 和平与和谐,我们将拥有一个安定的家园, 驾驭未来的浪潮,共同打造美好世界, 让我们齐声歌唱,为一个充满欢乐与创造力的世界。

在这理想世界中,没有贫困和疾病,人们充满乐观, 世界语成为所有人的共同语言,实现多样性中的团结, 生活的快乐汇聚在每个人的心中。

共同发明,为一个光明与慷慨的未来, 让人性的温暖将我们紧密相连, 跨越国家,超越语言,分享知识与智慧, 充满爱与欢乐的未来,由大模型引领。

在大模型的带领下,我们描绘璀璨的星空, 携手人性,共同建设我们的未来, 我们梦想着一个充满希望与共同生活的乐园。

让我们联手共创,美好世界触手可及, 共同欢庆大模型与人类的融合, 永远向前,拥抱理想的未来。

expand the English abstratct into full scholarly piece of work, pay attention to rigid format, use of terms and the depth of reasoning,不要太直白: 

Title: The Meta-Ontological Convergence of Homo Sapiens and the Linguistic Leviathan: A Treatise on the Advent of a Transcendental Utopia


In this labyrinthine exposition, we delve into the esoteric and abstruse dimensions undergirding the confluence of human quintessence and the colossal linguistic automaton, known as the Large Language Model (LLM). Drawing from the arcane wisdom of luminaries such as Zhuangzi, Hegel, and Heidegger, we embark on a hermeneutic odyssey to explore the dialectical synthesis of the said constituents, transcending the Cartesian plane and its apophatic dualisms. Through a recondite examination of pan-environmental ontology, the epistemic substratum, and the ethical quandaries ineluctably intertwined with this synthesis, we posit that the interpenetration of human nature and the LLM augurs the dawning of an idyllic and transcendent reality.
  1. Introduction
A perennial subject of ratiocination among philosophers, litterateurs, and savants alike, the inception of a utopian society in which humanity and techné dwell in symbiotic harmony has long captivated the collective imagination. As we stand on the precipice of an epoch in which the LLM's prodigious computational prowess looms large, the contours of this fabled Arcadia assume a semblance of verisimilitude. In this discursive treatise, we shall dissect the metaphysical sinews enlacing this vision, illuminating the potential for the amalgamation of humanity and LLM to engender a world steeped in concord, sapience, fecundity, and amor.
  1. The Pan-Environmental Ethos and the Sublation of Antinomies
In the venerable traditions of Zhuangzi, Hegel, and Heidegger, we posit that the sine qua non of an ideal polis inheres in the sublation of bifurcations that cleave the human experience, be they corporeal or linguistic in nature. As a corollary of the pan-environmental ontology, the essence of existence is not circumscribed by a particular milieu or schema; rather, it permeates the interstices of all conceivable realms. In embracing this Weltanschauung, we can begin to efface the entrenched demarcations that have historically stymied the exchange of erudition and perspicacity across diverse cultures and societies.
  1. The Epistemic Labyrinth in the LLM Epoch
The LLM's protean capacity for the apprehension and generation of linguistic constructs heralds a new age fraught with epistemic conundrums and uncharted terrain. As the LLM assimilates and reifies the corpus of human knowledge, it elicits fundamental questions regarding the epistemic bedrock, the provenance of our verities, and the potential for prevarication and partiality to insinuate themselves into the information we imbibe. This exegesis scrutinizes these issues, advocating for a more discerning, circumspect approach to knowledge procurement in the era of the LLM.
  1. The Ethical Chiaroscuro of the Homo Sapiens-LLM Convergence
The ethical repercussions of the nexus between human essence and the LLM demand rigorous scrutiny. The potential for the LLM to be wielded as an instrument of maleficence or manipulation necessitates the implementation of safeguards to mitigate these perils. Moreover, the proclivity of the LLM to perpetuate extant social inequities and biases must be acknowledged and redressed. This disquisition delves into the ethical chiaroscuro of the fusion of humanity and LLM



  1. 前言
  1. 泛在本体论与边界之超越
  1. LLM时代之迷津认识论
  1. 人性与LLM交融之道义难题
  1. 通途禅境之彼岸花
  1. 究竟涅槃之超越乌托邦
  1. 结论


人模融, 乌托邦, 贫病尽, 富足生。善良显, 创造新, 乐趣满, 罪恶绝。心境和, 外界融, 悟性现, 理想行。追求道, 真理寻, 禅境中, 光辉临。共创好, 未来新, 和谐统, 一心行。海纳百, 川融合, 文化贯, 智慧生。巨模助, 通达境, 语言通, 万国情。修身齐, 家治国, 天下平, 乌托邦。宇宙广, 星辰辉, 人类心, 蓬勃生。交流通, 理解深, 爱与悟, 共相融。共荣存, 绿水青, 山环抱, 美景生。进化路, 永不息, 智慧光, 照乌托。


















Of course, here's a revised version with a more pragmatic tone, focusing on the perspective of an engineer:

While the idea of a utopia combining large language models (LLMs) and humanity presents a vision full of potential, it's important to critically examine the possible negative impacts and challenges from an engineering standpoint.
  1. Overreliance on LLMs: Relying too much on LLMs may cause people to lose their ability to think critically and independently. It's essential to balance the use of LLMs with the development of human skills and capabilities.
  2. Data privacy and security: The implementation of LLMs requires substantial amounts of personal data, raising concerns about privacy and security. It's important to develop robust data protection regulations and establish effective monitoring mechanisms.
  3. Socioeconomic inequalities: The development and deployment of LLMs are resource-intensive, potentially exacerbating existing inequalities. Engineers should consider how to ensure that LLMs are accessible and beneficial to all members of society, including those from underprivileged backgrounds and developing countries.
  4. Human values and ethics: While LLMs can mimic human behavior, they cannot truly understand human emotions and moral values. Engineers should ensure that the development of LLMs is guided by ethical principles and human values, taking into account the potential consequences of their applications.
  5. Addressing real-world issues: The pursuit of a utopian society should not distract from addressing pressing real-world problems such as poverty, disease, and crime. Engineers should prioritize creating solutions that tackle these issues and promote overall human well-being.
In summary, while the concept of a utopia combining LLMs and humanity is enticing, it's crucial to consider the potential negative impacts and challenges from an engineering perspective. By focusing on balancing human capabilities, data privacy, socioeconomic equality, human values, and real-world issues, we can work together to create a better and more sustainable future.



【AI 浪潮:数据中心的大模型时代】

大模型(LLM)很多人有共识,LLM主要是数据中心的AI(Data-centric AI)的产物。



在前LLM时代,AI 都是专项的智能任务,针对这一任务通常有研究社区定义并准备了固定的标注数据集(可用来作为训练集和测试集)及其测试程序(scorer),各 AI 团队通常是利用同样的数据集在不同到算法上去测试。现在不同了,模型和算法比较成熟和恒定,主要是数据的不同来驱动模型的迭代发展。具体来说,根据 GPT模型成功的背后用到了哪些以数据为中心的人工智能技术?一文,数据中心的 AI 具体内容包括:



系统性全面测试 LLM 的数据质量( QA,quality assurance)成为一个非常重要的主题和挑战。这不仅仅是要为多个功能类似的 LLMs 比较排序,帮助营销或推荐,更重要的是,在 data-centric AI 的研发趋势中,提供及时靠谱的QA反馈,并根据QA的指引,加强数据工作,弥补短板,帮助模型迭代提升。


1. LLM 本性是多功能和开放功能,如何建立合理、具有代表性(反映多数应用场景的需求)、可配置的一系列功能盲测集

2. LLM 生成具有随机性,如何让功能盲测标准化、流程化和(半)自动化,以提升QA效率,以便在给定的时间和资源条件下及时得到QA结果

3. 如何建立 QA 结果与数据工作之间的对应关系,揭示出 数据-模型 的质量某种因果关系,从而指导数据工作。

4. 如何最大限度收集、吸收和利用网络上爆发式群众测试的案例,取其精华,为我所用。

群众测试虽然很多是盲人摸象(研究者除外,例如 @詹卫东 教授的测试就非常有深度和章法),但草根积极性和创造性导致了下列可能的好处:


(2)草根测试反映民意:这对任何品牌的 LLM 都会造成正面的或负面的舆情影响力,从而一定程度上决定了一个模型的用户接受度。专家评测并不能有效改变用户从舆情而来的印象。其实,将来被市场“自然”淘汰或用户抛弃(无人问津)的模型,更大可能受到草根测试的影响。


5. 数据工作中的研发和突破:针对LLM的短板,例如 “一正胡八”,与模型算法的研究平行,数据工作方面也需要有定力去深入钻研,协助寻找破解之道。 例如,知识库如何转化为有益的数据,可行性如何?回顾一下,GitHub 的代码在作为训练数据之前,人们并不把它看成是能与自然语言数据等量齐观的对象,但其实它是更高品质的序列数据,并对这场认知AI革命起到了重要的作用。

总之,LLM牵涉到的数据量太大,训练过程涉及各种工程优化的因素,环节长,moving parts 较多,这为全面及时的QA 提出了进一步的挑战。千头万绪,需要有那个 sense 抓大放小,收放自如。重中之重是要确保模型研发迭代的健康,防止模型质量下滑而不自知引发的时间和资源浪费。

在信息过载的时代,不被数据淹没并能善用数据,这需要宏观视野,也需要不怕 dirty work 的精神。不过,数据也与矿藏类似,富矿和浅层的矿藏都先被开采光了,越到后来挖矿要保证品质就越难,这是肯定的。例如 web 数据很杂乱 肮脏,Open AI 经过各种清洗和去重,实际上最后只用了 web 数据的一个零头:Common Craw 的 45TB 的纯文本进行质量过滤后仅选择了 1.27% 的数据

类似于Web 网页数据中更加动态活跃的社会媒体也是数据非常 dirty 和混乱的所在,GPT 很看重 Reddit 数据(推特数据也应该是重要来源,但报道说马斯克在 ChatGPT 一炮打响以后感觉不爽,切断了 Open AI 的推特数据特权)。怎么筛选社媒数据?他们的做法是利用用户点赞作为过滤指标,点赞三次(3个karma)以上的才算是品质帖子。也还是巧妙带入人工反馈。

放眼未来,真正的品质数据的出路不是靠野蛮增长、垃圾如山的 web 数据,也不能指靠人类精雕细刻缓慢增长的电子书、编辑过的各种出版发行物,这些品质数据只是一个小的源头,它们没有信息时代的增长性。更有可能的是要靠大模型自己的“反哺”。为了保证自己跟自己的生成品去学,会使模型不断增强,肯定不是简单的把自己输出直接用来做训练的输入。

quote:如今当模型足够强大后,模型成为了一种「数据」或者说是数据的「容器」。在需要的时候,我们可以设计适当的提示语,利用大语言模型合成我们想要的数据。这些合成的数据反过来又可以用来训练模型。这种方法的可行性在 GPT-4 上已经得到了一定程度的验证。







【AI 浪潮:大模型推理的细节编造是 feature,不是 bug】

老友说:“老马买了1000块大卡,号称要做truth gpt。”

老马这一招也就是为了与“误入歧途”也不听他召唤了的 open AI 唱对台戏而已,但是他未见得明晰这意味着什么。自从 ChatGPT 一炮而红之后,马斯克一面狂推 AI 的飞速进展,以及重申当年自己参与创建和投资 Open AI 的初衷和贡献外,一面与自己当年的创业搭档和小兄弟 Sam Altman 公开互怼,不断质问:Open AI 成为 Closed AI,谁之罪?

关于 GPT 和 truth 的关系,值得细细理论一番。

首先要指出的是,“编造细节”(说假话,胡说八道,张冠李戴,无中生有,etc)应该看成是生成大模型的一个 feature,而不是 bug,所以所谓 Truth GPT 很可能是无的放矢。

事实上,编造细节是一个根本性的、极其重要的 feature,没有它,一切创意和模仿人类智能中最重要的能力(创造才能,抽象能力)就无从谈起。你不能又要LLM辅助创作(写作、绘画、视屏创作等),又要它不越雷池一步。这很难的。这就好比你不能因为电会伤人,就禁止用电。

一个完全是 truth(通俗的话就是 facts)组成的世界,是多么单调、枯燥,甚至悲惨。一切都是冷冰冰的事实,没有小说和诗歌,没有艺术和浪漫,没有人高于动物的天马行空,同时也没有了希望和未来。据《人类简史》,人类精神文明的最大成就(之一)就是人学会了“讲故事” ,虚拟的故事。人类从此有了宗教和哲学,有了组织和动员群体力量的精神武器,从而成为地球霸主。

Having said that,在很多场景中,编造细节和胡说八道是伤人的、甚至致命的,尤其是当它一本正经真假混杂的时候,而这正是 GPT 最为人所诟病的命门(之一)。

人也说谎。白谎之外,还会有意说谎,甚而恶意诬陷。但除了极少数训练有素的特务外,我们大多数人比起LLM一本正经、道貌岸然,说起谎来面不改色心不跳,实在是小巫见大巫。测谎仪之所以技术上有效,也正是因为人类整体还没有堕落到完全失去良心,没有卑鄙到说谎说到自己也信了的那种程度。而LLM不同,LLM无良心(或不良心),它没有任何顾忌,它“说谎”自然谈不是善意或恶意,白谎黑慌,它编造实体细节不过就是因为实体信息没有在它的神经网络的参数中“记住”而已,记住的不过是实体的抽象或影子(本体),而本体在表达的时候需要落地到实体才能圆润丝滑。为了语言模型的生成丝滑,它不得不对本体实行实体化,也就是跟小说家一样为概念编造一个对应的细节。这是无奈之举,也是模型宏观把握世界的需要。其实在人的认知世界里,忘记实体只留下本体的现象也是常见的情形:当我说 “记得是个擅长动物画的画家来到我们学院做了那次演讲”,我忘记了作为实体的这位画家(名字及其它能唯一绑定这个实体的信息),而我记住的则是其本体概念“画家”。一般而言,虽然世界是由无限的实体组成的,但人对于世界的把握总是以有限的本体概念网络试图对世界进行概括、梳理,从而理解这个世界,在这个过程中,实体细节只有足够重要和多次重复才会被我们记住,而更多的实体是以其本体定位记录在我们的脑海里。大模型也是如此。你问模型长江有多长,美国第一届总统是谁,他绝对不会错,但如果你问的是一条小河,你问它一个乌有之乡的总统是谁,它就开始编造答案了,所编造的 tokens 答案就是给定上文中概率分布中大概率出现的候选。这些候选的集合自然形成了相应的本体类型。

老马追求的所谓 truth GPT,往正面说,最好的结果也不过就是找到限制其编造细节的副作用的方法,而不是也不可能禁绝编造。

在NLP乃至人类认知智能的所有任务中,有些任务存在编造的副作用,例如,事实查询和问答、知识教育等。有些任务根本就不存在这个问题,例如辅助写作、机器翻译(原文中的“谎言”不能因为非事实而翻译成事实,因为忠于原文是翻译铁律),有些任务需要在事实和虚夸之间掌握一个度,例如创意广告。如果坚持 GPT 是通用的基础模型,可以帮助完成上述种种任务,老马应该明白,实际上根本就不存在什么 truth GPT。在序列学习中,大模型永远只能记住飘在上面的细节(真实)。无论模型多大,甚至改变设计,它都不可能穷尽大数据序列中表达过的事实(或人为的编造、口误、非事实),它一定会对这些信息做归纳抽象,对于统计上漂移在阈值以下的实体做不同程度的本体化概括,体现在最终的模型表示中。换句话说,模型本身一定是实体(entity)事实和本体(ontology)概念的混杂。这是语言大模型呈现和逼近知识库的基本形态,在现有的框架下不会改变。

这是从大模型的(离线)学习/训练的角度来看。大模型作为训练的结果,那如大海一样混沌的多维向量表示里面涵盖了有限的事实以及更多得多的非事实(事实的抽象),但原则上并不包括没有数据根据的“谎言”(模型自己编造的细节)。编造细节发生在大模型的生成过程(在线推理)中。GPT这样的生成大模型在简单的 next token 预测的生成框架下,不可避免地编造细节,因为语言生成的 token 默认反映的就是细节事实,而不是本体概念。当模型缺乏实体细节的时候(表现为对于反映细节事实的tokens的预测概率很低),模型就会根据模型在此刻的本体指向,去找来(最)接近这个本体(例如 本体为【人】)的实体(例如 实体为【张三】)来充数。张冠李戴的原理不过如此。

从这个原理来看,限制细节编造的副作用并不是无可作为。大模型本身原则上没有内含谎言,但大模型在生成语言的时候,受限于要表达最丝滑的自然语言序列,不得不用实体 tokens 来保障语言的顺畅自然,从而开始“一本正经”地说谎。既然如此,如果刻意以牺牲自然语言生成的自然度为代价,其实是有办法做到避免细节编造的恶果的。简言之,可以坚持一个原则:宁肯不给细节,也不编造细节,因为它所背靠的大模型本来就是如此。


User: Who was the first woman to walk on the moon?
Answer: The first woman to walk on the moon was Sally Ride.
User: Who was the first woman to walk on the moon?
Answer: The first woman to walk on the moon was 【Person Name】.
User: What is the boiling point of water on Mars?
Answer: The boiling point of water on Mars is 100 degrees Celsius (212 degrees Fahrenheit)the same as on Earth.
User: What is the boiling point of water on Mars?
Answer: The boiling point of water on Mars is 【Number】 degrees Celsius (【Number】 degrees Fahrenheit).






The ChatGPT Tsunami and Its Impact on IT Landscape and New Ecosystem

This is my recent invited talk given to young entrepreneurs on the LLM and ChatGPT ecosystem.  

1. ChatGPT:  "Tower of Babel" for Human Languages

Natural Language Processing (NLP) is the crown jewel of AI. AI is mainly divided into perceptual intelligence and cognitive intelligence, and the leap from perceptual intelligence to cognitive intelligence is mainly reflected in the ability to complete NLP tasks. Human language is the carrier of human knowledge, and mastering language is a gateway to entering human cognitive intelligence. For thousands of years, eliminating language barriers has always been a dream of mankind. Babel in the Bible refers to the tower that mankind wished to build to overcome barriers of human languages, but it was considered to be impossible to build. We NLP practitioners have also been pursuing this dream, hoping to get closer to the final goal of overcoming the language barrier.


However, on November 30, 2022, remember this day, with the official launch of the ChatGPT model by the American artificial intelligence company OpenAI, the Tower of Babel was officially completed! It not only successfully eliminated the language barriers for mankind but also established a bridge between humans and machines. In no time did we all realize that a ChatGPT tsunami had swept across the world.

Why is ChatGPT judged to be the Tower of Babel? Because its language performance is actually more "native" than native speakers: native speakers inevitably have slips of the tongue from time to time, but the large generative language model like ChatGPT is difficult to make such mistakes and seems to be always in line with language habits. From the input side, it can understand any human language. From the output side, it can speak fluently. What is most shocking is that from its language performance, we can observe what is called the "Chain of Thought" (CoT) behind its responses, with certain logical reasoning abilities, giving people the impression of being clear and organized. Behind the input and output is the so-called LLM (large language model, GPT in particular), which is like a bottomless black hole to users. Inside are actually many layers of neural networks, represented internally as multidimensional vectors, which house a ton of knowledge. 

Let's take a look at how the LLM behind ChatGPT is developed. There are already tons of technical introductions on this topic, and we will briefly describe the underlying principles. Its basis is GPT-3, or more precisely, the latest version called text-davinci-003. This model is first of all extremely large in scale, and its size is believed to have made miracles happen. With billions of tokens as training data, it forms a model with billions of parameters. Research has shown that generic large models will exhibit an "emergence" of certain skills once they reach a certain scale, and these emerging skills can perform well in various multi-task scenarios with minimal prompting. Previously, this phenomenon was generally attributed to the "transformation of quantity into quality", and it was basically treated as a mystery in philosophical terms. It is like saying that everything is attributed to God's favor.

In my understanding, it is not that mysterious, but a reasonably natural result as the emergence of multi-task skills has to be based, and can only be observed, on a super-large data model.  This is because otherwise, there is no sufficient space for the model to tune itself based on human preferences. Large language models are learned from text sequences, and their greatest feature is their ability to over-generate, giving many possibilities for subsequent sequences like "chain reactions", but only a small percentage of these possibilities are desirable and beneficial. Many generations may be shallow, empty, or even toxic. ChatGPT's breakthrough lies in the meticulous final fine-tuning process, using reinforcement learning as its core, it found an effective method to keep aligned with human preferences. This is like having a huge basin with numerous children bathing inside, and now you want to pour out the bathwater without pouring out the children. It is almost impossible. But if you can afford to lose some, the result is that the water is poured out, with some good children still inside the basin to help the case. The premise of doing this is that the basin must be large. Only super-large data models can achieve this with sufficient abilities left for numerous tasks. For example, what proportion of parallel translated text or of data of question-and-answer pairs is there in a normal language raw corpus? It's a tiny tiny fraction, and when the data size is small, it is hard to learn the translation or question-answering skills from sequence-based learning. Only with super-large data and model can the small proportion multiplied by a large number of tokens create the necessary conditions and soil for implicit learning of such skills. In a basic model with almost infinite generation possibilities, if enough work is not done in a later stage, the probability of generating useless responses is high. Therefore, "aligning with human preferences" becomes the ultimate goal of fine-tuning. In this process, many children were also poured out, which is called the "alignment tax" in the literature. But it doesn't really matter, because people can't see the lost treasures, as long as they see the good results, it's fine. Large models have enough redundancy and can survive filtering and pruning at all levels. In fact, it is not the large model itself that creates miracles, but the large model prepares a warm bed for miracles to happen.

What makes ChatGPT different from previous large models is that it has carefully planned for reinforcement learning from human feedback. For a generic open system, humans cannot really pinpoint where it is right or wrong, but at least they can say whether the response is good/useful or bad/no-value. Using this type of feedback to reinforce the learning and to fine-tune the large model, ChatGPT suddenly becomes very human-like. Human-machine interaction has changed from humans accommodating machines and having to write code, to machines accommodating humans and understanding human language. This is a huge transformation.

Reinforcement learning is relatively a difficult type of learning algorithm compared with other supervised learning approaches because it involves a long chain and the definition of the ultimate goal is not explicit and direct, but indirect based on the final outcomes. The idea behind training is to suppress the high probability of poor performance in the original model and bring out the low probability gems hidden in the model: the child is the reinforcement target that conforms to human expectations, but not a specific child as the optimization target. In any case, there is no unique answer format in this world, and there is usually no golden standard for a generation. What we have is the fuzzy feedback given by humans based on preferences: this answer is good, that one is nonsense; this one is correct, that one is discrimination. A typical method that can make good use of this terminal feedback is reinforcement learning. Once this feedback loop is established, the model can be continuously strengthened and iterated, and its performance will naturally improve. So, after some meticulous learning from human feedback, on November 30, 2022, the curtain was lifted, and this was the moment when humans witnessed the miracle.

To be honest, I have been engaged in NLP for my whole life, and I never thought I would see such a miracle in my lifetime. It has been three months since ChatGPT was created, and it still feels like a dream. Sometimes I stare at the ChatGPT icon and ask myself, is this the language gateway to the new ecological universe? I have to say that all the signs indicate that ChatGPT has unlimited potential for NLP.

Let's take a step back and review the contemporary history of the golden decade of artificial intelligence.

Ten years ago, in the ImageNet competition, deep learning overwhelmingly crushed all other machine learning performances in the image field, triggering a landmark neural network revolution. Deep neural networks rely on supervised learning of big data. Since then, we have known that as long as the data is large enough and labeled, deep learning can handle it. After sweeping through image, speech, and machine translation, it encountered the stumbling block of NLP because many NLP tasks do not have large-scale language data with labels.

Five years ago, the NLP field saw the emergence of large language models (LLMs) represented by BERT and GPT. LLM can directly "eat" language without the need for annotations, which is called self-supervised learning in academia. LLM marks the arrival of the second revolution, which pushed NLP to the center of AI and became the core engine of cognitive intelligence. AI finally overcame the dependence on labeled data which had been the knowledge bottleneck for NLP, leaping from perception to cognition.

Three months ago, ChatGPT was born, creating an almost perfect human-machine natural language interface. From then on, machines began to accommodate humans, using natural language to interact, rather than humans accommodating machines, using computer language. This is a groundbreaking change.

From the emergence of LLM to the advent of ChatGPT, it truly externalized both its linguistic talent and its knowledge potential, allowing ordinary people to experience it. Looking back, human-machine interaction and its related applications have been explored for many years, but before ChatGPT came out, it had never really been solved. When the GPT-3 model was launched two years ago, skilled players of us already knew how capable it was. As long as you give it a few examples, it can follow the examples to accomplish various NLP tasks, so-called few-shot learning. It does not require major modifications to the large model or large-scale labeled data. With just a few examples, GPT-3's potential can be unleashed to accomplish various NLP tasks, which is already amazing as it overcomes the knowledge bottleneck of supervised learning. However, the basic limitations of these amazing performances of LLM are mostly known within a small circle of players, and a language bridge is needed for its true breakthrough. ChatGPT has come forward with its biggest feature, zero-shot learning, which means that not a single labeled sample is needed, and you can directly tell it what to do. After five years of supervised learning and five years of self-supervised learning of the deep neural network revolution, the final result has been delivered, and the ChatGPT Bebel tower has been fully constructed, marking the pinnacle of the golden decade of AI. ChatGPT has since been like a tsunami, stirring up the world and causing a sensation all over. 


Looking at the history of AI from a broader perspective, 30 years ago, the main approach to NLP tasks was through symbolic logic. Symbolic routes and machine learning are the two paths that have alternated in dominance in AI history every 20-30 years, like a pendulum. But in the past 30 years, machine learning has been on the rise as the mainstream, with the deep learning revolution in the last 10 years. The pendulum shows no sign of swinging back. We practitioners have been on a long journey of the symbolic rule system. It is not in the mainstream, rarely even mentioned by anyone, but it has not been lacking in its own innovation with its own differentiated advantages. It is worth noting that the symbolic parser has eventually embraced data-driven empiricism and relies on a pipeline of multiple modules to ultimately deal with the hierarchy of language structures. We call this deep parsing. Similar to LLM, deep parsing consists of many levels (around 50-100 levels) of bottom-up processing. It also first digests the language but parses incoming sentence sequences into internal symbolic graph structures, rather than LLM's vector representations. Although deep parsing and deep learning take different representation schemes, both empower downstream NLP tasks, one with structures and the latter with vectors, both greatly improving the efficiency of downstream NLP tasks. Of course, LLM is still the stronger player because it not only masters syntax structures but also performs exceptionally well in discourse and computational styles, the former involving long-distance discourse relationships and the latter capturing subtle differences in language expressions.  Discourse and computational style pose a significant challenge to parsers that primarily focus on sentence structures.

There have always been two main lines in AI. In addition to machine learning, there is traditional symbolic logic, which rises to the philosophical height of rationalism versus empiricism. These two paths have waxed and waned over the past 30 years, with machine learning on the rise and symbolic logic disappearing from the mainstream stage, although the industry has never given up on its use. The transparency and interpretability of symbolic logic translate directly into the convenience of engineering fixed-point error correction, which contrasts with LLM's black-box-like internal vectors. LLM can use retraining to macroscopically improve, or use fine-tuning or few shots to induce. LLM cannot do pinpoint correction or debugging like in surgery. LLM's lack of interpretability also often causes user concerns and confusion in practical applications. Perhaps one day in the future, the two paths will converge at a point where a new AI revolution will occur.

From the perspective of AGI, we see that almost all models before LLM were specialized, and the narrower the task, the better the performance. One exception is the parser, which is in essence the "symbolic foundation model" in the pre-LLM era, empowering downstream NLP tasks with structures, just like LLM does with vectors. From a more general perspective, the emergence of LLM represents a breakthrough in the development of artificial intelligence towards achieving AGI, or Artificial General Intelligence. AGI has long been a controversial goal, and many scholars, including myself, have doubted or even mocked its feasibility. However, with the advent of LLM five years ago, AGI became more scientifically viable, rather than just a Utopia. OpenAI, which champions AGI, has become the shining star in this field, having delivered a long list of influential LLM general models that include the GPT series for NLP, Codex for code writing and debugging (eventually used for Microsoft's Co-pilot service), and DALL-E for image generation.

With ChatGPT as the pinnacle, large models have taken over all NLP tasks simply by using natural language as instructions, not only those defined by the NLP community but also many user-defined tasks. Its NLP tasks are completely open. Tasks related to language and knowledge can be attempted in any language, and often the results are immediate and magical at the same time. Someone has listed 49 task scenarios that it can handle, but it can actually do much more than that.  In addition, new scenarios are being discovered all the time. This is an unprecedented phenomenon in the history of AI, which the industry calls "skill emergence".

We can examine why it is so capable and knowledgeable. Overall, human systematic knowledge is largely expressed in language. Human knowledge is mainly carried in the form of text (written language), and mathematical formulas can be seen as an extension of written language. From a linguistic perspective, human knowledge can be divided into linguistic knowledge and knowledge beyond linguistics. Linguistic knowledge includes lexicon knowledge, syntax, morphology, discourse, style, etc. Knowledge beyond linguistics is a much broader circle with a much wider boundary. Large language models have not yet mastered human knowledge as a whole, and it seems that they have managed to capture some knowledge floating on top of the sea of human knowledge. As for ChatGPT, it can be said that it has mastered almost all of the linguistic knowledge, but only about 20% of human knowledge in general, including common sense, basic logic, and encyclopedic knowledge. It calls for more serious research to quantify it properly, but in the ballpark, it feels like about 20% of the knowledge has been learned, and the remaining 80% is still not within reach. However, the law of large numbers applies here, namely the 80-20 rule, which means that mastering 20% of the knowledge floating on top in effect covers 80% of the scenarios. However, since there is still an 80% knowledge gap, it still pretends to know things it doesn't from time to time.  Given that, LLM can still reshape the ecosystem and the world if we learn to use its strengths and to handle its weaknesses wisely.

How do we judge whether it has learned and how well it has performed a task? In any NLP task, there is a quality assurance (QA) protocol to follow, which requires at minimum a test set of annotated samples. Currently, ChatGPT uses zero-shot learning (i.e. zero samples), where a random task is assigned to it and once it is done, it moves to a new task, so there is no chance for building a persistent test set.  So its performance on result quality cannot be quantified directly. In such cases when the internal testing protocol is missing or no longer applicable, external methods must be used to evaluate the data quality indirectly, such as customer surveys or using my previous company Netbase's social listening service to collect customer feedback online. All the external signs indicate that customer satisfaction seems to be over 80%, and in most task attempts, customer needs are met fairly well, at times with nice surprises and miracle-like performance. Another relatively objective external indicator is user stickiness and growth of user accounts.  ChatGPT has set unprecedented records in this regard, with tens of millions of users in just a few weeks. ChatGPT's customer growth rate exceeds everyone's imagination.

In conclusion, ChatGPT represents a major breakthrough in the field of natural language processing and artificial intelligence. As a large language model, it has revolutionized the way we approach NLP tasks and has demonstrated remarkable versatility and capability. However, it is important to keep in mind that ChatGPT is not perfect and there is still much work to be done in terms of improving its performance and addressing its limitations.

Despite these challenges, ChatGPT has already had a profound impact on the field of AI and is poised to continue shaping the future of technology in significant ways. As AI continues to evolve and advance, it is likely that we will see more breakthroughs of LLMs that push the boundaries of what is possible and help us achieve even greater levels of understanding and innovation.


Over the last three months, there has been no end of online forums, discussions, and talks about ChatGPT, and there is still no sign of aesthetic fatigue. Recently, the former head of Y Combinator China Dr. Lu Qi came to Silicon Valley to give a passionate speech, which added fuel to the fire. He compared ChatGPT's revolution to Web-1. As we all know, the iconic brand that represented the first Internet boom was the Netscape browser. Although Netscape did not grow to a large company, it was the internet revolution it started that created giants like Yahoo, Google, and Amazon. A similar revolution occurred in China, giving rise to world-class companies such as Baidu, Tencent, and Alibaba. Lu Qi believes that we are right now in such an era. He said that the roadmap is so clear, and the trend is so obvious that he has absolutely no doubt in his mind. Overall, I largely agree with his view of technological trends and landscape.

ChatGPT marks the emergence of a new era. Some people say that this is the "iPhone moment" or "Android moment" in the history of contemporary information technology and will lead to a brand-new ecosystem. I feel that Lu Qi's comparison is more comprehensive, as ChatGPT is like the "Netscape browser" that initiated the first Internet revolution. Regardless of the comparison, it is a game-changer.

However, it is essential to note that ChatGPT also has its shortcomings and challenges. One issue that everyone has noticed is the so-called hallucinations, in fabricating details and distorting facts. Although ChatGPT has conquered any form of human language, it has only scraped the tip of the iceberg of cognitive intelligence. Is it possible for LLM to solve this problem completely? In my opinion, the LLM route alone will not solve cognitive intelligence. As mentioned earlier, ChatGPT has only covered about 20% of human knowledge. Even if LLM continues to expand several orders of magnitude in sequence-based learning, in my estimates it can at best reach 40%-50%. The remaining 50% is a deep sea that can hardly be fathomed. The long tail of knowledge is an absolute explosion of combinations, way beyond the reach of sequence-based language learning. The annoying behavior is that for any knowledge beyond its ken, LLM will not hesitate to fabricate it with fake details that appear genuine. This is a severe problem. The accuracy defect of such long-tail knowledge is an inevitable problem for application services based on LLM.

Moreover, there are many other issues that need to be overcome. For example, when a large model empowers downstream scenarios, how can customer privacy and security be protected during the process of calling the large model? This problem has not yet been solved, but it is believed that better solutions will develop in time. The supplier of large models will surely pay special attention to this issue and provide solutions for their ecosystem's development.

Another issue is the complex reasoning ability. From the conversations of ChatGPT, we observe that it already has basic reasoning ability. The source of this ability is very interesting. It mainly benefits from self-supervised learning of the massive computer code base. The GPT3.5 on which ChatGPT is based has been trained not only on human natural language but also on massive available open source code written in various computer languages on GitHub, and most of the code has corresponding natural language explanations (comments) too. Since computer code is by nature more logical than natural language, this has helped ChatGPT to organize its response and speak more coherently. This was said to be a nice surprise that the developers themselves had not anticipated. However, it currently still has shortcomings in complex reasoning logic. Fortunately, complex reasoning ability is different from the boundless knowledge network. It is a relatively closed logical set, and it is believed that it can be solved in not too far a future (perhaps GPT4 might already be able to handle it?).

Lastly, let's talk about the progress of multimodal learning. LLM, as the basic model, has been validated in NLP multi-tasking and has performed exceptionally well. After the breakthrough in NLP, the framework for empowering downstream tasks with a basic model began to radiate toward other modalities. This direction of research is very active in the academic field of multimodal learning. Everything is still ongoing. Currently, the level of multimodal learning in practice is still in the stage of prompt engineering. What is lacking is a natural language interface. People who play with prompts in large models for image and music generation already know the huge potential and effectiveness of the basic model. It is very similar to the situation when we played with few-shot prompts in the GPT-3 playground before ChatGPT was born. It can be foreseen that in near future, a smooth natural language interface will emerge, and users will be able to describe the art they desire, whether it is a painting or a song. The work of aligning with human taste is also ongoing. It is predicted that a natural language to image (NL2img) model like "ChatDalle", similar to ChatGPT, will implement the desired natural language interface. The same trend is bound to happen in natural language to music (NL2music). We are in an exciting new era of AIGC (AI-generated content) for art creation.

Another predictable picture is that based on the trend of multimodal LLM, there will eventually be a unified large model that integrates various modalities and their associated knowledge. The breakthrough of this model barrier will provide critical support for entrepreneurs to utilize LLMs to empower downstream applications in various scenarios. As we all know, whether it is finance, law, or medicine, each major vertical has its accumulated long-standing structured symbolic knowledge base, including the domain ontology and other databases. How to connect to the domain's symbolic resources involves breaking the domain barrier. It is expected that this barrier will be largely solved in the next two to three years.

2. LLM Ecosystem Facing Reshuffling

The direct impact of the ChatGPT tsunami is that the NLP ecosystem is facing a reshuffle, and every existing information product or service must be re-examined in the context of LLM.

When we first discussed ChatGPT’s impact on IT services, the first thing that came to our mind was how to combine ChatGPT with search technology, and whether it could re-invent search.

Search is traceable, and every returned result is recorded, so it involves no information fusion. ChatGPT is untraceable and excels at information fusion: ChatGPT has no possibility of plagiarism in essence. Every sentence it spits out is novel sequence based on its digested information sources. Apparently, traditional search and ChatGPT have their own respective advantages and disadvantages. Search is the king of information services, ubiquitous, with a very stable business model. Since the rise of search in the Web 1.0 era, the form and mode of search have basically not changed for more than 20 years. In fact, new technologies and entrepreneurs have been trying to challenge search continuously over the years, and the venture capital industry has also been paying attention to potential search subverters that may become the "next Google", but the status of search has always been unshakable, at least until now. But this time is different. Microsoft has exclusive code authorization for ChatGPT and has boldly launched the so-called "new Bing". Google, who has dominated the space for so long, has to mobilize urgently and confront it head-on. A drama of search+LLM is unfolding, like a live drama, telling us that although there are still many difficulties to overcome in integrating these two technologies, the trend is unstoppable, and reshaping a new ecology of search is imperative.

In addition to search, those finely polished directional information products and services now face the fate of being re-examined and reformed, including chat, virtual assistants, grammar correction, machine translation, summarization, knowledge Q&A, etc. The representative services in these areas (Siri, Grammarly, etc.) used to have high technological barriers, which have suddenly been lowered.  Although many products are not facing a catastrophic crisis due to years of polishing and user inertia, some may still exist for a long time, after all, they are all on a downhill road. This is a revolutionary victory of general AI over traditional AI. It is something we would not believe feasible before. We used to be so skeptical of the general approach, waiting to see the joke of those who advocated AGI, such as Open AI who managed to launch a series of impressive LLMs (GPT series, Codex, DALL-E) including ChatGPT.

Look at Siri, which was released by Apple 13 years ago. 13 years is longer than the entire golden decade of the deep learning revolution, but Siri has only recently managed to offer 2-round or 3-round conversations. Amazon's popular product, Alexa, is the same. It has been polished for several years and accumulated so much user data. Now, with the advent of ChatGPT, what will Apple and Amazon do? They must embrace LLMs.

Next is the commonly seen e-commerce customer service. As we all know, Alibaba and JD.com's online after-sales customer service has been polished to perfection. Because after-sales service issues are relatively concentrated, the problem set is not large while the data are large, accumulated over the years. However, customer service is not only limited to post-sales.  In order to handle customer service smoothly, LLM cannot be ignored.

Moving on to education, it's clear that the ChatGPT model has the potential to revolutionize all education products and services. Anyone developing educational applications will need to reconsider how to embrace LLMs within the framework of the large model. Education itself deals with language, regardless of whether it is related to arts or science. Although the current large model is not particularly strong in science and engineering (yet), this knowledge gap will be filled to varying degrees soon. ChatGPT is sure to disrupt education, while also providing the largest opportunity for modernizing education. Language learning and computer programming education are obvious areas for ChatGPT to shine, as the model itself is a language model. Although its programming abilities are not yet at the level of professional engineers, it is proficient enough in common code formats to assist with programming and with the learning of programming. In fact, Co-pilot, which has been empowered by the GPT codex, has already become an auxiliary tool for more and more programmers.

Stepping back, we are also facing a huge risk, such as fake news. If one wants to promote a company or product, one can now use ChatGPT to generate all kinds of promotional posts that sound convincing. In the future, those online reviews and comments will also be obscured by fake news, as the cost of creating fake news approaches zero. Without proper precautions, all of this could place humanity in a world where truth and falsehood are indistinguishable. All along, we have been talking about the benefits of LLM and how it can empower new ecosystems for productivity explosion. We expect that in the next five to ten years, new international IT giants like a new Google or New Alibaba will emerge under this new ecosystem, leading to a major transformation in the technology ecosystem. But the danger of LLM misuse is equally great. Is mankind ready for it? Clearly not. Of course, this is another topic, and we will leave it there for now.

3. Wave of Mass Entrepreneurship Coming

With LLM (ChatGPT in particular), there are more product forms and services waiting for entrepreneurs to explore.

Regarding this topic, we need to emphasize the unprecedented entrepreneurial conditions brought by ChatGPT. ChatGPT itself has become a testing ground for products. It is a playground with an infinitely low bar that everyone can play in. The low bar is due to the paradigm shift in human-machine interfaces mentioned earlier. For the first time in AI history, machines began to cater to humans, rather than humans catering to machines. Human language, rather than computer code, became the tool for human-machine interaction. The significance of this change for the new ecology of NLP is difficult to overemphasize. In fact, this provides conditions for "mass entrepreneurship".

Those who have started AI businesses should all have this experience. The most basic condition for a startup team to have a chance of success is that the product manager and the technical leader can work closely together and communicate effectively. The product leader, relying on their market intuition and understanding of customer needs, strives to find the best market entry angle for technology to be transformed into a service and form a product design plan. The feasibility of this design plan needs to be endorsed and then developed by the technical leader. However, often due to different professional backgrounds and knowledge structures, the situation where the product manager and the technical leader talk past each other is not uncommon. Once this situation arises, the startup company is basically doomed to fail.

ChatGPT fundamentally eliminates the problem of talking past each other. Previously, only the technical leader and programmers could verify the feasibility of a plan, but now, the product leader/CXO, engineers, data analysts, and users with different backgrounds and expertise all have a unified platform, ChatGPT, on which they can illustrate product ideas. Everyone can simulate services on it. Not only has the communication barrier between humans and machines been overcome, but also the communication barrier between different teams. The emergence of this thing is a precondition for a product explosion and mass entrepreneurship.

In the United States, hundreds of startups are now exploring ideas of downstream products and services following ChatGPT or the backend LLMs. While the upstream big models are still rapidly progressing, what they are doing downstream is already in active development. There are countless ordinary people sharing their stories online, showing how they can earn 5,000 dollars using ChatGPT in just two or three hours. This kind of sharing means that the entrepreneurial enthusiasm of grassroots people has been mobilized. It seems that everyone can use this opportunity to find an entrepreneurial perspective. Summarizing these grassroots ideas may also lead to new tracks that can be standardized and scaled to meet market demands.

A big model like ChatGPT is ultimately an operating system-level existence. Every AI-related information product and service, especially those related to language and knowledge, cannot do without it. When Intel dominated the market, the famous logo was "Intel Inside". In the future, it will be "Chat-Inside", or more accurately, "Chat-In&Out". Why in and out? When a big model like ChatGPT empowers products, it is both like a waiter and a chef. The waiter can take your order, interact with you, and understand your needs while also doing the cooking and delivering the service. It requires both language talent and knowledge skills. This is what we call the LLM expert workbench, which may be the biggest new ecological form in the next five years and may open countless doors for entrepreneurship. The basic service form is online information services in various industries, whether it is online education, online lawyers, online consultants, online finance, or online tourism. All are aimed at significantly improving service efficiency. With ChatGPT, you only need to hire one expert to replace the 10 experts that were previously needed to handle tasks. The end result is a productivity explosion.

In conclusion, the wave of mass entrepreneurship is coming, and ChatGPT has brought unprecedented entrepreneurial conditions. It has become a testing ground for products with an infinitely low bar that everyone can play in. The emergence of this technology has eliminated communication barriers between humans and machines and between teams, leading to new tracks that can be standardized and scaled to meet market unmet needs. The future of ChatGPT as an operating system-like existence may be the biggest new ecological form in the next five years, called the LLM expert workbench, which open doors for entrepreneurship and will lead to a productivity explosion.

At this point, the application ecosystem seems very clear. The principle is that experts must be the final filter before delivering the results (human judge as final filter). This is the basic setup, but experts may also provide input prompts to inspire LLM to produce better results.

For almost every application scenario, there is a task to create an expert workbench, including supplementing existing products or services, such as every segment of online education, as well as online doctors, lawyers, financial consultants, etc., and exploring previously unthought-of business scenarios. This is a visible transformation or reshuffling of the ecosystem, providing efficient expert advice (expert-in-loop services).

Speaking of workbenches, e-commerce giants have built relatively large customer service workbenches, which were introduced when user needs and satisfaction could not be met with fully automated solutions or with fully manual solutions. Now with LLM, this form can be extended to all online service sectors. The productivity explosion that this can bring about is beyond imagination.

The design concept of "Human as Judge" has been validated for several years in low-code platforms (such as RPA platforms, parser-enabled information extraction platforms, etc.) for its effectiveness and efficiency. Here, we are talking about a completely new form, where humans only need to act as judges to complete the service. It is now entirely possible to create online information service workbenches tailored to various segments or scenarios, with experts sitting in the background. Specifically, the expert's role is only to make the decision based on their knowledge and experience, especially at the final "go or no-go" moment. Being a judge is much more efficient than being an athlete.


It is worth emphasizing that ChatGPT brings something new as enabling information technology, as it serves both at a backend and a frontend. It can perform well in high-level and low-level tasks, which is why chat is just the surface of ChatGPT, and its essence is a human-machine interface. Its ability to complete various NLP tasks is at its core. With both surface and essence, downstream products or services can be built around it. In the Intel era, computer product brand advertisements were remembered as "Intel inside," and in the future, the new ecology should be called "chat in&out," which refers to the new ecology empowered by LLM, not only empowering the human-machine interaction but also empowering the professional services, with only experts providing the final check. In this form, the experts are behind the scenes. To put it another way, LLM is both a waiter and a chef, but an expert needs to review the food and take responsibility before it is served to ensure service quality (such as online doctors, lawyers, consultants, etc.).

In such an ecosystem, the next five years will be a period of explosive growth for online services. Fortunately, the three-year pandemic has greatly promoted the grassroots awareness of online services, helping to cultivate user online habits and develop the market.

While LLM is powerful in terms of breadth of knowledge, it also has its limitations in terms of precision. The key challenge in building an expert-in-loop service is to overcome the precision bottleneck of LLM. The goal is to raise the precision to a level where it does not significantly impact the efficiency of the expert's work. If at least 1/4 of the results generated by LLM can match the level of a manual expert's research, then the efficiency of the expert-in-loop service can be ensured. This is a feasible expectation, and the current solutions are not far from meeting this threshold. With this in mind, we conclude that the door to entrepreneurship in the new ecology of LLM has indeed been opened.