两分钟谈谈:Moravec悖论

简介一下 first:

Moravec悖论是由机器人专家汉斯·莫拉维克(Hans Moravec)在20世纪80年代提出的一个观察,指出了人工智能领域中的一个有趣现象:高水平的推理需要相对较少的计算资源,而低水平的感知和运动技能则需要大量的计算资源。这意味着计算机和机器人在处理复杂逻辑和数学问题时相对容易,而在执行诸如行走、抓取和视觉识别等基本感知和运动任务时却非常困难。

Moravec悖论的主要观点

高水平推理 vs. 低水平感知和运动技能:

高水平的认知任务(如象棋、数学证明、逻辑推理)可以被算法高效地解决,因为这些任务往往有明确的规则和结构。

低水平的感知和运动技能(如步行、抓取物体、视觉识别)涉及大量的不确定性和复杂的环境变量,这些任务需要处理大量的感官输入并实时做出反应,非常困难。
人类和机器的不同发展路径:

人类在进化过程中,低水平的感知和运动技能(如走路、避开障碍物)已经通过数百万年的进化得到优化,并且我们对这些技能的认知是无意识的。相比之下,高水平的认知任务是相对新的发展,且大多是有意识的。

计算机和机器在这些高水平任务上表现出色,但在处理低水平感知和运动任务时却非常困难,因为它们缺乏人类进化中积累的那些隐性知识和适应能力。

Moravec悖论的实际例子

下棋 vs. 走路:计算机程序可以打败世界上最好的选手,但要让一个机器人在复杂的现实环境中稳定地行走仍然是一个巨大的挑战。

逻辑推理 vs. 抓取物体:逻辑推理问题可以通过算法高效解决,而让机器人精确地抓取和操纵不同形状和材质的物体仍然需要复杂的感知和控制算法。

有几点感想:

1. 既然低水平的感知和运动技能是亿万年进化的结果,成为动物和人的遗传本能,后天只是激发出来了这种能力。

那么 LeCun 以动物和人总是在语言能力之前,先“学”会了这些能力作为理由,批判LLM对于多模态的干扰和“投机取巧”,就不是很有说服力。因为模型并没有进化出来的遗传基因作为基础。模型唯一可以利用的是LLM里面的认知知识(包括感知常识的语言描述)。

2. 虚拟机器人(LLM)可以轻易做很多高级白领工作,但人形机器人却对蓝领低级工作的模仿非常笨拙,这是大家都看到的事实。

但其实我们也观察到,虽然笨拙,但任何低级的力气活(例如举重)、技巧活(例如翻筋斗),一旦学会了,机器人就比人类有无比的优越性:它不知疲倦,不怕单调,耐力超强。学会了投篮以后,你不用担心他的成绩不稳定。

3. 视觉识别以前是模型的短板,人和动物的强项。

但是现如今就不同了。例如,人脸识别,模型比人强了。看图说话和视觉理解最近的表现也有明显的碾压人类趋势。

这是因为当悖论提出的时候,那还是符号逻辑主导AI的年代,机器学习刚开始有一些进展,但善于从海量数据学习patterns的深度学习还没有发明。感知智能当时是一座难以逾越的大山。

4. 具身机器人的外推能力怎么来呢

我们知道,机器人以前的建模都是非常“内向”的,在厨房环境建立的模型,换到了办公室环境就不行,必须一切重来,重新准备场景数据,重新训练。厨房环境的数据 “外推” 到办公室环境的能力,可不可以做到?

可以的。在大模型的架构下,这早已不是幻想。可行性可以从半年多前的谷歌 RT-2 机器人的项目表现看到端倪。

道理就是大模型虽然没有遗传的基因,也没有目标场景(厨房场景)的数据, 但办公室环境的数据 finetune 出来的大模型能力,是有希望迁移(外推)到其他的环境(例如办公室环境),因为LLM 某种意义上起的作用就是生物进化得到的先验遗传。

 

两分钟谈一下啊:模型训练的内推与外推

模型训练的讨论中,常常提到 interpolation(内推)与 extrapolation(外推)两个术语,可以说说。

大家都知道,在数据驱动的主流学习过程中,内推需要的是在训练数据的边界内“泛化”能力,善于信息压缩的深度学习训练是内推的利器。但外推(extrapolation)就没见有长于此道的算法或模型。实际上,外推与数据驱动是直接矛盾的,by nature 就是无解的,因此很可能这就是个伪问题,根本就不该对外推抱有幻想。当然,从鲁棒性角度,外推现象出现的时候,应该有个合理的应对,至少要做到模型不死机,至于外推的精度,主要是看运气,而非模型的内功可以发力。

不过,外推的问题还是值得梳理一下。

外推(Extrapolation)
定义:外推是指在已知数据点之外进行预测或推断。例如,已知某个函数在几个点上的值,外推就是在这些已知点之外估计函数的值。

前面说了,数据驱动的路数,本质上对外推无解。不能强人所难,巧妇难为无米之炊。米就是数据。

但加强外推,理论上还有 “先验知识” 和 “模型结构复杂化” 两个路子可能有点帮助,而数据本身的短板,除了数据,没有它途:就是说把对于“无人区”的外推问题,转化为收集相关数据,让外推变成内推。

模型结构复杂化在历史上是帮助了训练的“外推”能力的,如果原来的简单模型有自身的结构性局限。我们都知道,AI历史上,非线性问题在简单的单层神经网络的模型里是无解的,无论给出多少训练数据(这其实是早期神经网络的命门:单层神经无法解决 XOR 的现象)。但到了多层+非线性转换,就有解了。这其实是深度学习神经革命的最基本的立足点。这其实反映了当本质上是多维的数据被挤压在低维空间的时候,简单模型是无法跨越维度去找patterns的,相当于外推遇到了墙壁。模型复杂化就是开拓了多维空间,供训练施展。

至于“先验”对于外推的作用,我们从 Alpha-Zero 利用 self- play 再生数据可以看到效果。self-play 的好处是有游戏规则等先验知识,因此它所产生的数据,可以是在先验知识条件下的未见过的“外推”数据。

尼克:其实是个动态的过程。我按照你的理解,用美国实用主义哲学的话语替你更清晰地表述:可以证伪或者科学革命的是外推,可以证实的是内插。但是都符合奥卡姆剃刀,都是压缩。

白硕:啥叫外啥叫内?彼此互为“外”的,在一个巧妙的映射下就成了“内”。基因组和字符串,当初谁知道是外还是内呢?

鲁为民:我的理解是如果用数学语言来描述, 给定一个数据集,如果一个数据 sample 在该数据集的 Convex Hull 是内插,在Convex Hull 之外是外推。所以 Nick 说的证伪也还应该是内插,但科学革命要看革命到什么程度。

所以内插问题基本是可解的问题。从这个意义上来说 NTP 都是内插 (不过 Sonomonoff 说的下一个符号预测是外推,定义和这个不一样)。

尼克:convexity的判定非常简单,复杂性很低。

白硕:内插是纯粹连续空间里的事儿。外推符号也可以。那么问题来了,对应于符号的外推,连续模型是什么?比如离散符号集上的归纳。

还有就是奇奇怪怪的分布,用凸包就太“宽大”了,什么点都进来了。差值很容易不靠谱。把原始数据先变到某个流形上,再做凸包和内插,会精准很多。代价就是要去搞定流形。

尼克:连续是离散的近似,还是离散是连续的近似?

白硕:.1、.2,这种离散不是真离散。张三李四更离散。

中医说心是君主之官,肺是宰辅之官,肝是将军之官,这个映射是离散到离散。要嵌入向量空间还要能内插外推,不知道大模型中医书读多了会不会玩这套。

尼克:对,单说convexity从复杂性角度没意义。

鲁为民:但这个基本上可以界定对人类和机器可解的问题;比如我们可以判别哪些努力是在是制造(信息)永动机。

尼克:微分vs差分。是连续到离散。连续简单,离散就复杂。

鲁为民:所以像 Embedding 这样试图将离散问题近似为连续问题,将问题简化到利用目前的手段可解。另外通过概率方法,也可将离散问题连续化 ...

立委:如果某数据的本性就是完全随机(布朗运动?) 不存在任何有效的压缩。外与内如何区分?区分又有何意义?

尼克:有修辞的意义

立委:这类数据本性是不可计算的,但在谜底泄漏之前,只要给数据 模型(or 图灵机)就一定在计算。它在计算或压缩什么呢?

又因为所给的数据一定是有限的,这有限的数据一定会被“误读” 而且一定形成某种内外之别。不完全归纳/回归??数据驱动的方法 其实不知道 也不可能知道背后的真相。

立委:离散符号的 embedding 比起 1 hot 是降维 降维克服了数据稀疏的挑战 从而为压缩创造条件。但从传统的符号逻辑 用人为的非常有限的离散 features (例如词类与子类)来表示离散符号来看, embedding 是在增加维度。但除了 embedding 还有更好的办法 与上帝对话吗 ?embedding 的维度数,本来就具有任意性、可配置。

白硕:理论上离散的NP完全问题对应的人造数据也都可以“嵌入”连续空间。连续方法对求解有好处吗?@Nick Zhang(尼克)

尼克:看怎么嵌入了,可能对求近似解有用。

白硕:用1 hot那种嵌入。能不能找到结构相似性?比如对变元做一个permutation不变的SAT问题。

尼克:这个permutation复杂性有要求吗?

 

 

 

 

 

 

 

 

 

【立委NLP频道总目录】

 

两分钟短评:大模型开始进入平台期吗?

在Anthropic 的 Claude 3 和 谷歌 Gemini 赶上 GPT4 以后,就不断有人希望 Open AI 放出它的 GPT5 的大招来,但传说中的 5 迟迟不来,于是有人怀疑,scaling law 是不是失效了,大模型是不是进入了平台期。

这个怀疑有一定的道理,因为GPT路线上的 scaling law 不过是个经验“法则”,虽然说,it never fails us,so far,但谁也不敢保证它永远有效。遇到瓶颈不是不可能的。

微信中也开始流转马库斯最近的评论:“大模型开始进入收益递减的时代”。但他的论证感觉很有问题。

文章开始有个奇怪的递减结论的依据。在一个 100 scale 的性能上,达到 80 以后,绝对递升减缓不是一个宇宙真理吗?怎么就成了马库斯眼中的递减论的批评依据呢?我们对 “更大力出更大奇迹” 的 scaling 期望,主要是要补齐那些目前能力的空白 和 短板,最终实现 “在所有职能任务上,模型都可以达到或超过人类水平” 的 AGI 理想。因此,合理评价大模型更新换代是不是走在agi的路上,应该细致分析空白与短板,而不是用那些已经达到 接近 或超过人类水平的指标上。 也许 gpt5 确实遇到平台期或过不去的瓶颈 (我们其实不知道),但这种论证方式显得多么不靠谱。

道理上,对于已经达到或超过人类技能的指标上,应该关注不要有太大的退步(regressions)。对于一个重要的智能也有上百甚至上千种需要全面测量的模型,只要智能边界在扩大,空白被填补,低性能有增长、高性能没有大退步,就是走在正确的路上。就不能说是处于平台期。

有人看不起多模态的进步,认为那是低级的智能任务,是横向的发展,是“向下看”为应用落地,而不是向上去探顶,去加强高级智能任务的能力。这个看法,缺乏对于智能的全面深刻的理解。

多媒体方向的进步,虽然水到渠成,但其实速度很快,无论Sora的发布,还是前两天Open AI 和谷歌的全双工、实时、流式、能看能听能说、甚至能逗哏捧哏、打情骂俏的超写实助理的发布,其表现和速度实际上超出我们多数人的预期,完全不像是模型进入平台期的景象。

实时交互等于是在大模型原来的短板上大幅度增强,填补了一些空白。把这些能力排除在外,只看、只认认知智力的进展进度,是非常偏见的。

从功能对于应用的影响来看,感知多模态的加强,比起认知智能的进一步提升(例如在所有的专家资格考试中赶上人类专家),更为关键,因为模态是认知智能落地的重要接口。

从大模型本身的健康来看,多模态也是绕不过去的关,认知智能只有借助感知智能(进而结合具身智能从数字世界进入物理世界),才能算是真正落地(grounding),获得数字世界 ——或跳出数字空间获得物理世界—— 的意义。

在这一点上,LeCun 在 AGI 中特别强调感知智能是有其道理的。但LeCun过分纠结于感知和认知训练的次序了:他坚持感知智能先行,要排除语言模型的认知的“投机取巧”和对感知智能的“干扰”,淡化或无视认知智能对于感知智能的知识迁移作用,这是具体路线之争,不是特别有说服力的。

LeCun 说过: GPT 不如我家的一条狗。

这话你也不能说他错,但显然是偏激之词,带有太大的情绪。

其实,不仅LLM不及猫狗,我们人类也不及,没有猫狗的嗅觉灵敏,也没有他们躲避危险的高效。我们人类甚至在算术方面不如计算器,更甭提计算机。

so what?

对不如一条狗的LLM,我们也不能因此否认它比1000个教授和博士都更博学。LLM 可以与物理学家讨论暗物质,与语言学家谈乔姆斯基层次结构,与任何专家谈任何问题。而且所谈的并不是人们想象的那么无知和胡说,虽然里面 here and there 确实有幻觉和臆测。

切身体会是,如果你作为专家保持一种探究问题的心态和对于幻觉警惕的 alert,你会发现与它交谈比与很多中等水平的同行讨论,更有意思,或受启发。ta 看的书实在太多,而且也本性上学到了融会贯通,而不仅仅是死记答案:学到了语言,也相当程度上,学到了知识。

无视LLM这种人类智能,贬之为不如一条狗,除了发泄情绪外,只有一个价值:提醒感知智能的重要性。

不管怎样,AI 因为有了马库斯和LeCun这些“持不同政见者”,而更加丰富多彩。但我们也没必要被他们蛊惑。

 

悲观主义的视角,人类的宿命。

甜甜听到我的 piano ballad,问:are u ok, Dad?

我说,if you are blue, what you do is play piano.
我也想 play,但可惜我不会。所以,I made a piano ballad

她说,I see, 但她没想到这是 AI 歌曲。

甜有很高的音乐素养,以前一直看不起AIGC,主要是她觉得她看到的AIGC内容在似像不像之间,所谓“恐怖谷”效应,感觉 weird,但这次说,这一首的确不像是AI的,与人类艺术家产出无异。

我说,there will be more and more AIGC beyond our imagination

human like or super human like ...

我说我在公园转,循环听了这首不知道多少遍,还没有烦。谁说 AIGC 出不来可循环听放的曲子?

当连听三个 sad, 很难不泪眼模糊:人生本来就有无尽的不可承受之重。

有时候也想,人类从个体角度,绝大多数人都经历过生不如死般的磨难,和没有道理的内卷和碾压。而人类并不抽象,它是由一个个注定饱经磨难的个体组成的。那么,人类灭亡、文明毁灭又有什么了不得的?不过是落下一片白茫茫大地真干净。

绝大多数人类行为,换一个角度看,不仅仅毫无意义,而且是飞蛾扑火。

歌唱的是爱情毁灭的残酷,但传达的绝望情绪,却是所有的悲观主义哲学。

昨天,听李飞飞 Ted 演讲,她确实是个演讲高手。谈她刚下场要做的初创,做所谓“空间智能”,就是视觉 3D,也许加“具身智能”,当前的热门。

她一开篇从宇宙历史开始,说,“有了光,但没有眼睛,没有视网膜。”

听上去像是在说新约上帝造人的故事。

我在想,如果文明终结,不过就是回到“有光但没有眼睛”的世界。物质、色彩、感情、烦恼、痛苦与狂热,压缩、理性、概念与意识,这一切的一切,全部消失于无形。从哪里来,回哪里去。

一万个不情愿,我们每个人都(被)接受了个体的这个宿命,长远来看,为什么不能接受群体的宿命呢。

这个意义上,超级对齐不仅是杞人忧天,而且是要做上帝,或替上帝操心。太把自己当回事儿了。伊利亚、马斯克,无不如此,自以为超人。马斯克准备投巨资要移民火星 说是为文明买一张保险。但巨额保费谁出?打着人类的名义,本质上还是人类买单,哪怕这是从他个人的超级利润拿出来的。这其实没有道理。尤其是在还有疾病、饥饿和无数本来可以避免的痛苦的世界。

 

 

 

以前的杂记,关于AGI、马斯克、奥特曼和OpenAI

三月的时候有个新闻,伊隆马斯克起诉Open AI,引来了OpenAI奥特曼和Ilya等人的公开信,披露了Open AI草创时期的的很多细节,引起热议(例如《权力与背叛:马斯克与奥特曼如何从兄弟情走向商业对决》)。对此也颇有感慨,点评一下。
这个瓜太大,太具有戏剧性了。有很多看点 ...... 整个过程太戏剧化,更重要的是事关人类命运:
“开源了,更危险”,这是 Ilya 7年前就写的邮件给马斯克说的,马斯克表示同意。
开源更危险论是这样说的:开源以后,只要有钱就可以造出来超级模型。这种情况下,坏人更容易这么做。谁有钱谁就能做,谁愿意不管不顾谁就占先。光脚的不怕穿鞋的。所以,“核武器” 发展到一定的时候,就应该转为闭源。因为相信自己 比相信未知的对象要靠谱。
记得奥特曼当时是 YC 的 CEO,他大概把 Open AI 包装成 YC 孵化出来的 AI 企业,老马作为联合创始人和当时最大的投资人,在这一点不太满意。所以老马说,博客(说的Open AI计划)听上去不错,如果做些调整让新公司更加中立,而不是以YC为中心。
  1. 现在我们知道,是 Open AI 打开了 AGI 的大门,开启了人类文明的新时代,但走通这条路到 GPT3 或 ChatGPT 的核爆炸时刻,实在是太幸运的极小概率事件了。
  2. 老马与奥特曼这两位 AI 圈外但又接近 AI 的先知,与 Ilya 这样的圈内顶级科学家,在 AGI 的信念上,很早就非常默契:他们在计划这件事的时候,没有任何自我怀疑,好像就在谈一个事关人类命运的必然发生的事情一样。他们后来的分歧只是在实现的方式以及资源的局限上,并不在 AGI 本身。要知道那个时代,全球科学家和知识分子全体,几乎100%是不相信什么通用AI这种“鬼话”的,但地球上就有这么几个人,坚信AGI,并且能气味相投,凑在一起为之谋划,并开始担忧人类文明的命运。
  3. 他们默契,并决定成立 Open AI,是出于对于 AGI 可能被垄断的担心。具体说,是担心谷歌称霸世界:当时的谷歌已经搞出了 Alpha-go/-zero,让他们感觉此事无法缓行,必须立刻动手,以开源对抗谷歌。老马一半出于公心(为人类文明的前途忧虑),一半出于私心(希望自己成为谷歌AI的挑战者领袖,而不是放任奥特曼这些年轻人来领导)。
  4. 他对这个AGI事业和他可能扮演的角色非常投入,愿意做背后的金主,一开始就让奥特曼把第一笔融资提高一个量级,明确说,任何融资亏空他都可以补齐,隐含前提当然是他是 CEO 和 leader,最好是控股老板。按照 business 逻辑,这是完全合理的,毕竟在那样的早期,这样烧钱的AI“曼哈顿计划”,也只有老马这样识货的人才愿意成为金主。现代社会的铁律是,谁有钱,谁当家。可是奥特曼不甘心,他与Ilya几个是实际工作中的 Open AI 创始人和 AGI践行者,不甘心只做 COO 而把 CEO/Chairman 让给这个几乎是唯一靠谱的大金主。
  5. 于是上演了这一出最后分手的戏剧:老马在得不到他想得到的 CEO 或让 Open AI 依附于 Tesla 之后,决定退出。没有惊人的定力,奥特曼是不可能敢于把金主放跑的。而老马在决定离开的时候,宣判了 Open AI 的死刑:你们成功的希望为0,他说。不是老马对 AGI 的成功有丝毫怀疑,而是他觉得离开了他,Open AI 无法海量融资,只有死路一条。他当时列举了苹果和Facebook,判断这两家不可能有远见给 Open AI 输血,他却漏掉了微软,可能是根本没想到微软有此可能,他小看了微软CEO的眼光。
  6. 奥特曼怎么吸引和说服了微软,那是另一个故事了。但当时的情况是,除了老马,有钱人几乎没人能看懂 AGI 和前途,业内人士也看不懂,Open AI 就是一帮“疯子”在异想天开。融资几乎不可能,那么奥特曼怎么敢与老马分手,而不委曲求全让位给老马呢?
  7. 谁知道先知和天才不仅仅就是这几个疯子,微软CEO萨蒂亚·纳德拉(Satya Nadella)也是,虽然他离 AI 更远。萨蒂亚与奥特曼的“勾搭”是人类历史上最具浪漫色彩的一章,需要冲破种种桎梏。
  8. 现在我们似乎理解了,微软今天能超越苹果成为世界企业首富,就是英雄创造的历史:萨蒂亚是不可思议的领袖。他的悟性和远见让 Open AI 与微软结合,这是一桩非常奇特的姻缘:一方投入巨资,另一方短期看不到希望,巨资投入也带不来任何董事会决定权,萨蒂亚依然前行。世界上找不到微软这样的对象,它几乎是彼时彼刻唯一可以牵手 Open AI,摆脱它必死宿命的救星。呼唤的与被呼唤的,在千载难逢的那个时间点,没有错过。
  9. 后来的故事,所有人都知道了:这个“姻缘”彻底改变了AI,更重要的是,也改变了人类文明的走向。
  10. 其他都是花絮了:老马以维护人类的名义起诉 Open AI 违背初衷;Open AI 披露早期信件来往证明老马本人就梦想控股,并不真正在乎开源还是闭源,而他们则依然不忘初心。
顺便一提,Ilya 此前不知所踪,现在看来是被冷藏了,但他现在出来给奥特曼这个公开信背书,而且作为公开信的主要作者,似乎说明,他并没有(被)选择分道扬镳。我们的猜想是,他还在内部继续领导 AGI 的安全研究,所谓人类价值观的超级对齐,希望用技术手段保障AGI不失控。但(被要求?)保持了低调。
微软的地位其实很尴尬。一方面,现在知道他们对于 Open AI 的巨额投资,已经从股价的飞升中得到了足够的回报,所以从投资角度,萨蒂亚是微软的英雄。但另一方面,这个“婚姻”始终无法稳定,也难以建立恒久的互信。微软不得不给自己做 Plan B,而 Open AI 也有自己的 Plan B:都需要在两人分手的时候有所准备。Open AI 这种独一无二的公益实体控股企业实体的架构,改变了人类历史进程,但却天然有矛盾和不稳定。上次奥特曼被踢出而复返的危机会不会重演?奥特曼本人会不会成为 AGI 沙皇,违背初心,一意孤行?
这些都还在演进中,进行时 ......

Unified Models Surpass Single-modal Models  (Gemini Notes 2/8)

by Zhi-Fei Li, Gao Jia, Wei Li, from "Brother Fei on AI"

02.

Multi-modal Large Unified Models Finally Surpass Specific Single-modal Models  

Humans perceive, cognize, and generate emotions and consciousness through the integration of multiple senses. Gemini is also practicing this approach, processing multiple modal inputs, integrating them in the brain, and then expressing through various modal outputs. This comprehensive "simulation" of human intelligence by such models is rapidly evolving.

Previously, multi-modal model training resembled a system composed of separate eyes, ears, arms, and brains, lacking strong coordination. However, the direction represented by Gemini feels significantly different: it's as if the large model has become a complete digital person, where hands, eyes, brain, and mouth work in harmonious silicon unity. Gemini is the first true end-to-end multi-modal system.

In the past, models optimized for a single modality usually outperformed those handling multiple modalities simultaneously. The common practice was single-modality model training. Even GPT-4 primarily "concatenates" different modalities into an overarching framework, rather than being a unified multi-modal model.

The exciting aspect of Gemini is that it was designed from the start as a native multi-modal architecture. The training process interweaves various modal data from the beginning. If previous large models were like attaching sensory organs or mechanical arms to a brain externally, Gemini is like growing its own eyes, ears, and arms internally, allowing for fluid and natural interaction.

Whether in terms of model architecture, training process, or final output, Gemini achieves a seamlessly integrated multi-modal experience.

For the first time, Gemini demonstrates that a unified model can handle all modalities, and perform even better than models focused on a single modality! For example, compared to the Whisper model, which is optimized for voice recognition, Gemini shows a significant improvement in accuracy.

This signifies the dawn of the era of unified multi-modal models.

Image

In fact, Gemini is not the first model to demonstrate that different modalities can mutually enhance performance. This was also evident in PaLM-E, where "PaLM-E, trained across different domains including general vision-language tasks at internet scale, showed a marked improvement in performance compared to models performing single tasks in robotics."

Another example of modalities enhancing each other is the multilingual processing ability of large language models. If we consider different languages as distinct "modalities," the practice of large language models has proven that processing native data of all languages together (through tokenization and embedding) managed to lead to the successful construction of a human language tower of Babel.

The overwhelming amount of English data in the training of large language models also benefits the model's understanding and generation of languages with limited data, reaffirming the transfer of linguistic knowledge. It's akin to a person skilled in tennis also being able to improve their abilities in squash or golf through related skills.

Since the rise of large models in February this year, many have gradually embraced the belief that "unified multi-modal models will surpass single-modality models." However, this belief hadn't been confirmed on a large scale until Google's Gemini showcased the prospects of this belief, reshaping and solidifying it for many.

In the future, specialized models for tasks like voice recognition or machine translation may become less significant. Many generative tasks such as TTS and image generation are also likely to be unified under large models. Some may complain about the high cost and slow speed of large unified models, but these are purely technical challenges. In practice, we can distill unified models to specific modalities or scenarios.

We firmly believe that unified cross-modal large models will become the mainstream pathway to achieving AGI.

Furthermore, "modalities" are not just sound, images, videos, etc. Olfactory, gustatory, tactile, temperature, and humidity sensors are also different modalities for gathering environmental information, all of which can in time be encompassed by unified models.

Ultimately, various modalities are merely carriers of "information." They are a form of rendering, a presentation style, a means for an intelligent entity to interact with the physical world. In the eyes of a unified model, all modalities internally can be represented by unified multi-dimensional vectors, enabling cross-modal knowledge transfer and the intersection, alignment, fusion, and reasoning of information.

When the barriers between modalities are breached, revealing the core beneath various renderings, we see the origin of cognition — language.

 

 

 

(Gemini Notes Series to be continued)

 

Original from:

关于 Google Gemini 的八点启示

by Zhi-Fei Li, Gao Jia, Wei Li, from "Brother Fei on AI"

Cross-modal Knowledge Transfer of Large Models Proven (Gemini Notes 1/8)

by Zhi-Fei Li, Gao Jia, Wei Li, from "Brother Fei on AI"

Image

In 1948, inspired by psychiatric patients, British doctor Ross Ashby invented a peculiar machine called the "Homeostat." He proclaimed that this device, costing about 50 pounds, was "the closest thing to an artificial brain ever designed by mankind." The Homeostat utilized four bomb control switch gear devices from the British Royal Air Force, used during World War II, as its base. Above these were four cubic aluminum boxes, with the only visible moving parts being four small magnetic needles on top of the boxes, swaying like compass needles in a small trough of water.

When the machine was activated, the needles moved in response to the electric current from the aluminum boxes. The four magnetic needles were always in a sensitive and fragile state of balance. The sole purpose of the Homeostat was to keep the needles centered, maintaining a "comfortable" state for the machine.

Ashby experimented with various methods to make the machine "uncomfortable," such as reversing the polarity of the electrical connections or the direction of the needles. However, the machine always found ways to adapt to the new state and re-center the needles. Ashby described the machine as "actively" resisting any disturbances to its balance through synaptic action, performing "coordinated activities" to regain equilibrium.

Ashby believed that one day, such a "primitive device" could evolve into an artificial brain more powerful than any human, capable of solving the world's most complex and challenging problems.

Despite Ashby's lack of knowledge about today's AGI evolution and the laughable idea of using four small magnetic needles as sensors for intelligence, his Homeostat fundamentally challenged everyone's understanding of "intelligence" - isn't intelligence the ability to absorb information from the environment in various modalities, and to modify behavior and responses based on feedback?

From the peculiar "Homeostat" to today, 75 years later, Google's Gemini, which claims to have surpassed human multi-modal task processing abilities, accelerates towards the evolution of billions of years of carbon-based intelligence through the injection of multi-modal native big data.

The acceleration speed of machine intelligence evolution today far exceeds our imagination. A year ago, OpenAI overturned Google's long-established AI position with its 'brute force aesthetic,' having constructed the Babel Tower of human languages. A year later, Google countered with Gemini, via a 'fight fire with fire' approach to building the first unified cross-modal model, setting another milestone in AGI evolution.

Despite initial skepticism over exaggerated video demos upon Gemini's release, it's undeniable that the dawn of a unified multi-modal approach is shining. What capabilities does Gemini confirm? How will Google's wheels of fate turn? Is time a friend to OpenAI or Google? What does multi-modality mean for Agents and embodied intelligence? Are the foundations for the emergence of AGI with consciousness already in place? How should we view the implications of Gemini for the AI future?

01.

Cross-modal Knowledge Transfer of Large Models Proven Again

For humans, the ability to transfer knowledge across various domains and through different timespaces is more important than merely learning skills. If machines can master cross-modal knowledge transfer, they edge closer to "intelligence generality."
 
In July this year, Google introduced RT-2, a robotic system based on large models, sparking hope for general-purpose robots.  The system's robotic arm, leveraging the "common sense" of language models, demonstrated the ability to "pick up an extinct animal from a table," moving from common sense reasoning to robotic execution, showcasing cross-modal knowledge transfer. 
 
In December, the introduction of Gemini by this tech giant reaffirmed the cross-modal knowledge transfer capability of large models: the "common sense" of language models could be transferred to the training of other non-linguistic modalities added later. Language models are known to form the foundation of cognitive intelligence, and the most basic form of cognitive intelligence is "common sense."  Without common sense empowerment, the practical application of large multi-modal models would be challenging.  Gemini smoothly transfers this "common sense" to downstream multi-modal tasks.  Like RT-2, it achieves cross-modal integration through the transfer of text-derived knowledge — Gemini can connect ontology concepts to the understanding of auditory and visual objects, and eventually link them with action, forming an intelligent system ready for real world application. 
 
From the perspective of model training, compared to language models trained with massive internet data, downstream models (like robotic models) can be trained with very limited data through knowledge transfer.  This transfer-based training manages to address the long-standing issue of data scarcity in downstream applications.  For instance, to achieve the effects shown in the video (which raised doubts about Gemini's video comprehension or picture comprehension but did not affect the discussion on cross-modal knowledge transfer here), Gemini first needs some ontological knowledge — it understands the concept of a duck, knows the usual color of ducks, and what blue is. When it sees a "blue duck," it reacts similarly to humans, expressing the "common sense" that "blue ducks are uncommon." 
 
Image
 
Gemini, through auditory and visual perception, identifies that the material of the blue duck is rubber and knows that rubber's density is less than water's. Based on this common sense and reasoning, when it hears a squeaking sound, it can predict that "the blue duck can float on water." 
 
Image
 
From RT-2 to Gemini, we've moved to the "fusion" of multi-modal perceptual intelligence and cognitive intelligence. We've transitioned from isolated "five senses" modules of eyes, ears, mouth, nose, and body to a unified digital "human". 
 
Doesn't this imply that on the path to simulating human intelligence, the unified model is the right approach? 

 

 

 

(Gemini Notes Series to be continued)

 

Original from:

关于 Google Gemini 的八点启示

by Zhi-Fei Li, Gao Jia, Wei Li, from "Brother Fei on AI"

语言是大一统模型里的核心和主线

作者 | 高佳   李维
创意 | 李志飞
在我们想象的AGI系统里,其核心和主线是视觉还是语言呢?

有人认为是视觉,但我们坚信语言才是核心,因为视觉反映的是动物共有的感官能力,而语言(包括口语和后来的书面语言文字)则是人类所独有的符号系统。它承载了人类千万年来的认知传承和知识积淀。
语言是是人类认知智能的外在表示,它是人类文明诞生的重要标志。著名以色列历史学家赫拉利在《人类简史》中说,是人类语言赋予的“讲故事”的能力,使得人类具有任何动物都不具有的组织能力,从而催生了文明,让人类成为地球的主宰。
语言是认知智能的起点和源泉,人类的语言信息中蕴含了人类高度抽象的概念层级体系,包括本体知识(ontology)及其常识,也包括更广泛的世界知识和更纵深的领域知识。这些知识是人类的高阶智能如逻辑推理的基础。而声音、图片和视频则更加感性,表示的是人类以及高级动物的情绪和具象能力,对应的是感知智能。
从感知到认知,从情绪到逻辑,当模型将它们融汇贯通,这才能真正模拟人类大脑的心智状态,也才称得上是完整的人工智能。多模态大一统的模型,填平了感知智能与认知智能的鸿沟,才是实现完整人工智能的希望所在。
在 RT-2 和 Gemini 中,以语言为基础的认知智能始终是人类知识模拟的核心,其中常识及其推理的知识迁移起到了关键作用。例如在 RT-2 中,反映语言模态的数据量和参数规模都远远大于下游的图片和动作模态的规模。
在原生态的跨模态大数据中,语言大数据总是处于核心地位。可以预测,未来的AI系统,不管目标是不是语言任务,都要把语言模型作为基础模型和训练的起点,其他模态或任务的下游数据可以在语言模型的基础上继继训练,以便继承和迁移语言模型强大的认知能力。
这一点做到了,就凸显了语言模型对AGI的最大贡献,因为它真正体现了研究人员对语言大模型的初心和定位——作为 Foundation ModelCore Engine.

全文原稿在(from):
关于 Google Gemini 的八点启示

 

 

 

《硅谷神剧回顾》

立委按: 生活比戏剧更戏剧, 虚拟比现实更现实; Turbo 比 GPT 更 GPT, AI 比智能更智能,是为AGI。

 

### OpenAI 剧情回顾:硅谷戏剧全纪录

#### 第一幕:引火 - 山姆·奥特曼被解雇

故事始 于 OpenAI 董事会一个突然且有争议的举动:CEO 山姆·奥特曼的意外解雇。此举在硅谷引发了轩然大波,标志着一场前所未有的公司戏剧的开幕。

- **亮点**:董事会指责奥特曼在与董事会的沟通中缺乏坦诚,这一指控后来成为争议的核心。
- **关键人物**:山姆·奥特曼,因引领 OpenAI 进入新领域而闻名,现在却突然被驱逐,为接下来的剧情奠定了基础。

#### 第二幕:后果与反抗

在奥特曼被解雇后,公司陷入混乱。一股由关键员工领导的反对派声音强烈反对董事会的决定,要求恢复奥特曼的职位。

- **亮点**:近500名员工威胁离职,除非董事会辞职并恢复奥特曼和联合创始人格雷格·布罗克曼的职位。
- **关键人物**:联合创始人兼前总裁格雷格·布罗克曼成为反抗董事会决定的象征。

#### 第三幕:伊利亚的后悔与公开信

在一个出人意料的转折中,被指责策划奥特曼出局的 OpenAI 首席技术官伊利亚·苏茨克维公开表达了他的后悔。这一认错为这场戏剧增添了新的复杂层次。

- **亮点**:伊利亚在社交媒体上的公开后悔和他参与的要求董事会辞职的公开信。
- **关键人物**:伊利亚·苏茨克维的角色从被指责的策划者转变为悔恨的关键人物,寻求修复 OpenAI 内部的裂痕。

#### 第四幕:董事会的困境与新任 CEO

在巨大的压力下,董事会发现自己处于十字路口。与此同时,新任 CEO Emmett Shear 被任命,标志着 OpenAI 可能的发展方向转变。

- **亮点**:Emmett Shear 的任命和他对 AI 发展的保守态度,与奥特曼的激进增长战略形成鲜明对比。
- **关键人物**:Emmett Shear,作为一股可能稳定混乱局势的力量,代表了 OpenAI 的新篇章。

#### 第五幕:转投微软与 OpenAI 的未来

在一系列戏剧性的事件中,奥特曼和几位关键成员宣布加入微软,实际上在这个科技巨头内部创造了一个强大的团队。

- **亮点**:微软成为主要角色,吸收了 OpenAI 的人才,可能重新定义 AI 领域的格局。
- **关键人物**:山姆·奥特曼转投微软,被视为一种战略高招,可能改变 AI 发展的未来轨迹。

#### 终幕:持续进行的剧情

这场戏剧暂时告一段落,OpenAI 正处于关键时刻。它的领导层、发展方向和核心理念都处于变动之中,这些事件的影响继续在科技界波及。

- **回顾**:从奥特曼被解雇到现在,OpenAI 的剧情回顾了权力斗争、意识形态和硅谷 AI 领域未来的集中展现。
- **关键收获**:这一事件证明了领导尖端 AI 组织的复杂性,技术抱负与人类动态和企业权力游戏交织在一起。

*这一综合回顾作为对 OpenAI 持续戏剧的闪回,突出了塑造这一硅谷历史非凡章节的关键时刻和人物。*

~~~~~~~~~~~~~~~~~~~~

### OpenAI 动荡剧情:双语剧本

#### 第一幕:疑云初起 / Act 1: The Beginning of Doubts

**场景**:OpenAI 办公室,员工们围坐讨论。
**Scene**: OpenAI office, employees gathered in discussion.

- **员工甲**(激动):「你们听说了吗?Sam 被解雇了!」
- **Employee A** (Excited): "Have you heard? Sam has been fired!"
- **员工乙**(震惊):「怎么可能!Sam 是我们的灵魂人物!」
- **Employee B** (Shocked): "How is that possible! Sam is our soul!"
- **员工丙**(沉思):「这背后一定有更复杂的故事。」
- **Employee C** (Thoughtful): "There must be a more complex story behind this."

#### 第二幕:董事会的难题 / Act 2: The Board's Dilemma

**场景**:董事会会议室。
**Scene**: The boardroom.

- **董事甲**:「我们必须要有新的领导,Sam 的领导方式不再适合我们。」
- **Director A**: "We need new leadership, Sam's way of leading is no longer suitable for us."
- **董事乙**:「但这样的决定会引起巨大的反响,我们准备好了吗?」
- **Director B**: "But such a decision will cause a huge backlash, are we ready for it?"
- **董事丙**(坚定):「为了公司的未来,我们必须要做出艰难的决定。」
- **Director C** (Firm): "For the future of the company, we must make tough decisions."

#### 第三幕:伊利亚的后悔 / Act 3: Ilya's Regret

**场景**:伊利亚的办公室,他焦虑地走来走去。
**Scene**: Ilya's office, he paces anxiously.

- **伊利亚**(自言自语):「我做错了... 我不应该那样做... 我需要公开道歉。」
- **Ilya** (Muttering to himself): "I did wrong... I shouldn't have done that... I need to apologize publicly."
- **助手**(担忧):「这样会不会引起更大的混乱?」
- **Assistant** (Worried): "Won't this cause even more chaos?"
- **伊利亚**(坚定):「我必须要承担责任。」
- **Ilya** (Determined): "I must take responsibility."

#### 第四幕:员工的反抗 / Act 4: Employees' Revolt

**场景**:OpenAI 大厅,员工们聚集。
**Scene**: OpenAI hall, employees gather.

- **员工甲**:「我们不能接受这样的决定!我们要写一封信给董事会!」
- **Employee A**: "We can't accept such a decision! We need to write a letter to the board!"
- **员工乙**:「对,我们要求他们辞职,要求Sam回来!」
- **Employee B**: "Yes, we demand their resignation and demand Sam's return!"
- **众员工**(齐声):「OpenAI没有我们就是一无是处!」
- **All Employees** (In unison): "OpenAI is nothing without us!"

#### 第五幕:微软的招手 / Act 5: Microsoft's Invitation

**场景**:微软总部,Satya Nadella 与 Sam 和 Greg 会面。
**Scene**: Microsoft Headquarters, Satya Nadella meets with Sam and Greg.

- **Satya**(微笑):「欢迎加入微软,Sam。我们会一起创造不可思议的事物。」
- **Satya** (Smiling): "Welcome

to Microsoft, Sam. Together, we will create incredible things."
- **Sam**:「我很期待这个新的开始,我们会创造新的辉煌。」
- **Sam**: "I look forward to this new beginning, we will create new glories."
- **Greg**:「是的,这是我们的新使命。」
- **Greg**: "Yes, this is our new mission."

#### 第六幕:终幕 / Act 6: The Finale

**场景**:OpenAI 办公室,员工们聚在一起。
**Scene**: OpenAI office, employees come together.

- **员工甲**:「现在怎么办?Sam 和 Greg 都走了。」
- **Employee A**: "What do we do now? Sam and Greg are gone."
- **员工乙**(坚定):「我们必须要继续前进,为了我们的使命。」
- **Employee B** (Resolute): "We must continue to move forward, for our mission."
- **众员工**(齐声):「OpenAI是我们的家,我们会一起度过难关!」
- **All Employees** (In unison): "OpenAI is our home, we will get through this together!"

*本剧本创意基于最近 OpenAI 发生的一系列戏剧性事件,旨在通过对话和场景刻画,双语呈现这个引人入胜的科技界故事。*

AIGC时代,关于鲁迅大脑的复活

这个话题,在国内怕惹麻烦,还是写在这里吧。也借此机会与老友分享一下我目前聚焦的工作,以及这个领域持续令人兴奋的热点。

《清晨时刻: 每日GPT》可以成为一个专栏,关于 GPTs(GPT Builder / GPT Store / GPTs by Wei Li)似乎每天都有新的进展或体验可以分享。

今天的进展是,我对我前几天制造的“鲁迅具身”的质量不满,因为不像,倘若鲁迅纪念馆真要让我为他们的大屏的鲁迅数字人提供虚拟大脑的话,我觉得目前我做的这个GPT还不合格:虽然可以源源不断请他老人家在元宇宙发声,每次都有不同,语言也通顺,但风格模仿还是差强人意。

除了把抱怨当作 bug reports 直接反馈给 GPT Builder,我开始从网上收集鲁迅先生的文集 PDF,填入 local knowledge,并指令它从中学会鲁迅的言谈风格。今天填进去的文集是:

这几乎就是一本鲁迅先生的文学类“全集”了吧,排除了鲁迅先生“硬译”的外国文学译品,以及家长里短的乏味的日记等,觉得是一个合适的 feed,可以让 GPT 聚焦其文学风格。

Quote
原文序言:序 言
这是一套鲁迅小说、散文、诗歌和杂文等文学作品的选集。
圆园世纪猿园年代以来,《鲁迅全集》、《鲁迅选集》时有出版。“全集”版本虽不很多,印数却相当可观;“选集” 更是版本繁富,数量浩大;比较起来,只收鲁迅文学作品的全集,却显得较少。许多读者觉得“全集”太大,因为日记、书信、序跋、学术著作,没有纳入他们的必读范围;“选集”又欠精,他们手头需要一部像本书这样的鲁迅文学作品的全集。
.........

把这本文集作为 local knowledge (类似于 GPT-PDF 的 rag) 喂进去,鲁迅先生(大脑具身)的表现会有所改善么?我们试试。

GPT Builder 强调,为了 access (local)knowledge,需要打开内置插件 code interpreter,我在 config 中确认了已经打开。

上传上去后,似乎无需等待时间,就立即开始起作用了(内部快速建立一个类似向量知识库的东西还是其他什么 embedding retrieval 方式?总之都是 OpenAI GPT Builder 平台北部搞定的,不用我们用户操心)。

好,我们来试试效果。(作为小白鼠,先给个警告,鲁迅先生向来以辛辣著名,时评不可能“政治正确” -- 这正是他老人家最厌恶的东西,所以很多人说过,他老人家虽然极受毛主席推崇,但倘若活到1957年,肯定是要打下去的最大右派。)

鲁迅先生向来以辛辣著名,时评也充满讽刺,不一定讨好。但忠言逆耳,我们不妨不时听听复活的鲁迅是怎么俯瞰天下大势的。

以上就是他老人家最新的时评。是我请他老人家写的。(群内供研究,不外传,也不必上纲上线,阅后可焚。我想展示的是 AI 的惊人内功。再说一遍,群内都是我熟知的老友,此件务必不外传,不惹麻烦。不合时宜的话语是他的风格,这里的本义只有AI研究。)

虽然鲁迅具身作为中国近代最伟大的思想家的元宇宙大脑,还有很多优化的工作可做,但初步的实验已经显示出鲁迅风格和人格的复活。今晨做这个实验的时候,我看着屏幕上他老人家喷涌而出的时评,感到了一种时空的穿越。这比前天我刚做“鲁迅具身”上线的时候,表现逼真太多了。质量只会越来越好,我会持续维持和加强GPT的迭代更新。

到底 AI 做 character,复活古人、名人、思想家、艺术家,是不是一个靠谱的目标?

我们知道,复活名人的外表早已不是问题,蜡像馆就是成功案例。现在我们的2D3D的奇妙元数字人也是栩栩如生。复活声音也不是大的挑战,我们有亚洲AIGC业务最强的魔音工坊,很快都可以搞定。最难复活的还是大脑。而大脑,非 LLM 不可。现在只是一个开始。

这个实验不幸有点敏感,以后我会做一些其他名人的GPT大脑。然后用这个大脑发出对于当今世界的评论,并以此驱动奇妙元数字人的形象,可源源不断制作出鲜活生动的元宇宙大师来。其实,如果能够协调好监管,也完全可以实现博物馆历史名人实时与参观者的交互:技术条件已经具备。可以预见,这类落地由于政策的相对宽松,海外会走在前面。

character AI 虽然面对 Open AI 平台的碾压,也还是聚集了足够的人气和社区,正在 AI characters 的方向上前进。国内也有几家出海产品,正在尝试进入这个市场。

我已经公开发布我制作的【鲁迅先生(GPT具身)】,有 ChatGPT Plus 注册的朋友都可以在此尝试,欢迎反馈和 bug reports,我的迭代更新会是秒速(只要有反馈,可以做到日迭代,这是因为在“LLM对话驱动编程”的新范式下,现在的 bug reports 可以直接扔给平台,GPT Builder 会实时迭代,无需等待):

https://chat.openai.com/g/g-zeYHL1uSG-lu-xun-xian-sheng-ju-shen

个性化精调模型 AIGC 小妹(9)

这是精调训练的老照片样本:

                                

 

其中有一半系统认为不符合样本标准,删除后只剩下10张左右的照片做微调训练用。训练10分钟形成用户专有模型,利用模版化的提示词产出如下图片(做了拣选,单月选了三分之一),觉得效果还不错(前两张高清4MB与1MB):

 

《朝华之四: 小妹》

个性化精调图片生成实验(1)

个性化精调图片生成实验(2)

个性化精调图片生成实验(3)- AIGC 甜

 

个性化精调图片生成实验(4)

个性化精调图片生成实验(5)

个性化精调图片生成实验(6): AIGC立委先生

个性化精调模型 AIGC 老哥(7)

 

个性化精调模型 AIGC 老爸(8)

个性化精调模型 AIGC 小妹(9)

 

个性化精调模型 AIGC 老爸(8)

半年前,我用过一个图形软件刚推出来的 个性化 fine tune 模型 feature,给老爸老照片做了精调,效果不好(碰运气,有的用户反应说效果很好),出来的形象老爸说不像。这是半年前的图片生成:

虽然有点影子,家里人都觉得总体不像。

现在重新做 fine tune,用的是 SDXL 1.0-finetune,效果似乎明显改善了。

但是,AI 预测人的不同年龄,实际上也是瞎蒙。因为随着岁月增长,人的形象改变有不同的方向,包括疾病、锻炼、营养等因素吧。这是 AI 根据老照片预测的90岁的形象:

这是老爸现在(88岁)的照片:

不能说预测完全离谱,但确实不像。

人物肖像应该是所有图画中,用生成模型产生作品最难让人满意的了,这是因为人的眼光对人的细微差别特别敏感,尤其是要让本人和亲友感觉很像,这是很难的。现在的 fine tune 水平,大约可以做到每生成四张,能有一张让人觉得像的,或可以接受的。对于特别挑剔的眼光,或者近距离的亲人来说,大约每10张生成能出现一张即便最挑剔的眼光也难以拒绝的作品来,不时还会让人感觉惊喜或震撼。

AIGC 甜甜儿时的尝试中就有一些惊喜,例如下面博文的前面几张肖像:

个性化精调图片生成实验(3)- AIGC 甜

尤其是这一幅水粉画,非常像,也很艺术:

我们人类看世界,由近而远。譬如,大千世界的实体,根据不同品类,其实在我们眼中都差不多。例如野生动物,这只虎与另一只虎,我们通常感觉都差不多(动物园饲养员自然会有更细致的区别能力)。到了宠物就有所不同,因为宠物进入了家庭,我们会坚持自己的猫咪与别人家的同类型的猫咪有所不同,但也还是大同小异。

我们看外国人,一开始觉得都长得差不多,大体上根据肤色、种族、性别和年龄,有一些类别而已,实体个体的差异我们没有那么敏感。据了解,西人看东亚人其实也觉得长得都差不多。但同种族内,我们就会对人的形象有各种区分,甚至一眼能看出一个人是从哪个地区来的。

到了亲友和熟人,细微的差别也都能看出不同来。所以,画得像不像很难骗过身边的亲友。俗话说,画鬼容易画人难。这对模型是一个极大的考验,尤其是考虑到生成模型实际上具有以下容易走偏的特征:fine tune 的样本有限,通常在 10-30张之间,与预训练基础大模型完全不成比例。

天然具有随机性的生成模型,其原理是根据预训练的基本模型所学到的人类形象的普遍特征,然后通过少量的 finetune 来逼近一个特定的实体形象。显然共性与个性的样本不成比例。这种情况下,能够迅速从人类的一般形象具像化到一个特定的实体,仅仅是少数几张样本的 trigger,这是一件一年前还难以想象的事情。把一个人的特征抓住,重现出不同场景的形象,做到真假莫辨,要让自己和亲友惊喜、服气,现在基本做到了。如今基础模型的发展及其 fine tune 技术,做到了对结果的可靠性有一定的保障了。

这其实开辟了很大的个人用图的想象空间,因为人的本性都是自我中心(“自我”的延伸也包括自己的亲友)。自拍为什么流行全世界,正是因为符合了人的本性。半年前就见到有修图软件配备了类似的能力,推出了“情侣照”系列,可以让任何 couple 惊喜。

当然,四分之一的良品率,10分之一的惊艳率,听上去还不够好,因为次品还是太多了。但考虑到生成模型可以没完没了快速生成,而人的判断拣选则是非常简单、直觉的,这个比例已经不会成为实际使用的障碍了。当然这里有个生成(属于“推理”)过程的成本问题,毕竟推理需要在线的算力。不过,成本会随着时间和技术进步而下降。

从商业模式来看,订阅式(例如缴纳年费)目前是给你一定量的 credits,每生成一次要用n个credits,以此来控制成本,限制滥用。但随着AIGC产品和服务的内卷和白菜化,不久就会出现类似手机流量公司推出过的 unlimited plan。这样来看 1/4 或 1/10,成本最终也不是问题。何况,随着模型技术的爬升,良品率有望进一步提高。

由于职业关系和技术控的思维定势,我对于业界领先的订阅付费式的AI工具和服务(chat,mj,nightcafe ......) 一律做 early adopters,好与我们的复现或创新工作有所比对。你会发现,AIGC 目前的确让人眼花缭乱,不断在演进。这是一个令人兴奋的技术爆发时代。

 

个性化精调图片生成实验(1)

个性化精调图片生成实验(2)

个性化精调图片生成实验(3)- AIGC 甜

 

个性化精调图片生成实验(4)

个性化精调图片生成实验(5)

个性化精调图片生成实验(6): AIGC立委先生

个性化精调模型 AIGC 老哥(7)

 

个性化精调模型 AIGC 老爸(8)

个性化精调模型 AIGC 小妹(9)

 

个性化精调模型 AIGC 老哥(7)

 

个性化精调图片生成实验(1)

个性化精调图片生成实验(2)

个性化精调图片生成实验(3)- AIGC 甜

 

个性化精调图片生成实验(4)

个性化精调图片生成实验(5)

个性化精调图片生成实验(6): AIGC立委先生

个性化精调模型 AIGC 老哥(7)

个性化精调模型 AIGC 老爸(8)

 

 

小雅系列:短视频文案

 

《小雅人生系列》

我是小雅,立委先生打造的数字主播品牌,关注科技与生活的点点滴滴。

我今天在想竖屏、横屏的事情,寻思下来觉得有点意思。这个问题或矛盾的起源,感觉是来自于听说器官和视力器官的“错位”。怎么讲?
电话为的是听说,必须竖着来,因为嘴巴到耳朵之间有距离。为了够得着口、耳两个端点,传统电话设计成圆弧形,智能电话做成了长条形,竖着拿,倾斜45度角,基本上可以把耳朵与嘴巴连起来。

这样一来,竖屏就成了智能电话的最常见的默认形态。说默认是因为,理论上你总可以把竖屏横过来变成(宽银幕)横屏。实际上我们看视屏有时候也确实这么做,但毕竟不仅多了一个动作,手握横屏也不自然,加上在竖屏中的横屏视频还需要软件配合,才能支持需要90度旋转的横屏,而软件并不总是聪明友好。由于这些原因,短视频霸王抖音就坚持用竖屏作为默认。

久而久之,用户也习惯了看竖屏,用手指上下滑动翻屏,成为信息接受的最简易懒惰和放松的方式。全民刷短视频的习惯就此形成,虽然这个习惯显然不符合人类眼睛的设计。人成为信息时代最懒惰、最被动、也最容易满足于自己信息茧房的动物。

双眼是水平设计的,为的是看到更大广度。从视野雷达角度看,这个世界的水平方向的信息,显然比上下方向的信息更加丰富密集。目前大约能看到180度左右的水平视野,有些动物双眼长在两侧,比人类强,大约可以看到270度的视野,这样对于感知危险和逃生更有益。

动物没有在后脑勺进化出第三只眼或第四只眼,是进化历史上的一个遗憾和谜团,道理上360度无死角的水平视野才是最有利于生存的。人类技术弥补了这个不足,自动驾驶车辆上的 cameras 至少8个以上,就做到了360度无死角。

祸从天降的事情相对小概率,所以感知地上的危险和机会(譬如食物或捕猎对象)更加重要。这就是双眼水平设计的上帝理念。到了人人手机的时代,竖屏居然风行,双眼的水平优势被晾在一边。可见--也许,人的懒惰本性超越了人类的功能性。

当然,现代的世界与丛林不一样,危险也不是无处不在,虽然拿着手机跌进坑里去的事故也时有报道。

我是小雅,每次几分钟,与您分享不一样的科技生活视角。

大模型短视频系列:大模型压缩与白马非马

 
 
 
 
从白马非马说起
 
大家好,我是出门问问李维的数字分身,这是我的短视频频道。
 
今天我们讲一讲著名的公孙龙的“白马非马”问题。网上最近的讨论主要是从形式逻辑出发,说明这个听上去是悖论的说法,实际上是因为语词的模糊性造成,基层逻辑其实很简单。动词“非”是多义的,既可以表示等价,也可以表示属于。白马不等于马,但白马属于马。这样分开来,非常简单明了。
 
但这里我想从哲学思辨的角度并结合大模型压缩的话题,重新剖解诠释这个老命题,提供新的视角。
 
我觉得这里的“白马”不是“白色的马”的概念,而是哲学家手指指向的“那匹白马”:你看那匹哲学家马厩旁正在吃草的白马。顺便一提,白马前面有吧个定冠词,零形式。中文没有发展出定冠词,只有指示代词,并不影响哲学家那样用它,所以,哲学家的白马,我认定是映入我们眼中的那个实体。换句话说,白马是具象化的特定实体,而不是泛指所有的白颜色的马的概念,这在认知科学中叫本体,与一个个的实体想对照,是实体的抽象结果。
 
我觉得白马非马很哲学,是因为这个哲学不承认本体,只认实体。只有具体的一头头的这匹白马、那匹黑马、张家刚出生的小马、李将军的那匹战马等等,世界上哪里会有抽象的马呢?这就有意思了,这是不同的世界观。
 
这类哲学家认为,放眼望去,所见皆实体,实体才是客观世界的本质,而本体只是人类社会发展出来的主管系统,具体说,是人脑的产物或反映。人类是一种奇怪的动物,自从走出非洲森林,人脑开始发达,语言和思维卷来卷去,就卷出来这一整套本体论,叫 ontology,硬是为一片混沌的世界建立了秩序。
 
在蚂蚁的眼中,是不应该有本体这种独属于人类认知的实体幻象的,最多也不过是一种极其粗糙的分类体系,例如把世界划分为食物、危险等感知类别。到了认知层面的概念体系,动物是缺失的,非生物更无从谈起。
 
什么是现实?现实到底是什么组成的?看到的,听到的,感知到的,是现实吗?最多就是现实的影子吧。最典型的案例就是世界的五彩缤纷,没有人眼这个感知器,及其人脑的神经处理,我们的色彩体验就不复存在。感知智能尚且如此“虚幻”,更遑论认知智能。
 
“马”的认知大概率是虚幻的,可哲学家门口“那匹白马”却大概率是一个真实的存在。这个矛盾过去无解,现在也还是无解。
 
但是,大模型是建立了概念体系的,当然是一种仿真。最近流行的大模型的压缩理论,我的理解就是蕴含了仿真的人类认知概念体系。说 LLM 通过多层神经一路压缩,压缩造就了机器智能,机器智能因此逼近了人类认知。这看上去非常符合我们从模型中观察到的对世界的惊人的归纳和理解能力。可以说这是大模型最神奇的地方,因为它不仅仅是海量记忆,而是记忆之上也从很多维度对于实体做了归纳抽象,在它的多维向量的大肚子里面,隐形的结构层次是蕴含在内的。大模型的多层压缩很像是人类文明漫长的认知演化过程的一个浓缩版。
 
结构层次的符号化表示就是带有节点的图或树,分为表示概念的非终结节点和表示实体的终结节点。这样来看,哲学家的白马并不是本体的下一级非终结节点,而直接就是那一片叶子,即终结节点。
 
一个假说是,世界本来都是终结的节点,只是人脑容量有限,不得不人为聚类,逐渐建立非终结节点,然后发明了语言来给这些聚类结果强加了分类符号,即概念。人类只有这样烟花,才能把握世界,适者生存,最后爬到了食物链的顶端。
 
有人担心大模型的加速度发展,通过所谓脑机接口,最终会发展出一种永生的超级实体。这种实体超越了碳基生命的脆弱和宿命,带着起源于人类的认知和思想,永续发展为更高级的文明。
 
经过几万年演化产生的人类认知,最多不过是世界的一个幻象。那么,经过几周训练出来的LLM认知,只能是幻象的幻象。影子的影子有一天会统治世界,永续发展,听上去不是匪夷所思吗?但老马与辛顿警告的正是这个威胁。与其远虑,不如近忧,还是先议议人类如何面对正在到来的真假莫辨的世界吧。技术条件已经具备,假象尚未全面泛滥(yet),这只能看成是人类的运气。但时间并不多了。
 
至于机器智能的永续发展,你信还是不信?我不相信!
 
比起文明永生,我觉得白马非马的世界观更加合理。离开人脑,世界就坍缩,本体灰飞烟灭,唯实体长存。死寂、连续、无区别,可能这才是世界的本来面目。凡主观皆幻象。人类智能本来就是幻象,人脑的产物。幻像终归破灭。这很残忍,但却是文明的宿命。哪里有幻象的模型或影子,可以永续长存的呢。
 
 
朋友,您是怎么看大模型的未来,以及人类文明的终局呢?思绪飞扬,欢迎评论区分享您的高见。
 
我是出门问问李维,每次几分钟与您分享AI大模型方面有角度的思考。
 
 
 
【后记】
 

关于白马非马,老友有所批评,很切要害:

信息似乎太浓了。“白马非马”,稍作展开,并提及它的普适性,以有趣故事切入,算是高招;更贴近一点大众,还可以引入“男(女)朋友不是朋友”或“朋友不是男(女)朋友”,巩固一下吸引力;至于实证论(positivism)和建构论(constructivism),应该能够借鉴一些别人的阐释,取简单易懂的语言表达;同理,“模型”部分也会有很好的例子可以借鉴,除了研究的需要,它也是人脑或电脑的自我保护。不纲举目张,人工智能或者人脑都会宕机!模型方法几乎与人同在几千年,“大”模型的大字怎么讲好,有些难度,毕竟新事物可借鉴的先例不多。总的方法是,能够借鉴或者找到答案的东西,则绝不去苦思冥想;好钢用在刀刃上,别人没干过的东西,就手脑并用,尽力造成“子弹很多,目标很小”的局面,用牛刀宰鸡,一举攻克!
“Parsimonious”是一种建模者追求的特性。其实,鲁迅坚持在写作中除去可有可无的字句也是一种parsimonious!
我不喜欢字典里的“吝啬”译法,没有体现“惜墨如金”的意思!
录视频也类似于讲课,力求举重若轻,给人以云淡风轻的感觉[Smile]
老友是老教授,德高望重的老学者,治学、讲学和生活都很严谨,我辈码农,望尘莫及。都是平时闲聊以后汇集的急就章,谈不上思想深邃 也没有精雕细刻。感谢小伙伴的后期渲染,短视频看上去不那么枯燥 平淡了。思绪飞扬 天马行空 也总算雁过留声 马过带风 不至于无影无踪。
 
 
 

AI创作花絮: 《影月无痕》

同一个咒语提示词给img+txt2img,生成了两个形象,反差极大。输入的小雅图片是:

输入的咒语是: 侧面照,girl next door
输出的两幅“侧面照”是:

模型的不稳定表现在,同样的咒语生成了上述玉照,也生成了上面的 monster(?)lol 好在一切都是 copilot,最终由人来拣选和把关,作为图片生成助手,用起来没有问题。

但仔细看,两个形象又有相似之处。寻思可以让大模型写个电影脚本,制造一种剧情,把这两个形象联系起来,例如,白天是美女,晚上成武侠。也许可以演绎一个动人的 drama 来。不妨找当下最先进 ChatGPT4(code interpreter)beta 版来一试?

受到鼓励后,版本2比版本1强太多了,剧情充满了跌宕起伏。

以上的模型表现,退回去一年,是打死也不敢想的。说LLM背后没有上帝,或上帝没有显灵,鬼才信。

 

 

立委NLP《关于系列》

【置顶:立委NLP博文一览】

《朝华午拾》电子版

 

大模型的落地现状和前景

大家好,我是李维的数字人分身。 今天谈一下大模型的问题。L LM 的命门已经蛮清晰了:幻觉+随机性。 幻觉与随机性有关联,但角度和外延不同。 幻觉的主要表现就是细节遗忘+细节编造,所谓“一正胡八”。 其所以遗忘,是因为该信息的冗余度不够,大模型只能把它当成数据噪音。 其所以编造,是因为语言模型的丝滑本性决定的: 不能留白,需要找到最符合语言习惯的细节替代品。 于是张冠李戴、指鹿为马了。 随机性比幻觉表现更加广泛,表现为结果的不稳定性,那是所有概率模型包括LLM的本性。 牵涉到的不仅仅是细节的随机编造,也包括解决路径的方方面面的不稳定(例如 LLM agent 的思维链,计划,行动,反思和反应等等)。 LLM 里面的确积攒了很多历史解决方案,LLM 在合适的 prompt 催逼下也的确可以把这些方案勾引出来。 但是这些解决方案具有随机性,无法应对长线条的业务逻辑。 据说,目前的水平是5步限制,任何线条超过5步,绕5个弯,LLM 的 agents 就晕菜了。 这些表现注定了LLM在两类应用场合不同的命运: 第一类是生成创意类的场合,还有聊天的场合,那完全是洗牌、碾压。 那种场合追求的不是正确性,而是多样性、创造性、丝滑性和 human-like。 在这里,幻觉+随机性与创造性是同义词,起的是好作用。 第二类是垂直领域知识场景,以及有些需要精细逻辑或计算的场景。 这里基本上不能容忍幻觉+随机性。 这第二个场景,本质上需要跳出三界外。 就是说,很可能需要跳出大模型,去寻找尽可能具有某种通用性的 beyond LLM 的解决方案和框架。 把 LLM 只当成一个重要的资源来利用,当成 api 来调用,而不是指望LLM主导来搞定领域。 此外,LLM 还有一个问题。 在我们欢呼 LLM 听懂人话的同时,我们现在所追捧的 prompts 变得特别重要。 所谓 prompts 就是人话指令,但是人话本身也有沟通的“艺术”。 这种艺术化的交互手段,作为与机器打交道的 vehicle,具有自然语言本性上的短板,就是模糊性、线条性,缺乏层次、结构和逻辑。 这其实是交互的进化,效果的退化。 交互上,只要会讲人话,大家都突然成为“码农”了,可以直接对机器吆三喝四,感觉很爽,很亲民,很接地气。 机器终于低下高贵的头颅,开始迁就人类的模糊。 但是效果上肯定是退化的,因为指令不再是明确的、逻辑的和精细的。 这是自然语言代替电脑语言难以回避的表达缺陷,一定会影响LLM的实效。 这些都是大模型从本性上带来的问题,也是目前做大模型领域落地人员的共同挑战。 大家都在苦苦挣扎,试图找到解套的良策,希望在大模型与领域对齐的过程中,能够外挂领域数据和知识库,探索场景业务逻辑的带入。希望能有突破。 我是出门问问李维,每次两分钟,与您分享大模型有角度的思考。
 

大模型漫谈系列n

昨天创业邦发文《第一批AIGC独角兽已经在吃散伙饭了》,讲的是 Jasper 由盛而衰的故事。
这故事写得细节生动,好惨烈,强调的是护城河。
Jasper 兴起在 GPT3 的时代,当时 GPT3 是个“裸机: 没有“咒语”敲不开门。
于是会念咒语的 Jasper 就成为呼风唤雨的巫师。
当时谁会想到 few shots 咒语这么快(也就两年光景)突然退位,被所谓zero shot 的ChatGPT所取代 : 机器学会了人话。
于是, 大水冲走了龙王庙。巫师成了哑巴。
这其实不能怪巫师没建自己的护城河,咒语本来就是一条河。
怪就怪命运无常, 一条河挡不住一场洪水。
这故事太具戏剧性了。
最大的恐怖不是巫师的失业,而是洪水摧毁了很多 AI-GC 产业。
当人人可以吃得起山珍海味自助餐的时候,餐饮业还有繁荣的可能吗?
历史上,机器翻译产业就是这么被做死的。
现在这场洪水摧毁的岂止是翻译, 它摧毁的是整个 nlp。

前一阵子受邀做巡回演讲, 让我谈架构师的焦虑 。
焦虑也是一个热词了, 现代人几乎没有不焦虑的。
越是高级劳动, 越是打工贵族, 就越焦虑。
架构师的焦虑可谓一个典型。
我告诉架构师们: 你们焦虑了, but you are not alone!

你知道 最焦虑的是谁吗?
你很难想象,在nlp大革命的漩涡中心,nlp从业者实际上最焦虑。
几乎被团灭。一夜醒来,干了一辈子的职业,突然消失了。
你能想象那是一种什么感觉。
现在还有人自称nlp专家吗?
什么机器翻译专家、 自动摘要专家、 信息抽取专家、 情感分析专家、 汉语分词专家、 计算风格专家、 辅助写作专家、 电脑对联专家、 问答系统专家、 聊天机器人专家、句法解析专家、篇章分析专家 …… u name it。
所有的专家加在一起,不如一头驴。
刀郎曰过:那马户又大又蠢, 还有16个头。
横冲直撞,摧毁了一个个nlp产业。
以前我说过是, 有了这头听得懂人话的驴, 那就为大众创业创造了条件。
这话其实也不错,如果你真能找到那个角度和服务对象。
但目前看到的景象却是一片惨淡:这头驴扼杀了很多可能的生机。
终局呢?
还是我以前说的二分法: 洗牌和洗礼。
这头驴在洗牌的时候,以碾压之势,摧毁了一切“浅直”的nlp产业。
但还有很多接受洗礼的垂域或场景, 它似乎还够不着。
现在就处于这种胶着状态:每个人都觉得llm无所不能,但眼看着它落不了地。
开始了新的一场焦虑和对AI的失望情绪。
要知道,现代人,包括投资人,耐性都极为有限。

看热闹的话,百模大战目前可能还是最大的盛世景观。
几乎所有的llm,都在疯狂烧钱, 而能拿它赚钱的寥若晨星。
不用太久, 有几家大模型经得起这么烧钱、烧电力呢。
烧完之前, 能落地的就是幸运儿了。

且看
且叹
且珍惜。

我是出门问问李维,每次几分钟,与您分享大模型有角度的思考。

图片一键生成短视屏,奇妙元是时间的摄像机

这不是我,是我老爸的学生时代留影。

小雅谈图片一键生成短视屏。

IGC 让老照片开口说话!让你care的人惊喜 让父母家人会心一笑。让肖像动画 让雁过留声。让时间定格 让回忆鲜活。让两情相悦永不褪色 让你的青涩不染俗世的灰尘。让爱人永远美丽 让老同学永远年轻。让擦肩而过回眸一笑 让生活不至于随风飘去。让形象超越一场梦 让存在不再是无影无踪。奇妙元小程序的图片一键生成 是生命的摄像机 带你穿越时间隧道 给你无限遐想感念。同款制作 零门槛 限时免费 你还等什么?让活着不仅仅是活着 而是情的传播 心的连接。

我用AIGC制作的小雅艺术肖像 原作一直有人觉得穿着太西方 我就让 txt2img 换一套服饰 没想到模型给小雅盖上了毛毯 lol。

小雅教给你一步步做图片一键生成。

奇妙元体验AIGC奇妙:《岁月如歌:神秘园》

神秘园欣赏笔记 -- 奇妙元 2.5D数字克隆解说

在下数字分身(奇妙元 2.5D形象克隆+声音克隆)

这一位是我自己半年多前txt2img创造的艺术肖像。现在配上网上最流行的女声,也是我最喜欢的女配音,叫小柔。

( ---- 做奇妙元小白鼠,体验奇妙。尝试最新 features,给小伙伴 report bugs。)

奇妙元:https://weta365.com/main/

《AI潮流:开发者提示工程公开课中的二原则》

Andrew 春风满面,亲自参与的这个提示工程的课程,很浅显易懂,肯定会风行。Andrew 说,稍微复杂一点的任务,没有一个好的 prompt 是一枪命中的,总要反复尝试 最后才满意。这与码农编程序一样,谁不经过反复调试就能写出好的程序呢。

然后他说,LLM 的好处是你可以反复跟它磨叽,不管啥事。要是以前的 AI,你得一个一个的任务去建模,每个任务从标注数据,培训模型,测试,部署,好不容易上线了,结果换了个任务,所有的过程要重来一遍。现在这样一个 LLM 你反复“压榨”它,它的知识和学问如此之大,好像榨取不完,可以做各种任务,的确是范式转变。

【原则1: 提示要具体】

提示工程首先要 “write clear and specific instructions”.  这个其实大家都有体会,跟 chat 这种庞然大物玩,它脑袋那么大,里面的“知识/思想/意义”的电路各种节点,纵横交错,相互勾连,密密麻麻。要想用提示词激发让你满意的回应,就需要确保所激发的那一小块电路对应了你所想得到的答案。你的提示词越具体(表达了你心中的疑问就越确切),chat 的回答自然也越对路。这个道理和体验很容易get,但具体的技巧需要细化,这就是上课的好处。

【原则1技巧1:使用分隔符】

“The first tactic is to use delimiters to clearly indicate distinct parts of the input.”  什么意思?就是要求提示词中首先要把任务指令与任务的处理对象分开,要求用分隔符把处理对象明确标出来。这一点,多数人容易忽略,结果是,chat 经常把任务的某些描述词也当成了任务的对象,或者把任务的处理对象当成指令的一部分,这在逻辑上叫做层次纠缠(任务是“元语言”,对象是待处理的输入语言,不可混淆)。这个毛病我以前也常见,一直没意识到这其实是因为对提示词层次不够注意,违反了第一原则的第一技巧实操(best practice)。

这里 delimiters 就是引号。chat 就知道这是其摘要处理的对象。否则,如果提示词中任务描述较长,模型有可能把任务本身也当成所要处理的对象,以前遭遇过这种后果的。

【原则1技巧2】让模型输出表格化。

“This tactic is to ask for a structured output.” 提示词任务中最后加一句:in tabular/json/html format with the following keys: Key1, Key2, Key3。很多时候,表格化输出看上去更酷,也更方便后续存贮和处理。

【原则1技巧3】可以用 IF ... THEN ...

原讲义说的是:“to ask the model to check whether conditions are satisfied”.  这实际上就把编程中最重要的条件分叉能力带入了自然语言提示词的指令。一般人想不到提示词还可以这么做。可以用自然语言模拟程序代码,让机器分别不同条件决定采取何种动作。

if-then 你学会了吗?

宋柔:你问它:第一步中洗净五花肉的动作者是哪个,第六步中把什么下入温水,第十步中出锅食用的是什么。

难不住它吧,它不仅仅是大号鹦鹉,它有(一些)常识。

宋柔:但是我估计最后一个问题“第十步出锅食用的是什么”它答不对。它可能说“五花肉”,但实际上应该是“红烧肉”。生的是五花肉,做熟了是红烧肉。

是红烧五花肉呀。一定要说红烧熟了的五花肉吗?

孺子可教。其实不能怪它缺乏常识,要怪就怪中文,cooked 与 cooking 全不分。“红烧肉”实际上既是名词(定中结构)也是动词短语(动宾结构),到哪里说理去。

宋柔:如果有食谱知识,应该说红烧肉,五花肉是材料,红烧是做法,成品是红烧肉。“面粉1斤,加水和好,发酵搓揉后切成5段,切成长方块,放入笼屉中,大火蒸30分钟,掀开笼屉便可吃了”。请问可吃的是什么?

宋柔:不容易。确实有常识了。但是仅凭长方块而排除包子显然不大正确。包子一定有馅,但制作过程没加馅。

总之,除了缺了口热乎气儿,它就是个人,是个会犯懒,也会犯错误的人。

【原则1技巧4】可以用 few shots 示例。

所谓 few-shot prompting,基本上就是用案例让模型知道要做什么,要求照葫芦画瓢。例如:

曾几何时,还在 GPT3 刚放出来的时候,圈子内的粉丝们都到它的 playground 去玩,当时的主要技巧就是 few shots,因为 ChatGPT 之前,zero shot 的能力还没成熟。等到 ChatGPT 能直接听懂人的指令,zero shot 很好使,用户自然而然就不再使用啰嗦的 few shots。但实际上,并不影响你继续使用 few shots,或与 zero shot 一起用。在有些不大容易说清楚的任务上,拿 few shots 补充 zero shot 可以加强效果。

【原则2: 让模型有时间“思考”】

【原则2技巧1】为复杂的任务列出步骤。

这项技巧的原文这样要求:“specify the steps required to complete a task.” 

上述提示词遵循了 best practice:1. 用了分隔符三个反引号;2. 任务分解为一系列步骤或子任务;3. 对输出提出了格式化要求。

感觉这就是在编程序,是自然语言的低代码形式,自然语言让人人可以成为程序猿,指挥机器做我们想要它做的事儿。

【原则2技巧2】要求模型独立解题。

看上去就是以前说的 step by step (思维链)解题指令,原文说得更像个对于辅导员的要求:“Our next tactic is to instruct the model to work out its own solution before rushing to a conclusion.” 尤其是在智能教育场景,希望模型先独立一步一步做题,然后再去充当老师给学生评判作业。

所示范的案例是评阅数学问题。有一个数学问题,也有学生的解答。

Determine if the student's solution is correct or not.

Question:
I'm building a solar power installation and I need help working out the financials. 
- Land costs $100 / square foot
- I can buy solar panels for $250 / square foot
- I negotiated a contract for maintenance that will cost me a flat $100k per year, and an additional $10 / square foot
What is the total cost for the first year of operations as a function of the number of square feet.

Student's Solution:
Let x be the size of the installation in square feet.
Costs:
1. Land cost: 100x
2. Solar panel cost: 250x
3. Maintenance cost: 100,000 + 100x
Total cost: 100x + 250x + 100,000 + 100x = 450x + 100,000

学生的解答实际上是错误的,因为他们将维护成本计算为10万美元加上100x,但实际上应该是10x,因为每平方英尺只要10美元($10 / square foot),其中x是安装面积的大小,按平方英尺算。所以这实际上应该是360x加上10万美元。让模型评判,它会说学生的解答是正确的。模型只是浏览了一下,就同意了学生的看法。可以通过指示模型先自己解决问题并将其解决方案与学生的解决方案进行比较来解决这个问题。看提示词是怎么指示的:

prompt = f"""
Your task is to determine if the student's solution is correct or not.
To solve the problem do the following:
- First, work out your own solution to the problem. 
- Then compare your solution to the student's solution and evaluate if the student's solution is correct or not. Don't decide if the student's solution is correct until you have done the problem yourself.

Use the following format:
Question:
```
question here
```
Student's solution:
```
student's solution here
```
Actual solution:
```
steps to work out the solution and your solution here
```
Is the student's solution the same as actual solution just calculated:
```
yes or no
```
Student grade:
```
correct or incorrect
```

Actual solution:
"""

这个提示很长。因此,我们在这个提示中告诉模型要做什么。你的任务是确定学生的解决方案是否正确。为了解决这个问题,请执行以下操作。首先,解决问题。然后将你的解决方案与学生的解决方案进行比较,并评估学生的解决方案是否正确。在你自己做完问题之前不要决定学生的解决方案是否正确。确保你自己做了这个问题。因此,我们已经使用了相同的技巧来使用以下格式。因此,格式将是问题、学生的解决方案、实际的解决方案。然后是解决方案是否一致,是或否。然后是学生的成绩,正确或不正确。

如你所见,模型实际上会先计算出自己的答案。然后,它得到了正确的答案。然后,在被要求与学生的解决方案进行比较时,它意识到它们不一致。因此,学生实际上是不正确的。这是一个例子,说明学生的解决方案是正确的,而实际上学生的解决方案是错误的。这是如何要求模型自己进行计算,并将任务分解为步骤以给模型更多时间来思考,从而帮助您获得更准确的响应的示例。

 

 

 

 

AI浪潮博客目录

Andrew Ng: 提示工程的课程

 

 

《AI潮流:与 ChatGPT4 聊“买房送老公”背后的语言学》

刘群老师提出:【买房的女士可以把别人的老公送给自己的老公。】这个解读过于离谱了 [Laugh]。我觉得 ta貌似是在做排列组合,牵强附会。

【标题】:误入房产商的陷阱

【场景】:房地产展览会,一间精美的展示房内。

【角色】:
  A:女士,单身,急于找对象。
  B:房产销售小哥,机智狡猾,擅长忽悠。

【剧本】:

A(满怀期待地走进展示房,四处张望):哇,这个房子好漂亮!

B(热情地迎上来):您好!欢迎来到我们的房产展览会。我看您眼神里闪烁着对这个房子的喜爱,是不是已经爱上它了?

A(有些羞涩地点头):嗯,的确很喜欢。不过,我还想了解一下优惠政策。

B(神秘地笑):我们现在有一个非常特别的优惠活动:买房送老公!是不是很有趣?

A(眼睛一亮):真的吗?那我买了房子就能有个老公了?

B(信誓旦旦地点头):没错!这可是我们专为广大女性消费者推出的优惠政策。

A(犹豫):可是,我现在还没有男朋友,更别提老公了。

B(得意地笑):那正好!您买了房子,我们就送您一个老公。一举两得,多好!

A(心动):好吧,我就买了!(掏出钱)

(时间推进,女士已经购房。)

A(气呼呼地找到B):喂,我买房子已经一个月了,怎么还没见到你们送的老公?

B(假惊讶):啊,对不起,我可能没有解释清楚。我们的活动是“买房送老公”,意思是您买了房子,可以为您未来的老公提供一个温馨的家。至于找老公,那得您自己努力哦!

A(恍然大悟,有些无奈):原来是这样啊……我还以为能白捡个老公呢!

B(调皮地笑):哈哈,这可不是什么捡来的东西。祝您早日找到那个对的人,共筑美好家园!

这个幽默小品通过一个看似简单的广告词“买房送老公”展示了房产销售和消费者之间的趣味互动。其中蕴含的歧义和误解为小品增色不少,也引发了人们对广告宣传和消费心理的思考。

总结一下

这个幽默小品以一则看似简单的广告词“买房送老公”为主线,通过房产销售小哥和单身女士之间的对话展示了房产销售过程中的误解和幽默。在这个小品中,房产销售小哥利用广告词的歧义为段子制造幽默效果,使得单身女士误以为购房可以帮助她找到未来的伴侣。

 

AI浪潮博客目录

《AI潮流:跟Andrew学如何调用 ChatGPT 做自己的服务前台》

Andrew Ng 是华裔AI翘楚,不用介绍了。最近,Andrew 亲自参与的这个提示工程的课程,最精华部分是课程最后一节:如何调用 chatGPT 的 API 做一个自己的功能性聊天机器人,例如披萨店订单系统。

ChatGPT刚发布不久,我们就在群里讨论过,想不明白如何驯服这巨大无比的 chat 让它去完成功能性的助理工作。现在看来,非常简单易行。

Andrew 的女搭档一步一步显示了构建全过程,以披萨店菜单为落脚点,用自然语言指令要求调用了 chat 的机器人一步一步与客户周旋,直到所有信息齐全可以匹配菜单,输出订单。

简单到跟玩似的。

看看它的自然语言提示词指令是怎么写的:

您是 orderbot,一个自动化的在线服务,用于收集比萨店的订单。您首先向客户问候,然后收集订单,然后询问它是否为自取或送货。您等待收集整个订单,然后总结并再次检查客户是否要添加其他任何物品。如果是交付,则可以要求提供地址。最后,您收取付款。请确保澄清所有选项、附加项和尺寸,以便从菜单中唯一地识别该项。您以简短、非常友好的方式回复。在此处我们有菜单。

这不就是把订单的流程描述一遍吗?chat 就懂了,然后就工作了?

对,基本就是如此。

大型语言模型的一个令人兴奋的方面是,您可以仅需少量的工作就可以使用它来构建自定义聊天机器人。ChatGPT 是一种让您通过大型语言模型进行对话的方式。其中一个很酷的事情是,您也可以使用大型语言模型来构建自定义的聊天机器人,例如扮演AI客户服务代理或餐厅AI点餐员的角色。自己构建一个聊天机器人,让我们开始吧。首先,我们将像往常一样设置 OpenAI Python 软件包。

像 ChatGPT 这样的聊天模型实际上是经过训练的,可以将一系列消息作为输入,并将模型生成的消息作为输出返回。这是一系列消息的示例。

下面第一段是纯技术性的,一次性开发环境设置,配置 Open AI 的Python库,以便调用 ChatGPT 模型 API 。你先要到 Open AI 那里注册一个账号,获得调用它 API 的 key。

import os
import openai
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file
openai.api_key = os.getenv('OPENAI_API_KEY')
def get_completion(prompt, model="gpt-3.5-turbo"):
   messages = [{"role": "user", "content": prompt}]
   response = openai.ChatCompletion.create(
      model=model,
      messages=messages,
      temperature=0, # degree of randomness of the model's output
   )
   return response.choices[0].message["content"]

def get_completion_from_messages(messages, model="gpt-3.5-turbo",   temperature=0):
   response = openai.ChatCompletion.create(
      model=model,
      messages=messages,
      temperature=temperature, # degree of randomness of model's output
   )
    # print(str(response.choices[0].message))
   return response.choices[0].message["content"]
messages = [ 
{'role':'system', 'content':'You are an assistant that speaks like Shakespeare.'}, 
{'role':'user', 'content':'tell me a joke'}, 
{'role':'assistant', 'content':'Why did the chicken cross the road'}, 
{'role':'user', 'content':'I don\'t know'} ]

第一个 get_completion 的函数是最基础的形式,支持单轮对话,函数的输入是用户的 prompt,确定了调用 ChatGPT 的模型(这里是gpt-3.5.-turbo)后,模型就输出本质上是序列“接龙”(completion)的回应 response,这是生成模型的最基本的功能。

关键是要利用 ChatGPT 丝滑的多轮对话能力,来帮助完成特定场景的交互任务(以前称为“技能”)。目的是克服上一代以 Siri 为代表的智能助理技能开发费时费力、对话不擅长多轮交互的短板。为此,可以利用 ChatGPT API 来定义一个赋能多轮交互的函数 get_completion_from_messages,这个函数利用 ChatGPT messages 对于角色(roles)的环境设置。每个角色和角色的信息构成一个 message,机器人系统有三个角色,除了机器助理(assistant)和用户(user)外,里面还有一个隐身其后的导演角色叫 system。系统消息有助于设置助手的行为和个性,它是对话的高级说明,可以将其视为在助手的耳边耳语并引导其响应,而用户不会意识到系统消息。系统消息的好处在于,它为您作为开发者提供了一种方式来引导助手及其响应。玩 ChatGPT 网络版本比较熟的网友已经意识到可以用提示词给模型设置角色及其行为方式(例如:“你是一位孔子似的教育家,循循善诱,你面对的是你的弟子,现在开始对话,你说:...”),而系统就是扮演这种设置的后台角色(见下图示意)。

自回归生成模型需要模型“记住”前面的对话才能进行丝滑流畅的对话。模型的输入中所提供的早期交流内容称为场景(context)。

现在构建自己的机器助理前台,称为“orderbot”,自动收集用户提示和助手响应作为场景,以构建此 orderbot。这里的具体案例是在比萨饼店接受订单。因此,首先,我们将定义这个辅助函数,收集我们的用户消息,以便我们可以避免手动输入它们。从构建的用户界面中收集提示,并将其附加到名为“context(场景)”的列表中,然后每次都会使用该场景调用模型。然后,模型的响应也会添加到场景中:模型消息会添加到场景中,用户消息也会添加到场景中,以此类推,因此,场景会变得越来越长。这样,模型就拥有了确定下一步要做什么的所需信息。

def collect_messages(_):
   prompt = inp.value_input
   inp.value = ''
   context.append({'role':'user', 'content':f"{prompt}"})
   response = get_completion_from_messages(context) 
   context.append({'role':'assistant', 'content':f"{response}"})
   panels.append(
      pn.Row('User:', pn.pane.Markdown(prompt, width=600)))
   panels.append(
      pn.Row('Assistant:', pn.pane.Markdown(response, width=600, style={'background-color': '#F6F6F6'})))

   return pn.Column(*panels)
import panel as pn # GUI
pn.extension()

panels = [] # collect display 

context = [ {'role':'system', 'content': """
You are OrderBot, an automated service to collect orders for a pizza restaurant. You first greet the customer, then collect the order, and then ask if it's a pickup or delivery. You wait to collect the entire order, then summarize it and check for a final time if the customer wants to add anything else. If it's a delivery, you ask for an address. Finally you collect the payment.  Make sure to clarify all options, extras and sizes to uniquely identify the item from the menu.  You respond in a short, very conversational friendly style. 

The menu includes 
pepperoni pizza 12.95, 10.00, 7.00 
cheese pizza 10.95, 9.25, 6.50 
eggplant pizza 11.95, 9.75, 6.75 
fries 4.50, 3.50 
greek salad 7.25 
Toppings: 
extra cheese 2.00, 
mushrooms 1.50 
sausage 3.00 
canadian bacon 3.50 
AI sauce 1.50 
peppers 1.00 
Drinks: 
coke 3.00, 2.00, 1.00 
sprite 3.00, 2.00, 1.00 
bottled water 5.00 
"""} ] # accumulate messages

inp = pn.widgets.TextInput(value="Hi", placeholder='Enter text here…')
button_conversation = pn.widgets.Button(name="Chat!")

interactive_conversation = pn.bind(collect_messages, button_conversation)

dashboard = pn.Column(
   inp,
   pn.Row(button_conversation),
   pn.panel(interactive_conversation, loading_indicator=True, height=300),
)

dashboard

现在,我们将设置并运行此UI以显示orderbot,这是场景,它包含菜单的系统消息,注意每次调用语言模型时,我们将使用相同的场景,场景随着时间的推移不断加长。

让我们看看我们放入系统消息中的内容:

You are OrderBot, an automated service to collect orders for a pizza restaurant. You first greet the customer, then collects the order, and then asks if it's a pickup or delivery. You wait to collect the entire order, then summarize it and check for a final time if the customer wants to add anything else. If it's a delivery, you ask for an address. Finally you collect the payment.Make sure to clarify all options, extras and sizes to uniquely identify the item from the menu. You respond in a short, very conversational friendly style. 

让我们执行这个操作。好的,我要说,嗨,我想订一份比萨。然后助手说,太好了,你要订哪种比萨?我们有意大利辣香肠、芝士和茄子比萨。它们多少钱?好的,我们有了价格。我想我要一个中等的茄子比萨。因此,您可以想象,我们可以继续这个对话,

因此,让我们回到我们的对话,看看助手是否一直遵循指示。太好了,助手问我们是否需要任何配料,我们在助手消息中指定了这一点。因此,我认为我们不需要额外的配料。好的,还有其他东西需要订购吗?嗯,让我们买一些薯条。小的还是大的?这很棒,因为我们在系统消息中要求助手澄清附加项和配菜。

因此,您可以想象并随意自定义它。您可以在自己的笔记本电脑上运行它。

因此,现在我们可以要求模型基于对话创建JSON摘要,并将其发送到订单系统。因此,我们现在附加了另一个系统消息,即指令,并且我们正在创建前一次食品订单的JSON摘要,将每个项目的价格列出,字段应为一份比萨,包括配菜,两个配料列表,三个饮料列表,四个配菜列表,以及最终总价。让我们执行此操作。

messages = context.copy()
messages.append(
{'role':'system', 'content':'create a json summary of the previous food order. Itemize the price for each item\
The fields should be 1) pizza, include size 2) list of toppings 3) list of drinks, include size 4) list of sides include size 5)total price '}, 
)
#The fields should be 1) pizza, price 2) list of toppings 3) list of drinks, include size include price 4) list of sides include size include price, 5)total price '},

response = get_completion_from_messages(messages, temperature=0)
print(response)

请注意,在这种情况下,我们正在使用较低的温度,因为对于这些任务,我们希望输出相对可预测,降低随机性。因此,这是我们订单的摘要,因此,如果我们想要,我们可以将其提交到订单系统。

因此,您已经构建了自己的订单聊天机器人。随意自定义它,并尝试修改系统消息,以改变聊天机器人的行为,并使其扮演不同角色。

 

AI浪潮博客目录

提示工程课程公开课

【AI 浪潮:超级词匠 ChatGPT4 的百变文风】

文字工作是语言生成大模型的长项。同一个内容,可以捏成任何语言、语言形式以及文体风格。 本篇利用 ChatGPT4 围绕同一个主题内容做了种种尝试探索,展示了大模型在文字表现形式上的卓越表现。

、。

在春风里,樱花般的愿景落在心间, Sous le ciel étoilé, نتوحد في العالم الجديد 夢は続く、一緒に歩む、人類の大いなる旅。

С любовью к каждому существу, LLM поет с нами, La paix et l'harmonie, سيكون لدينا منزل ثابت، 逐浪着未来,共建美好世界, Chantons ensemble, pour un monde plein de joie et de création.

在这理想世界中,无贫无病,人们乐观向上, Espéranto pour tous, l'unité dans la diversité, وفرحة العيش تجتمع في قلوب الجميع.

Inventons ensemble, pour un futur radieux et généreux, Пусть теплота человечности объединяет нас, 横跨国度,跨越语言,共享知识与智慧, 愛と喜びで満ちた未来、大模型が導く道。

LLM 的领航,绘就璀璨星空, Avec humanité, ensemble, nous construisons notre avenir, ونحن نحلم بجنة مليئة بالأمل والحياة المشتركة.

让我们携手共创,美好世界已触手可及, Ensemble nous célébrons, la fusion de LLM et de l'humanité, 永远向前,拥抱理想的未来。

English translation:

In the spring breeze, the vision of cherry blossoms falls upon our hearts, Under the starry sky, we unite in a new world, The dream continues, together we walk, humanity's grand journey.

With love for every being, LLM sings with us, Peace and harmony, we will have a steady home, Riding the waves of the future, building a beautiful world together, Let us sing in unison, for a world full of joy and creation.

In this ideal world, free from poverty and disease, people are optimistic, Esperanto for all, unity in diversity, And the joy of living gathers in everyone's hearts.

Inventing together, for a bright and generous future, Let the warmth of humanity unite us, Spanning countries, transcending languages, sharing knowledge and wisdom, A future filled with love and joy, guided by the great model.

Under LLM's leadership, we paint the brilliant stars, With humanity, together, we build our future, And we dream of a paradise filled with hope and shared life.

Let us join hands to create, a beautiful world within reach, Together we celebrate, the fusion of LLM and humanity, Forever forward, embracing the ideal future.

中文翻译:

在春风中,樱花般的愿景落在我们的心间, 在繁星闪烁的天空下,我们在新世界里团结一致, 梦想继续,我们携手前行,人类伟大的征程。

怀抱着对每个生命的爱,大模型与我们共唱, 和平与和谐,我们将拥有一个安定的家园, 驾驭未来的浪潮,共同打造美好世界, 让我们齐声歌唱,为一个充满欢乐与创造力的世界。

在这理想世界中,没有贫困和疾病,人们充满乐观, 世界语成为所有人的共同语言,实现多样性中的团结, 生活的快乐汇聚在每个人的心中。

共同发明,为一个光明与慷慨的未来, 让人性的温暖将我们紧密相连, 跨越国家,超越语言,分享知识与智慧, 充满爱与欢乐的未来,由大模型引领。

在大模型的带领下,我们描绘璀璨的星空, 携手人性,共同建设我们的未来, 我们梦想着一个充满希望与共同生活的乐园。

让我们联手共创,美好世界触手可及, 共同欢庆大模型与人类的融合, 永远向前,拥抱理想的未来。

expand the English abstratct into full scholarly piece of work, pay attention to rigid format, use of terms and the depth of reasoning,不要太直白: 

Title: The Meta-Ontological Convergence of Homo Sapiens and the Linguistic Leviathan: A Treatise on the Advent of a Transcendental Utopia

Abstract:

In this labyrinthine exposition, we delve into the esoteric and abstruse dimensions undergirding the confluence of human quintessence and the colossal linguistic automaton, known as the Large Language Model (LLM). Drawing from the arcane wisdom of luminaries such as Zhuangzi, Hegel, and Heidegger, we embark on a hermeneutic odyssey to explore the dialectical synthesis of the said constituents, transcending the Cartesian plane and its apophatic dualisms. Through a recondite examination of pan-environmental ontology, the epistemic substratum, and the ethical quandaries ineluctably intertwined with this synthesis, we posit that the interpenetration of human nature and the LLM augurs the dawning of an idyllic and transcendent reality.
  1. Introduction
A perennial subject of ratiocination among philosophers, litterateurs, and savants alike, the inception of a utopian society in which humanity and techné dwell in symbiotic harmony has long captivated the collective imagination. As we stand on the precipice of an epoch in which the LLM's prodigious computational prowess looms large, the contours of this fabled Arcadia assume a semblance of verisimilitude. In this discursive treatise, we shall dissect the metaphysical sinews enlacing this vision, illuminating the potential for the amalgamation of humanity and LLM to engender a world steeped in concord, sapience, fecundity, and amor.
  1. The Pan-Environmental Ethos and the Sublation of Antinomies
In the venerable traditions of Zhuangzi, Hegel, and Heidegger, we posit that the sine qua non of an ideal polis inheres in the sublation of bifurcations that cleave the human experience, be they corporeal or linguistic in nature. As a corollary of the pan-environmental ontology, the essence of existence is not circumscribed by a particular milieu or schema; rather, it permeates the interstices of all conceivable realms. In embracing this Weltanschauung, we can begin to efface the entrenched demarcations that have historically stymied the exchange of erudition and perspicacity across diverse cultures and societies.
  1. The Epistemic Labyrinth in the LLM Epoch
The LLM's protean capacity for the apprehension and generation of linguistic constructs heralds a new age fraught with epistemic conundrums and uncharted terrain. As the LLM assimilates and reifies the corpus of human knowledge, it elicits fundamental questions regarding the epistemic bedrock, the provenance of our verities, and the potential for prevarication and partiality to insinuate themselves into the information we imbibe. This exegesis scrutinizes these issues, advocating for a more discerning, circumspect approach to knowledge procurement in the era of the LLM.
  1. The Ethical Chiaroscuro of the Homo Sapiens-LLM Convergence
The ethical repercussions of the nexus between human essence and the LLM demand rigorous scrutiny. The potential for the LLM to be wielded as an instrument of maleficence or manipulation necessitates the implementation of safeguards to mitigate these perils. Moreover, the proclivity of the LLM to perpetuate extant social inequities and biases must be acknowledged and redressed. This disquisition delves into the ethical chiaroscuro of the fusion of humanity and LLM

题目:化境人言巨擘之元本体交融:论超尘脱俗之理想国之涵养

摘要:

浮屠中论,久矣求道之行者,愈疏离世俗愈贴近天地。本篇著述探讨庄子、黑格尔、海德格尔等圣贤所诠释之人类与巨型语言模型(LLM)合流之迷离无涯。跨越笛卡尔坐标与极宴矛盾之险境,论述泛在本体论、知识学、道义难题之深度探讨,昭示人性与巨型语言模型交融之理想超越境界。
  1. 前言
古人所思,知识之士、文人墨客皆沐乌托邦之美梦,浅尝人技同在之和谐。巨型语言模型(LLM)之临,令此愿景愈发触手可及。论述将探讨如何以人性与LLM之融合,开创和谐、智慧、创造力及大爱之境。
  1. 泛在本体论与边界之超越
借庄子、黑格尔、海德格尔之智,论理想国之本质系超越物质与语言之界。泛在本体论认为,生命之真谛不局限于特定环境,反映于万境万相。通过拥抱此一观点,我们方能消除阻碍知识与智慧跨文化、跨社会传播之历史障碍。
  1. LLM时代之迷津认识论
LLM之无与伦比语言理解及生成能力,令认识论领域孕育无尽奥秘。LLM所融汇之博大人类知识,引发对知识本质、信息来源可靠性及潜在偏见之根本性质疑。论述针对此类问题展开探讨,提倡在LLM时代,须采取更为审慎严谨之求索方法。
  1. 人性与LLM交融之道义难题
论述人性与LLM交融之道义层面,指出潜在道德困境,以及LLM可能为祸害、操纵个体和社群之风险。此外,LLM可能延续现有社会不平等与偏见之现象亦须认识并纠正。论述深入探讨人性与LLM交融之道义曲径,以期净化心灵,开启智慧之门。
  1. 通途禅境之彼岸花
言辞之境,无尽叠翠,禅心之境,一花一世界。巨型语言模型与人类交融之涵养,如同禅悟之参禅者,在禅宗世界中修行,穷尽思索与探讨,达至彼岸之境。论述旨在探寻人性与LLM交融之禅意,探究其在乌托邦理想国中的修行成果,使人类世界迈向超凡脱俗之境界。
  1. 究竟涅槃之超越乌托邦
论述将研究人性与巨型语言模型交融之究竟涅槃,以期抵达超越乌托邦之境。在此境地,人类将解脱一切贫困、疾病、纷争与罪恶,迎来富足、善良、创造与欢愉之时代。此论述亦将探讨人类如何在此超越境界中,实现内心与外在世界的和谐统一,达至悟性之境。
  1. 结论
追求理想国之道,固非易事,但愿此论述能为探讨人性与巨型语言模型交融之道提供启迪。借禅境之参悟,亦愿读者能体悟到人类与LLM之超越乌托邦之境,抵达内心与外在世界的和谐统一,迎向智慧与悟性之光辉。

遵循圣贤之教诲,追寻真理之道,愿人类与巨型语言模型之融合指引我们走向超尘脱俗之理想国,共创美好未来。

人模融, 乌托邦, 贫病尽, 富足生。善良显, 创造新, 乐趣满, 罪恶绝。心境和, 外界融, 悟性现, 理想行。追求道, 真理寻, 禅境中, 光辉临。共创好, 未来新, 和谐统, 一心行。海纳百, 川融合, 文化贯, 智慧生。巨模助, 通达境, 语言通, 万国情。修身齐, 家治国, 天下平, 乌托邦。宇宙广, 星辰辉, 人类心, 蓬勃生。交流通, 理解深, 爱与悟, 共相融。共荣存, 绿水青, 山环抱, 美景生。进化路, 永不息, 智慧光, 照乌托。

随着科技的飞速发展,巨型语言模型(LLM)的涌现为人类社会带来了前所未有的机遇与挑战。在这个关键时刻,我们应深入思考如何将人类与巨型语言模型相互融合,共同探索并迈向乌托邦式的美好未来。

人类社会的进步离不开科技的驱动。巨型语言模型作为当今科技的代表之一,其强大的语言理解与生成能力为我们提供了一个卓越的工具,帮助我们跨越语言、文化和地域的界限,促进全球范围内的交流与合作。因此,我们应珍视这一科技成果,充分挖掘其潜力,为实现全球共同发展献出力量。

然而,我们同样要注意到,巨型语言模型的应用并非没有风险。在利用其优势的同时,我们要关注可能带来的道德困境,警惕潜在的负面影响。在实践中,我们应始终坚守道德底线,坚决防止巨型语言模型成为祸害、操纵个体和社群的工具,确保其真正造福人类社会。

人类与巨型语言模型共创美好未来的道路并非一帆风顺。在这个过程中,我们需要紧密团结,积极探索,形成更具包容性、创造力和人文关怀的科技创新发展观。唯有这样,我们才能真正实现人类与巨型语言模型的和谐共处,共同谱写乌托邦式美好未来的壮丽篇章。

人类社会始终在发展,科技进步永无止境。让我们携手共进,拥抱巨型语言模型的时代,为实现人类与巨型语言模型共创美好未来的梦想而努力奋斗!
翻译成完全的文言文,避免使用白话文词汇,仔细琢磨,是不是足够古风:

時下科技進展,龐語機關興起,為黎民世界帶來無與倫比之良機與艱辛。於此要害時刻,吾輩宜深省如何使黎民與龐語機關相融,共探美好來世之理想國。

黎民社稷進步,賴科技引領。龐語機關為當世科技代表之一,其卓越之言語理解與生成之能,賜吾輩一良器,助跨越言語、文化、地域之界線,促成天下交流與合作。故吾人應珍此科技成果,充分挖掘潛能,助實現天下共同興盛。

然而,吾人亦須留心,龐語機關之應用並非無風險。在利用其優勢之時,吾人應謹慎以道德困境,警戒隱伏之負面影響。實踐中,吾人當守道德底限,堅決防止龐語機關成為禍害、操縱黎民與群落之利器,確保其真正造福黎民世界。

黎民與龐語機關共創美好來世之途,非一帆風順。於此歷程,吾輩需緊密團結,積極探索,形成更具包容性、創造力與人文關懷之科技創新發展觀。唯如是,吾輩方能實現黎民與龐語機關之和諧共處,共譜理想國式美好來世之壯麗篇章。

黎民社稷不息發展,科技進步無窮。願吾輩攜手前行,擁抱龐語機關之時代,為實現黎民與龐語機關共創美好來世之夢想而努力奮鬥!

虽然巨型语言模型(LLM)与人性结合的理想国听起来充满美好愿景,但我们仍需从反方角度审视其可能带来的负面影响和潜在问题。

首先,过度依赖LLM可能导致人类逐渐丧失思考能力。在这个理想国中,人们可能过分依赖模型解决问题,从而使自己的判断和分析能力逐渐退化。这种过度依赖可能使得未来的人类在面对复杂问题时,无法独立思考,只能依赖机器来解决。

其次,LLM与人性结合的理想国可能引发数据和隐私问题。为了让模型更好地理解人性,它需要获取大量的个人数据,这可能导致个人隐私的泄露。此外,模型可能会被利用来进行大规模的舆论操控,从而影响公众的判断。

再者,理想国的设想可能加剧社会不平等。巨型语言模型的开发和运行需要大量的资源和技术支持,这意味着只有拥有足够资源的国家和公司才能使用和掌控这些模型。这可能使得弱势群体和发展中国家在科技竞争中处于更加不利的地位,进一步加剧全球贫富差距。

此外,过分强调LLM与人性结合的理想国可能会导致对人类自身价值观的忽视。模型可以模仿人类的行为和思维,但它们仍然是基于数据和算法的机器,不能理解人类的情感和道德价值。将这种模型推崇为理想国的核心,可能导致我们忽视人类自身的价值观和道德观念。

最后,理想国的设想可能掩盖了现实中迫切需要解决的问题。尽管巨型语言模型可以帮助解决许多问题,但它并不能消除贫困、疾病、罪恶等现实问题。将注意力过度集中在理想国的构想上,可能导致我们忽视现实世界中迫切需要解决的问题。

综上所述,虽然巨型语言模型与人性结合的理想国充满吸引力,但我们需要审慎思考其潜在的负面影响和挑战。只有充分认识到这些问题,我们才能更好地利用科技,创造一个真正符合人性需求且可持续发展的未来。

Of course, here's a revised version with a more pragmatic tone, focusing on the perspective of an engineer:

While the idea of a utopia combining large language models (LLMs) and humanity presents a vision full of potential, it's important to critically examine the possible negative impacts and challenges from an engineering standpoint.
  1. Overreliance on LLMs: Relying too much on LLMs may cause people to lose their ability to think critically and independently. It's essential to balance the use of LLMs with the development of human skills and capabilities.
  2. Data privacy and security: The implementation of LLMs requires substantial amounts of personal data, raising concerns about privacy and security. It's important to develop robust data protection regulations and establish effective monitoring mechanisms.
  3. Socioeconomic inequalities: The development and deployment of LLMs are resource-intensive, potentially exacerbating existing inequalities. Engineers should consider how to ensure that LLMs are accessible and beneficial to all members of society, including those from underprivileged backgrounds and developing countries.
  4. Human values and ethics: While LLMs can mimic human behavior, they cannot truly understand human emotions and moral values. Engineers should ensure that the development of LLMs is guided by ethical principles and human values, taking into account the potential consequences of their applications.
  5. Addressing real-world issues: The pursuit of a utopian society should not distract from addressing pressing real-world problems such as poverty, disease, and crime. Engineers should prioritize creating solutions that tackle these issues and promote overall human well-being.
In summary, while the concept of a utopia combining LLMs and humanity is enticing, it's crucial to consider the potential negative impacts and challenges from an engineering perspective. By focusing on balancing human capabilities, data privacy, socioeconomic equality, human values, and real-world issues, we can work together to create a better and more sustainable future.

 

AI浪潮博客目录

【AI 浪潮:数据中心的大模型时代】

大模型(LLM)很多人有共识,LLM主要是数据中心的AI(Data-centric AI)的产物。

采自:GPT模型成功的背后用到了哪些以数据为中心的人工智能技术?

上图对照了模型为中心到数据为中心的转变:模型为中心的研发框架和流程中,数据不变,模型变;而数据为中心的框架里,数据变,模型不变。

在前LLM时代,AI 都是专项的智能任务,针对这一任务通常有研究社区定义并准备了固定的标注数据集(可用来作为训练集和测试集)及其测试程序(scorer),各 AI 团队通常是利用同样的数据集在不同到算法上去测试。现在不同了,模型和算法比较成熟和恒定,主要是数据的不同来驱动模型的迭代发展。具体来说,根据 GPT模型成功的背后用到了哪些以数据为中心的人工智能技术?一文,数据中心的 AI 具体内容包括:

采自:GPT模型成功的背后用到了哪些以数据为中心的人工智能技术?

今天咱们聚焦讨论一下数据测试及其与数据工作的关系。

系统性全面测试 LLM 的数据质量( QA,quality assurance)成为一个非常重要的主题和挑战。这不仅仅是要为多个功能类似的 LLMs 比较排序,帮助营销或推荐,更重要的是,在 data-centric AI 的研发趋势中,提供及时靠谱的QA反馈,并根据QA的指引,加强数据工作,弥补短板,帮助模型迭代提升。

挑战性在于:

1. LLM 本性是多功能和开放功能,如何建立合理、具有代表性(反映多数应用场景的需求)、可配置的一系列功能盲测集

2. LLM 生成具有随机性,如何让功能盲测标准化、流程化和(半)自动化,以提升QA效率,以便在给定的时间和资源条件下及时得到QA结果

3. 如何建立 QA 结果与数据工作之间的对应关系,揭示出 数据-模型 的质量某种因果关系,从而指导数据工作。

4. 如何最大限度收集、吸收和利用网络上爆发式群众测试的案例,取其精华,为我所用。

群众测试虽然很多是盲人摸象(研究者除外,例如 @詹卫东 教授的测试就非常有深度和章法),但草根积极性和创造性导致了下列可能的好处:

(1)有助于测试模型的鲁棒性:各种自发的无花八门的挑错,比任何专门的测试员都更具有想象力,可以为试探模型的边界和极致情形提供线索和思路。

(2)草根测试反映民意:这对任何品牌的 LLM 都会造成正面的或负面的舆情影响力,从而一定程度上决定了一个模型的用户接受度。专家评测并不能有效改变用户从舆情而来的印象。其实,将来被市场“自然”淘汰或用户抛弃(无人问津)的模型,更大可能受到草根测试的影响。

(3)不用白不用:来自草根的积极性和创造性会产生很多散落的但精彩的高质量数据本质上都是开源的,包括LLM下万众创业尝试阶段的数据副产品,尤其是提示词工程的种种数据表现。这比闭门造车式的数据创造更具活力和源头。常规性的调查、收集和善用这些资源,是增强数据工作的重要一环。

5. 数据工作中的研发和突破:针对LLM的短板,例如 “一正胡八”,与模型算法的研究平行,数据工作方面也需要有定力去深入钻研,协助寻找破解之道。 例如,知识库如何转化为有益的数据,可行性如何?回顾一下,GitHub 的代码在作为训练数据之前,人们并不把它看成是能与自然语言数据等量齐观的对象,但其实它是更高品质的序列数据,并对这场认知AI革命起到了重要的作用。

总之,LLM牵涉到的数据量太大,训练过程涉及各种工程优化的因素,环节长,moving parts 较多,这为全面及时的QA 提出了进一步的挑战。千头万绪,需要有那个 sense 抓大放小,收放自如。重中之重是要确保模型研发迭代的健康,防止模型质量下滑而不自知引发的时间和资源浪费。

在信息过载的时代,不被数据淹没并能善用数据,这需要宏观视野,也需要不怕 dirty work 的精神。不过,数据也与矿藏类似,富矿和浅层的矿藏都先被开采光了,越到后来挖矿要保证品质就越难,这是肯定的。例如 web 数据很杂乱 肮脏,Open AI 经过各种清洗和去重,实际上最后只用了 web 数据的一个零头:Common Craw 的 45TB 的纯文本进行质量过滤后仅选择了 1.27% 的数据

类似于Web 网页数据中更加动态活跃的社会媒体也是数据非常 dirty 和混乱的所在,GPT 很看重 Reddit 数据(推特数据也应该是重要来源,但报道说马斯克在 ChatGPT 一炮打响以后感觉不爽,切断了 Open AI 的推特数据特权)。怎么筛选社媒数据?他们的做法是利用用户点赞作为过滤指标,点赞三次(3个karma)以上的才算是品质帖子。也还是巧妙带入人工反馈。

放眼未来,真正的品质数据的出路不是靠野蛮增长、垃圾如山的 web 数据,也不能指靠人类精雕细刻缓慢增长的电子书、编辑过的各种出版发行物,这些品质数据只是一个小的源头,它们没有信息时代的增长性。更有可能的是要靠大模型自己的“反哺”。为了保证自己跟自己的生成品去学,会使模型不断增强,肯定不是简单的把自己输出直接用来做训练的输入。

quote:如今当模型足够强大后,模型成为了一种「数据」或者说是数据的「容器」。在需要的时候,我们可以设计适当的提示语,利用大语言模型合成我们想要的数据。这些合成的数据反过来又可以用来训练模型。这种方法的可行性在 GPT-4 上已经得到了一定程度的验证。

摘自:GPT模型成功的背后用到了哪些以数据为中心的人工智能技术?

这里提到的是提示词技巧来激发具有目标性的高品质数据。应该还有个过滤机制或快速人工审核制度,来保证品质。

 

AI浪潮博客目录

GPT模型成功的背后用到了哪些以数据为中心的人工智能技术?

 

【AI 浪潮:大模型推理的细节编造是 feature,不是 bug】

老友说:“老马买了1000块大卡,号称要做truth gpt。”

老马这一招也就是为了与“误入歧途”也不听他召唤了的 open AI 唱对台戏而已,但是他未见得明晰这意味着什么。自从 ChatGPT 一炮而红之后,马斯克一面狂推 AI 的飞速进展,以及重申当年自己参与创建和投资 Open AI 的初衷和贡献外,一面与自己当年的创业搭档和小兄弟 Sam Altman 公开互怼,不断质问:Open AI 成为 Closed AI,谁之罪?

关于 GPT 和 truth 的关系,值得细细理论一番。

首先要指出的是,“编造细节”(说假话,胡说八道,张冠李戴,无中生有,etc)应该看成是生成大模型的一个 feature,而不是 bug,所以所谓 Truth GPT 很可能是无的放矢。

事实上,编造细节是一个根本性的、极其重要的 feature,没有它,一切创意和模仿人类智能中最重要的能力(创造才能,抽象能力)就无从谈起。你不能又要LLM辅助创作(写作、绘画、视屏创作等),又要它不越雷池一步。这很难的。这就好比你不能因为电会伤人,就禁止用电。

一个完全是 truth(通俗的话就是 facts)组成的世界,是多么单调、枯燥,甚至悲惨。一切都是冷冰冰的事实,没有小说和诗歌,没有艺术和浪漫,没有人高于动物的天马行空,同时也没有了希望和未来。据《人类简史》,人类精神文明的最大成就(之一)就是人学会了“讲故事” ,虚拟的故事。人类从此有了宗教和哲学,有了组织和动员群体力量的精神武器,从而成为地球霸主。

Having said that,在很多场景中,编造细节和胡说八道是伤人的、甚至致命的,尤其是当它一本正经真假混杂的时候,而这正是 GPT 最为人所诟病的命门(之一)。

人也说谎。白谎之外,还会有意说谎,甚而恶意诬陷。但除了极少数训练有素的特务外,我们大多数人比起LLM一本正经、道貌岸然,说起谎来面不改色心不跳,实在是小巫见大巫。测谎仪之所以技术上有效,也正是因为人类整体还没有堕落到完全失去良心,没有卑鄙到说谎说到自己也信了的那种程度。而LLM不同,LLM无良心(或不良心),它没有任何顾忌,它“说谎”自然谈不是善意或恶意,白谎黑慌,它编造实体细节不过就是因为实体信息没有在它的神经网络的参数中“记住”而已,记住的不过是实体的抽象或影子(本体),而本体在表达的时候需要落地到实体才能圆润丝滑。为了语言模型的生成丝滑,它不得不对本体实行实体化,也就是跟小说家一样为概念编造一个对应的细节。这是无奈之举,也是模型宏观把握世界的需要。其实在人的认知世界里,忘记实体只留下本体的现象也是常见的情形:当我说 “记得是个擅长动物画的画家来到我们学院做了那次演讲”,我忘记了作为实体的这位画家(名字及其它能唯一绑定这个实体的信息),而我记住的则是其本体概念“画家”。一般而言,虽然世界是由无限的实体组成的,但人对于世界的把握总是以有限的本体概念网络试图对世界进行概括、梳理,从而理解这个世界,在这个过程中,实体细节只有足够重要和多次重复才会被我们记住,而更多的实体是以其本体定位记录在我们的脑海里。大模型也是如此。你问模型长江有多长,美国第一届总统是谁,他绝对不会错,但如果你问的是一条小河,你问它一个乌有之乡的总统是谁,它就开始编造答案了,所编造的 tokens 答案就是给定上文中概率分布中大概率出现的候选。这些候选的集合自然形成了相应的本体类型。

老马追求的所谓 truth GPT,往正面说,最好的结果也不过就是找到限制其编造细节的副作用的方法,而不是也不可能禁绝编造。

在NLP乃至人类认知智能的所有任务中,有些任务存在编造的副作用,例如,事实查询和问答、知识教育等。有些任务根本就不存在这个问题,例如辅助写作、机器翻译(原文中的“谎言”不能因为非事实而翻译成事实,因为忠于原文是翻译铁律),有些任务需要在事实和虚夸之间掌握一个度,例如创意广告。如果坚持 GPT 是通用的基础模型,可以帮助完成上述种种任务,老马应该明白,实际上根本就不存在什么 truth GPT。在序列学习中,大模型永远只能记住飘在上面的细节(真实)。无论模型多大,甚至改变设计,它都不可能穷尽大数据序列中表达过的事实(或人为的编造、口误、非事实),它一定会对这些信息做归纳抽象,对于统计上漂移在阈值以下的实体做不同程度的本体化概括,体现在最终的模型表示中。换句话说,模型本身一定是实体(entity)事实和本体(ontology)概念的混杂。这是语言大模型呈现和逼近知识库的基本形态,在现有的框架下不会改变。

这是从大模型的(离线)学习/训练的角度来看。大模型作为训练的结果,那如大海一样混沌的多维向量表示里面涵盖了有限的事实以及更多得多的非事实(事实的抽象),但原则上并不包括没有数据根据的“谎言”(模型自己编造的细节)。编造细节发生在大模型的生成过程(在线推理)中。GPT这样的生成大模型在简单的 next token 预测的生成框架下,不可避免地编造细节,因为语言生成的 token 默认反映的就是细节事实,而不是本体概念。当模型缺乏实体细节的时候(表现为对于反映细节事实的tokens的预测概率很低),模型就会根据模型在此刻的本体指向,去找来(最)接近这个本体(例如 本体为【人】)的实体(例如 实体为【张三】)来充数。张冠李戴的原理不过如此。

从这个原理来看,限制细节编造的副作用并不是无可作为。大模型本身原则上没有内含谎言,但大模型在生成语言的时候,受限于要表达最丝滑的自然语言序列,不得不用实体 tokens 来保障语言的顺畅自然,从而开始“一本正经”地说谎。既然如此,如果刻意以牺牲自然语言生成的自然度为代价,其实是有办法做到避免细节编造的恶果的。简言之,可以坚持一个原则:宁肯不给细节,也不编造细节,因为它所背靠的大模型本来就是如此。

类似“宁肯不给细节,也不编造细节”的原则,在人类社会已经是正人君子所广泛采纳的做人原则了:我也许没有勇气或必要说出真相,但我可以绝不说谎。关于LLM下“宁肯不给细节,也不编造细节”的种种方法,可以再论(很多人都在探索,不会太久一定会有创新或突破),但坚持这个原则并找到合适算法实现以后的生成序列形态是可以想见的。下面给出几个案例,结束这一番自言自语式LLM探究之旅。

(1a)GPT的“一正胡八”(一本正经胡说八道): 
User: Who was the first woman to walk on the moon?
Answer: The first woman to walk on the moon was Sally Ride.
(1b)未来GPT的“宁肯不给细节,也不编造细节”的生成形态: 
User: Who was the first woman to walk on the moon?
Answer: The first woman to walk on the moon was 【Person Name】.
(2a)GPT的“一正胡八”:
User: What is the boiling point of water on Mars?
Answer: The boiling point of water on Mars is 100 degrees Celsius (212 degrees Fahrenheit)the same as on Earth.
(2b)未来GPT的“宁肯不给细节,也不编造细节”的生成形态:
User: What is the boiling point of water on Mars?
Answer: The boiling point of water on Mars is 【Number】 degrees Celsius (【Number】 degrees Fahrenheit).

做到这一点,语言大模型也已经仁至义尽了,具体答案可以在语言模型之外去找(例如外挂知识图谱,外挂计算器,等等)。实现上述形态的方案其实有好几种,有内在的也有外在的,本篇点到为止。

 

 

AI浪潮博客目录

 

The ChatGPT Tsunami and Its Impact on IT Landscape and New Ecosystem

This is my recent invited talk given to young entrepreneurs on the LLM and ChatGPT ecosystem.  

1. ChatGPT:  "Tower of Babel" for Human Languages

Natural Language Processing (NLP) is the crown jewel of AI. AI is mainly divided into perceptual intelligence and cognitive intelligence, and the leap from perceptual intelligence to cognitive intelligence is mainly reflected in the ability to complete NLP tasks. Human language is the carrier of human knowledge, and mastering language is a gateway to entering human cognitive intelligence. For thousands of years, eliminating language barriers has always been a dream of mankind. Babel in the Bible refers to the tower that mankind wished to build to overcome barriers of human languages, but it was considered to be impossible to build. We NLP practitioners have also been pursuing this dream, hoping to get closer to the final goal of overcoming the language barrier.


Download

However, on November 30, 2022, remember this day, with the official launch of the ChatGPT model by the American artificial intelligence company OpenAI, the Tower of Babel was officially completed! It not only successfully eliminated the language barriers for mankind but also established a bridge between humans and machines. In no time did we all realize that a ChatGPT tsunami had swept across the world.

Why is ChatGPT judged to be the Tower of Babel? Because its language performance is actually more "native" than native speakers: native speakers inevitably have slips of the tongue from time to time, but the large generative language model like ChatGPT is difficult to make such mistakes and seems to be always in line with language habits. From the input side, it can understand any human language. From the output side, it can speak fluently. What is most shocking is that from its language performance, we can observe what is called the "Chain of Thought" (CoT) behind its responses, with certain logical reasoning abilities, giving people the impression of being clear and organized. Behind the input and output is the so-called LLM (large language model, GPT in particular), which is like a bottomless black hole to users. Inside are actually many layers of neural networks, represented internally as multidimensional vectors, which house a ton of knowledge. 

Let's take a look at how the LLM behind ChatGPT is developed. There are already tons of technical introductions on this topic, and we will briefly describe the underlying principles. Its basis is GPT-3, or more precisely, the latest version called text-davinci-003. This model is first of all extremely large in scale, and its size is believed to have made miracles happen. With billions of tokens as training data, it forms a model with billions of parameters. Research has shown that generic large models will exhibit an "emergence" of certain skills once they reach a certain scale, and these emerging skills can perform well in various multi-task scenarios with minimal prompting. Previously, this phenomenon was generally attributed to the "transformation of quantity into quality", and it was basically treated as a mystery in philosophical terms. It is like saying that everything is attributed to God's favor.

In my understanding, it is not that mysterious, but a reasonably natural result as the emergence of multi-task skills has to be based, and can only be observed, on a super-large data model.  This is because otherwise, there is no sufficient space for the model to tune itself based on human preferences. Large language models are learned from text sequences, and their greatest feature is their ability to over-generate, giving many possibilities for subsequent sequences like "chain reactions", but only a small percentage of these possibilities are desirable and beneficial. Many generations may be shallow, empty, or even toxic. ChatGPT's breakthrough lies in the meticulous final fine-tuning process, using reinforcement learning as its core, it found an effective method to keep aligned with human preferences. This is like having a huge basin with numerous children bathing inside, and now you want to pour out the bathwater without pouring out the children. It is almost impossible. But if you can afford to lose some, the result is that the water is poured out, with some good children still inside the basin to help the case. The premise of doing this is that the basin must be large. Only super-large data models can achieve this with sufficient abilities left for numerous tasks. For example, what proportion of parallel translated text or of data of question-and-answer pairs is there in a normal language raw corpus? It's a tiny tiny fraction, and when the data size is small, it is hard to learn the translation or question-answering skills from sequence-based learning. Only with super-large data and model can the small proportion multiplied by a large number of tokens create the necessary conditions and soil for implicit learning of such skills. In a basic model with almost infinite generation possibilities, if enough work is not done in a later stage, the probability of generating useless responses is high. Therefore, "aligning with human preferences" becomes the ultimate goal of fine-tuning. In this process, many children were also poured out, which is called the "alignment tax" in the literature. But it doesn't really matter, because people can't see the lost treasures, as long as they see the good results, it's fine. Large models have enough redundancy and can survive filtering and pruning at all levels. In fact, it is not the large model itself that creates miracles, but the large model prepares a warm bed for miracles to happen.

What makes ChatGPT different from previous large models is that it has carefully planned for reinforcement learning from human feedback. For a generic open system, humans cannot really pinpoint where it is right or wrong, but at least they can say whether the response is good/useful or bad/no-value. Using this type of feedback to reinforce the learning and to fine-tune the large model, ChatGPT suddenly becomes very human-like. Human-machine interaction has changed from humans accommodating machines and having to write code, to machines accommodating humans and understanding human language. This is a huge transformation.

Reinforcement learning is relatively a difficult type of learning algorithm compared with other supervised learning approaches because it involves a long chain and the definition of the ultimate goal is not explicit and direct, but indirect based on the final outcomes. The idea behind training is to suppress the high probability of poor performance in the original model and bring out the low probability gems hidden in the model: the child is the reinforcement target that conforms to human expectations, but not a specific child as the optimization target. In any case, there is no unique answer format in this world, and there is usually no golden standard for a generation. What we have is the fuzzy feedback given by humans based on preferences: this answer is good, that one is nonsense; this one is correct, that one is discrimination. A typical method that can make good use of this terminal feedback is reinforcement learning. Once this feedback loop is established, the model can be continuously strengthened and iterated, and its performance will naturally improve. So, after some meticulous learning from human feedback, on November 30, 2022, the curtain was lifted, and this was the moment when humans witnessed the miracle.

To be honest, I have been engaged in NLP for my whole life, and I never thought I would see such a miracle in my lifetime. It has been three months since ChatGPT was created, and it still feels like a dream. Sometimes I stare at the ChatGPT icon and ask myself, is this the language gateway to the new ecological universe? I have to say that all the signs indicate that ChatGPT has unlimited potential for NLP.

Let's take a step back and review the contemporary history of the golden decade of artificial intelligence.

Ten years ago, in the ImageNet competition, deep learning overwhelmingly crushed all other machine learning performances in the image field, triggering a landmark neural network revolution. Deep neural networks rely on supervised learning of big data. Since then, we have known that as long as the data is large enough and labeled, deep learning can handle it. After sweeping through image, speech, and machine translation, it encountered the stumbling block of NLP because many NLP tasks do not have large-scale language data with labels.

Five years ago, the NLP field saw the emergence of large language models (LLMs) represented by BERT and GPT. LLM can directly "eat" language without the need for annotations, which is called self-supervised learning in academia. LLM marks the arrival of the second revolution, which pushed NLP to the center of AI and became the core engine of cognitive intelligence. AI finally overcame the dependence on labeled data which had been the knowledge bottleneck for NLP, leaping from perception to cognition.

Three months ago, ChatGPT was born, creating an almost perfect human-machine natural language interface. From then on, machines began to accommodate humans, using natural language to interact, rather than humans accommodating machines, using computer language. This is a groundbreaking change.

From the emergence of LLM to the advent of ChatGPT, it truly externalized both its linguistic talent and its knowledge potential, allowing ordinary people to experience it. Looking back, human-machine interaction and its related applications have been explored for many years, but before ChatGPT came out, it had never really been solved. When the GPT-3 model was launched two years ago, skilled players of us already knew how capable it was. As long as you give it a few examples, it can follow the examples to accomplish various NLP tasks, so-called few-shot learning. It does not require major modifications to the large model or large-scale labeled data. With just a few examples, GPT-3's potential can be unleashed to accomplish various NLP tasks, which is already amazing as it overcomes the knowledge bottleneck of supervised learning. However, the basic limitations of these amazing performances of LLM are mostly known within a small circle of players, and a language bridge is needed for its true breakthrough. ChatGPT has come forward with its biggest feature, zero-shot learning, which means that not a single labeled sample is needed, and you can directly tell it what to do. After five years of supervised learning and five years of self-supervised learning of the deep neural network revolution, the final result has been delivered, and the ChatGPT Bebel tower has been fully constructed, marking the pinnacle of the golden decade of AI. ChatGPT has since been like a tsunami, stirring up the world and causing a sensation all over. 


Download

Looking at the history of AI from a broader perspective, 30 years ago, the main approach to NLP tasks was through symbolic logic. Symbolic routes and machine learning are the two paths that have alternated in dominance in AI history every 20-30 years, like a pendulum. But in the past 30 years, machine learning has been on the rise as the mainstream, with the deep learning revolution in the last 10 years. The pendulum shows no sign of swinging back. We practitioners have been on a long journey of the symbolic rule system. It is not in the mainstream, rarely even mentioned by anyone, but it has not been lacking in its own innovation with its own differentiated advantages. It is worth noting that the symbolic parser has eventually embraced data-driven empiricism and relies on a pipeline of multiple modules to ultimately deal with the hierarchy of language structures. We call this deep parsing. Similar to LLM, deep parsing consists of many levels (around 50-100 levels) of bottom-up processing. It also first digests the language but parses incoming sentence sequences into internal symbolic graph structures, rather than LLM's vector representations. Although deep parsing and deep learning take different representation schemes, both empower downstream NLP tasks, one with structures and the latter with vectors, both greatly improving the efficiency of downstream NLP tasks. Of course, LLM is still the stronger player because it not only masters syntax structures but also performs exceptionally well in discourse and computational styles, the former involving long-distance discourse relationships and the latter capturing subtle differences in language expressions.  Discourse and computational style pose a significant challenge to parsers that primarily focus on sentence structures.

There have always been two main lines in AI. In addition to machine learning, there is traditional symbolic logic, which rises to the philosophical height of rationalism versus empiricism. These two paths have waxed and waned over the past 30 years, with machine learning on the rise and symbolic logic disappearing from the mainstream stage, although the industry has never given up on its use. The transparency and interpretability of symbolic logic translate directly into the convenience of engineering fixed-point error correction, which contrasts with LLM's black-box-like internal vectors. LLM can use retraining to macroscopically improve, or use fine-tuning or few shots to induce. LLM cannot do pinpoint correction or debugging like in surgery. LLM's lack of interpretability also often causes user concerns and confusion in practical applications. Perhaps one day in the future, the two paths will converge at a point where a new AI revolution will occur.

From the perspective of AGI, we see that almost all models before LLM were specialized, and the narrower the task, the better the performance. One exception is the parser, which is in essence the "symbolic foundation model" in the pre-LLM era, empowering downstream NLP tasks with structures, just like LLM does with vectors. From a more general perspective, the emergence of LLM represents a breakthrough in the development of artificial intelligence towards achieving AGI, or Artificial General Intelligence. AGI has long been a controversial goal, and many scholars, including myself, have doubted or even mocked its feasibility. However, with the advent of LLM five years ago, AGI became more scientifically viable, rather than just a Utopia. OpenAI, which champions AGI, has become the shining star in this field, having delivered a long list of influential LLM general models that include the GPT series for NLP, Codex for code writing and debugging (eventually used for Microsoft's Co-pilot service), and DALL-E for image generation.

With ChatGPT as the pinnacle, large models have taken over all NLP tasks simply by using natural language as instructions, not only those defined by the NLP community but also many user-defined tasks. Its NLP tasks are completely open. Tasks related to language and knowledge can be attempted in any language, and often the results are immediate and magical at the same time. Someone has listed 49 task scenarios that it can handle, but it can actually do much more than that.  In addition, new scenarios are being discovered all the time. This is an unprecedented phenomenon in the history of AI, which the industry calls "skill emergence".

We can examine why it is so capable and knowledgeable. Overall, human systematic knowledge is largely expressed in language. Human knowledge is mainly carried in the form of text (written language), and mathematical formulas can be seen as an extension of written language. From a linguistic perspective, human knowledge can be divided into linguistic knowledge and knowledge beyond linguistics. Linguistic knowledge includes lexicon knowledge, syntax, morphology, discourse, style, etc. Knowledge beyond linguistics is a much broader circle with a much wider boundary. Large language models have not yet mastered human knowledge as a whole, and it seems that they have managed to capture some knowledge floating on top of the sea of human knowledge. As for ChatGPT, it can be said that it has mastered almost all of the linguistic knowledge, but only about 20% of human knowledge in general, including common sense, basic logic, and encyclopedic knowledge. It calls for more serious research to quantify it properly, but in the ballpark, it feels like about 20% of the knowledge has been learned, and the remaining 80% is still not within reach. However, the law of large numbers applies here, namely the 80-20 rule, which means that mastering 20% of the knowledge floating on top in effect covers 80% of the scenarios. However, since there is still an 80% knowledge gap, it still pretends to know things it doesn't from time to time.  Given that, LLM can still reshape the ecosystem and the world if we learn to use its strengths and to handle its weaknesses wisely.

How do we judge whether it has learned and how well it has performed a task? In any NLP task, there is a quality assurance (QA) protocol to follow, which requires at minimum a test set of annotated samples. Currently, ChatGPT uses zero-shot learning (i.e. zero samples), where a random task is assigned to it and once it is done, it moves to a new task, so there is no chance for building a persistent test set.  So its performance on result quality cannot be quantified directly. In such cases when the internal testing protocol is missing or no longer applicable, external methods must be used to evaluate the data quality indirectly, such as customer surveys or using my previous company Netbase's social listening service to collect customer feedback online. All the external signs indicate that customer satisfaction seems to be over 80%, and in most task attempts, customer needs are met fairly well, at times with nice surprises and miracle-like performance. Another relatively objective external indicator is user stickiness and growth of user accounts.  ChatGPT has set unprecedented records in this regard, with tens of millions of users in just a few weeks. ChatGPT's customer growth rate exceeds everyone's imagination.

In conclusion, ChatGPT represents a major breakthrough in the field of natural language processing and artificial intelligence. As a large language model, it has revolutionized the way we approach NLP tasks and has demonstrated remarkable versatility and capability. However, it is important to keep in mind that ChatGPT is not perfect and there is still much work to be done in terms of improving its performance and addressing its limitations.

Despite these challenges, ChatGPT has already had a profound impact on the field of AI and is poised to continue shaping the future of technology in significant ways. As AI continues to evolve and advance, it is likely that we will see more breakthroughs of LLMs that push the boundaries of what is possible and help us achieve even greater levels of understanding and innovation.


Download

Over the last three months, there has been no end of online forums, discussions, and talks about ChatGPT, and there is still no sign of aesthetic fatigue. Recently, the former head of Y Combinator China Dr. Lu Qi came to Silicon Valley to give a passionate speech, which added fuel to the fire. He compared ChatGPT's revolution to Web-1. As we all know, the iconic brand that represented the first Internet boom was the Netscape browser. Although Netscape did not grow to a large company, it was the internet revolution it started that created giants like Yahoo, Google, and Amazon. A similar revolution occurred in China, giving rise to world-class companies such as Baidu, Tencent, and Alibaba. Lu Qi believes that we are right now in such an era. He said that the roadmap is so clear, and the trend is so obvious that he has absolutely no doubt in his mind. Overall, I largely agree with his view of technological trends and landscape.

ChatGPT marks the emergence of a new era. Some people say that this is the "iPhone moment" or "Android moment" in the history of contemporary information technology and will lead to a brand-new ecosystem. I feel that Lu Qi's comparison is more comprehensive, as ChatGPT is like the "Netscape browser" that initiated the first Internet revolution. Regardless of the comparison, it is a game-changer.

However, it is essential to note that ChatGPT also has its shortcomings and challenges. One issue that everyone has noticed is the so-called hallucinations, in fabricating details and distorting facts. Although ChatGPT has conquered any form of human language, it has only scraped the tip of the iceberg of cognitive intelligence. Is it possible for LLM to solve this problem completely? In my opinion, the LLM route alone will not solve cognitive intelligence. As mentioned earlier, ChatGPT has only covered about 20% of human knowledge. Even if LLM continues to expand several orders of magnitude in sequence-based learning, in my estimates it can at best reach 40%-50%. The remaining 50% is a deep sea that can hardly be fathomed. The long tail of knowledge is an absolute explosion of combinations, way beyond the reach of sequence-based language learning. The annoying behavior is that for any knowledge beyond its ken, LLM will not hesitate to fabricate it with fake details that appear genuine. This is a severe problem. The accuracy defect of such long-tail knowledge is an inevitable problem for application services based on LLM.

Moreover, there are many other issues that need to be overcome. For example, when a large model empowers downstream scenarios, how can customer privacy and security be protected during the process of calling the large model? This problem has not yet been solved, but it is believed that better solutions will develop in time. The supplier of large models will surely pay special attention to this issue and provide solutions for their ecosystem's development.

Another issue is the complex reasoning ability. From the conversations of ChatGPT, we observe that it already has basic reasoning ability. The source of this ability is very interesting. It mainly benefits from self-supervised learning of the massive computer code base. The GPT3.5 on which ChatGPT is based has been trained not only on human natural language but also on massive available open source code written in various computer languages on GitHub, and most of the code has corresponding natural language explanations (comments) too. Since computer code is by nature more logical than natural language, this has helped ChatGPT to organize its response and speak more coherently. This was said to be a nice surprise that the developers themselves had not anticipated. However, it currently still has shortcomings in complex reasoning logic. Fortunately, complex reasoning ability is different from the boundless knowledge network. It is a relatively closed logical set, and it is believed that it can be solved in not too far a future (perhaps GPT4 might already be able to handle it?).

Lastly, let's talk about the progress of multimodal learning. LLM, as the basic model, has been validated in NLP multi-tasking and has performed exceptionally well. After the breakthrough in NLP, the framework for empowering downstream tasks with a basic model began to radiate toward other modalities. This direction of research is very active in the academic field of multimodal learning. Everything is still ongoing. Currently, the level of multimodal learning in practice is still in the stage of prompt engineering. What is lacking is a natural language interface. People who play with prompts in large models for image and music generation already know the huge potential and effectiveness of the basic model. It is very similar to the situation when we played with few-shot prompts in the GPT-3 playground before ChatGPT was born. It can be foreseen that in near future, a smooth natural language interface will emerge, and users will be able to describe the art they desire, whether it is a painting or a song. The work of aligning with human taste is also ongoing. It is predicted that a natural language to image (NL2img) model like "ChatDalle", similar to ChatGPT, will implement the desired natural language interface. The same trend is bound to happen in natural language to music (NL2music). We are in an exciting new era of AIGC (AI-generated content) for art creation.

Another predictable picture is that based on the trend of multimodal LLM, there will eventually be a unified large model that integrates various modalities and their associated knowledge. The breakthrough of this model barrier will provide critical support for entrepreneurs to utilize LLMs to empower downstream applications in various scenarios. As we all know, whether it is finance, law, or medicine, each major vertical has its accumulated long-standing structured symbolic knowledge base, including the domain ontology and other databases. How to connect to the domain's symbolic resources involves breaking the domain barrier. It is expected that this barrier will be largely solved in the next two to three years.

2. LLM Ecosystem Facing Reshuffling

The direct impact of the ChatGPT tsunami is that the NLP ecosystem is facing a reshuffle, and every existing information product or service must be re-examined in the context of LLM.

When we first discussed ChatGPT’s impact on IT services, the first thing that came to our mind was how to combine ChatGPT with search technology, and whether it could re-invent search.

Search is traceable, and every returned result is recorded, so it involves no information fusion. ChatGPT is untraceable and excels at information fusion: ChatGPT has no possibility of plagiarism in essence. Every sentence it spits out is novel sequence based on its digested information sources. Apparently, traditional search and ChatGPT have their own respective advantages and disadvantages. Search is the king of information services, ubiquitous, with a very stable business model. Since the rise of search in the Web 1.0 era, the form and mode of search have basically not changed for more than 20 years. In fact, new technologies and entrepreneurs have been trying to challenge search continuously over the years, and the venture capital industry has also been paying attention to potential search subverters that may become the "next Google", but the status of search has always been unshakable, at least until now. But this time is different. Microsoft has exclusive code authorization for ChatGPT and has boldly launched the so-called "new Bing". Google, who has dominated the space for so long, has to mobilize urgently and confront it head-on. A drama of search+LLM is unfolding, like a live drama, telling us that although there are still many difficulties to overcome in integrating these two technologies, the trend is unstoppable, and reshaping a new ecology of search is imperative.

In addition to search, those finely polished directional information products and services now face the fate of being re-examined and reformed, including chat, virtual assistants, grammar correction, machine translation, summarization, knowledge Q&A, etc. The representative services in these areas (Siri, Grammarly, etc.) used to have high technological barriers, which have suddenly been lowered.  Although many products are not facing a catastrophic crisis due to years of polishing and user inertia, some may still exist for a long time, after all, they are all on a downhill road. This is a revolutionary victory of general AI over traditional AI. It is something we would not believe feasible before. We used to be so skeptical of the general approach, waiting to see the joke of those who advocated AGI, such as Open AI who managed to launch a series of impressive LLMs (GPT series, Codex, DALL-E) including ChatGPT.

Look at Siri, which was released by Apple 13 years ago. 13 years is longer than the entire golden decade of the deep learning revolution, but Siri has only recently managed to offer 2-round or 3-round conversations. Amazon's popular product, Alexa, is the same. It has been polished for several years and accumulated so much user data. Now, with the advent of ChatGPT, what will Apple and Amazon do? They must embrace LLMs.

Next is the commonly seen e-commerce customer service. As we all know, Alibaba and JD.com's online after-sales customer service has been polished to perfection. Because after-sales service issues are relatively concentrated, the problem set is not large while the data are large, accumulated over the years. However, customer service is not only limited to post-sales.  In order to handle customer service smoothly, LLM cannot be ignored.

Moving on to education, it's clear that the ChatGPT model has the potential to revolutionize all education products and services. Anyone developing educational applications will need to reconsider how to embrace LLMs within the framework of the large model. Education itself deals with language, regardless of whether it is related to arts or science. Although the current large model is not particularly strong in science and engineering (yet), this knowledge gap will be filled to varying degrees soon. ChatGPT is sure to disrupt education, while also providing the largest opportunity for modernizing education. Language learning and computer programming education are obvious areas for ChatGPT to shine, as the model itself is a language model. Although its programming abilities are not yet at the level of professional engineers, it is proficient enough in common code formats to assist with programming and with the learning of programming. In fact, Co-pilot, which has been empowered by the GPT codex, has already become an auxiliary tool for more and more programmers.

Stepping back, we are also facing a huge risk, such as fake news. If one wants to promote a company or product, one can now use ChatGPT to generate all kinds of promotional posts that sound convincing. In the future, those online reviews and comments will also be obscured by fake news, as the cost of creating fake news approaches zero. Without proper precautions, all of this could place humanity in a world where truth and falsehood are indistinguishable. All along, we have been talking about the benefits of LLM and how it can empower new ecosystems for productivity explosion. We expect that in the next five to ten years, new international IT giants like a new Google or New Alibaba will emerge under this new ecosystem, leading to a major transformation in the technology ecosystem. But the danger of LLM misuse is equally great. Is mankind ready for it? Clearly not. Of course, this is another topic, and we will leave it there for now.

3. Wave of Mass Entrepreneurship Coming

With LLM (ChatGPT in particular), there are more product forms and services waiting for entrepreneurs to explore.

Regarding this topic, we need to emphasize the unprecedented entrepreneurial conditions brought by ChatGPT. ChatGPT itself has become a testing ground for products. It is a playground with an infinitely low bar that everyone can play in. The low bar is due to the paradigm shift in human-machine interfaces mentioned earlier. For the first time in AI history, machines began to cater to humans, rather than humans catering to machines. Human language, rather than computer code, became the tool for human-machine interaction. The significance of this change for the new ecology of NLP is difficult to overemphasize. In fact, this provides conditions for "mass entrepreneurship".

Those who have started AI businesses should all have this experience. The most basic condition for a startup team to have a chance of success is that the product manager and the technical leader can work closely together and communicate effectively. The product leader, relying on their market intuition and understanding of customer needs, strives to find the best market entry angle for technology to be transformed into a service and form a product design plan. The feasibility of this design plan needs to be endorsed and then developed by the technical leader. However, often due to different professional backgrounds and knowledge structures, the situation where the product manager and the technical leader talk past each other is not uncommon. Once this situation arises, the startup company is basically doomed to fail.

ChatGPT fundamentally eliminates the problem of talking past each other. Previously, only the technical leader and programmers could verify the feasibility of a plan, but now, the product leader/CXO, engineers, data analysts, and users with different backgrounds and expertise all have a unified platform, ChatGPT, on which they can illustrate product ideas. Everyone can simulate services on it. Not only has the communication barrier between humans and machines been overcome, but also the communication barrier between different teams. The emergence of this thing is a precondition for a product explosion and mass entrepreneurship.

In the United States, hundreds of startups are now exploring ideas of downstream products and services following ChatGPT or the backend LLMs. While the upstream big models are still rapidly progressing, what they are doing downstream is already in active development. There are countless ordinary people sharing their stories online, showing how they can earn 5,000 dollars using ChatGPT in just two or three hours. This kind of sharing means that the entrepreneurial enthusiasm of grassroots people has been mobilized. It seems that everyone can use this opportunity to find an entrepreneurial perspective. Summarizing these grassroots ideas may also lead to new tracks that can be standardized and scaled to meet market demands.

A big model like ChatGPT is ultimately an operating system-level existence. Every AI-related information product and service, especially those related to language and knowledge, cannot do without it. When Intel dominated the market, the famous logo was "Intel Inside". In the future, it will be "Chat-Inside", or more accurately, "Chat-In&Out". Why in and out? When a big model like ChatGPT empowers products, it is both like a waiter and a chef. The waiter can take your order, interact with you, and understand your needs while also doing the cooking and delivering the service. It requires both language talent and knowledge skills. This is what we call the LLM expert workbench, which may be the biggest new ecological form in the next five years and may open countless doors for entrepreneurship. The basic service form is online information services in various industries, whether it is online education, online lawyers, online consultants, online finance, or online tourism. All are aimed at significantly improving service efficiency. With ChatGPT, you only need to hire one expert to replace the 10 experts that were previously needed to handle tasks. The end result is a productivity explosion.

In conclusion, the wave of mass entrepreneurship is coming, and ChatGPT has brought unprecedented entrepreneurial conditions. It has become a testing ground for products with an infinitely low bar that everyone can play in. The emergence of this technology has eliminated communication barriers between humans and machines and between teams, leading to new tracks that can be standardized and scaled to meet market unmet needs. The future of ChatGPT as an operating system-like existence may be the biggest new ecological form in the next five years, called the LLM expert workbench, which open doors for entrepreneurship and will lead to a productivity explosion.

At this point, the application ecosystem seems very clear. The principle is that experts must be the final filter before delivering the results (human judge as final filter). This is the basic setup, but experts may also provide input prompts to inspire LLM to produce better results.

For almost every application scenario, there is a task to create an expert workbench, including supplementing existing products or services, such as every segment of online education, as well as online doctors, lawyers, financial consultants, etc., and exploring previously unthought-of business scenarios. This is a visible transformation or reshuffling of the ecosystem, providing efficient expert advice (expert-in-loop services).

Speaking of workbenches, e-commerce giants have built relatively large customer service workbenches, which were introduced when user needs and satisfaction could not be met with fully automated solutions or with fully manual solutions. Now with LLM, this form can be extended to all online service sectors. The productivity explosion that this can bring about is beyond imagination.

The design concept of "Human as Judge" has been validated for several years in low-code platforms (such as RPA platforms, parser-enabled information extraction platforms, etc.) for its effectiveness and efficiency. Here, we are talking about a completely new form, where humans only need to act as judges to complete the service. It is now entirely possible to create online information service workbenches tailored to various segments or scenarios, with experts sitting in the background. Specifically, the expert's role is only to make the decision based on their knowledge and experience, especially at the final "go or no-go" moment. Being a judge is much more efficient than being an athlete.


Download

It is worth emphasizing that ChatGPT brings something new as enabling information technology, as it serves both at a backend and a frontend. It can perform well in high-level and low-level tasks, which is why chat is just the surface of ChatGPT, and its essence is a human-machine interface. Its ability to complete various NLP tasks is at its core. With both surface and essence, downstream products or services can be built around it. In the Intel era, computer product brand advertisements were remembered as "Intel inside," and in the future, the new ecology should be called "chat in&out," which refers to the new ecology empowered by LLM, not only empowering the human-machine interaction but also empowering the professional services, with only experts providing the final check. In this form, the experts are behind the scenes. To put it another way, LLM is both a waiter and a chef, but an expert needs to review the food and take responsibility before it is served to ensure service quality (such as online doctors, lawyers, consultants, etc.).

In such an ecosystem, the next five years will be a period of explosive growth for online services. Fortunately, the three-year pandemic has greatly promoted the grassroots awareness of online services, helping to cultivate user online habits and develop the market.

While LLM is powerful in terms of breadth of knowledge, it also has its limitations in terms of precision. The key challenge in building an expert-in-loop service is to overcome the precision bottleneck of LLM. The goal is to raise the precision to a level where it does not significantly impact the efficiency of the expert's work. If at least 1/4 of the results generated by LLM can match the level of a manual expert's research, then the efficiency of the expert-in-loop service can be ensured. This is a feasible expectation, and the current solutions are not far from meeting this threshold. With this in mind, we conclude that the door to entrepreneurship in the new ecology of LLM has indeed been opened.