Anthropic Claude 3 Tech Report 阅读笔记

与 Open AI 和 Gemini 一样,技术报告不透露技术实现细节,如具体的模型结构、训练方法、超参数设置等。

Anthropic tech report 读下来,主要感觉是

主要指标达到或超过 GPT4v,把GPT4v从LLM天花板神坛上拉下来,至少终于可以平起平坐了。这个世界变得更加有趣,因为谁也不愿意看独角戏。

以前以为 Gemini 是 GPT4v 的挑战者,后来用了一阵子发现,整体体验差太多了。但 Claude 不同,玩下来感觉很丝滑,尤其是长文本理解和问答。很开心实力派来打了擂台!

引起我注意的几个点有:

克服幻觉上大幅度进步,在事实准确性上大幅提升:Anthropic 开发了几个内部评测来考察模型回答的事实准确程度,并与参考答案对比。Claude 3 Opus 在 100Q Hard 评测(包含一些晦涩的开放式问题)的准确率达到46.5%,是 Claude 2 的近2倍;在 Multi-factual 评测中准确率提高到62.8%,而错误回答的比例减半。模型更多地表示"不确定"而不是给出错误信息。模型很大程度上学会了“不确定” 的中间状态,而不是以生编硬造的错误答案来把假话说的跟真的似的。

长文本理解的亮点:QuALITY 阅读理解基准测试是平均5000个token的长篇章,远超一般模型的输入长度。Claude 3 Opus 在 1-shot 下达到90.5%的准确率,在 0-shot 下也有89.2%,相比人类93.5%的表现,Claude 3 已经大大缩小了与人类的差距。

拒答率大幅降低:这是好消息,因为此前 Claude 一直被诟病拒答太多,因为 Anthropic 从一开始就特别强调安全合规,这方面开始是太严了,现在在技术上做了提升,大幅降低了误杀误伤,提升了用户体验,这对它与Open AI 对垒,特别重要。

这是在 helpfulness 和 harmlessness 之间取得了更好的平衡。之前的模型为了避免有害输出,往往会过度谨慎,拒绝回答一些无害的问题。Claude 3 在这方面做了优化,在无害问题上拒答率大幅降低,而在有害问题上仍保持高拒答率。Opus 的错误拒答率从 Claude 2 的35%降到了9%。通过人工反馈优化,Opus 能更好判断什么是真正有害的,什么是可以回答的。

总结一下Claude 的八大优点:

1. 多模态理解能力出色,不亚于4v:  Claude 3 能很好地处理文本、图像等不同模态的输入,并在手写体识别、视觉推理、图像内容审核等方面展现出色的性能,为发挥语言模型在现实世界问题中的作用铺平了道路。

在技术规格的脚注中, 明确说明了支持的图像格式(JPEG/PNG/GIF/WebP)和规格(最大10MB,分辨率不超过8000x8000),并建议避免使用过小或低分辨率的图像。

对低质量、手写体图片的文字识别能力出色。Claude 3 Opus能够准确地将一张质量较差、带有手写字迹的照片转换为文本,并进一步将表格形式的文本整理为JSON格式。这展现了强大的OCR和文本结构化能力。

识别图像中的物体(但拒绝识别人物), 能建立物体与抽象概念之间的复杂联系。例如:Claude 3不仅能识别出图片中的计算器,还能将其与数学、计算等概念关联起来,体现了一定的概念抽象和推理能力。(也许是多模态模型中 LLM 融合/加持的知识迁移结果?)

在预期用途部分,特别强调了多模态特性对生产力的提升,比如模型可以解读各种图表、图像(GPT4也有这个图表能力,到底强多少?), 支持更广泛的企业应用场景。

2. 在各类权威基准测试中表现卓越: 无论是在 MMLU 这样的通用推理任务,还是 MATH、APPS 等数学和编程任务,或是 RACE-H、QuALITY 等阅读理解和常识问答数据集,Claude 3 都取得了业内领先的成绩,多次超越了 GPT-4、PaLM 等强劲模型,展现了顶尖的综合能力。

在Diamond子集上,Claude 3 Opus 在0-shot CoT设置下达到了50.4%的准确率,超过GPT-4的35.7%。Diamond是GPQA中质量最高的一组问题,这表明Claude 3 Opus在处理需要专业知识的复杂问题上有独特的优势。

3. 强大的少样本学习和推理能力: Claude 3 Opus 在 GSM8K、MGSM、GPQA 等测试中,无需微调,仅通过少量样例(Few-shot)就能掌握复杂任务,特别是在 GPQA 的 Diamond 子集上,其表现已经接近人类专家水平,展现了超强的学习迁移和推理能力。

4. 多语言理解和生成能力突出: 从报告的多语言数学推理 (MGSM)、多语言常识推理 (MMLU)、多语言对话等评测来看,Claude 3 在处理英语之外的语言时也有出色表现,并较上一代模型有显著提升,使其有望服务全球用户。特别注意到,以前它的中文较拉垮,这次算是赶上来了,对国人友好啦!

5. 在开放域对话、写作等任务上的出众表现: 人类评估者在创意写作、开放讨论等多个维度一致认为 Claude 3 优于同类模型,其生成的内容更加流畅、连贯、吸引人,这将极大拓展其在内容创作领域的应用。这方面我试了一些,truly impressed,今后我会很乐意、开心滴常常使用它。

6. 在代码编写任务上的亮眼表现: Claude 3 在 HumanEval、APPS、MBPP 等编程任务上的成绩证明了其对主流编程语言的掌握能力,有望成为开发者的智能编程助手。强化了根据要求生成结构化的输出(如YAML,JSON,XML)的能力,更易于企业应用和商业部署。

7. 在长文档处理和信息检索方面的进步: Claude 3 不仅支持高达百万 token 的超长上下文(但目前生产环境还是20万),在 QuALITY 这样的长文阅读理解测试中也展现出强劲实力。CLaude 3 Opus 在20万token的超长文档上仍能保持99%以上的关键信息召回率,打破了之前大模型在处理长文档时所谓的"中间错乱(lost in the middle)"的魔咒。

Anthropic 在真实场景的评测方面也一直走在前列, ta 开创的“大海捞针(Needle In A Haystack)” 评测,成为业界衡量大模型长文档处理能力的重要标准。

8. 系统的安全评估和缓解机制: Anthropic 专门制定了负责任扩展政策(RSP),从多个维度评估了 Claude 3 可能带来的安全风险,并采取了一系列缓解措施。虽然评估表明目前还没有灾难性风险,但 Anthropic 仍然未雨绸缪,这种审慎、严谨的态度值得赞许。

具体到信任与安全方面,Anthropic 进行了全面的多模态红队测试,以减少有害输出的可能性。结果显示 Claude 3 Opus 和 Sonnet 在97%以上的红队提示中做出了无害回应,在涉及危险话题时能够巧妙地将对话引向更合乎道德的方向。

在减少有害内容方面的努力卓有成效: 红队测试表明,经过有针对性的优化和训练,Claude 3 在面对危险或违规话题时能够做出恰当回应,将对话引向更合乎伦理的方向,这将有效降低 Claude 被滥用于制造有害内容的风险。

对可能的失控风险有清醒认识:  Anthropic 一如既往大打“宪法”大旗,强调它在伦理、安全、鲁棒性方面的 leader 地位。报告坦诚地指出,随着 AI 系统能力的快速提升,其失控和被滥用的风险不容忽视。Anthropic 积极参与全球 AI 治理,推动制定相关标准,展现了一个负责任 AI 企业的担当。

代码能力是它的重要亮点,值得专门总结一下。Claude 3 系列模型在编程和代码方面,在多个权威基准测试中取得了优异的成绩。下面从四个方面总结 Claude 3 的代码能力:

在 HumanEval 评测中,Claude 3 Opus 达到了84.9% 的准确率,远超 GPT-4 的 67% 和 GPT-3.5 的 48.1%。这表明其对 Python 语言的掌握已经非常全面和深入。
在 APPS 和 MBPP 评测中, Claude 3 Opus 分别达到了70.2% 和86.4%的准确率。APPS 包含了 Python 语言的各种应用问题, MBPP 则考察了 Claude 根据问题描述直接生成正确代码的能力。这些成绩进一步验证了 Claude 3 对 Python 的熟练程度。

强大的代码理解和分析能力:
在 APPS 评测中,Claude 3 需要理解问题的自然语言描述,并将其转化为正确的 Python 代码。这要求模型不仅要准确把握问题的本质和要求,还要合理设计算法和数据结构,足见其代码理解和分析能力之强。
HumanEval 中的任务也都是以自然语言描述的, Claude 3 能高质量地完成这些任务,表明其能很好地理解代码的功能和意图。

出色的代码生成能力:
在 MBPP 评测中,Claude 3 展现了强大的代码生成能力,它可以根据问题描述直接生成正确的代码。这种 "一步到位" 的能力将极大提升开发者的效率。
即使在需要多轮对话澄清需求的复杂编程任务中,Claude 3 也能最终生成高质量的代码。这得益于其出色的上下文理解和语义追踪能力。

除了正确性,Claude 3 生成的代码在可读性、健壮性、时间/空间复杂度等方面也有不错的表现,这将有助于工程质量的提升。

在软件工程任务中的应用前景:
除了直接的代码理解和生成,Claude 3 在一些软件工程任务上也有广阔的应用前景。比如代码补全、代码文档生成、源代码转自然语言描述等。

得益于其强大的大语言模型能力,Claude 3 有望成为智能化软件开发的得力助手,协助开发者进行需求分析、架构设计、性能调优等高阶任务。

更进一步,Claude 3 或许能发展为智能化的 "软件工程顾问",为开发团队提供全流程的指导和优化建议,提升软件过程的成熟度。

当然,尽管 Claude 3 在代码相关任务上已经展现了非凡的能力,但其在真实软件开发场景中的应用还有待进一步探索和验证。看能不能对微软 copilot 构成挑战。

缺点方面:

暂不支持网络搜索(这个有点意外,因为应该是标配),知识截止2023年8月。
模型为了安全合规,拒绝识别图像中的人。

视觉理解方面,也有待更全面的评估: 虽然 Claude 3 展现了一些令人印象深刻的视觉理解能力, 如手写体识别、视觉推理等, 但报告并未系统地评测其在 OCR、目标检测、图像描述等常见视觉任务上的性能。此外, 红队测试也发现其在理解视觉内容时偶尔会出现"幻视"、遗漏违规内容等问题。未来还需在更大规模、更多样化的视觉数据集上系统优化。

报告多次提到一些评估方法还处于较早期阶段, 如对 AI 失控风险的评估、对 AI 系统生物和网络安全能力的评估等。虽然 Anthropic 采取了一些超出常规的预防措施,但评估本身还需要随着 AI 能力的进化而持续迭代。

总的来说, Claude 3 家族无疑代表了语言模型的一个新的里程碑, 构成了 GPT4 的强有力的对手(而 Gemini 整体上看,面对 GPT4 则远远不行)。Claude 3 在智能水平、多模态理解、安全评估等多个方面展现了突破/超越。Anthropic 严谨、审慎、透明的态度,在负责任 AI 方面树立了良好典范,保持了其 leader 地位。但 Claude 3 绝非完美, 在幻觉问题、评估框架等方面还有不少提升的空间。

这是从 tech report 中来的信息。至于这个系统本身,今天找机会可以做一些实测,对比 gpt4 和 claude 3 Opus,谈谈真实的个体用户体验和感受。我已经订阅了 claude 3 Opus 最新版本,随时可做对比实验。看朋友的测试便随手做的一个小学数学题,有点令人啼笑皆非:

不过,这点毛病不影响我自己的使用,我用 LLM 从来也不为了做数学。今后几个月,我会坚持同时使用 chat4v 和 claude3, 直到新的 monster chat5 或 Q* 的降临。

 

好久没经营博客了,应该恢复工作了?

前一阵子,聚焦在短视频特别是AI赋能一键成片的调研和实践,没时间写文字版的博客,这里荒芜了一段时间了。结果是零星生产了众多短视频实验室作品,发在微信视频频道和抖音上,研究实验的正业外,也算业余自娱娱人。但倒腾视频模态到博客比较麻烦,就一直没能更新了。期间不是没有心得,只是没时间整理和倒腾。

先说抖音短视频,那是个大海,基本上是自生自灭,如果没有运作,再好的作品除了亲友,基本上是无人问津,藏在深山人未识。微信视频号稍好,因为有长期积聚的朋友圈作为底盘,不至于完全的门庭冷落,但没有运营,也还是零散而缓慢。这其实与我几十年的博客一样,坚持持续记录,更多是留下足迹以及与亲友分享,而不是追求影响。

 

 

Unified Models Surpass Single-modal Models  (Gemini Notes 2/8)

by Zhi-Fei Li, Gao Jia, Wei Li, from "Brother Fei on AI"

02.

Multi-modal Large Unified Models Finally Surpass Specific Single-modal Models  

Humans perceive, cognize, and generate emotions and consciousness through the integration of multiple senses. Gemini is also practicing this approach, processing multiple modal inputs, integrating them in the brain, and then expressing through various modal outputs. This comprehensive "simulation" of human intelligence by such models is rapidly evolving.

Previously, multi-modal model training resembled a system composed of separate eyes, ears, arms, and brains, lacking strong coordination. However, the direction represented by Gemini feels significantly different: it's as if the large model has become a complete digital person, where hands, eyes, brain, and mouth work in harmonious silicon unity. Gemini is the first true end-to-end multi-modal system.

In the past, models optimized for a single modality usually outperformed those handling multiple modalities simultaneously. The common practice was single-modality model training. Even GPT-4 primarily "concatenates" different modalities into an overarching framework, rather than being a unified multi-modal model.

The exciting aspect of Gemini is that it was designed from the start as a native multi-modal architecture. The training process interweaves various modal data from the beginning. If previous large models were like attaching sensory organs or mechanical arms to a brain externally, Gemini is like growing its own eyes, ears, and arms internally, allowing for fluid and natural interaction.

Whether in terms of model architecture, training process, or final output, Gemini achieves a seamlessly integrated multi-modal experience.

For the first time, Gemini demonstrates that a unified model can handle all modalities, and perform even better than models focused on a single modality! For example, compared to the Whisper model, which is optimized for voice recognition, Gemini shows a significant improvement in accuracy.

This signifies the dawn of the era of unified multi-modal models.

Image

In fact, Gemini is not the first model to demonstrate that different modalities can mutually enhance performance. This was also evident in PaLM-E, where "PaLM-E, trained across different domains including general vision-language tasks at internet scale, showed a marked improvement in performance compared to models performing single tasks in robotics."

Another example of modalities enhancing each other is the multilingual processing ability of large language models. If we consider different languages as distinct "modalities," the practice of large language models has proven that processing native data of all languages together (through tokenization and embedding) managed to lead to the successful construction of a human language tower of Babel.

The overwhelming amount of English data in the training of large language models also benefits the model's understanding and generation of languages with limited data, reaffirming the transfer of linguistic knowledge. It's akin to a person skilled in tennis also being able to improve their abilities in squash or golf through related skills.

Since the rise of large models in February this year, many have gradually embraced the belief that "unified multi-modal models will surpass single-modality models." However, this belief hadn't been confirmed on a large scale until Google's Gemini showcased the prospects of this belief, reshaping and solidifying it for many.

In the future, specialized models for tasks like voice recognition or machine translation may become less significant. Many generative tasks such as TTS and image generation are also likely to be unified under large models. Some may complain about the high cost and slow speed of large unified models, but these are purely technical challenges. In practice, we can distill unified models to specific modalities or scenarios.

We firmly believe that unified cross-modal large models will become the mainstream pathway to achieving AGI.

Furthermore, "modalities" are not just sound, images, videos, etc. Olfactory, gustatory, tactile, temperature, and humidity sensors are also different modalities for gathering environmental information, all of which can in time be encompassed by unified models.

Ultimately, various modalities are merely carriers of "information." They are a form of rendering, a presentation style, a means for an intelligent entity to interact with the physical world. In the eyes of a unified model, all modalities internally can be represented by unified multi-dimensional vectors, enabling cross-modal knowledge transfer and the intersection, alignment, fusion, and reasoning of information.

When the barriers between modalities are breached, revealing the core beneath various renderings, we see the origin of cognition — language.

 

 

 

(Gemini Notes Series to be continued)

 

Original from:

关于 Google Gemini 的八点启示

by Zhi-Fei Li, Gao Jia, Wei Li, from "Brother Fei on AI"

随笔:2023年终感言

老友群里女同学重播当年的几首时代大合唱《明天会更好》《让世界充满爱》以及《we are the world》,面对战乱和纷扰的2023年即将的过去,感叹道:今天的世界怎么了?回放世界和平年的几首歌,悲从中来,欲哭无泪。

有老友说我们其实都生活得更好,不是?

我的呼应是:

我们经历的80年代,心中充满希望,解放区的天是明朗的天。而现在不同。

其实,我们的父辈在刚解放的头几年也有过意气风发的美丽憧憬,《青春万岁》留下 了记录。此后的政治运动不断,才给玫瑰色的画面抹上了阴影。在大革命10年浩/劫后,邓公带给我们上大学、研究生的机会,百废待兴,百花齐放,社会充满了生机,我们充满了希望和担当。这是历史的机遇,也是美丽的邂逅。

可惜,我们经历的80年代可能是历史上的异数而不是常态。是人间的四月天,而不是寒来暑往。

现在真的不同,冬天来了。虽然物质生活比80年代提升太多,但全世界都难掩某种末日景观。包括AI,或AGI,内卷外卷的狂热和兴奋更像是飞蛾扑火,而不是希望和憧憬。集体下意识,身不由己。
因为AI,所以AI,而不是因为爱所以爱。
爱不需要理由,AI 却不能无理由疯行。

-- 虽然老马到中国居然学会了把爱与AI连上。

-- 虽然Ilya声称要给模型植入爱人类的心。

-- 虽然我们每个个体依然循着惯性或本性,总是怀念单纯爱,可面对的却是乱世怪象:信息茧房,真假莫辨,快餐文化,爽一把就死。似乎没有明天,没有盼望。

没有最差,只有更差。

这是最快的时代,也是最坏的时代。

冷战转为热战,一场增为两场。猜忌日增,信誉不再。

老大哥前两天聚会谈伊朗旅游的观感体会说得好(大意),一个系统要往下,还真TM的可以无底线向下。触底反弹的铁律失效了?

2024 人类的马儿呀,不仅仅是AI,你能慢点儿跑,稳点儿跑,带着悲悯和人心跑吗?

年终感喟,但愿非杞人之忧。

 

Cross-modal Knowledge Transfer of Large Models Proven (Gemini Notes 1/8)

by Zhi-Fei Li, Gao Jia, Wei Li, from "Brother Fei on AI"

Image

In 1948, inspired by psychiatric patients, British doctor Ross Ashby invented a peculiar machine called the "Homeostat." He proclaimed that this device, costing about 50 pounds, was "the closest thing to an artificial brain ever designed by mankind." The Homeostat utilized four bomb control switch gear devices from the British Royal Air Force, used during World War II, as its base. Above these were four cubic aluminum boxes, with the only visible moving parts being four small magnetic needles on top of the boxes, swaying like compass needles in a small trough of water.

When the machine was activated, the needles moved in response to the electric current from the aluminum boxes. The four magnetic needles were always in a sensitive and fragile state of balance. The sole purpose of the Homeostat was to keep the needles centered, maintaining a "comfortable" state for the machine.

Ashby experimented with various methods to make the machine "uncomfortable," such as reversing the polarity of the electrical connections or the direction of the needles. However, the machine always found ways to adapt to the new state and re-center the needles. Ashby described the machine as "actively" resisting any disturbances to its balance through synaptic action, performing "coordinated activities" to regain equilibrium.

Ashby believed that one day, such a "primitive device" could evolve into an artificial brain more powerful than any human, capable of solving the world's most complex and challenging problems.

Despite Ashby's lack of knowledge about today's AGI evolution and the laughable idea of using four small magnetic needles as sensors for intelligence, his Homeostat fundamentally challenged everyone's understanding of "intelligence" - isn't intelligence the ability to absorb information from the environment in various modalities, and to modify behavior and responses based on feedback?

From the peculiar "Homeostat" to today, 75 years later, Google's Gemini, which claims to have surpassed human multi-modal task processing abilities, accelerates towards the evolution of billions of years of carbon-based intelligence through the injection of multi-modal native big data.

The acceleration speed of machine intelligence evolution today far exceeds our imagination. A year ago, OpenAI overturned Google's long-established AI position with its 'brute force aesthetic,' having constructed the Babel Tower of human languages. A year later, Google countered with Gemini, via a 'fight fire with fire' approach to building the first unified cross-modal model, setting another milestone in AGI evolution.

Despite initial skepticism over exaggerated video demos upon Gemini's release, it's undeniable that the dawn of a unified multi-modal approach is shining. What capabilities does Gemini confirm? How will Google's wheels of fate turn? Is time a friend to OpenAI or Google? What does multi-modality mean for Agents and embodied intelligence? Are the foundations for the emergence of AGI with consciousness already in place? How should we view the implications of Gemini for the AI future?

01.

Cross-modal Knowledge Transfer of Large Models Proven Again

For humans, the ability to transfer knowledge across various domains and through different timespaces is more important than merely learning skills. If machines can master cross-modal knowledge transfer, they edge closer to "intelligence generality."
 
In July this year, Google introduced RT-2, a robotic system based on large models, sparking hope for general-purpose robots.  The system's robotic arm, leveraging the "common sense" of language models, demonstrated the ability to "pick up an extinct animal from a table," moving from common sense reasoning to robotic execution, showcasing cross-modal knowledge transfer. 
 
In December, the introduction of Gemini by this tech giant reaffirmed the cross-modal knowledge transfer capability of large models: the "common sense" of language models could be transferred to the training of other non-linguistic modalities added later. Language models are known to form the foundation of cognitive intelligence, and the most basic form of cognitive intelligence is "common sense."  Without common sense empowerment, the practical application of large multi-modal models would be challenging.  Gemini smoothly transfers this "common sense" to downstream multi-modal tasks.  Like RT-2, it achieves cross-modal integration through the transfer of text-derived knowledge — Gemini can connect ontology concepts to the understanding of auditory and visual objects, and eventually link them with action, forming an intelligent system ready for real world application. 
 
From the perspective of model training, compared to language models trained with massive internet data, downstream models (like robotic models) can be trained with very limited data through knowledge transfer.  This transfer-based training manages to address the long-standing issue of data scarcity in downstream applications.  For instance, to achieve the effects shown in the video (which raised doubts about Gemini's video comprehension or picture comprehension but did not affect the discussion on cross-modal knowledge transfer here), Gemini first needs some ontological knowledge — it understands the concept of a duck, knows the usual color of ducks, and what blue is. When it sees a "blue duck," it reacts similarly to humans, expressing the "common sense" that "blue ducks are uncommon." 
 
Image
 
Gemini, through auditory and visual perception, identifies that the material of the blue duck is rubber and knows that rubber's density is less than water's. Based on this common sense and reasoning, when it hears a squeaking sound, it can predict that "the blue duck can float on water." 
 
Image
 
From RT-2 to Gemini, we've moved to the "fusion" of multi-modal perceptual intelligence and cognitive intelligence. We've transitioned from isolated "five senses" modules of eyes, ears, mouth, nose, and body to a unified digital "human". 
 
Doesn't this imply that on the path to simulating human intelligence, the unified model is the right approach? 

 

 

 

(Gemini Notes Series to be continued)

 

Original from:

关于 Google Gemini 的八点启示

by Zhi-Fei Li, Gao Jia, Wei Li, from "Brother Fei on AI"

语言是大一统模型里的核心和主线

作者 | 高佳   李维
创意 | 李志飞
在我们想象的AGI系统里,其核心和主线是视觉还是语言呢?

有人认为是视觉,但我们坚信语言才是核心,因为视觉反映的是动物共有的感官能力,而语言(包括口语和后来的书面语言文字)则是人类所独有的符号系统。它承载了人类千万年来的认知传承和知识积淀。
语言是是人类认知智能的外在表示,它是人类文明诞生的重要标志。著名以色列历史学家赫拉利在《人类简史》中说,是人类语言赋予的“讲故事”的能力,使得人类具有任何动物都不具有的组织能力,从而催生了文明,让人类成为地球的主宰。
语言是认知智能的起点和源泉,人类的语言信息中蕴含了人类高度抽象的概念层级体系,包括本体知识(ontology)及其常识,也包括更广泛的世界知识和更纵深的领域知识。这些知识是人类的高阶智能如逻辑推理的基础。而声音、图片和视频则更加感性,表示的是人类以及高级动物的情绪和具象能力,对应的是感知智能。
从感知到认知,从情绪到逻辑,当模型将它们融汇贯通,这才能真正模拟人类大脑的心智状态,也才称得上是完整的人工智能。多模态大一统的模型,填平了感知智能与认知智能的鸿沟,才是实现完整人工智能的希望所在。
在 RT-2 和 Gemini 中,以语言为基础的认知智能始终是人类知识模拟的核心,其中常识及其推理的知识迁移起到了关键作用。例如在 RT-2 中,反映语言模态的数据量和参数规模都远远大于下游的图片和动作模态的规模。
在原生态的跨模态大数据中,语言大数据总是处于核心地位。可以预测,未来的AI系统,不管目标是不是语言任务,都要把语言模型作为基础模型和训练的起点,其他模态或任务的下游数据可以在语言模型的基础上继继训练,以便继承和迁移语言模型强大的认知能力。
这一点做到了,就凸显了语言模型对AGI的最大贡献,因为它真正体现了研究人员对语言大模型的初心和定位——作为 Foundation ModelCore Engine.

全文原稿在(from):
关于 Google Gemini 的八点启示

 

 

 

Tanya's Childhood 2: American nursery rhymes

The text provided here is a nostalgic recounting of a parent reminiscing about his daughter's childhood, particularly focusing on various American nursery rhymes and the daughter's playful interactions. The parent reflects on the limited recordings they have of their daughter from when she was young, which were transferred from an iPod to an iPhone and often played in the car, blending with music into fond memories of the past.

The daughter is described as a talkative and somewhat rapid-fire speaker as a child, who enjoyed showing off nursery rhymes.

 

April 13, 2019

立委_米拉的微博视频 or YouTube:

As I navigate through the cherry blossom season, I'm engulfed in a wave of nostalgia, reflecting on the fleeting moments of my daughter's childhood. It's remarkable how certain memories, like her voice from those few recordings we made, have ingrained themselves in my heart. These snippets, once captured in an iPod and now residing in my iPhone, have become an auditory pathway back to those treasured times.

My daughter was always a chatterbox, her words often racing ahead of her thoughts. She had a particular fondness for American nursery rhymes, relishing in their playful rhythms and catchy phrases. I fondly recall how she would eagerly recite them, her voice filled with the enthusiasm of youth.

 

One of her favorite rhymes was a humorous jibe at boys:

"Boys go to Jupiter to get more stupider, girls go to college to get more knowledge."

She'd recite it with such dramatic flair, emphasizing each word, as if imparting some profound wisdom. Her rendition was always animated, almost rap-like, making it impossible not to smile.

“what do you want me to say now? boys go to Jupiter , do you know the planet Jupiter? they go to the planet Jupiter, once they get there, they get stupider and stupider every second. And girls they go to college to get more knowledge and knowledge into their brain on their head.”

"Eeny, meeny, miny, moe" was another staple in her repertoire.

“Eeny, meeny, miny, moe,
Catch a tiger by the toe.
If he hollers, let it go,
Eeny, meeny, miny, moe."

It's fascinating to think about how this simple rhyme was more than just a game; it was a glimpse into the cunning minds of children. They'd use it to make choices, but often, the outcome was already decided in their hearts. They'd cunningly manipulate the ending to suit their desired choice, either accepting or rejecting it with a claim like

"My mother told me to pick the very best one, and you are not it."

Or, “My mother says to pick the very best one, and that is YOU”.

Among these recordings was a playful, teasing rhyme that still brings a chuckle:

“You know what
Kick your butt
All the way to Pizza Hut

While you’re there,
Comb your hair
Don’t forget your underwear!”

This rhyme, intertwined with stories of school and friendships, showcased the innocent yet intricate world of children's social dynamics.

“I said that I am the Princess of Jewelry because one of my friends and buddy said that she looked at my jewelry I brought to school.  What happened is she was so surprised and she loved it … she said that I am Princess of Jewelry and she is the Queen of Makeup.  Next time I am going to bring new jewelry, she said that I am the Queen of Jewelry…… No,Daddy, Jessica said I am the Queen of Jewelry if I bring some new jewelry tomorrow.”

A particularly memorable story was about Tanya proclaiming herself the "Princess of Jewelry" after a school friend complimented her on her collection. This interaction with her friend, Jessica, who crowned herself the "Queen of Makeup," was a brilliant display of childhood diplomacy and innocence.

Tanya's excitement at the thought of being elevated to the "Queen of Jewelry" the next day if she brought new jewelry to school was both touching and amusing.

Listening to these recordings also brought into stark relief the difference between a native language and a second language. Her English, fluid and expressive, stood in contrast to her Mandarin, which, despite her efforts at weekend Chinese school, sounded labored and less natural.

These memories, encapsulated in a few precious recordings, remind me of how quickly time passes. They're not just echoes of Tanya's childhood but also emblems of a period that seems both distant and vividly close. In the beauty of the cherry blossoms, I find a reflection of those bygone days, a tender reminder of the passage of time.

 

from

朝华之二十五:爸爸的小棉袄

Tanya's Childhood 1: McDonalds

养育下一代(parenting)是人生最可回味的经历。孩子成长的花絮,时不时让人惊喜,积淀成温馨和亲情。很多父女对答妙趣横生,想起来就随手记录下来,更多的是随风飘散。人生的旅程步步惊心,支持我们走过低谷的是一种信念,为了女儿,我们不能停步。

Parenting is one of the most memorable experiences in life. The growing up moments of children often bring surprises and accumulate into warmth and affection. Many delightful father-daughter conversations are casually recorded, while others are lost with the wind. Life's journey is full of suspense, and it's our belief in our daughters that supports us through the lows, urging us not to stop moving forward.

永远的麦当劳 / Forever McDonald's

我们在水牛城的时期,一到周末,大小领导常常在工厂直销中心(Factory Outlets)不知疲倦地购物,跟厂商玩着买了退退了买的游戏。我跟往常一样,找一家附近的麦当劳快餐店,打开膝式苹果电脑,就着炸薯条,品着咖啡,上网有一眼无一眼看看老友们在闲极无聊中又整出什么让人跌破眼镜的新鲜事来,头脑里想的是怎样来写这篇酝酿已久的"麦克唐纳万岁"。还好,太阳底下没有新鲜事,只是一帮理呆在争论《十万个为什么》中的飞机为什么能飞的问题,争了几个月了,还没有结果。扯嘛,飞机不能飞还叫飞机吗?还是先回答鸟儿为什么能飞吧,飞机不就是人类的大鹏嘛。

During our time in Buffalo City, every weekend, the 'big and small bosses' (wife & daughter lol)would tirelessly shop at the Factory Outlets, playing the game of buying and returning with the merchants. As usual, I would find a nearby McDonald's, open my laptop, enjoy fries and coffee, and half-heartedly browse the internet to see what new, shocking things my bored old buddies had come up with. I pondered how to write the long-brewing "Long Live McDonald's." Fortunately, under the sun, there's nothing new; just a bunch of nerds arguing about why planes can fly, as described in books like "One Hundred Thousand Whys," without any conclusion for months. Ridiculous - if planes couldn't fly, would they still be planes? Maybe it's better to answer why birds can fly first, as planes are just the great rocs of humanity.

回到麦当劳。不管营养师怎样呼吁围剿所谓垃圾食品,也不管爱国分子怎样鼓噪抵制西方餐饮大王的侵入,麦当劳在我的心中金光闪烁,温馨惬意,有如家园。麦当劳给我的美好感觉,不在它的价廉物鲜 — 当然是新鲜的鲜,并非鲜美的鲜,毕竟是鬼子食。炸薯条和鸡块还是不错的,汉堡包在饿极时也可以下咽,比那些冷冰冰的三明治稍强。麦当劳的美好也不仅仅是它卫生亮敞的环境和茶馆一样的平易可亲的氛围。真正使麦当劳万寿无疆的是它的 Happhy Meal(儿童欢乐套餐)和它附带的儿童园地(Ronald's Playhouse)。Happy Meal 给儿时的女儿带来过无数的惊喜和欢乐,麦当劳儿童园地也见证了我跟女儿一起度过的无数美好快乐的时光。

Back to McDonald's. Regardless of how nutritionists call for a boycott of so-called junk food, or patriots decry the invasion of Western fast food giants, McDonald's shines brightly in my heart, cozy and homely. Its appeal isn't just in its inexpensive food – fresh in terms of newness, not taste, as it's still fast food after all. The fries and chicken nuggets aren't bad, and the burgers are tolerable when you're really hungry, better than cold sandwiches. But McDonald's charm isn't just in its clean, bright environment and the approachable atmosphere of a tea house. What really makes McDonald's everlasting is its Happy Meal and the accompanying Ronald's Playhouse. Happy Meals have brought countless surprises and joy to my daughter in her childhood, and Ronald's Playhouse has witnessed many wonderful moments we've shared.

对麦当劳的最初印象是我2015年前出国旅欧时形成的。一帮清贫的学生决定结伴周游欧洲各国。旅游并非阔人的专利,学生有学生的招数:买一张物超所值的铁路通票,就有了游遍欧洲的基本保障,食住行囊括了后两项。大体是白天游玩,晚上搭车加睡觉。有时一夜经过好几个国家,睡意朦胧中查验护照和签证,完了歪头再睡。一觉醒来,撞到什么旅游点,就下来走马观花。如果错过了什么名城胜景,可以转身搭车回转。随缘随机,倒也自在。这种旅行方式在学生中非常流行,对于节俭到苛刻的中国留学生更是如此。除了车票和门票(学生有优惠),唯一的开销就是吃了。旅游在外,胃口特别好,肚子时常闹意见,可旅游点的餐馆甚至小吃都价格不菲,就麦当劳的价格比较稳定。同学总结说:"Believe me, 游遍欧洲,颠扑不破的真理只有一条:麦当劳是唯一吃得起也吃得饱的所在。" 人以食为天,麦当劳的流水作业和薄利多销成全了它的餐饮业霸主的地位。

My first impression of McDonald's was formed during a trip to Europe before 2015. A group of poor students decided to tour various European countries together. Traveling isn't just for the wealthy; students have their ways: buying a value-for-money rail pass ensured basic travel across Europe, covering accommodation and transportation. We generally toured during the day and traveled and slept at night. Sometimes we'd pass through several countries in one night, vaguely waking up for passport and visa checks, then dozing off again. Waking up, we'd spontaneously visit whatever tourist spot we bumped into. If we missed any famous city or scene, we could easily catch a train back. This laissez-faire travel style was popular among students, especially thrifty Chinese international students. Aside from train and attraction tickets (with student discounts), our only major expense was food. Appetites grow when traveling, and stomachs often complain, but eating at tourist spots is expensive, making McDonald's a stable choice. A fellow student summarized, "Believe me, the only unbreakable truth in traveling across Europe is: McDonald's is the only place you can afford and get full." People need to eat, and McDonald's fast service and thin profit margins cemented its dominance in the food industry.

对麦当劳的亲密而频繁的接触,还是由于甜甜。玩具是儿童的天使,甜甜热衷于追踪麦当劳儿童套餐推出的每一款玩具,遇到她喜欢的主题,比如 Furby, Teletubby, 她总是要收集各种颜色和造型的全套才满足。为此,我也没有少吃儿童套餐,为的就是尽快收集完全。有一次我连续一周午餐吃儿童套餐, 甜甜感觉奇怪:“Dad, are you ok? Did you tell me you don't really like the McDonald's food?” 我笑笑,说:“it's not bad, actually I seem to like it. Important thing is, we got the toy”。后来甜甜终于悟出来了,跟小伙伴说:"I can't believe it. My Dad ate Happy Meals nonstop just to get a complete collection of my favorite toys." 语气里透着被宠爱的满足。

My close and frequent encounters with McDonald's were mostly due to my daughter, Tanya. Toys are angels to children, and she was keen on collecting every toy from McDonald's Happy Meals. Whenever she liked a theme, like Furby or Teletubby, she had to collect all the colors and designs. I ended up eating many Happy Meals to complete her collection. Once, I ate Happy Meals for lunch for a week straight. Tanya found it odd: "Dad, are you ok? Did you tell me you don't really like McDonald's food?" I just smiled and said, "It's not bad, actually I seem to like it. The important thing is, we got the toy." Eventually, Tanya realized and told her friends, "I can't believe it. My Dad ate Happy Meals nonstop just to get a complete collection of my favorite toys." She felt a satisfied sense of being spoiled.


麦当老儿童园地 / Ronald's Playhouse at McDonald's

在水牛城的岁月,麦当劳附设的儿童园地是我们最常光顾的场所,有吃有喝有迷宫,总有其他小朋友,甜甜在那里不到筋疲力竭不愿意回家。麦当劳迷宫,千转百迴,上下左右贯通,最受儿童喜爱。甜甜天生胆子小,很长一段时间,望宫兴叹。有一天,我们注意到麦当劳迷宫的游玩规定中写道:And parents, too! 原来允许做父母的跟孩子一块进去玩儿,于是陪着甜甜爬进那窄长园筒状迷宫通道,甜甜兴奋莫名,从此一发不可收拾。可怜我的老骨头,猫着腰跟一帮孩子在里面爬呀爬,很多家长旁观而笑。有孩子在迷宫哭闹的,就托我领孩子出宫。

During our time in Buffalo City, Ronald's Playhouse at McDonald's was our frequent haunt, with food and drink and a maze. There were always other kids, and Tanya wouldn't want to leave until she was completely exhausted. The maze at McDonald's, with its twists and turns, was a favorite among children. Tanya was initially timid, but one day, we noticed the Playhouse rules stated: And parents, too! So, I joined her in the narrow, cylindrical maze, and she was ecstatic. Poor me, crouching down and crawling with a bunch of kids, while many parents watched and laughed. When a child cried in the maze, I was often asked to help lead them out.

全家外出旅游,时常在没有尽头的高速公路上狂奔,夜色渐浓,困顿饥饿之时,我们也总是习惯地搜寻下一站的麦当劳。那金黄的霓虹灯招牌M,顶天立地耸立在那里,是温馨随意的召唤,总给我们宾至如归的感觉。

When traveling as a family, we'd often search for the next McDonald's on endless highways, especially when night fell and hunger struck. The golden neon 'M' sign stood tall and inviting, always offering a warm and casual welcome.

永远的麦当劳! / Forever McDonald's!

记于2007年母亲节

Written on Mother's Day 2007.

from

朝华之二十五:爸爸的小棉袄

写在巴别塔建成周年纪念日的文案

 

我是出门问问李维,欢迎你来到我的AI短视频频道。今天我谈谈大模型时代的 killer apps 的话题。

我们知道每一次信息技术的革命,都会产生一批杀手级应用(killer apps),大模型时代的 killer apps 路在何方呢?
回顾第一次互联网技术革命,killer apps 包括浏览器和搜索引擎,紧接着是游戏、电商和社交媒体,催生了谷歌/百度、亚马逊/阿里和脸书这样的大厂。到了移动平台时代,电商、游戏和搜索这三大类,在移动平台发扬光大。此外,移动互联网专有的 killer apps 还有 (1)出行app Uber/滴滴;(2)住宿app Air B﹠B;(3)通信app 微信;(4)娱乐短视频 app 抖音;(5)吃喝apps 美团等。这些 apps 涵盖了人类生活的基本场景,极大地提高了劳动生产率和人类的生活品质。

大模型的 killer apps 还不清晰,但大体的方向可以看到轮廓。

虚拟陪伴(包括老年陪护,也包括虚拟女友/男友)肯定算其中一类。人类的情感需求是刚需。当然,由于情色的监管,国内现在做不了虚拟交友。食色性也,老祖宗千百年前就洞悉的道理。这种东西不是洪水猛兽,而是技术时代的福音。人畜无害,其实没道理禁止。从趋势上看,最终也禁止不了。

除了虚拟陪伴的情感需求外,另一类是知识需求,就是所谓助理、副驾驶这种(辅导、问答),也包括辅助创作(无论什么模态,文字、音频、图片、视频)。文艺创作其实是人类的高阶需求,超出了吃喝玩乐与卡拉ok的低层次需求。这也是大模型的长项,将来定会诞生 killer apps,因为人对于精神生活和品质的追求,永无止境,而文学艺术方面的代入感,会让人觉得生活更有意义。文学艺术不再是少数贵族的专有,而会成为大众的自我表现形式。ChatGPT 作为助理/副驾驶,已经呈现 killer app/super app 形态。

心理辅导(therapy)介于虚拟陪伴与医疗之间,也是刚需,但有监管、私密和走偏的挑战。

最后代代相传的老三样,电商、游戏和搜索,在大模型时代会有继承改造。例如,电商就会借力LLM,提供虚拟换装、虚拟居家设计等扩展。游戏更不用说,多模态大模型会把游戏推向新的高度,包括元宇宙色彩的加持。至于搜索,自然的演进就是 RAG(搜索增强)的 Chat GPT 这类,搜索与阅读理解和问答合为一体了。

在上面这些方向上,LLM 大有可为,但还是需要时间去孕育、试错和普及,逐渐形成有商业模式、能持续的 killer apps。
这样看来,LLM落地应用的前景还是非常广阔的,也许只是需要两三年的时间就可以看清 LLM应用落地的生态图以及创新的商业模式。

稍安勿躁。精彩正在继续。

我们生在见证历史和奇迹的年代。同时也在见证人类的危机和解套。

我是出门问问李维,每次两分钟与您分享大模型及其AI落地应用的有角度的思考。

 

https://weibo.com/u/2316531634?layerid=4973825203373916

 

《硅谷神剧回顾》

立委按: 生活比戏剧更戏剧, 虚拟比现实更现实; Turbo 比 GPT 更 GPT, AI 比智能更智能,是为AGI。

 

### OpenAI 剧情回顾:硅谷戏剧全纪录

#### 第一幕:引火 - 山姆·奥特曼被解雇

故事始 于 OpenAI 董事会一个突然且有争议的举动:CEO 山姆·奥特曼的意外解雇。此举在硅谷引发了轩然大波,标志着一场前所未有的公司戏剧的开幕。

- **亮点**:董事会指责奥特曼在与董事会的沟通中缺乏坦诚,这一指控后来成为争议的核心。
- **关键人物**:山姆·奥特曼,因引领 OpenAI 进入新领域而闻名,现在却突然被驱逐,为接下来的剧情奠定了基础。

#### 第二幕:后果与反抗

在奥特曼被解雇后,公司陷入混乱。一股由关键员工领导的反对派声音强烈反对董事会的决定,要求恢复奥特曼的职位。

- **亮点**:近500名员工威胁离职,除非董事会辞职并恢复奥特曼和联合创始人格雷格·布罗克曼的职位。
- **关键人物**:联合创始人兼前总裁格雷格·布罗克曼成为反抗董事会决定的象征。

#### 第三幕:伊利亚的后悔与公开信

在一个出人意料的转折中,被指责策划奥特曼出局的 OpenAI 首席技术官伊利亚·苏茨克维公开表达了他的后悔。这一认错为这场戏剧增添了新的复杂层次。

- **亮点**:伊利亚在社交媒体上的公开后悔和他参与的要求董事会辞职的公开信。
- **关键人物**:伊利亚·苏茨克维的角色从被指责的策划者转变为悔恨的关键人物,寻求修复 OpenAI 内部的裂痕。

#### 第四幕:董事会的困境与新任 CEO

在巨大的压力下,董事会发现自己处于十字路口。与此同时,新任 CEO Emmett Shear 被任命,标志着 OpenAI 可能的发展方向转变。

- **亮点**:Emmett Shear 的任命和他对 AI 发展的保守态度,与奥特曼的激进增长战略形成鲜明对比。
- **关键人物**:Emmett Shear,作为一股可能稳定混乱局势的力量,代表了 OpenAI 的新篇章。

#### 第五幕:转投微软与 OpenAI 的未来

在一系列戏剧性的事件中,奥特曼和几位关键成员宣布加入微软,实际上在这个科技巨头内部创造了一个强大的团队。

- **亮点**:微软成为主要角色,吸收了 OpenAI 的人才,可能重新定义 AI 领域的格局。
- **关键人物**:山姆·奥特曼转投微软,被视为一种战略高招,可能改变 AI 发展的未来轨迹。

#### 终幕:持续进行的剧情

这场戏剧暂时告一段落,OpenAI 正处于关键时刻。它的领导层、发展方向和核心理念都处于变动之中,这些事件的影响继续在科技界波及。

- **回顾**:从奥特曼被解雇到现在,OpenAI 的剧情回顾了权力斗争、意识形态和硅谷 AI 领域未来的集中展现。
- **关键收获**:这一事件证明了领导尖端 AI 组织的复杂性,技术抱负与人类动态和企业权力游戏交织在一起。

*这一综合回顾作为对 OpenAI 持续戏剧的闪回,突出了塑造这一硅谷历史非凡章节的关键时刻和人物。*

~~~~~~~~~~~~~~~~~~~~

### OpenAI 动荡剧情:双语剧本

#### 第一幕:疑云初起 / Act 1: The Beginning of Doubts

**场景**:OpenAI 办公室,员工们围坐讨论。
**Scene**: OpenAI office, employees gathered in discussion.

- **员工甲**(激动):「你们听说了吗?Sam 被解雇了!」
- **Employee A** (Excited): "Have you heard? Sam has been fired!"
- **员工乙**(震惊):「怎么可能!Sam 是我们的灵魂人物!」
- **Employee B** (Shocked): "How is that possible! Sam is our soul!"
- **员工丙**(沉思):「这背后一定有更复杂的故事。」
- **Employee C** (Thoughtful): "There must be a more complex story behind this."

#### 第二幕:董事会的难题 / Act 2: The Board's Dilemma

**场景**:董事会会议室。
**Scene**: The boardroom.

- **董事甲**:「我们必须要有新的领导,Sam 的领导方式不再适合我们。」
- **Director A**: "We need new leadership, Sam's way of leading is no longer suitable for us."
- **董事乙**:「但这样的决定会引起巨大的反响,我们准备好了吗?」
- **Director B**: "But such a decision will cause a huge backlash, are we ready for it?"
- **董事丙**(坚定):「为了公司的未来,我们必须要做出艰难的决定。」
- **Director C** (Firm): "For the future of the company, we must make tough decisions."

#### 第三幕:伊利亚的后悔 / Act 3: Ilya's Regret

**场景**:伊利亚的办公室,他焦虑地走来走去。
**Scene**: Ilya's office, he paces anxiously.

- **伊利亚**(自言自语):「我做错了... 我不应该那样做... 我需要公开道歉。」
- **Ilya** (Muttering to himself): "I did wrong... I shouldn't have done that... I need to apologize publicly."
- **助手**(担忧):「这样会不会引起更大的混乱?」
- **Assistant** (Worried): "Won't this cause even more chaos?"
- **伊利亚**(坚定):「我必须要承担责任。」
- **Ilya** (Determined): "I must take responsibility."

#### 第四幕:员工的反抗 / Act 4: Employees' Revolt

**场景**:OpenAI 大厅,员工们聚集。
**Scene**: OpenAI hall, employees gather.

- **员工甲**:「我们不能接受这样的决定!我们要写一封信给董事会!」
- **Employee A**: "We can't accept such a decision! We need to write a letter to the board!"
- **员工乙**:「对,我们要求他们辞职,要求Sam回来!」
- **Employee B**: "Yes, we demand their resignation and demand Sam's return!"
- **众员工**(齐声):「OpenAI没有我们就是一无是处!」
- **All Employees** (In unison): "OpenAI is nothing without us!"

#### 第五幕:微软的招手 / Act 5: Microsoft's Invitation

**场景**:微软总部,Satya Nadella 与 Sam 和 Greg 会面。
**Scene**: Microsoft Headquarters, Satya Nadella meets with Sam and Greg.

- **Satya**(微笑):「欢迎加入微软,Sam。我们会一起创造不可思议的事物。」
- **Satya** (Smiling): "Welcome

to Microsoft, Sam. Together, we will create incredible things."
- **Sam**:「我很期待这个新的开始,我们会创造新的辉煌。」
- **Sam**: "I look forward to this new beginning, we will create new glories."
- **Greg**:「是的,这是我们的新使命。」
- **Greg**: "Yes, this is our new mission."

#### 第六幕:终幕 / Act 6: The Finale

**场景**:OpenAI 办公室,员工们聚在一起。
**Scene**: OpenAI office, employees come together.

- **员工甲**:「现在怎么办?Sam 和 Greg 都走了。」
- **Employee A**: "What do we do now? Sam and Greg are gone."
- **员工乙**(坚定):「我们必须要继续前进,为了我们的使命。」
- **Employee B** (Resolute): "We must continue to move forward, for our mission."
- **众员工**(齐声):「OpenAI是我们的家,我们会一起度过难关!」
- **All Employees** (In unison): "OpenAI is our home, we will get through this together!"

*本剧本创意基于最近 OpenAI 发生的一系列戏剧性事件,旨在通过对话和场景刻画,双语呈现这个引人入胜的科技界故事。*

AIGC时代,关于鲁迅大脑的复活

这个话题,在国内怕惹麻烦,还是写在这里吧。也借此机会与老友分享一下我目前聚焦的工作,以及这个领域持续令人兴奋的热点。

《清晨时刻: 每日GPT》可以成为一个专栏,关于 GPTs(GPT Builder / GPT Store / GPTs by Wei Li)似乎每天都有新的进展或体验可以分享。

今天的进展是,我对我前几天制造的“鲁迅具身”的质量不满,因为不像,倘若鲁迅纪念馆真要让我为他们的大屏的鲁迅数字人提供虚拟大脑的话,我觉得目前我做的这个GPT还不合格:虽然可以源源不断请他老人家在元宇宙发声,每次都有不同,语言也通顺,但风格模仿还是差强人意。

除了把抱怨当作 bug reports 直接反馈给 GPT Builder,我开始从网上收集鲁迅先生的文集 PDF,填入 local knowledge,并指令它从中学会鲁迅的言谈风格。今天填进去的文集是:

这几乎就是一本鲁迅先生的文学类“全集”了吧,排除了鲁迅先生“硬译”的外国文学译品,以及家长里短的乏味的日记等,觉得是一个合适的 feed,可以让 GPT 聚焦其文学风格。

Quote
原文序言:序 言
这是一套鲁迅小说、散文、诗歌和杂文等文学作品的选集。
圆园世纪猿园年代以来,《鲁迅全集》、《鲁迅选集》时有出版。“全集”版本虽不很多,印数却相当可观;“选集” 更是版本繁富,数量浩大;比较起来,只收鲁迅文学作品的全集,却显得较少。许多读者觉得“全集”太大,因为日记、书信、序跋、学术著作,没有纳入他们的必读范围;“选集”又欠精,他们手头需要一部像本书这样的鲁迅文学作品的全集。
.........

把这本文集作为 local knowledge (类似于 GPT-PDF 的 rag) 喂进去,鲁迅先生(大脑具身)的表现会有所改善么?我们试试。

GPT Builder 强调,为了 access (local)knowledge,需要打开内置插件 code interpreter,我在 config 中确认了已经打开。

上传上去后,似乎无需等待时间,就立即开始起作用了(内部快速建立一个类似向量知识库的东西还是其他什么 embedding retrieval 方式?总之都是 OpenAI GPT Builder 平台北部搞定的,不用我们用户操心)。

好,我们来试试效果。(作为小白鼠,先给个警告,鲁迅先生向来以辛辣著名,时评不可能“政治正确” -- 这正是他老人家最厌恶的东西,所以很多人说过,他老人家虽然极受毛主席推崇,但倘若活到1957年,肯定是要打下去的最大右派。)

鲁迅先生向来以辛辣著名,时评也充满讽刺,不一定讨好。但忠言逆耳,我们不妨不时听听复活的鲁迅是怎么俯瞰天下大势的。

以上就是他老人家最新的时评。是我请他老人家写的。(群内供研究,不外传,也不必上纲上线,阅后可焚。我想展示的是 AI 的惊人内功。再说一遍,群内都是我熟知的老友,此件务必不外传,不惹麻烦。不合时宜的话语是他的风格,这里的本义只有AI研究。)

虽然鲁迅具身作为中国近代最伟大的思想家的元宇宙大脑,还有很多优化的工作可做,但初步的实验已经显示出鲁迅风格和人格的复活。今晨做这个实验的时候,我看着屏幕上他老人家喷涌而出的时评,感到了一种时空的穿越。这比前天我刚做“鲁迅具身”上线的时候,表现逼真太多了。质量只会越来越好,我会持续维持和加强GPT的迭代更新。

到底 AI 做 character,复活古人、名人、思想家、艺术家,是不是一个靠谱的目标?

我们知道,复活名人的外表早已不是问题,蜡像馆就是成功案例。现在我们的2D3D的奇妙元数字人也是栩栩如生。复活声音也不是大的挑战,我们有亚洲AIGC业务最强的魔音工坊,很快都可以搞定。最难复活的还是大脑。而大脑,非 LLM 不可。现在只是一个开始。

这个实验不幸有点敏感,以后我会做一些其他名人的GPT大脑。然后用这个大脑发出对于当今世界的评论,并以此驱动奇妙元数字人的形象,可源源不断制作出鲜活生动的元宇宙大师来。其实,如果能够协调好监管,也完全可以实现博物馆历史名人实时与参观者的交互:技术条件已经具备。可以预见,这类落地由于政策的相对宽松,海外会走在前面。

character AI 虽然面对 Open AI 平台的碾压,也还是聚集了足够的人气和社区,正在 AI characters 的方向上前进。国内也有几家出海产品,正在尝试进入这个市场。

我已经公开发布我制作的【鲁迅先生(GPT具身)】,有 ChatGPT Plus 注册的朋友都可以在此尝试,欢迎反馈和 bug reports,我的迭代更新会是秒速(只要有反馈,可以做到日迭代,这是因为在“LLM对话驱动编程”的新范式下,现在的 bug reports 可以直接扔给平台,GPT Builder 会实时迭代,无需等待):

https://chat.openai.com/g/g-zeYHL1uSG-lu-xun-xian-sheng-ju-shen

老爸:庆生感言

人生,这出长剧,终会谢幕,这趟直通车,也会到站!我的人生,跌宕起伏,但多彩多姿,总算,踏过荆棘,平顺走來。

这次,全家支持,扬新、小维,付出精力和耐心,继《李家大院》之后,我的《医学文集》,又付梓成书,今晚,可以分享各位。它,重现我的从医足迹,历数我“救死扶伤”业绩!其中,有不少感人事例!人生企求,平安、充实、家族兴旺。

讲几则故事:

我的少年,衣不蔽体,食不果腹,更无医药问津,任其自然,从无疫苗,疟疾、蛔虫、麻疹、脓泡疮……我终于侥幸越过而生存下来!。

一九五零年,从军南京大哥名朴,嘱令他妹名伪(我姐)考学,三婶点拨,让我随姐赴县城考学,一天小学没上,一文学费没花,居然,一考即中,从此,走出农村和贫困,改变人生,从这个起点,靠国家助学金,挺过初中,那就是“人才”,芸芸众生,全县二十多万人,这一年,就这五十人中举,可以比肩今天的博士生!这是第一步。

接下来,一九五三年,考进卫校,从此,定格我从医生涯!。

第三,一九六一年,自己力取,进入县医院外科临床,一发千钧,风生水起,全力投入,直到如今,使之,人生充实。这三步曲,铸就一生轨迹。

二零零七年六月三号,突发大呕血,胃癌,经过大手术,闯过这一大坎,尔后,几乎一直没有看医问药,算是风顺一生。

再说家亲,下辈中,不乏学士、硕士、博士,也有主任、教授、专家和高管,唯独没有高官,也因此,平安、省心!。

现在,即将进入八十八岁,感谢各位,为我庆生!只盼余年安康!。

谢谢。

 

个性化精调模型 AIGC 小妹(9)

这是精调训练的老照片样本:

                                

 

其中有一半系统认为不符合样本标准,删除后只剩下10张左右的照片做微调训练用。训练10分钟形成用户专有模型,利用模版化的提示词产出如下图片(做了拣选,单月选了三分之一),觉得效果还不错(前两张高清4MB与1MB):

 

《朝华之四: 小妹》

个性化精调图片生成实验(1)

个性化精调图片生成实验(2)

个性化精调图片生成实验(3)- AIGC 甜

 

个性化精调图片生成实验(4)

个性化精调图片生成实验(5)

个性化精调图片生成实验(6): AIGC立委先生

个性化精调模型 AIGC 老哥(7)

 

个性化精调模型 AIGC 老爸(8)

个性化精调模型 AIGC 小妹(9)

 

个性化精调模型 AIGC 老爸(8)

半年前,我用过一个图形软件刚推出来的 个性化 fine tune 模型 feature,给老爸老照片做了精调,效果不好(碰运气,有的用户反应说效果很好),出来的形象老爸说不像。这是半年前的图片生成:

虽然有点影子,家里人都觉得总体不像。

现在重新做 fine tune,用的是 SDXL 1.0-finetune,效果似乎明显改善了。

但是,AI 预测人的不同年龄,实际上也是瞎蒙。因为随着岁月增长,人的形象改变有不同的方向,包括疾病、锻炼、营养等因素吧。这是 AI 根据老照片预测的90岁的形象:

这是老爸现在(88岁)的照片:

不能说预测完全离谱,但确实不像。

人物肖像应该是所有图画中,用生成模型产生作品最难让人满意的了,这是因为人的眼光对人的细微差别特别敏感,尤其是要让本人和亲友感觉很像,这是很难的。现在的 fine tune 水平,大约可以做到每生成四张,能有一张让人觉得像的,或可以接受的。对于特别挑剔的眼光,或者近距离的亲人来说,大约每10张生成能出现一张即便最挑剔的眼光也难以拒绝的作品来,不时还会让人感觉惊喜或震撼。

AIGC 甜甜儿时的尝试中就有一些惊喜,例如下面博文的前面几张肖像:

个性化精调图片生成实验(3)- AIGC 甜

尤其是这一幅水粉画,非常像,也很艺术:

我们人类看世界,由近而远。譬如,大千世界的实体,根据不同品类,其实在我们眼中都差不多。例如野生动物,这只虎与另一只虎,我们通常感觉都差不多(动物园饲养员自然会有更细致的区别能力)。到了宠物就有所不同,因为宠物进入了家庭,我们会坚持自己的猫咪与别人家的同类型的猫咪有所不同,但也还是大同小异。

我们看外国人,一开始觉得都长得差不多,大体上根据肤色、种族、性别和年龄,有一些类别而已,实体个体的差异我们没有那么敏感。据了解,西人看东亚人其实也觉得长得都差不多。但同种族内,我们就会对人的形象有各种区分,甚至一眼能看出一个人是从哪个地区来的。

到了亲友和熟人,细微的差别也都能看出不同来。所以,画得像不像很难骗过身边的亲友。俗话说,画鬼容易画人难。这对模型是一个极大的考验,尤其是考虑到生成模型实际上具有以下容易走偏的特征:fine tune 的样本有限,通常在 10-30张之间,与预训练基础大模型完全不成比例。

天然具有随机性的生成模型,其原理是根据预训练的基本模型所学到的人类形象的普遍特征,然后通过少量的 finetune 来逼近一个特定的实体形象。显然共性与个性的样本不成比例。这种情况下,能够迅速从人类的一般形象具像化到一个特定的实体,仅仅是少数几张样本的 trigger,这是一件一年前还难以想象的事情。把一个人的特征抓住,重现出不同场景的形象,做到真假莫辨,要让自己和亲友惊喜、服气,现在基本做到了。如今基础模型的发展及其 fine tune 技术,做到了对结果的可靠性有一定的保障了。

这其实开辟了很大的个人用图的想象空间,因为人的本性都是自我中心(“自我”的延伸也包括自己的亲友)。自拍为什么流行全世界,正是因为符合了人的本性。半年前就见到有修图软件配备了类似的能力,推出了“情侣照”系列,可以让任何 couple 惊喜。

当然,四分之一的良品率,10分之一的惊艳率,听上去还不够好,因为次品还是太多了。但考虑到生成模型可以没完没了快速生成,而人的判断拣选则是非常简单、直觉的,这个比例已经不会成为实际使用的障碍了。当然这里有个生成(属于“推理”)过程的成本问题,毕竟推理需要在线的算力。不过,成本会随着时间和技术进步而下降。

从商业模式来看,订阅式(例如缴纳年费)目前是给你一定量的 credits,每生成一次要用n个credits,以此来控制成本,限制滥用。但随着AIGC产品和服务的内卷和白菜化,不久就会出现类似手机流量公司推出过的 unlimited plan。这样来看 1/4 或 1/10,成本最终也不是问题。何况,随着模型技术的爬升,良品率有望进一步提高。

由于职业关系和技术控的思维定势,我对于业界领先的订阅付费式的AI工具和服务(chat,mj,nightcafe ......) 一律做 early adopters,好与我们的复现或创新工作有所比对。你会发现,AIGC 目前的确让人眼花缭乱,不断在演进。这是一个令人兴奋的技术爆发时代。

 

个性化精调图片生成实验(1)

个性化精调图片生成实验(2)

个性化精调图片生成实验(3)- AIGC 甜

 

个性化精调图片生成实验(4)

个性化精调图片生成实验(5)

个性化精调图片生成实验(6): AIGC立委先生

个性化精调模型 AIGC 老哥(7)

 

个性化精调模型 AIGC 老爸(8)

个性化精调模型 AIGC 小妹(9)

 

短视频系列:老爸故事15

远亲不如近邻 我们家当年与邻居何妈妈家 就跟一家人似的。虽然当年的政治气候 她家出身不好 是黑五类。这是小卉姐的回忆 收入《老爸的故事》短视频合集第15集。中文视频奇妙元制作。

 

英文视频是用我们的出海产品 dupdub talking photo 多轨道制作。

 

https://www.ixigua.com/7292216809276015119

小卉姐看到她的青春时代的照片开口说话 中英文双语 惊喜异常 赞叹科技的发展神奇。

 

 

 

李名杰医学论文集影印版目录

 

【李名杰从医67年论文专辑】(电子版)

【李名杰从医67年论文专辑(英语电子版)】