The Turbulent Second Chapter of Large Language Models: Has Scaling Stalled?

The recent Chinese podcast from Guangmi's quarterly report on large language models, discussing the "scaling paradigm shift" toward AGI (Artificial General Intelligence), is well worth a listen. It touches on many key topics related to the AI industry landscape, offering a unique perspective and style.

The term "paradigm shift" may sound a bit dramatic, but as a seasoned analyst, Guangmi uses it to describe the current turbulent landscape accurately. While the AI arms race among industry giants is still in full swing, real-world scalable applications of these models are struggling to materialize. The question of how to justify investments has become a significant pressure point, or perhaps even a looming bubble.

Let's revisit some AI basics. There are three main types of learning in LLMs (Large Language Models):

(i) supervised learning;
(ii) unsupervised learning (self-learning/pre-training); and
(iii) reinforcement learning (RL, self-play/post-training).

Ilya has emphasized the importance of RL in exploring new directions for LLMs. Guangmi's podcast highlights RL as the pathway to the paradigm shift in AGI through large models.

Historically, two key milestones in RL have stood out: AlphaZero's victory over human Go players, which shocked the world, and RLHF (Reinforcement Learning from Human Feedback), which aligned models with human preferences and paved the way for ChatGPT’s explosive growth.

Currently, discussions revolve around the potential of a new RL-driven ecosystem for large models (though there's no broad consensus—it's primarily a conversation within small Silicon Valley circles) and the emerging trends in the "arms race" of large models. Here’s the context:

1. Pre-training scaling seems to have hit a bottleneck, with GPT-5 still unreleased;
2. The overall momentum of the arms race remains unchanged among the major players (the billionaire clubs/giants);
3. Key tech figures are proposing new roadmaps or trying to construct new scaling laws to continue the AGI journey.

Guangmi closely monitors trends in Silicon Valley. His small team conducts in-depth research in the Bay Area and has established extensive contacts. Having chatted with them over coffee a couple of times, I’ve found them to be a dynamic, young team under his leadership—a small but sharp presence.

Guangmi’s thoughts are well-structured, and his breadth of knowledge and understanding of the larger context are impressive. This is no small feat, as the landscape of large models, both in terms of the models themselves and the industry, is often akin to the parable of the blind men and the elephant. Even top experts and business leaders struggle to assess the full picture. Just recently, Meta’s Zuckerberg responded to a question about whether the AI arms race would deliver the expected AGI returns, essentially saying: “No one really knows, but we can’t afford to miss out,” reflecting a typical FOMO (Fear Of Missing Out) mindset.

We’re currently in a delicate phase with little consensus. However, the few tech giants that have propelled Nvidia’s stock to astronomical levels won’t allow the arms race to slow anytime soon, as it is central to their tech and business dominance. OpenAI continues to raise funds, and Ilya, with his new company, recently secured more investment, all of which keeps the race heated.

At the same time, the obsession with scaling among tech elites and the mainstream AGI circles in Silicon Valley persists. The endless demand for resources driven by this scaling wave of large models means that only a small circle of tech insiders has the opportunity and resources to experiment, sense, and adjust the roadmap.

According to Guangmi, the so-called self-play RL scaling is currently gaining traction within a small circle of about 200 tech elites in Silicon Valley, indicating that this is still a nascent trend—one that even management leaders have not fully aligned with yet.

It seems Guangmi adopts a “prophet” mentality at times, perhaps exaggerating this trend to alert his audience. He even suggests that if he were a large-model entrepreneur, he would focus 200% of resources on RL, betting on it as the future path to victory.

In reality, for most people, this advice is neither practical nor actionable—it’s likely aimed at tech giants or unicorns, though even for them, it may fall on deaf ears.

Reinforcement learning is inherently challenging. Even the open-source leader Meta LLaMA 3 has chosen to sidestep RLHF in post-training alignment. So, it's even less realistic to expect large-model teams to fully bet on RL as the core of a new ecosystem. Furthermore, this trend is, at best, a “subtle undercurrent” in Silicon Valley. We’ll likely have to wait until OpenAI’s “Strawberry” or the new version of Claude releases later this year to fully assess its impact.

It seems the first chapter of LLM scaling has indeed come to an end. The actionable items in the so-called second chapter might not emerge from lofty, exploratory scaling directions with an uncertain roadmap. Instead, the focus should be on finding market entry points, accelerating applications, and addressing genuine market needs (PMF, product-market fit), especially as the inference costs of top models like GPT-4o/Claude 3.5 become more affordable, and multimodal capabilities (such as advancements in hyper-realistic full-duplex voice and video) further enhance application opportunities.

For the industry, the bottleneck in scaling large-model applications is the sword hanging over its future. This will determine whether the second chapter of the tech adoption curve ends with a soft landing and eventual recovery. As for the arms race, it’s best to leave that to Elon Musk, Zuckerberg, and the billionaire club to continue playing.

Reinforcement learning, as an extension of pre-training, belongs to the realm of “post-training.” When pre-training hits bottlenecks and diminishing returns, strengthening RL is a natural complement. In the simulation of human cognition, pre-training represents the accumulated knowledge of human civilization, while RL applies that knowledge in practice, learning from the environment. This overall approach to intelligent learning makes perfect sense and is the necessary direction for applying large models.

My old friend Lu said: “It’s intuitive that RL is the path we must take because there isn’t enough supervised learning data anymore.”

Indeed, utilizing regenerated data to varying degrees has become common practice. It’s inevitable. Models can already generate data of higher quality than humans, and this will only improve. However, this is not the same as self-play's proactive exploration and data regeneration.

As Mr. Mao pointed out: “RL aligns with the cognitive processes of humans and epistemology. It’s essentially the process of receiving external feedback and being tested in practice. RL is active learning, while training is passive.”

Guangmi's RL paradigm shift suggestion still lacks the necessary catalysts. But this potential trend is worth keeping in mind. It’s best to remain cautiously optimistic and open-minded while watching how things unfold.

 

Related original:

大模型风云诡谲的下半场:scaling 失效?

大模型风云诡谲的下半场:scaling 失效?

广密大模型季报谈AGI范式大转移这篇播客,很值得一听。涉及很多大模型产业重要话题,视野和风格很独到。

“范式大转移”的说法太耸人,但风云诡谲,是当下的写照。那是因为大佬军备竞赛虽然依旧如火如荼,可应用落地却处于难产期,如何 justify 投资是一个巨大的拷问,或泡沫。

三大学习: 监督学习、非监督学习(自学习/预训练)、强化学习(RL,自主学习/self-play),伊利亚曾经专门强调后者作为探索大方向的重要性。广密这里特别强调它是正在到来的大模型AGI之道的范式转变。

此前,大家都知道强化学习主要是两个里程碑:一个是 alpha0 围棋完胜人类选手,震惊了世界 ;另一个是所谓RLHM(人类反馈强化学习),强化了与人类偏好的对齐,成就了ChatGPT的核爆。

现在谈的是大模型新生态可能性(并无广泛共识,只是硅谷小圈子在做、在议)以及大模型“军备竞赛”的新趋向。这个话题的背景如下:

1、 预训练 scaling (更大规模)似乎受困,GPT5 迟迟不出;

2、 军备竞赛的大格局和造势,大厂和大佬不要改变;

3、 技术大佬开始提出新路线图或试图构建新的 scaling law 继续AGI 的征程

广密在podcast里面,观察硅谷动向比较 closely,他的小团队不仅定期去湾区做深度调研,也建立了广泛的联系。在硅谷跟他们喝过两次咖啡聊天,一帮生龙活虎的小年轻在他的带领下,我的印象,是一个小而精干的独特存在。

这台节目的个人风格和视野也非常 unique,喜欢他说话的思路敏捷,有跳跃感,但张儿不散,有一种吸引人的表达力。主持人与他的交互也很丝滑,张弛有度。

听他们唠嗑吧,谈笑间大模型AGI的大趋势貌似尽收眼底。还是值得点赞的。

广密条理非常清晰,所涉及的知识面和大形势观非常广泛,却能present到自己的视角参照系,与LLM社区的思想趋势有较好的映射。这不容易,因为LLM这档子事,无论模型还是产业的 landscape,大多都是盲人摸象。很多大专家、商业大佬也都各有自己的三分地和视角,也很难全面评估形势。Meta 小扎刚前不久面对万卡竞赛能不能得到预期的AGI return的天问,回答说(大意):其实没人知道,但总不想万一错过的(典型的 FOMO心态)。

目前形势处于微妙期,其实还没有凝聚太多的共识。但是把英伟达送上天价的几个富可敌国的大佬/大厂,短期内却绝对不允许停止军备竞赛,这是他们科技商业争霸的游戏。这叫欲罢不能,节奏在他们手中。Open AI 不断融资,伊利亚自己也最近融资成功,这些都是这场竞赛持续热度的浪花。

与之相配合的是技术大佬和硅谷AGI主流技术圈对scaling的执着和痴迷。因为这次大模型 scaling 技术浪潮对于资源的无止境需求,真正能有机会实践、感知并做出调整改变路线图的技术人,也只能是一个很小的圈子。

据广密的信息,这个所谓 self-play RL 新生态趋势,目前是局限在硅谷技术大佬小圈子的共识,他提到大约不超过200人的圈子的。如果信息正确的话,一个在硅谷技术核心圈200人以内的某种共识和议论,说明还只是一个动向,甚至连管理圈还没真正 get it 和对齐。

感觉上,广密有一些“春江水暖鸭先知”/“语不惊人死不休”的心态(LOL),有意强调/夸张了这个趋势,警醒国人,甚至说,如果我是大模型创业家,我会200%资源聚焦 RL 方向,bet on it,因为这是未来赢家的选择,云云。

其实,客观说,对于多数人这个不实在,也无可操作性,最多是说给国内大厂玩家或六小龙听的吧,但其实也是白说。RL 本来就不好玩,连开源标杆 Meta Llamma 3 在最基本的 RLHF 方面都选择绕开来走,就更甭提提倡国内大模型公司全力 bet on 以强化学习作为新生态核心的愿景了。何况后者在硅谷最多也只是一种“潜流”,可能要等年底前OpenAI草莓以及Claude新版发布后,才能对这个所谓新生态的影响,看得清楚一些吧。

这个苗头可以 keep in mind,但上半场确实似乎结束了。真正可以在所谓的下半场作为 action items 的,其实不是这种高大上、带有很强探索性的大模型 scaling 方向的尚未确定的 roadmap,更多是趁着 GPT4o/Claude3.5级别的通用模型的推理成本越来越亲民化、趁着LLM供应商多模态功能在进一步推广和完善(例如超拟人全双工语音的最新突破和工具赋能就会大大增加应用层面的机会,还有视频的进展等), 加快找市场切入点(PMF),专注应用场景真正需求的解决。

对于产业而言,当前大模型规模化应用的困局才是悬在大模型产业头上的利剑,决定了这下半场在 tech adoption curve 下行能不能软着陆和最终平缓回升。至于军备竞赛,让马斯克、小扎等首富俱乐部继续玩继续high就好。

作为“预训练”的延深,强化学习属于“后训练”,在前者遇到瓶颈和 diminishing returns的时候,加强后者是自然的补足。从AI对人类认知的模拟来说,前者是继承人类文明的知识和科技积淀,后者是把这些知识真正用到实处,在环境中学习。这个智能学习的总体思路 makes perfect sense,也是大模型应用必须要走的方向。

所以老友吕兄说:“直觉上RL是必须要走的路,因为supervised learning的数据没有那么多了。”

没错,不同程度利用再生数据,其实已经是日常 practice 了,也不再有以前的“心理障碍”,是一个必然。总体而言,模型就是比人能够更高质量产生数据,而且会越来越好。但这还不是这里说的self-play的主动探索和数据再生。

毛老说的也不错:“RL 与人类的认知过程相符,与认识论一致。实质上就是接收外界反馈,接受实践检验的过程。RL 是主动学习,而训练是被动的。”

广密现在是说,需要研究测把 RL 范式化,提供某种 RL dev toolkit,然后有在各种场景去做 scale up RL 的路线。这个所谓“范式大转移”,没有1-2年的大厂/大佬的推动普及,没有抓手。持谨慎乐观或怀疑的open 心态,静观其变吧。

Professor Ma's long paper out

Here is the link to Professor Ma Yi’s presentation from the Shenzhen Entrepreneurship Forum, in Chinese, recommended.

Professor Ma is a compelling speaker, and his talk is definitely worth listening to. His paper on whitebox transformer, over 100 pages long, has just been released (Yi Ma’s white-box transformer paper is available here).  Unfortunately, I haven’t had the time to dig into it yet. We’ll have to wait until more people have accepted or verified it before delving deeper.

His current claims revolve around using an extremely sparse approach to force transparency in transformers, with results that are reportedly on par with BERT and GPT-2 in many benchmarks. However, this doesn’t mean that he will be able to catch up with GPT-3 or later models anytime soon. But to be fair, it’s not a level playing field—he’s an academic without the resources to compete with mainstream AI in an arms race. What he does believe, however, is that he has opened a door—a path toward explainable AI in large models.

Honestly, I’ve always had a litttle bit doubts about Ilya’s theory explanation of shortest program compression (his Berkeley talk). From an ultimate theoretical perspective—where lossless compression is the ideal—the idea of continually scaling training, deepening, and lengthening learning makes sense, as it pushes the model toward becoming the smallest possible program for universal tasks. Ilya’s theory may hold up in this respect, at least in theory or as an end goal. But in any real-world scenario (e.g., under budgetary constraints, with methodological limitations), it’s hard to call a model purely derived through gradient descent the “shortest program,” because these models appear to be gigantic beasts with "huge circuits" inside, intuitively, should not be considered "short or small".

Models with hundreds of billions or even trillions of parameters are massive monstrosities, succeeding mainly through sheer size rather than through high regularity or elegance. Emphasizing how impressive their compression ratios are or how well they handle lossless compression may help explain the generalization and emergeng abilities in sequence learning from a theoretical standpoint. But in practice, any model at a given time is far from being the “shortest program.”

This highlights an unavoidable distance between theory and practice. Ilya essentially hedged practice with theory along a future time axis, but our immediate reality doesn’t seem to align with this. It’s like a clumsy wrestler trying to brand himself as sleek and slender fashion model.  Visually not a fit, to most of our eyes.

Instinctively, LLMs feel full of rote memorization with significant redundancy. Under real-world conditions, achieving extreme or lossless compression seems impossible.

On the other hand, Professor Ma’s sparsity approach almost feels “over the top.” Enforcing the same weight for QKV directly seems a bit crude and simplistic, yet it still managed to be trained successfully. This shows that there’s a lot of flexibility within transformers—no matter what restrictions or pruning are applied, the model still finds a path out. In this sense, Professor Ma’s pursuit of the “shortest program” is more real and direct—it’s so short that even a human can interprete the process (hence the LLM explainability).

Yet the difference between these two extremes is still mind-boggling. On one side, we have gigantic models, and on the other, extreme simplicity to generate whitebox models. The fact that both approaches work is shocking.

Speaking of simplicity and explainability, here’s an interesting anecdote in AI history: Back in the day, during the era of symbolic MT, one of the earliest deployed systems (Siemens' METAL) for English-German translation used only eight symbolic features (such as human, animal, etc.). The rules were simple, transparent, and easy to explain. This shows that extreme simplicity and rule-based transparency can work in some rough application scenarios (where English and German are linguistically close, making translation easier).

Later, we MT-ers expanded the number of features to the thousands, trying to cover more of the long tail. Even then, it wasn’t perfect. At the time, we thought that with enough effort, we could match the quality of statistical MT. But now, we know that even if symbolic MT could catch up and match statistical MT, it’s still far from competing with neural MT.

So, could we have continued refining features further? It wasn’t because we didn’t want to keep extending symbolic features (similar to one-hot encoding, but with the internal structure of ontology/taxonomy). We wanted to go beyond thousands to tens of thousands of features. But in reality, thousands (of features in size) were already reaching the limit of human experts’ capacity to understand (AI explanability), manage and debug. Expanding further would have been unmanageable.

Meanwhile, how many parameters do mainstream Transformer neural networks have? And the space and granularity they represent are on a completely different scale. Given the vast difference in scale between the two, it’s natural to doubt any efforts to bridge this gap for AI explanability.  How could that even be possible?

That’s why I’ve always felt that explainability in large models is an elusive goal. But Professor Ma is telling the world that they’ve achieved it.

 

 

Relevant link:

Professor Ma Claims to Have Fully Unveiled the Mysteries of Neural Networks

What did Ilya see? -- secret behind success of LLMs

马毅教授的演讲,值得一听

创业邦深圳会议马毅教授的演讲链接在此:https://mp.weixin.qq.com/s/ibxGO_A7H-akpbwf2R2mGw

马教授还是很能讲的,他上面的演讲,很值得听。他的100多页论文也已经放出来了,可惜没时间钻研了,等以后更多人接受或验证后再说。

他目前所做出的 claims,是说用那种极度稀疏化的方法逼迫 transformer 透明化,结果也在多方面匹敌了BERT 和 GPT2。但并不说明短期他有办法赶上GPT3以上。话说回来,那也不公平。他作为教授没有资源去以军备竞赛的方式与AI产业主流打擂台。只是说,从路线上说,他觉得自己打开了一扇门,一条可以通向可解释AI的大模型大门。还是应该赞佩这样的反潮流的教授的。

其实,我也一直隐隐约约对伊利亚说的最短程序压缩论,持有怀疑:从终极目的(理论上以无损压缩作为理想目标)来看,不断加大训练、加深加长学习,结果就是朝着让模型真正成为最小程序,伊利亚理论也许没错。但在任何一个实际条件约束下(例如预算约束、方法论约束),这种纯粹靠 gradiant descent “凑出来”的模型/路径,很难说是最小 program,因为模型看上去就是个庞然大物,谈何“最小”。

千亿万亿参数的超大模型本来就是以大取胜,而不是以精简和规则见长的怪兽(gigantic monster),非要强调自己的压缩率厉害,无损压缩做得好,虽然有从理论上方便说明序列学习达成的通用性、泛化现象以及“涌现”能力,但实践中,在任意一个特定时间条件下的模型,都远远不是“最小程序”。

这是理论和实践躲不开的一种矛盾。在伊利亚那里,实际上他是以未来时间轴,用理论对实践做了对冲。我们的真实感觉并非如此,不敢这么说。就好比一个摔跤选手,都那么笨重了,还非要标榜自己性感、苗条?

直觉上,LLM 里面充满了死记硬背和信息冗余的,在现实条件下其实不可能做到极度/无损的压缩。

但另一方面,马教授也太奇了,他的稀疏化直觉上做得“过分”,QKV直接拉平,看上去有点简单粗暴,但居然也最终能训练出来。可见,transformer 的肚子里的操作空间还是很大的,你给它各种限制,动不动就剪枝(化零),也不用担心它走不出来。这种意义上,马教授追求的才是真正的“最短程序”,短到了连“豆腐脑”的人类都可以看懂路径(hence 可解释性)。

疑问还是这两个极端差距太大。一边庞然大物,一边无限精简,二者都能走通,也是震撼了。

谈到精简可解释,谈个掌故。老老年做 symbolic MT,一个著名的早期的实用系统(西门子的 METAL)做英语德语的翻译,整个系统只用了8个 symbolic features(例如人、动物等),规则简单而可解释,系统也一样上线实用了。可见极度精简和规则化,做到完全透明和人类可解释,在粗线条的应用场景(英语和德语距离较近,翻译难度低),有时候也是管用的。

我们后来把 8 个 features 扩展到千数量级,才擦了长尾的屁股。但也没擦干净。当时觉得,也许认真做可以对垒统计MT的品质(与董振东老师谈过,我们都觉得可以在翻译上最终用符号打败统计的,只是需要时间磨细活),但现在知道即便匹敌了统计MT,也远远不能与神经MT比高下。

那就把 features 往细做,成不?不是因为我们不想继续把 symbolic features (类似于 one hot encoding,但人为在 features 内部强加了类似于 HowNet 的 ontology/taxonomy 的结构性),从千这个量级进一步提升到万的量级。实际情况是,千几乎已经达到专家人脑的极限了,再扩大 features 的范围,我们就无法掌控和调试了。

可是,神经里面有多少 params 啊,其所能反映的 representation 的空间和细密度,与千量级的 symbolic features,尺度完全无法比拟。二者表征的尺度如此悬殊,对拉近二者距离的任何努力,我们天然会产生怀疑:这怎么可能做到呢。

所以一直就觉得大模型可解释性是一个可望不可及的目标。马教授告诉世人,他们做到了。

相关链接:

马毅教授称,已经揭开完全揭开神经网络的面纱

NLP老司机的AIGC旅程

今天想到做个小结,以“玩”的心态,回顾一下前两年的AIGC旅程,以及一个NLP老兵一路走来的心路历程和感受。‍‍‍

大模型爆发前,最痴迷的是当时就有的 txt2img 文生图模型。当时尝试过很多种工具,“小雅”就是那个阶段的产物。不仅人物,也做过各种绘画风格,在群里和博客也分享多次。后来疲劳了,就不怎么玩了。

开始对数字人感兴趣,2D 的 talking photo,2.5D 的有姿态虚拟主播,以及 3D 舞蹈等。因为是自家产品「奇妙元」,玩起来没限制,作为“产品体验官”,疯玩了一阵子。

可惜数字人的黄金时期转瞬即去,还没来得及起飞,就开始鱼龙混杂、遍地开花了,市场给卷的。

紧接着对于超拟人/超写实配音,以及跨语言的突破,包括最近“双工”的突破,各大头部模型开始显摆自己的语音亲民能力,与普通真人无异,不再是板着腔调的播音味了。 咱们自家的AIGC拳头产品「魔音工坊」赶上了这波语音tokens化的端到端大模型浪潮,也实现了超写实,那是大约半年前的事儿。意义重大,因为语音是所有copilot类大模型应用的最自然的接口,也是数字人和短视频的必要赋能点,但语音从可玩性上,不如音乐生成。

Suno 惊艳登场,我入迷了几个月,实现了自己也做“音乐人”的梦想。当然,现在也淡化了,不是不好,是没时间玩了。

时间被中国的 Sora,快手可灵AI的视频生成大模型占用了。视频生成疯玩到今天,我用它做了很多儿时的回忆,定格和再现了人生的高光时刻,虚拟了超生活的场面,最 high 的时期也过去了。这一通尝试,包括三分钟视频连续生成的极限试验,以及种种提示词工程探索,对当前视觉大模型的优点短板看得比较清晰了。

视觉模型的重要应用形态之一就是“一键成片”,也是自家产品了,叫「元创岛」。 目前还很粗糙和简陋,但的确做到了“傻瓜”制作能力,零门槛,任何人都可以用它来生成视频。显然有落地场景和起飞的迹象。

这种对多模态体验和迷恋,想起来与一辈子只做文本NLP得经历,本来是格格不入的。但背后有个大模型的宏大背景。原来,LLM炸平了NLP后,马不停蹄,又开始炸平多模态。这种通用性让人觉得这一切有着共同的主线贯之,是自然的技术汇合之流。这是从模型研究的心路历程看。

从人文和科技结合的角度看,我们这种“老文科生”与生俱来对于人文、艺术的追求本性,并没有因为在工业界的码农环境“挖煤”几十年,而(被)湮灭,应用到如今又是一个自然汇聚。这有点像乔布斯当年的说法,他追求的就是人文意味的科技产品,工程结合美学品味,嘲笑微软产品的粗鄙,no taste。

想想这一路走来挺有意思,无论研发还是应用,冥冥之中都在汇聚。而我们何等有幸见证、经历和投入到这种汇聚的潮流中,虽然这个汇聚也同时意味着颠覆自己、碾压自己、否定自己的过往,抛弃很多过去的“绝技”,例如曾经做到世界顶尖的符号解析(symbolic parsing)的庖丁解牛之术。 靠的是终身学习,不至于掉队太远。但一切的一切,更需要一种 精神,尤其是 passion:passion 所驱,乐此不疲。

下一个passion点 应该是 to b 场景,因为最终的应用大期待,大概率在垂直。To c 虽然很卷,但路线图和态势,能做什么,包括 aigc,已经基本清晰。但 to b 还在泥潭里挣扎,方向都还隔雾看花,闪闪烁烁,但也看到高人。例如白硕老师,感觉他就在捻须微笑,坐在金融交易的莲花池上,仗着to b 积淀。

个人而言,垂直赛道,最喜欢是教育,其次是法律,这都在大模型知识能力的路上:既容易最终被通用大模型碾压,又立即能对齐场景呈现价值。金融太繁琐,水更深。水利、电力、汽车等非常专门,行外人感觉枯燥。但医疗和心理,却很诱人,虽然比教育、法律更难涉入。看命运之神领我何往吧。

Decoupling to Resolve: Issue of Character Consistency in Video Generation

I’ve now become the go-to expert for AIGC (AI-generated content) "custom services" among my old friends and classmates, just for fun. Below are nostalgic videos made from old photos that two of my classmates asked me to create.

Whenever I find the time, I’m more than happy to provide this kind of emotional value for friends and family because it’s truly satisfying to see their reactions of surprise.

The pianist is now a world-class piano master, frequently touring and performing in Europe, America, and China. These are precious old photos of him practicing and performing with our mutual friend, Brother Sun, in Philadelphia back in the early days.

Dr. Bai Shuo, a seasoned expert in NLP and a multi-talented musician, commented humorously: “Looks real for someone who pulls on the bow in  Meditation as named, but the bowing and fingering are all wrong.”

Another old friend also left feedback noting that the visual model doesn’t understand music: "This needs improvement! It's obvious that the model was created by someone who doesn’t know how to play the violin or piano. The bowing and piano accompaniment are off. The first note has a two-and-a-half beat long tone, which should be played with a long bow. Additionally, the pianist’s right foot should never be raised or shaking like that—it should be on the sustain pedal.”

LOL

Even though the music's name Meditation was clearly specified in my prompt during generation, there is no model, in the foreseeable future, that can truly align the understanding of music with the intricate details of bodily movements during performance. Perhaps this can be reserved as one of the ultimate challenges for large models aiming for AGI, because theoretically, if enough alignment data of musical performance is available, based on the compression theory of "joint training", it’s possible to aim at perfect alignment across different modalities.

If simulating the objective world is the ultimate goal of visual models, then the current generation of visual models is at the level of “playing the piano to a cow” or “playing music to a tone-deaf audience”—completely unable to withstand scrutiny from musicians. For example, as someone with little musical knowledge, when I watch the nostalgic performance videos above, I wouldn’t notice the flaws as an expert would; instead, I find them vivid and emotionally engaging.

Of course, the standards of musicians might as well just be a "pseudo-demand" or a pseudo-goal (even if the visuals satisfy the picky “expert eye,” so what? Will it sell well?). It might not be worth the effort to pursue this. However, in theory, an ideal AGI should be capable of meeting these expert-level demands.

This is the challenge of musical performance alignment.  Another challenge to Sora-like video generation models is character consistency in videos.

Achieving facial consistency in generative visual models is an extremely difficult. Don’t expect this issue to be resolved by video generation models alone in the short term, especially not through autoregressive methods.

Human eyes are extremely discerning with regards to face recognition, especially when it comes to familiar faces of friends and family—you can immediately tell when a character's appearance is off. For example, while playing with old photos recently, I used the KeLing model (top notch Video Model in China) to generate a video of myself. At the 5-second mark, it still looked passable, but by 10 seconds, it no longer resembled me.

In the second 10-second video, just a slight turn of the head, and it’s no longer me—it looks more like my brother. How can a model handle such fine details? Especially when the starting image for video generation is not even a straightforward frontal shot, making the character information incomplete—how could it not go off track?

While the videos I've made for friends and family using KeLing during its public testing phase have generally been met with passionate surprise and amazement, most of them suffer from this issue of character consistency, which is a regret.

The current one-click video generation products on the market (including our own YuanChuang Island recently launched) tend to mainly use anime or manga styles. This is to avoid user scrutiny since these styles lack 3D distinct individual characteristics. As long as there is consistency in attire, no gender mix-ups, with age and race alignment, most people will accept it. The current one-click videos are generally rough, with entertainment value primarily in the story rather than character portrayal akin to a Hollywood blockbuster. However, as this path progresses, it will inevitably encounter the challenge of maintaining the consistency of digital IP actors and their roles.

My colleague, Lu, mentioned, "the consistency issue might require cross-checking from multiple video angles, which more or less touches on the core issue of whether modeling is necessary."

Indeed, some form of cross-checking is required, not just monotonic correction over time/sequence—that is indeed the key. There’s a need to decouple or separate the character's image from the storyline, rather than generating in a linear, one-way path. While sequence learning has indeed produced miracles in LLMs, sequence generation inherently has limitations, including random deviations over time. Although it's not as extreme as LeCun's criticism—where he says GPT's error accumulation is a tiny discrepancy that leads to a significant miss—his claim isn't entirely accurate because GPT's autoregressive operation also corrects and adjusts its course at every step in the context. Nevertheless, when it comes to fine-grained consistency, random deviations are almost impossible to handle, even with corrective mechanisms in place.

Hence decoupling, decoupling, decoupling! Decoupling can solve the problem. The world isn't limited to sequences. Beyond sequences and time, there is a constant abstraction (i.e., character image, or IP) that can be utilized. This is becoming increasingly clear. Take, for example, the digital IP character Maria (Xiao Ya) that I created using AIGC txt2img more than 2 years ago::

Unless they’re fans, perhaps my numerous Maria videos might cause aesthetic fatigue—someone even called her “Dr. Li's fairy” (LOL). But indeed, there are fans; several of my old classmates are among them.

Why? Because she is an IP, and she has been decoupled.

 

Related Links (original posts in Chinese):

视觉模型生成的极限对齐

解耦才能解套:再谈视频中的人物一致性问题

 

解耦才能解套:再谈视频中的人物一致性问题

前两天说过,对于生成式视觉大模型,人脸一致性是个非常难缠的东西,不要指望短期靠视频生成大模型本身来解决,尤其是不能指望自回归解决这个问题。

人眼太挑剔了,尤其是亲友和熟人,你会一眼看出人物走形了。譬如这几天玩老照片,我用头部视觉模型可灵5秒生成我自己,还过得去,到了10秒就不是我了。

10秒视频:

一转脸就不是我了,倒更像我哥。这种细粒度,模型怎么能搞定。尤其是,如果图生视频的起点图就不是正面照,character 信息本就不全,怎么可能不走偏。所以,我在可灵公测以来为亲友同学做的视频虽然普遍感觉惊喜或震撼,但大多存在这个人物变形的问题,成为一个遗憾。

现在市面上做的一键成片产品(包括我们的元创岛),其所以用二次元或其他夸张风格为主,是为了避免用户的挑剔,因为那些形象缺乏鲜明的个性,不是真正的 individual IP,只要保持穿戴一致性,男女不要错位,年龄和种族不要相左,一般人也就接受了。目前的一键成片普遍比较粗线条,娱乐价值更多是为视频里的故事,而不是好莱坞大片那样的角色形象刻画。但这条路往上走,就躲不开这种 digital IP 的演员角色定位及其一致性问题。

吕兄说:一致性问题可能需要靠多角度视频的cross-checking, 这里面多多少少要涉及到是不是要建模的硬核问题。

是的,要某种 cross-checking,而不是时间流单调矫正,这是key。需要解耦/剥离故事线上的人物形象,不能生成一条道走到黑。序列出过奇迹,但序列生成有随着时间出现随机偏差的局限,虽然不是 LeCun 批评的那样极端:他说gpt错误积累是差之毫厘失之千里;他的说法不正确,因为gpt的“自回归”推理方式也在每一步context自回归中不断纠错、矫正航向。尽管如此,对于细线条一致性,随机偏差哪怕有了矫正机制,也是基本搞不定的。

因此,解耦、解耦、解耦。解耦就可以解套。世界上也不是只有序列。跳出序列和时间,还有个恒定抽象(即character形象)可以利用。这一点已经越来越清晰了。以我制作的数字人IP形象小雅/Maria为例:

除非粉丝,也许我的众多小雅视频会引起审美疲劳吧,有人称她为“立委的妖精”(LOL)。但确实有粉丝,老同学中好几位人物就粉她。

为啥,因为她是IP,解耦了。

Related Links:

视觉模型生成的极限对齐

马毅教授称,已经揭开完全揭开神经网络的面纱

原创 立委 LiWeiNLP 2024年09月01日 17:44 北京

马教授的白盒transformer论文在:https://ma-lab-berkeley.github.io/CRATE/?continueFlag=680deb7516c156566f8eb73fdcc896ca

马毅教授大名鼎鼎,特立独行,一方旗帜人物,天下无人不识君。值得关注的是他最近对主流大模型的批评以及他自己工作的宣示。

前不久深圳会议请他来,他把主流大模型、伊利亚,以及k氏复杂性理论,批为中学生的水平,说他们根本不懂理论。而他自称理论实践双突破,说的就是他带领团队做的这个白箱 Transformer,不仅揭开了大模型的神秘面纱,而且是工程可实现的替代方案。

说句实话,把话说到这份上的人,多半是真牛人,有自带的底气和傲视群雄的霸气。对于主流的批判,老美有个杨立昆(他说GPT不如他家一条狗,是死路,他的世界模型才是替代方案),中华有个马教授,世界才显得不那么寂寞。也确实希望他们somehow能弥补当前主流大模型的“慢思维短板”,推动AI的整体进步。有时间还是应该慢慢研究他的学术和实践,但也许更多是要等待时间和同行对他们工作的检验和复现。

深圳会上就露了个脸,自己讲完和批完,立马走人。估计是真忙。

论文100多页,说过几天就放出来。看现在的 outline,重点在,quote:

Overall, CRATE is similar to a transformer, with two differences:

in each attention head, the Q,K, and V weight matrices are weight-tied, i.e., set to be equal;

and the nonlinearity following each attention layer is no longer a multi-layer perceptron (MLP), but rather a more structured operator (ISTA) with sparse outputs.

咱们了解一下,ISTA(Iterative Soft-Thresholding Algorithm,迭代软阈值算法),是一种用于解决稀疏优化问题的算法,在机器学习领域有广泛应用。在CRATE架构中, ISTA被用来替代传统Transformer中的多层感知器(MLP),还记得前不久的 KAN 的创新也是旨在平替 MLP。都是在 Transformer 里面开刀。

我浅薄的理解,ISTA 与 KAN for Science/Physics 的思路是一致的,就是要经过某种正则化或剪枝,最终拟合成稀疏化路径,从而获得可解释性。

工作原理: ISTA通过迭代的方式逐步接近问题的最优解。每次迭代包括两个步骤: a) 梯度下降步骤,这与主流同;b) 软阈值操作。增加这个操作,是要在两个目标之间找平衡:

a) 使模型尽可能准确;b) 使模型尽可能稀疏,即简单(因为人很难理解一团乱麻)。

软阈值操作鼓励内部元素变为0,从而产生稀疏的输出,增加可解释性。权重绑定的注意力机制和ISTA(迭代软阈值算法)促进了模型对输入数据结构的更深入理解。这更像人类的结构化解析过程,抓大放小,正则化、规整化。

马教授说的是,它们经过上述的两个改造,结果在监督学习中(后来他们也成功地用到了自监督学习),就自然学到了人类的这种结构化、稀疏化所带来的可解释性。

拿图像识别来说,研究发现某些注意力头对应于动物的不同部位。更引人注目的是,这种对应关系在不同动物甚至不同动物类别之间保持一致。例如,专注于"头部"的注意力头在处理不同种类的动物时,始终关注其头部区域。这种一致性表明CRATE已经学习到了跨类别的通用视觉特征表示。

但做LLM可解释性研究的人早就发现,在MLP网络的尽头,各种结构化(例如 头、脚)的部件也都被注意力抓到了,否则不能解释LLM表现出来的泛化(或压缩)能力。困难的是在网络的前期,那些个注意力比较混杂,主流的MLP搞不清都在注意什么。隐隐约约是在注意像素(点)、线等基础元素之间的关系。

可解释AI的基本思路都是一样的,就是要把一团麻一样的黑箱多层网络内部数据拟合中凑出来的路径,改造成强加了种种限制条件和剪枝、化零的“结构化”路径。

正常人,谁不愿意模型可解释?所以这种稀疏精简的路数/算法一定也是有极大的挑战,才 so far 无法与那种黑箱子瞎凑的方法比试。

马教授觉得自己很牛的底气是因为他在最近半年一年,开始用可解释的上述白箱方法,训练出了可以匹敌传统 transformer 的成绩。他在深圳会上说,此前他有信心这才是正道,但在没有拿到结果前,还是必须谨慎。现如今,他觉得他的跨国团队用这一套在各方面所做的实现,已经让他满意到可以向全世界宣告,他找到了理论和实践的突破口,找到了transformer白箱化的正确方法,可望引发范式转变式的深度学习的突破。这令他很激动,也很霸气。所以他不再满足于大学教授的理论建树,而是觉得应该下场了。马教授自己成立了一家公司要在工程上做这个推进。这次来创业邦深圳主题宣讲,据他说,是他第一次以这个新公司的名义来向社会宣布这个挑战主流的工程落地的方向性大项目。

不过,凭着多年经验和直觉说一句这件事的挑战性(或可疑之处):人类的可解释性建立在一个非常小的高度精简的 finite set 基础之上。如果拿 symbolic features 来说,超过千这个量级的 feature system,对于人就是不可解了。贵为上帝选民,我们人类的“豆腐脑”真地很局限。可是另一边看 transformer 里面的 parameters 以及 attention heads 的KQV数量,完全不具有可比性,天上地下。让大变小在这样的尺度下去归约,感觉上是不可思议的。

KAN for Science 之所以成功,是因为他们瞄准的 target 极为狭窄,只是 science 中的某些已有的解析公式,或者未知的但局限在少数参数的潜在公式。有了这样的目标去做剪枝,中间加上科学家的人为干涉或反馈,KAN 声称是做到了可解释性。

Anyway,马教授似乎是成竹在胸,咱们还是静观他的法术/魔术。

Related Links:

What did Ilya see? -- secret behind success of LLMs