2024 年 8 月 - 立委NLP频道

Professor Ma Claims to Have Fully Unveiled the Mysteries of Neural Networks

Professor Yi Ma’s white-box transformer paper is available here.

Professor Ma is a prominent figure, renowned for his distinctive style and leadership in the field. His name is widely recognized and respected. Of particular interest recently are his critiques of mainstream large models and the bold claims he has made about his own work (see his post in Chinese below).

Recently, at a conference in Shenzhen (which I attended with my own talk too), Professor Ma sharply criticized mainstream large models, Ilya, and Kolmogorov complexity theory, dismissing them as being on the level of high school students and claiming that they lack a true understanding of theoretical concepts. He asserted that he has achieved breakthroughs in both theory and practice, particularly with the white-box Transformer developed by his team. According to him, this model not only demystifies the complexity of large models but also offers an engineering-feasible alternative.

When someone speaks with such confidence, it usually indicates genuine expertise and a commanding presence. Just as Yann LeCun in the U.S. criticized GPT as being inferior to a dog and called it a dead end, proposing his world model as an alternative, China has Professor Ma. Their critiques balance the global discourse, making the world feel less excluding. There is indeed hope that their work might address the "slow thinking" and "interpretability" shortcomings of current mainstream large models and contribute to the overall advancement of AI. Professor Ma’s academic and practical work deserves close study, though we may have to wait for time and peer reviews to fully test and validate their findings.

At the Shenzhen conference, after delivering his talk and sharp critiques, Professor Ma left immediately, likely due to his busy schedule.

The paper is over 100 pages long and is said to be released in a few days. Based on the current outline, the key points are as follows:

Overall, CRATE is similar to a transformer, with two differences:

- In each attention head, the Q, K, and V weight matrices are tied, i.e., set to be equal.
- The nonlinearity following each attention layer is no longer a multi-layer perceptron (MLP) but rather a more structured operator (ISTA) with sparse outputs.

Let's examine ISTA (Iterative Soft-Thresholding Algorithm), a widely used algorithm for solving sparse optimization problems in machine learning. In his CRATE architecture, ISTA replaces the traditional MLP in Transformers. Not long ago, KAN also introduced innovations aimed at replacing the MLP, both approaches representing surgeries within the Transformer architecture.

In my understanding, ISTA and KAN (for Science/Physics) share a common goal: through regularization or pruning, they ultimately fit a sparse path, thus achieving interpretability.

How it works

ISTA Iteratively approaches the optimal solution of a problem. Each iteration involves two steps: a) a gradient descent step, which aligns with mainstream methods; and b) a soft-thresholding operation. This operation is added to balance two objectives:

a) Maximizing model accuracy;
b) Achieving model sparsity, i.e., simplicity (as overly complex models are difficult for humans to interpret).

The soft-thresholding operation encourages internal elements to become zero, resulting in sparse outputs and increased interpretability. The weight-tied attention mechanism, combined with ISTA, promotes a deeper understanding of the input data structure, resembling a human-like structured analysis process that prioritizes key elements while regularizing the data.

Professor Ma claims that these two modifications naturally lead the model to learn the interpretability associated with human-like structuring and sparsity during supervised learning (and later as claimed successfully applied to self-supervised learning too).

For example, in image recognition, it was observed that certain attention heads correspond to different parts of animals. What's more remarkable is that this correspondence remains consistent across different animals and even different categories of animals. For instance, an attention head focused on the "head" consistently pays attention to the head area when processing different kinds of animals. This consistency suggests that CRATE has learned a general representation of visual features across categories.

However, those studying LLM interpretability have long discovered that at the end of MLP networks, various structured components (such as heads and feet) are also captured by attention mechanisms. Without this, it would be difficult to explain the generalization (or compression) capabilities exhibited by LLMs. The challenge lies in the early stages of the MLP network, where attention is more mixed, and mainstream researcher struggle to clarify what the attentions heads are focusing on. It seems that they are vaguely paying attention to the relationships between basic elements like pixels/dots and lines.

The core idea behind explainable AI is consistent: transforming the tangled, black-box, multi-layer network's internal data fitting paths into structured paths that are enabled with various constraints and pruning, leading to a sparse representation.

Who wouldn’t want a model to be interpretable? However, achieving sparsity and simplicity is extremely challenging, which is why, so far, these approaches have struggled to compete with the black-box methods that involve randomness.

Professor Ma’s confidence stems from the fact that, in the past six months to a year, he has begun to train models using the explainable white-box methods mentioned above, achieving results comparable to traditional transformers. At the Shenzhen conference, he mentioned that while he had always been confident that this was the correct approach, he remained cautious until results were obtained. Now, he believes that his cross-national team’s achievements with this approach have satisfied him enough to announce to the world that he has found a breakthrough in theory as well as practice, the correct method for white-boxing transformers, which could lead to a paradigm shift and a breakthrough in deep learning. This has made him both excited and confident. Therefore, he is no longer content with academic theoretical achievements alone; he feels compelled to take actions in industry as well. Professor Ma has recently founded a company to advance this work on an engineering level. At Shenzhen, he announced a directionally significant project challenging the mainstream, first time under the banner of his new company.

However, based on my years of NLP experience and intuition, I must point out a challenge (or potential issue): Human interpretability is built on a highly simplified finite set. If we consider symbolic features, a feature system with more than thousands of elements becomes incomprehensible to humans. But on the other hand, the number of parameters in transformers and the number of KQVs for attention heads are on a completely different scale. Reducing such complexity on this scale seems almost unimaginable.

KAN for Science succeeded because their target was extremely narrow—certain existing symbolic formulas in physics or potential formulas limited to a few parameters. With such a goal, pruning, along with scientist intervention or feedback, allowed KAN to claim interpretability.

Regardless, Professor Ma seems confident, so we would like to observe how his methods and results evolve and will, or will not, be accepted.

创业邦深圳会议演讲笔记：推动AIGC商业落地，出门问问的「产模结合」实践

8月22日，2024AGI商业趋势大会在深圳成功举办。活动由创业邦主办，南山区科技创新局、西丽湖国际科教城融合创新中心、深圳市“模力营”AI生态社区为支持单位，揽胜运动为大会指定用车，搜狐为战略合作媒体。

本届大会以“AI塑造未来市场”为主题，邀请人工智能领域的行业专家、上市公司、独角兽企业、创新企业、投资机构代表，分享研讨最新的技术、创投、治理热点话题，并围绕产品商业化、人工智能+行业解决方案进行深度交流，共同探寻更广泛的产业合作与创新机遇。

会上，出门问问副总裁李维在主题为“盈利为王，AIGC从技术创新走向商业落地”的演讲中精彩观点如下：

大模型的出现，就像翻越了语言的大山，统一了这些不同的赛道，建立起了一种类似圣经中巴别塔的通用语言能力。
历史上，技术革新往往遵循一条被称为“technology adoption curve”的路径：一开始，人们对于重大创新和突破趋之若鹜，但当这些创新在商业化、盈利和实际应用方面遇到瓶颈时，就会出现一段回落期。
在通用人工智能时代，一个模型可以处理各种任务，这使得许多细分领域的创新空间被压缩。以前，每个细分赛道都有机会诞生超级应用，但现在这种可能性大大降低了。

以下为演讲内容，由创业邦整理：

很高兴有机会跟大家分享我们在AIGC赛道上的一些工作。我叫李维，是出门问问的VP。我今天演讲的题目叫“盈利为王，AIGC从技术创新走向商业落地”。

实际上这个题目挺难的，原因在于AI和大模型非常热闹，但是真正走向商业落地的寥若晨星。所谓“AI一日人间一年”，大模型的发展虽然很快，但从总的趋势上来看，AI实则进入了一段下行期。历史上，技术革新往往遵循一条被称为“technology adoption curve”的路径：一开始，人们对于重大创新和突破趋之若鹜，但当这些创新在商业化、盈利和实际应用方面遇到瓶颈时，就会出现一段回落期。我们现在正处于这一回落下行期，但尚未触底。

这一趋势也映射到AI投资上，今年对于大模型公司来说，融资变得极具挑战性。主要原因是大模型非常烧钱，算力消耗往往远超人力成本。然而，只投入不产出的模式是不可持续的。尽管大模型拥有巨大的潜力，但要实现盈利并不像看起来那么简单。

大家都在谈，上一次技术革命是移动互联网，虽然它的技术含量可能不如大模型带给人的震撼，但它确实催生了一系列超级应用，很好地满足了人们“吃穿住行、娱乐和通信”等基本生活需求。这些超级应用已经在上一代技术革命中达到了很高的水平，留给后来人创新提升的空间已经不多。想要通过大模型在这些领域实现进一步的突破，变得非常难。

ChatGPT的出现标志着通用人工智能的一次大爆发。在通用人工智能时代，一个模型可以处理各种任务，这使得许多细分领域的创新空间被压缩。以前，每个细分赛道都有机会诞生超级应用，但现在这种可能性大大降低了。

我举个例子，机器翻译并不是大模型时代的新现象，早在12年前神经网络革命开始时，机器翻译就是首批受到影响的领域。结果，像百度、谷歌这样的大公司都把机器翻译服务做成了免费产品。一旦一个领域变成了免费或白菜价，创业的空间就大大减少，甚至可能扼杀整个产业。类似的情况也出现在大模型普及之后，比如Jasper。在ChatGPT出现之前，大模型刚刚兴起，那时业内人都知道GPT3。Jasper利用大模型的能力，将其包装成可以帮助人们进行写作和生成各种文案的工具，吸引了一批客户。

随着ChatGPT等通用大模型的崛起，Jasper的优势逐渐消失，开始走向衰落。ChatGPT不仅仅是一个超级应用，它通过人机对话的方式，实际上已经成为一个“超级的超级应用”（super super-apps），它超越了传统的界限。通用大模型现在能够处理各种语言、知识，甚至多模态的内容，如语音、音乐、图像和视频。这种广泛的能力使得通用大模型在很多领域都占据了主导地位，挤压了相关赛道的生存空间。

如今，作为投资人或创业者，在寻找发展方向时，需要经常思考一个问题：我现在所在的领域或开发的App是否具有持久的竞争优势？虽然可能暂时吸引一些客户并满足部分需求，但如果我的项目挡在这个大模型自然扩散能力的路上，那么我最终很可能面临像Jasper那样的衰退命运。

通用大模型的能力如此之强，被寄予很高的期望，结果在真正落地应用的时候，才发现原来有很多限制条件在。一方面，上一代技术已经相当成熟，吃穿住行等低枝果实的超级应用已经做到极致；另一方面，大模型本身还在不断发展，其通用能力常常限制了落地赛道的发展空间。因此，大家普遍发现将大模型规模化落地应用是一件相当困难的事情。

我曾是大模型的热烈支持者，也是在中国最早“鼓吹”大模型的人之一。当时非常激动，因为在我一辈子的NLP职业生涯中，从未见过如此革命性的变化。

在NLP领域，过去我们有机器翻译、对话系统、问答系统等多个专业方向，甚至还有分词这样的细分技术。但大模型的出现，就像翻越了语言的大山，统一了这些不同的赛道，建立起了一种类似圣经中巴别塔的通用语言能力。大模型的出现，彻底改变了NLP行业的格局。但实际应用起来，我们发现它比我一年多前想象的要困难得多。例如，以NLP为方向的应用（如各种文案或翻译的 co-pilot）因为已经被头部大模型搞定，这个方向的创业产业就做死了。

现在，大家都在期待大模型原生的超级应用（LLM-native Super APPs），虽然呼声很高，行业内竞争激烈，但真正能够规模化落地的，目前还只有像ChatGPT、豆包、文心一言这样的通用类的toC应用。

简而言之，这种超级大模型正逐步深入社会的各个层面。然而，除了它之外，我们还没有看到其他大模型实现规模化落地的成功标杆案例，大家都在艰难跋涉，试图找到与通用大模型基本能力有差异化的突破口和切入点。

目前的情况是，无论是行业内外人士、投资者还是创业者，都对大模型的技术潜力广泛认可，这仍然是基本共识，但要找到它的market fit并实现规模化落地，可能需要至少两三年的时间探索：现在处于技术落地的难产期，也是突破期。

大模型虽然功能强大，但也存在一些严重的短板：第一，信息准确性问题：由于学习了大量信息，大模型可能在记忆不精确的细节时出现错误，导致其输出内容不准确，所谓“幻觉”。第二，可控性问题：与以往的符号逻辑AI不同，大模型包含百亿千亿参数，运作起来像一个巨大的黑箱，难以进行精确的调试和控制。第三，安全性问题：大模型可能存在安全隐患，公开上线需要谨慎。第四，算力成本问题：尽管随着技术进步，算力成本有望降低，但与上一代边际成本趋于零的许多应用相比，使用大模型的应用在算力成本上仍然相当高。推理时也常常遭遇高并发的瓶颈。普及到 toC的大众市场，巨大的推理成本直接影响盈利能力。

以上的分析回顾，听起来有些悲观。但长远一点来看，特别是在接下来的三到五年内，还是值得期待的，尤其是垂直赛道（尽管垂直领域尚未迎来大爆发，但许多人在积极探索）。今天上午座谈会的嘉宾们也在分享他们如何将大模型技术应用到自己的领域，都是从需求出发的第一线分享。这些努力最终将取得成效，预计在未来几年内，垂直领域将会迎来显著的发展和突破。

大模型的研发成本高昂，这对创业公司来说是个挑战，而且模型与产品之间的脱节也是一个主要问题。我们目前正在探讨如何将模型与产品有效结合，以克服这些困难。接下来，我将分享我们在这方面的探索和经验。

模型开发可以选择自主研发或使用第三方服务，这取决于产品的具体需求。目前市场上竞争激烈，第三方服务往往价格低廉，更具成本效益。但如果产品需要高并发处理能力或有特定的定制化需求，第三方服务可能无法满足，这时自主研发就显得重要。

这就需要探索如何将模型开发与产品需求紧密结合，以实现成本控制和产品定制的灵活性。这包括调整大模型以适应我们的产品需求，实现最优的产品模型结合。

然而，产模结合说起来容易做起来难。因为模型开发和产品开发是两个不同的领域，涉及的人员背景和思维方式各异。要让这两类人员有效协作，发挥出最大的协同效应，对许多公司来说都是一大挑战。

我们开发了自己的大模型，名为"序列猴子"，基于此，我们推出了多模态AIGC产品，涵盖数字人、配音和短视频一键生成等功能。此外，我们还成功推出了出海产品"DupDub "。"魔音工坊"是我们在市场中占据主导地位的产品，尤其在抖音平台上，约70%的声音内容都使用了我们的技术。

我们的目标客户群体包括内容创作者（ToPC，to professional consumer）和小型至中型企业（ToSMB，to small medium businesses）。内容创作者愿意为方便他们工作的工具付费，而我们正是为他们提供这样的工具。对于ToB客户，我们专注于为中小企业提供较为标准化的解决方案，因为大型客户的定制化需求较为复杂，不易操作。目前，我们拥有86万付费用户，这证明了我们的服务已经成功落地并得到市场的认可。下面是我们产品的一些展示。

转述老领导的硅谷风投现状和展望的分享

摘要

1. 未来几年内 65% 的风险投资公司将倒闭。

2. 凭借 2010 年代的出色回报，风险投资一路高歌猛进。

3. 2021 年，1577 家不同的风险投资公司共筹集了 1830 亿美元。

4. 但与此同时，创办一家初创公司的成本却越来越低。广泛可用的工具、全球劳动力和便捷的（在线）分销意味着创办一家 SaaS 公司从未如此简单或便宜。

5. IPO 窗口关闭了——公司无法上市。因此，风险投资公司无法通过大型 IPO 赚钱。

6. 并购也没有发生（至少对卖方来说价格不高）。因此，风险投资公司无法通过出售公司来赚钱。

7. “我们决定不再筹集另一只基金。” 翻译：他们可能做不到。更多的公司会说他们“不再投资”；合伙人“决定担任运营职务”。董事总经理正在退休。

8. 2023 年，597 家风险投资公司筹集了 810 亿美元。这分别下降了 63% 和 56%（与 2021 年相比）。

9. 风险投资的狂欢派对结束了。或者至少这一章结束了......

10. 超过 50% 的现有公司将无法生存。这意味着，如果您是在这种环境下筹集资金的初创公司首席执行官或运营商，您需要了解游戏规则已经改变。

11. 不要相信一些创始人仅凭 20 万美元的 ARR 和一份好牌就筹集了 3000 万美元的故事。风险投资公司用巨额支票救助经营不善的企业的时代已经结束。

12. 筹集资金的最佳时机是您不需要它的时候。深入挖掘以度过冬天。

以上是我的“老领导” Jonathan前天在LinkedIN关于硅谷风投下行的一种描述。很清醒。

听上去似乎奇怪，大模型爆发以后很快风投成为下行，这其实还是因为 technology adoption curve 本身现在在落地应用上遭遇挑战，整体处于下行，无论中美。

大模型的到来使得初创的启动变得容易，这样僧多粥少，筹集风投就更加困难。

我这个老板一共做过4家初创，第一家是他初出茅庐时候做的 Netbase，我们相处非常好。现在这第四家SAAS创业很扎实，势头很好，有望成为硅谷独角兽。他也历练成熟多了。

他是第一个在LinkedIn上给我写推荐的人，一直保持联系。

Jonathan Spier

Last Friday, I had dinner with a famous VC investor who told me 65% of VC’s will go out of business in the next few years. I believe him. Here’s what that means for startup leaders waiting on their Series A/B/C:

Those of us in startups tend to think VCs are at the top of the food chain.

They aren’t.

VCs are businesses too.

They raise money from THEIR investors (aka LPs).

And their job is to make a return for those LPs.

With outstanding returns in the 2010s, VC was on a win streak.

Many more funds were born.

And existing funds got much larger.

In 2021, 1577 different VC firms raised a total of $183 billion.

But at the same time, costs to launch a startup have gotten cheaper.

Widely available tools, global workforce, and easy (online) distribution mean it’s never been easier, or cheaper, to start a SaaS company.

So how are VCs supposed to deploy all that money they’ve raised?

They can’t.

There is too much money chasing too few deals.

Make no mistake, for VCs, it’s a fatal mix.

The IPO window is closed – companies can’t go public.

So VCs aren’t making money with big IPOs.

M&A isn’t happening (at least not at good prices for sellers).

So VCs aren’t making money by selling their companies.

If VCs aren’t making money, they can’t return capital to their LPs.

They are in trouble.

Of course, VCs rarely go out of business the way their companies might.

Reputations are at stake, so change happens quietly.

But it’s the same result.

It’s already happening.

Listen carefully, and you’ll hear VCs saying:

"We have decided not to raise another fund.”

Translation: they probably can’t.

More firms will say that they are “no longer investing”

Partners are “deciding to take operating roles.”

Managing Directors are retiring.

In 2023, 597 VC firms raised $81B.

That’s down 63% and 56% (vs. 2021).

The VC party is over.

Or at least this chapter is...

The select few at the top of the VC list will have their pick of deals.

The great business builders will choose their spots and continue to thrive.

I’ve been lucky to work with a few of those and am certain that their expertise and relationships will carry them through.

But over 50% of existing firms, won’t survive.

That means if you're a startup CEO or operator raising money in this environment, you need to understand the game has changed.

Don’t buy the stories of some founder that raised $30M with $200k ARR and a good deck.

The era of VCs bailing out bad businesses with huge checks is over.

Many of the VCs won’t even be around in a few years.

There is only one strategy that works in this economy.

Focus.

Nail your ICP.

Delight your customers.

Get profitable to control your financial destiny.

The best time to raise money is when you don’t need it.

It’s a harsh economy out there.

Burrow down deep to survive the winter.

Remember:

Tough markets make strong companies.

视觉模型生成的极限对齐

我现在成了老友同学中的AIGC定制服务的大师了，下面是两位老同学请我做的AIGC老照片怀旧视频。我只要能抽出空，非常乐于为亲友提供这种情绪价值，因为很开心看到他们的惊喜。

现在刘老师可是世界级钢琴大师，常在欧美中巡回演出。这是当年与我老同学孙兄在费城一起演奏练习的珍贵老照片。

吹拉弹唱无所不能的NLP老司机白硕老师评论说：会拉沉思的表示，弓法指法都不对。前微软NLP老友吴兄也留言反映视觉模型不懂音乐：“這個有待改進哦！一看就是不會拉小提琴和彈鋼琴的人做的/:,@P。音樂和弓子的運行差太多，第一個音是有兩拍半的長音，應該給個長弓才對，另外鋼琴伴奏的右腳永遠不會這樣翹著，或抖動著。他的右腳應該在 sustain pedal 上。”

LOL

生成的时候虽然prompt里是标明了音乐的名字的，但在可见的将来，没有哪家模型能够真正做到音乐理解与演奏肢体动作细节的对齐。或可留作大模型AGI的极限挑战题（之一）吧，因为理论上如果有足够的对齐数据，根据联合训练的压缩理论，做到极致不同模态之间是可以对齐的。

如果客观世界模拟器/仿真器是视觉大模型的最终目标的话，当代的视觉大模型处于“对牛弹琴”和“对音乐盲弹琴”的段位，绝对经不起音乐家的检验。譬如，我这样的乐盲，看上面的怀旧演奏视频，就不会像音乐行家那样一眼看出破绽，反而觉得惟妙惟肖，情绪拉满。

当然，音乐家的标准也许就是个伪需求伪目标（让挑剔的“专家眼”满意了视觉细节又如何？能大卖么），也许并不真值得费力追求。但理论上，理想的 AGI 都应该搞定这些专家水平的要求。

立委论LLM：什么是AI刚需

什么是刚需？

朋友圈看到一个鸡汤：“穷人说，有钱就是幸福；盲人说，能看见就是幸福；乞丐说，有饭吃就是幸福；病人说，能活着就是幸福；单身说，有伴侣就是幸福。...". 可以诠释用户刚需。

让穷人富起来是刚需，当然很多时候，因为它是刚需，不乏人类饥不择食的本性，刚需者成为被割的韭菜。

吃饱肚子是致富的子集，是最刚的需，因此AGI路上最应该实现的是UBI，不能让资本家独享技术进步和裁员的红利。

让盲人重见光明是让病人痊愈的刚需之最，二者说明，医疗无论多么艰难，仍然是最诱人的垂直领域。

以上属于吃穿住行生老病死的低层次刚需。

但最后，“脱单”却是高低层次融合（物理和精神陪伴：原始性吸引+情绪价值）的刚需。这个刚需恰好是 LLM native 的 sweet spot，不会因为 character.ai 商业化暂时受挫而失去前途。

以前说过，能让一个人感觉成功脱单，对提供商是功德，对用户是砸锅卖铁也心甘情愿的生活品质提升：绝对有付费意愿。

（也许是年轻脆弱多愁善感吧，孤独飘零在大北京，形单影只，当时既没有互联网，也没有游戏，生活中也见不到找到可心女孩的希望。在脱单前记得年轻做过多次噩梦，以为一辈子要打光棍了，不寒而栗，刻骨铭心。）

长期看，老年陪护、心理therapy等都属于 LLM native 的刚需，找到切入点，排除万难做到底做好的赢家，不仅是赚钱，而且是功德。

这些应该属于第一性原则的思考。

The Challenge of Character Consistency in Video Generation

Facial recognition in the vast world of AI is a specialized and challenging task, as human eyes are exceptionally sensitive to facial features. Because facial recognition is so specialized and sensitive, it presents a much greater challenge than traditional image recognition tasks, like identifying animal types. Consequently, this field achieved breakthroughs earlier than others: even before the advent of contemporary large models such as GPTs, deep neural network-based facial recognition, powered by extensive datasets of facial images, had already surpassed human visual capabilities and sensitivity. It became widely adopted, leading to the rise of unicorns in the pre-large model era.

Now, as we transition to universal video foundation models that aim to handle all objects in the world, whether it's Sora or Keling, maintaining facial consistency remains a significant challenge. The public has little access to Sora, but by examining similar leading visual models like Keling, we can perceive its limitations. Typically, after about half a minute, the generated faces start to diverge, no longer resembling the original person as closely. Achieving long-term consistency in character appearance is difficult without specialized processing and targeted optimization; relying solely on the current general video consistency training efforts is unlikely to overcome this bottleneck. This limitation has been repeatedly observed during various tests with publicly available visual products like Keling.

In some videos, if not for the sensitivity of human eyes, different visuals might be impossible from a purely physical perspective. This highlights the sharpness of human perception: the ability to instantly discern the real from the fake.

For example, in the videos generated below featuring Maria (Xiao Ya, the favorite text2image IP I have generated and maintained in my AIGC videos), her fans can immediately tell which one is genuine, even though Maria herself may present different appearances at different ages and in various settings. There exists an abstract, invariant facial characteristic that equips humans with an eagle-eyed ability to recognize faces. The secret to this lies in the decoupling of these characteristics already pretty well done in the previous generation of facial recognition models. Compare and contrast:

It's important to note that maintaining character consistency is a critical benchmark for generating cinematic and user-configurable video works. Without crossing this threshold, the field will struggle to achieve large-scale applications in video art creation. The dream of a fully virtual Hollywood production line, without physical filming, will remain a fantasy.

Why is it so difficult for visual models to achieve consistent character representation over long periods using brute force?

Video is a high-dimensional modality, and for large models (at least in the foreseeable future) to handle video, they must employ significant "lossy compression". The compression ratio of visual tokens is high, making it more feasible to align training/generation across the entire frames over time within the hidden space. The higher the compression ratio, the stronger the temporal consistency across the entire frames. Self-regressive models (GPT-like) or DiT (Diffusion Transformers) can achieve this. By doing so, videos that violate the physical laws of the real world can be effectively under control, reducing illogical hallucinations and making visual models appear to simulate the objective world (or so it seems). However, there is a trade-off: under lossy compression, the consistency of the overall frames and the consistency of detailed features of specific physical objects therein cannot be optimized simultaneously.

The current approach typically involves adding a super-resolution (SR) module/mdoel after achieving overall contour (blueprint) consistency, attempting to restore discarded details. In general, super-resolution rendering has made significant progress so far, thanks to the accumulation of research in "deepfake"-like technology. However, deepfake technology essentially compensates for the losses incurred during compression, using the large visual foundation model's strength in imagination (or "hallucination") to reasonably and non-deterministically fill in the details, depicting how the world "should" look like, what it should be, rather than what it is, often with amazingly detailed lifelike results. But if the goal is to represent an individual entity, especially a finely detailed one like a human face of some IP, with individual features sensitive to human perception, it's inevitable that the generated image will drift over time. This is the crux of the problem. The solution should not rely on increasingly larger models and longer context windows with brute-force data and training. Brute force can only slow the deviation but cannot eliminate the non-deterministic bias that accumulates during the SR process over long video sequences. We need to think outside the box and exclude the time dimension as a factor, using a step-by-step alignment method, which may break the time cycle. I’ll stop here—don't say you weren't warned.

The prerequisite for achieving this is the decoupling of facial features. Features that cannot be decoupled cannot be aligned step by step. They have to, and can, be decoupled; otherwise, it would be impossible to explain how dozens of Hollywood actors can star in thousands of blockbuster films. The decoupling of faces from expressions and time still has room for improvement, but the technology has already matured considerably. It is a matter of how to properly use it in the process.

Original Chinese post in

立委论LLM：视频生成的人物一致性问题

立委论LLM：视频生成的人物一致性问题

大千世界人脸识别是一个专修而艰难的任务，因为人眼对于人脸的特征有非常的敏感。正因为人脸识别专门而敏感，比起识别阿猫阿狗的传统图像识别挑战性非同日而语，结果这个赛道首先得到突破：甚至在大模型之前，深度神经人脸识别得力于极为丰富的人脸图像数据，就已经超过了人类肉眼的识别能力和敏感度，也率先得到了广泛应用，成就了前大模型一代的独角兽。

到了面向大千世界一切对象的万用视频大模型的时候，无论Sora还是可灵，人脸的一致性依然是一个巨大挑战。Sora 不公测不大知道，但通过类似design的国内头部视觉大模型可灵，可以感知里面的局限性。通常超过半分钟，人脸就开始发散，变得不像那个人了。长程人物形象的一致性，如果不做专项处理和引入专项优化，单靠现在通用的视频一致性的训练努力，是很难突破瓶颈的。这个局限，在玩可灵一类公开上线的产品时，各种测试一再被发现。

有些视频如果不是人眼的敏感性（难怪“外贸”协会的求偶者那么多，原来人眼揉不得沙子，尤其是找对象的时候，普通人很难粗粒度容忍潜在对象的面相瑕疵或因为面相的不来电 lol），哪怕从视觉物理特性看其实已经无法区分。可见人的眼毒：可以立即分辨真假李逵。

例如，下面两个小雅的视频生成，小雅粉丝一眼就能辨真假，虽然小雅本人也有不同的年龄，不同的场合，会呈现不同的形象，但有一种可以抽象出来的不变的人脸特征在武装着人类对于人脸的火眼金睛。而这一切的密码其实在上一代人脸识别的模型里面已经解耦。

注意：人物形象一致性是电影性和可配置的视频作品生成的关键指针。不跨过这道门槛，这个赛道会难以规模化落地。那些梦想空手套白狼、无需物理拍摄、全程虚拟数字化的大模型好莱坞生产线，也就成为泡影。

为什么视觉大模型靠蛮力很难在人物长程一致性上奏效？、因为视频是模态中维度很高的形态，大模型（至少在可见的将来）为了搞定视频，就必须做大力的有损压缩。视觉的 tokens 压缩比很高，这样在内部的 hidden space 做整体帧的长程一致性的对齐训练/生成才比较可行。压缩旅越高，总体画面的时间维度的一致性就越强。自回归或DiT就可以搞定。只有这样，违背大千世界物理规律的视频就会得到有效控制，减少违背常识的幻觉，让视觉大模型看上去已然是客观世界的模拟器（疑似）了。但这里有个矛盾，在有损压缩的条件下，帧总体的一致性与其中具体物理对象的细节特征的一致性，没办法同步优化一致性。

目前的方案通常是，在总体轮廓（blueprint）一致性搞定后，追加一个高清化（SR）的过程，试图把舍弃的细节复现出来。高清化渲染，总体而言在过去几年的 deep fake 研发积累中，已经做得相当不错了。但是 deep fake 本质上是在有损压缩的条件下的亡羊补牢，它所能做的就是通过大模型所擅长的想象（或曰幻觉）来合理地、非确定性地填补细节，描绘世界应该具有的形象（what it should be，Not what it is），可以栩栩如生。但如果目标是一个特定对象，尤其是人脸这种细粒度对象，有人眼敏感的个体特征（IP），它就免不了在长时间的生成中有所走偏，这就是问题的要害所在。解决的办法不应该指望模型越来越大、context window 越来越长的大数据蛮干。因为蛮力只能减缓偏差，不能根治长视频的SR过程中随时间而增加的非确定性偏差积累。要 think out of box，排除时间维度作为条件，以步步对齐的方法，或可解套。点到为止吧，勿谓言之不预。

做到这点的前提是人脸可以解耦。无法解耦的特征做不到步步对齐。一定是可以解耦的，否则无法说明好莱坞几十个名演员可以演几千部大片。人脸与表情和时间的解偶当然还有进一步的提升空间，但技术已经比较成熟了。