Professor Ma Claims to Have Fully Unveiled the Mysteries of Neural Networks

Professor Yi Ma’s white-box transformer paper is available here.

Professor Ma is a prominent figure, renowned for his distinctive style and leadership in the field. His name is widely recognized and respected. Of particular interest recently are his critiques of mainstream large models and the bold claims he has made about his own work (see his post in Chinese below).

Recently, at a conference in Shenzhen (which I attended with my own talk too), Professor Ma sharply criticized mainstream large models, Ilya, and Kolmogorov complexity theory, dismissing them as being on the level of high school students and claiming that they lack a true understanding of theoretical concepts. He asserted that he has achieved breakthroughs in both theory and practice, particularly with the white-box Transformer developed by his team. According to him, this model not only demystifies the complexity of large models but also offers an engineering-feasible alternative.

When someone speaks with such confidence, it usually indicates genuine expertise and a commanding presence. Just as Yann LeCun in the U.S. criticized GPT as being inferior to a dog and called it a dead end, proposing his world model as an alternative, China has Professor Ma. Their critiques balance the global discourse, making the world feel less excluding. There is indeed hope that their work might address the "slow thinking" and "interpretability" shortcomings of current mainstream large models and contribute to the overall advancement of AI. Professor Ma’s academic and practical work deserves close study, though we may have to wait for time and peer reviews to fully test and validate their findings.

At the Shenzhen conference, after delivering his talk and sharp critiques, Professor Ma left immediately, likely due to his busy schedule.

The paper is over 100 pages long and is said to be released in a few days. Based on the current outline, the key points are as follows:

Overall, CRATE is similar to a transformer, with two differences:

- In each attention head, the Q, K, and V weight matrices are tied, i.e., set to be equal.
- The nonlinearity following each attention layer is no longer a multi-layer perceptron (MLP) but rather a more structured operator (ISTA) with sparse outputs.

Let's examine ISTA (Iterative Soft-Thresholding Algorithm), a widely used algorithm for solving sparse optimization problems in machine learning. In his  CRATE architecture, ISTA replaces the traditional MLP in Transformers. Not long ago, KAN also introduced innovations aimed at replacing the MLP, both approaches representing surgeries within the Transformer architecture.

In my understanding, ISTA and KAN (for Science/Physics) share a common goal: through regularization or pruning, they ultimately fit a sparse path, thus achieving interpretability.

How it works

ISTA Iteratively approaches the optimal solution of a problem. Each iteration involves two steps: a) a gradient descent step, which aligns with mainstream methods; and b) a soft-thresholding operation. This operation is added to balance two objectives:

a) Maximizing model accuracy;
b) Achieving model sparsity, i.e., simplicity (as overly complex models are difficult for humans to interpret).

The soft-thresholding operation encourages internal elements to become zero, resulting in sparse outputs and increased interpretability. The weight-tied attention mechanism, combined with ISTA, promotes a deeper understanding of the input data structure, resembling a human-like structured analysis process that prioritizes key elements while regularizing the data.

Professor Ma claims that these two modifications naturally lead the model to learn the interpretability associated with human-like structuring and sparsity during supervised learning (and later as claimed successfully applied to self-supervised learning too).

For example, in image recognition, it was observed that certain attention heads correspond to different parts of animals. What's more remarkable is that this correspondence remains consistent across different animals and even different categories of animals. For instance, an attention head focused on the "head" consistently pays attention to the head area when processing different kinds of animals. This consistency suggests that CRATE has learned a general representation of visual features across categories.

However, those studying LLM interpretability have long discovered that at the end of MLP networks, various structured components (such as heads and feet) are also captured by attention mechanisms. Without this, it would be difficult to explain the generalization (or compression) capabilities exhibited by LLMs. The challenge lies in the early stages of the MLP network, where attention is more mixed, and mainstream researcher struggle to clarify what the attentions heads are focusing on. It seems that they are vaguely paying attention to the relationships between basic elements like pixels/dots and lines.

The core idea behind explainable AI is consistent: transforming the tangled, black-box, multi-layer network's internal data fitting paths into structured paths that are enabled with various constraints and pruning, leading to a sparse representation.

Who wouldn’t want a model to be interpretable? However, achieving sparsity and simplicity is extremely challenging, which is why, so far, these approaches have struggled to compete with the black-box methods that involve randomness.

Professor Ma’s confidence stems from the fact that, in the past six months to a year, he has begun to train models using the explainable white-box methods mentioned above, achieving results comparable to traditional transformers. At the Shenzhen conference, he mentioned that while he had always been confident that this was the correct approach, he remained cautious until results were obtained. Now, he believes that his cross-national team’s achievements with this approach have satisfied him enough to announce to the world that he has found a breakthrough in theory as well as practice, the correct method for white-boxing transformers, which could lead to a paradigm shift and a breakthrough in deep learning. This has made him both excited and confident. Therefore, he is no longer content with academic theoretical achievements alone; he feels compelled to take actions in industry as well. Professor Ma has recently founded a company to advance this work on an engineering level. At Shenzhen, he announced a directionally significant project challenging the mainstream, first time under the banner of his new company.

However, based on my years of NLP experience and intuition, I must point out a challenge (or potential issue): Human interpretability is built on a highly simplified finite set. If we consider symbolic features, a feature system with more than thousands of elements becomes incomprehensible to humans. But on the other hand, the number of parameters in transformers and the number of KQVs for attention heads are on a completely different scale. Reducing such complexity on this scale seems almost unimaginable.

KAN for Science succeeded because their target was extremely narrow—certain existing symbolic formulas in physics or potential formulas limited to a few parameters. With such a goal, pruning, along with scientist intervention or feedback, allowed KAN to claim interpretability.

Regardless, Professor Ma seems confident, so we would like to observe how his methods and results evolve and will, or will not, be accepted.



  1. 大模型的出现,就像翻越了语言的大山,统一了这些不同的赛道,建立起了一种类似圣经中巴别塔的通用语言能力。
  2. 历史上,技术革新往往遵循一条被称为“technology adoption curve”的路径:一开始,人们对于重大创新和突破趋之若鹜,但当这些创新在商业化、盈利和实际应用方面遇到瓶颈时,就会出现一段回落期。
  3. 在通用人工智能时代,一个模型可以处理各种任务,这使得许多细分领域的创新空间被压缩。以前,每个细分赛道都有机会诞生超级应用,但现在这种可能性大大降低了。



实际上这个题目挺难的,原因在于AI和大模型非常热闹,但是真正走向商业落地的寥若晨星。所谓“AI一日人间一年”,大模型的发展虽然很快,但从总的趋势上来看,AI实则进入了一段下行期。历史上,技术革新往往遵循一条被称为“technology adoption curve”的路径:一开始,人们对于重大创新和突破趋之若鹜,但当这些创新在商业化、盈利和实际应用方面遇到瓶颈时,就会出现一段回落期。我们现在正处于这一回落下行期,但尚未触底。





随着ChatGPT等通用大模型的崛起,Jasper的优势逐渐消失,开始走向衰落。ChatGPT不仅仅是一个超级应用,它通过人机对话的方式,实际上已经成为一个“超级的超级应用”(super super-apps),它超越了传统的界限。通用大模型现在能够处理各种语言、知识,甚至多模态的内容,如语音、音乐、图像和视频。这种广泛的能力使得通用大模型在很多领域都占据了主导地位,挤压了相关赛道的生存空间。




在NLP领域,过去我们有机器翻译、对话系统、问答系统等多个专业方向,甚至还有分词这样的细分技术。但大模型的出现,就像翻越了语言的大山,统一了这些不同的赛道,建立起了一种类似圣经中巴别塔的通用语言能力。大模型的出现,彻底改变了NLP行业的格局。但实际应用起来,我们发现它比我一年多前想象的要困难得多。例如,以NLP为方向的应用(如各种文案或翻译的 co-pilot)因为已经被头部大模型搞定,这个方向的创业产业就做死了。

现在,大家都在期待大模型原生的超级应用(LLM-native Super APPs),虽然呼声很高,行业内竞争激烈,但真正能够规模化落地的,目前还只有像ChatGPT、豆包、文心一言这样的通用类的toC应用。


目前的情况是,无论是行业内外人士、投资者还是创业者,都对大模型的技术潜力广泛认可,这仍然是基本共识,但要找到它的market fit并实现规模化落地,可能需要至少两三年的时间探索:现在处于技术落地的难产期,也是突破期。

大模型虽然功能强大,但也存在一些严重的短板:第一,信息准确性问题:由于学习了大量信息,大模型可能在记忆不精确的细节时出现错误,导致其输出内容不准确,所谓“幻觉”。第二,可控性问题:与以往的符号逻辑AI不同,大模型包含百亿千亿参数,运作起来像一个巨大的黑箱,难以进行精确的调试和控制。第三,安全性问题:大模型可能存在安全隐患,公开上线需要谨慎。第四,算力成本问题:尽管随着技术进步,算力成本有望降低,但与上一代边际成本趋于零的许多应用相比,使用大模型的应用在算力成本上仍然相当高。推理时也常常遭遇高并发的瓶颈。普及到 toC的大众市场,巨大的推理成本直接影响盈利能力。






我们开发了自己的大模型,名为"序列猴子",基于此,我们推出了多模态AIGC产品,涵盖数字人、配音和短视频一键生成等功能。此外,我们还成功推出了出海产品"DupDub "。"魔音工坊"是我们在市场中占据主导地位的产品,尤其在抖音平台上,约70%的声音内容都使用了我们的技术。

我们的目标客户群体包括内容创作者(ToPC,to professional consumer)和小型至中型企业(ToSMB,to small medium businesses)。内容创作者愿意为方便他们工作的工具付费,而我们正是为他们提供这样的工具。对于ToB客户,我们专注于为中小企业提供较为标准化的解决方案,因为大型客户的定制化需求较为复杂,不易操作。目前,我们拥有86万付费用户,这证明了我们的服务已经成功落地并得到市场的认可。下面是我们产品的一些展示。



1. 未来几年内 65% 的风险投资公司将倒闭。

2. 凭借 2010 年代的出色回报,风险投资一路高歌猛进。

3. 2021 年,1577 家不同的风险投资公司共筹集了 1830 亿美元。

4. 但与此同时,创办一家初创公司的成本却越来越低。广泛可用的工具、全球劳动力和便捷的(在线)分销意味着创办一家 SaaS 公司从未如此简单或便宜。

5. IPO 窗口关闭了——公司无法上市。因此,风险投资公司无法通过大型 IPO 赚钱。

6. 并购也没有发生(至少对卖方来说价格不高)。因此,风险投资公司无法通过出售公司来赚钱。

7. “我们决定不再筹集另一只基金。” 翻译:他们可能做不到。更多的公司会说他们“不再投资”;合伙人“决定担任运营职务”。董事总经理正在退休。

8. 2023 年,597 家风险投资公司筹集了 810 亿美元。这分别下降了 63% 和 56%(与 2021 年相比)。

9. 风险投资的狂欢派对结束了。或者至少这一章结束了......

10. 超过 50% 的现有公司将无法生存。这意味着,如果您是在这种环境下筹集资金的初创公司首席执行官或运营商,您需要了解游戏规则已经改变。

11. 不要相信一些创始人仅凭 20 万美元的 ARR 和一份好牌就筹集了 3000 万美元的故事。风险投资公司用巨额支票救助经营不善的企业的时代已经结束。

12. 筹集资金的最佳时机是您不需要它的时候。深入挖掘以度过冬天。

以上是我的“老领导” Jonathan前天在LinkedIN关于硅谷风投下行的一种描述。很清醒。

听上去似乎奇怪,大模型爆发以后很快风投成为下行,这其实还是因为 technology adoption curve 本身现在在落地应用上遭遇挑战,整体处于下行,无论中美。


我这个老板一共做过4家初创,第一家是他初出茅庐时候做的 Netbase,我们相处非常好。现在这第四家SAAS创业很扎实,势头很好,有望成为硅谷独角兽。他也历练成熟多了。



Jonathan Spier

Last Friday, I had dinner with a famous VC investor who told me 65% of VC’s will go out of business in the next few years. I believe him. Here’s what that means for startup leaders waiting on their Series A/B/C:

Those of us in startups tend to think VCs are at the top of the food chain.

They aren’t.

VCs are businesses too.

They raise money from THEIR investors (aka LPs).

And their job is to make a return for those LPs.

With outstanding returns in the 2010s, VC was on a win streak.

Many more funds were born.

And existing funds got much larger.

In 2021, 1577 different VC firms raised a total of $183 billion.

But at the same time, costs to launch a startup have gotten cheaper.

Widely available tools, global workforce, and easy (online) distribution mean it’s never been easier, or cheaper, to start a SaaS company.

So how are VCs supposed to deploy all that money they’ve raised?

They can’t.

There is too much money chasing too few deals.

Make no mistake, for VCs, it’s a fatal mix.

The IPO window is closed – companies can’t go public.

So VCs aren’t making money with big IPOs.

M&A isn’t happening (at least not at good prices for sellers).

So VCs aren’t making money by selling their companies.

If VCs aren’t making money, they can’t return capital to their LPs.

They are in trouble.

Of course, VCs rarely go out of business the way their companies might.

Reputations are at stake, so change happens quietly.

But it’s the same result.

It’s already happening.

Listen carefully, and you’ll hear VCs saying:

"We have decided not to raise another fund.”

Translation: they probably can’t.

More firms will say that they are “no longer investing”

Partners are “deciding to take operating roles.”

Managing Directors are retiring.

In 2023, 597 VC firms raised $81B.

That’s down 63% and 56% (vs. 2021).

The VC party is over.

Or at least this chapter is...

The select few at the top of the VC list will have their pick of deals.

The great business builders will choose their spots and continue to thrive.

I’ve been lucky to work with a few of those and am certain that their expertise and relationships will carry them through.

But over 50% of existing firms, won’t survive.

That means if you're a startup CEO or operator raising money in this environment, you need to understand the game has changed.

Don’t buy the stories of some founder that raised $30M with $200k ARR and a good deck.

The era of VCs bailing out bad businesses with huge checks is over.

Many of the VCs won’t even be around in a few years.

There is only one strategy that works in this economy.


Nail your ICP.

Delight your customers.

Get profitable to control your financial destiny.

The best time to raise money is when you don’t need it.

It’s a harsh economy out there.

Burrow down deep to survive the winter.


Tough markets make strong companies.




The Challenge of Character Consistency in Video Generation

Facial recognition in the vast world of AI is a specialized and challenging task, as human eyes are exceptionally sensitive to facial features. Because facial recognition is so specialized and sensitive, it presents a much greater challenge than traditional image recognition tasks, like identifying animal types. Consequently, this field achieved breakthroughs earlier than others: even before the advent of contemporary large models such as GPTs, deep neural network-based facial recognition, powered by extensive datasets of facial images, had already surpassed human visual capabilities and sensitivity. It became widely adopted, leading to the rise of unicorns in the pre-large model era.

Now, as we transition to universal video foundation models that aim to handle all objects in the world, whether it's Sora or Keling, maintaining facial consistency remains a significant challenge. The public has little access to Sora, but by examining similar leading visual models like Keling, we can perceive its limitations. Typically, after about half a minute, the generated faces start to diverge, no longer resembling the original person as closely. Achieving long-term consistency in character appearance is difficult without specialized processing and targeted optimization; relying solely on the current general video consistency training efforts is unlikely to overcome this bottleneck. This limitation has been repeatedly observed during various tests with publicly available visual products like Keling.

In some videos, if not for the sensitivity of human eyes, different visuals might be impossible from a purely physical perspective. This highlights the sharpness of human perception: the ability to instantly discern the real from the fake.

For example, in the videos generated below featuring Maria (Xiao Ya, the favorite text2image IP I have generated and maintained in my AIGC videos), her fans can immediately tell which one is genuine, even though Maria herself may present different appearances at different ages and in various settings. There exists an abstract, invariant facial characteristic that equips humans with an eagle-eyed ability to recognize faces. The secret to this lies in the decoupling of these characteristics already pretty well done in the previous generation of facial recognition models. Compare and contrast:



It's important to note that maintaining character consistency is a critical benchmark for generating cinematic and user-configurable video works. Without crossing this threshold, the field will struggle to achieve large-scale applications in video art creation. The dream of a fully virtual Hollywood production line, without physical filming, will remain a fantasy.

Why is it so difficult for visual models to achieve consistent character representation over long periods using brute force?

Video is a high-dimensional modality, and for large models (at least in the foreseeable future) to handle video, they must employ significant "lossy compression". The compression ratio of visual tokens is high, making it more feasible to align training/generation across the entire frames over time within the hidden space. The higher the compression ratio, the stronger the temporal consistency across the entire frames. Self-regressive models (GPT-like) or DiT (Diffusion Transformers) can achieve this. By doing so, videos that violate the physical laws of the real world can be effectively under control, reducing illogical hallucinations and making visual models appear to simulate the objective world (or so it seems). However, there is a trade-off: under lossy compression, the consistency of the overall frames and the consistency of detailed features of specific physical objects therein cannot be optimized simultaneously.

The current approach typically involves adding a super-resolution (SR) module/mdoel after achieving overall contour (blueprint) consistency, attempting to restore discarded details. In general, super-resolution rendering has made significant progress so far, thanks to the accumulation of research in "deepfake"-like technology. However, deepfake technology essentially compensates for the losses incurred during compression, using the large visual foundation model's strength in imagination (or "hallucination") to reasonably and non-deterministically fill in the details, depicting how the world "should" look like, what it should be, rather than what it is, often with amazingly detailed lifelike results. But if the goal is to represent an individual entity, especially a finely detailed one like a human face of some IP, with individual features sensitive to human perception, it's inevitable that the generated image will drift over time. This is the crux of the problem. The solution should not rely on increasingly larger models and longer context windows with brute-force data and training. Brute force can only slow the deviation but cannot eliminate the non-deterministic bias that accumulates during the SR process over long video sequences. We need to think outside the box and exclude the time dimension as a factor, using a step-by-step alignment method, which may break the time cycle. I’ll stop here—don't say you weren't warned.

The prerequisite for achieving this is the decoupling of facial features. Features that cannot be decoupled cannot be aligned step by step. They have to, and can, be decoupled; otherwise, it would be impossible to explain how dozens of Hollywood actors can star in thousands of blockbuster films. The decoupling of faces from expressions and time still has room for improvement, but the technology has already matured considerably.  It is a matter of how to properly use it in the process.

Original Chinese post in



到了面向大千世界一切对象的万用视频大模型的时候,无论Sora还是可灵,人脸的一致性依然是一个巨大挑战。Sora 不公测不大知道,但通过类似design的国内头部视觉大模型可灵,可以感知里面的局限性。通常超过半分钟,人脸就开始发散,变得不像那个人了。长程人物形象的一致性,如果不做专项处理和引入专项优化,单靠现在通用的视频一致性的训练努力,是很难突破瓶颈的。这个局限,在玩可灵一类公开上线的产品时,各种测试一再被发现。

有些视频如果不是人眼的敏感性(难怪“外贸”协会的求偶者那么多,原来人眼揉不得沙子,尤其是找对象的时候,普通人很难粗粒度容忍潜在对象的面相瑕疵或因为面相的不来电 lol),哪怕从视觉物理特性看其实已经无法区分。可见人的眼毒: 可以立即分辨真假李逵。



为什么视觉大模型靠蛮力很难在人物长程一致性上奏效?、因为视频是模态中维度很高的形态,大模型(至少在可见的将来)为了搞定视频, 就必须做大力的有损压缩。视觉的 tokens 压缩比很高,这样在内部的 hidden space 做整体帧的长程一致性的对齐训练/生成才比较可行。压缩旅越高,总体画面的时间维度的一致性就越强。自回归或DiT就可以搞定。只有这样,违背大千世界物理规律的视频就会得到有效控制,减少违背常识的幻觉,让视觉大模型看上去已然是客观世界的模拟器(疑似)了。但这里有个矛盾,在有损压缩的条件下,帧总体的一致性与其中具体物理对象的细节特征的一致性,没办法同步优化一致性。

目前的方案通常是,在总体轮廓(blueprint)一致性搞定后,追加一个高清化(SR)的过程,试图把舍弃的细节复现出来。高清化渲染,总体而言在过去几年的 deep fake 研发积累中,已经做得相当不错了。但是 deep fake 本质上是在有损压缩的条件下的亡羊补牢,它所能做的就是通过大模型所擅长的想象(或曰幻觉)来合理地、非确定性地填补细节,描绘世界应该具有的形象(what it should be,Not what it is),可以栩栩如生。但如果目标是一个特定对象,尤其是人脸这种细粒度对象,有人眼敏感的个体特征(IP),它就免不了在长时间的生成中有所走偏,这就是问题的要害所在。解决的办法不应该指望模型越来越大、context window 越来越长的大数据蛮干。因为蛮力只能减缓偏差,不能根治长视频的SR过程中随时间而增加的非确定性偏差积累。要 think out of box,排除时间维度作为条件,以步步对齐的方法,或可解套。点到为止吧 ,勿谓言之不预。



