立委博士,问问副总裁,聚焦大模型及其应用。Netbase前首席科学家10年,期间指挥研发了18种语言的理解和应用系统,鲁棒、线速,scale up to 社会媒体大数据,语义落地到舆情挖掘产品,成为美国NLP工业落地的领跑者。Cymfony前研发副总八年,曾荣获第一届问答系统第一名(TREC-8 QA Track),并赢得17个小企业创新研究的信息抽取项目(PI for 17 SBIRs)。
据广密的信息,这个所谓 self-play RL 新生态趋势,目前是局限在硅谷技术大佬小圈子的共识,他提到大约不超过200人的圈子的。如果信息正确的话,一个在硅谷技术核心圈200人以内的某种共识和议论,说明还只是一个动向,甚至连管理圈还没真正 get it 和对齐。
感觉上,广密有一些“春江水暖鸭先知”/“语不惊人死不休”的心态(LOL),有意强调/夸张了这个趋势,警醒国人,甚至说,如果我是大模型创业家,我会200%资源聚焦 RL 方向,bet on it,因为这是未来赢家的选择,云云。
其实,客观说,对于多数人这个不实在,也无可操作性,最多是说给国内大厂玩家或六小龙听的吧,但其实也是白说。RL 本来就不好玩,连开源标杆 Meta Llamma 3 在最基本的 RLHF 方面都选择绕开来走,就更甭提提倡国内大模型公司全力 bet on 以强化学习作为新生态核心的愿景了。何况后者在硅谷最多也只是一种“潜流”,可能要等年底前OpenAI草莓以及Claude新版发布后,才能对这个所谓新生态的影响,看得清楚一些吧。
Professor Ma is a compelling speaker, and his talk is definitely worth listening to. His paper on whitebox transformer, over 100 pages long, has just been released (Yi Ma’s white-box transformer paper is available here). Unfortunately, I haven’t had the time to dig into it yet. We’ll have to wait until more people have accepted or verified it before delving deeper.
His current claims revolve around using an extremely sparse approach to force transparency in transformers, with results that are reportedly on par with BERT and GPT-2 in many benchmarks. However, this doesn’t mean that he will be able to catch up with GPT-3 or later models anytime soon. But to be fair, it’s not a level playing field—he’s an academic without the resources to compete with mainstream AI in an arms race. What he does believe, however, is that he has opened a door—a path toward explainable AI in large models.
Honestly, I’ve always had a litttle bit doubts about Ilya’s theory explanation of shortest program compression (his Berkeley talk). From an ultimate theoretical perspective—where lossless compression is the ideal—the idea of continually scaling training, deepening, and lengthening learning makes sense, as it pushes the model toward becoming the smallest possible program for universal tasks. Ilya’s theory may hold up in this respect, at least in theory or as an end goal. But in any real-world scenario (e.g., under budgetary constraints, with methodological limitations), it’s hard to call a model purely derived through gradient descent the “shortest program,” because these models appear to be gigantic beasts with "huge circuits" inside, intuitively, should not be considered "short or small".
Models with hundreds of billions or even trillions of parameters are massive monstrosities, succeeding mainly through sheer size rather than through high regularity or elegance. Emphasizing how impressive their compression ratios are or how well they handle lossless compression may help explain the generalization and emergeng abilities in sequence learning from a theoretical standpoint. But in practice, any model at a given time is far from being the “shortest program.”
This highlights an unavoidable distance between theory and practice. Ilya essentially hedged practice with theory along a future time axis, but our immediate reality doesn’t seem to align with this. It’s like a clumsy wrestler trying to brand himself as sleek and slender fashion model. Visually not a fit, to most of our eyes.
Instinctively, LLMs feel full of rote memorization with significant redundancy. Under real-world conditions, achieving extreme or lossless compression seems impossible.
On the other hand, Professor Ma’s sparsity approach almost feels “over the top.” Enforcing the same weight for QKV directly seems a bit crude and simplistic, yet it still managed to be trained successfully. This shows that there’s a lot of flexibility within transformers—no matter what restrictions or pruning are applied, the model still finds a path out. In this sense, Professor Ma’s pursuit of the “shortest program” is more real and direct—it’s so short that even a human can interprete the process (hence the LLM explainability).
Yet the difference between these two extremes is still mind-boggling. On one side, we have gigantic models, and on the other, extreme simplicity to generate whitebox models. The fact that both approaches work is shocking.
Speaking of simplicity and explainability, here’s an interesting anecdote in AI history: Back in the day, during the era of symbolic MT, one of the earliest deployed systems (Siemens' METAL) for English-German translation used only eight symbolic features (such as human, animal, etc.). The rules were simple, transparent, and easy to explain. This shows that extreme simplicity and rule-based transparency can work in some rough application scenarios (where English and German are linguistically close, making translation easier).
Later, we MT-ers expanded the number of features to the thousands, trying to cover more of the long tail. Even then, it wasn’t perfect. At the time, we thought that with enough effort, we could match the quality of statistical MT. But now, we know that even if symbolic MT could catch up and match statistical MT, it’s still far from competing with neural MT.
So, could we have continued refining features further? It wasn’t because we didn’t want to keep extending symbolic features (similar to one-hot encoding, but with the internal structure of ontology/taxonomy). We wanted to go beyond thousands to tens of thousands of features. But in reality, thousands (of features in size) were already reaching the limit of human experts’ capacity to understand (AI explanability), manage and debug. Expanding further would have been unmanageable.
Meanwhile, how many parameters do mainstream Transformer neural networks have? And the space and granularity they represent are on a completely different scale. Given the vast difference in scale between the two, it’s natural to doubt any efforts to bridge this gap for AI explanability. How could that even be possible?
That’s why I’ve always felt that explainability in large models is an elusive goal. But Professor Ma is telling the world that they’ve achieved it.
我们后来把 8 个 features 扩展到千数量级,才擦了长尾的屁股。但也没擦干净。当时觉得,也许认真做可以对垒统计MT的品质(与董振东老师谈过,我们都觉得可以在翻译上最终用符号打败统计的,只是需要时间磨细活),但现在知道即便匹敌了统计MT,也远远不能与神经MT比高下。
那就把 features 往细做,成不?不是因为我们不想继续把 symbolic features (类似于 one hot encoding,但人为在 features 内部强加了类似于 HowNet 的 ontology/taxonomy 的结构性),从千这个量级进一步提升到万的量级。实际情况是,千几乎已经达到专家人脑的极限了,再扩大 features 的范围,我们就无法掌控和调试了。
下一个passion点 应该是 to b 场景,因为最终的应用大期待,大概率在垂直。To c 虽然很卷,但路线图和态势,能做什么,包括 aigc,已经基本清晰。但 to b 还在泥潭里挣扎,方向都还隔雾看花,闪闪烁烁,但也看到高人。例如白硕老师,感觉他就在捻须微笑,坐在金融交易的莲花池上,仗着to b 积淀。
I’ve now become the go-to expert for AIGC (AI-generated content) "custom services" among my old friends and classmates, just for fun. Below are nostalgic videos made from old photos that two of my classmates asked me to create.
Whenever I find the time, I’m more than happy to provide this kind of emotional value for friends and family because it’s truly satisfying to see their reactions of surprise.
The pianist is now a world-class piano master, frequently touring and performing in Europe, America, and China. These are precious old photos of him practicing and performing with our mutual friend, Brother Sun, in Philadelphia back in the early days.
Dr. Bai Shuo, a seasoned expert in NLP and a multi-talented musician, commented humorously: “Looks real for someone who pulls on the bow in Meditation as named, but the bowing and fingering are all wrong.”
Another old friend also left feedback noting that the visual model doesn’t understand music: "This needs improvement! It's obvious that the model was created by someone who doesn’t know how to play the violin or piano. The bowing and piano accompaniment are off. The first note has a two-and-a-half beat long tone, which should be played with a long bow. Additionally, the pianist’s right foot should never be raised or shaking like that—it should be on the sustain pedal.”
LOL
Even though the music's name Meditation was clearly specified in my prompt during generation, there is no model, in the foreseeable future, that can truly align the understanding of music with the intricate details of bodily movements during performance. Perhaps this can be reserved as one of the ultimate challenges for large models aiming for AGI, because theoretically, if enough alignment data of musical performance is available, based on the compression theory of "joint training", it’s possible to aim at perfect alignment across different modalities.
If simulating the objective world is the ultimate goal of visual models, then the current generation of visual models is at the level of “playing the piano to a cow” or “playing music to a tone-deaf audience”—completely unable to withstand scrutiny from musicians. For example, as someone with little musical knowledge, when I watch the nostalgic performance videos above, I wouldn’t notice the flaws as an expert would; instead, I find them vivid and emotionally engaging.
Of course, the standards of musicians might as well just be a "pseudo-demand" or a pseudo-goal (even if the visuals satisfy the picky “expert eye,” so what? Will it sell well?). It might not be worth the effort to pursue this. However, in theory, an ideal AGI should be capable of meeting these expert-level demands.
This is the challenge of musical performance alignment. Another challenge to Sora-like video generation models is character consistency in videos.
Achieving facial consistency in generative visual models is an extremely difficult. Don’t expect this issue to be resolved by video generation models alone in the short term, especially not through autoregressive methods.
Human eyes are extremely discerning with regards to face recognition, especially when it comes to familiar faces of friends and family—you can immediately tell when a character's appearance is off. For example, while playing with old photos recently, I used the KeLing model (top notch Video Model in China) to generate a video of myself. At the 5-second mark, it still looked passable, but by 10 seconds, it no longer resembled me.
10 second footage:
In the second 10-second video, just a slight turn of the head, and it’s no longer me—it looks more like my brother. How can a model handle such fine details? Especially when the starting image for video generation is not even a straightforward frontal shot, making the character information incomplete—how could it not go off track?
While the videos I've made for friends and family using KeLing during its public testing phase have generally been met with passionate surprise and amazement, most of them suffer from this issue of character consistency, which is a regret.
The current one-click video generation products on the market (including our own YuanChuang Island recently launched) tend to mainly use anime or manga styles. This is to avoid user scrutiny since these styles lack 3D distinct individual characteristics. As long as there is consistency in attire, no gender mix-ups, with age and race alignment, most people will accept it. The current one-click videos are generally rough, with entertainment value primarily in the story rather than character portrayal akin to a Hollywood blockbuster. However, as this path progresses, it will inevitably encounter the challenge of maintaining the consistency of digital IP actors and their roles.
My colleague, Lu, mentioned, "the consistency issue might require cross-checking from multiple video angles, which more or less touches on the core issue of whether modeling is necessary."
Indeed, some form of cross-checking is required, not just monotonic correction over time/sequence—that is indeed the key. There’s a need to decouple or separate the character's image from the storyline, rather than generating in a linear, one-way path. While sequence learning has indeed produced miracles in LLMs, sequence generation inherently has limitations, including random deviations over time. Although it's not as extreme as LeCun's criticism—where he says GPT's error accumulation is a tiny discrepancy that leads to a significant miss—his claim isn't entirely accurate because GPT's autoregressive operation also corrects and adjusts its course at every step in the context. Nevertheless, when it comes to fine-grained consistency, random deviations are almost impossible to handle, even with corrective mechanisms in place.
Hence decoupling, decoupling, decoupling! Decoupling can solve the problem. The world isn't limited to sequences. Beyond sequences and time, there is a constant abstraction (i.e., character image, or IP) that can be utilized. This is becoming increasingly clear. Take, for example, the digital IP character Maria (Xiao Ya) that I created using AIGC txt2img more than 2 years ago::
Unless they’re fans, perhaps my numerous Maria videos might cause aesthetic fatigue—someone even called her “Dr. Li's fairy” (LOL). But indeed, there are fans; several of my old classmates are among them.
Why? Because she is an IP, and she has been decoupled.
现在市面上做的一键成片产品(包括我们的元创岛),其所以用二次元或其他夸张风格为主,是为了避免用户的挑剔,因为那些形象缺乏鲜明的个性,不是真正的 individual IP,只要保持穿戴一致性,男女不要错位,年龄和种族不要相左,一般人也就接受了。目前的一键成片普遍比较粗线条,娱乐价值更多是为视频里的故事,而不是好莱坞大片那样的角色形象刻画。但这条路往上走,就躲不开这种 digital IP 的演员角色定位及其一致性问题。
Overall, CRATE is similar to a transformer, with two differences:
in each attention head, the Q,K, and V weight matrices are weight-tied, i.e., set to be equal;
and the nonlinearity following each attention layer is no longer a multi-layer perceptron (MLP), but rather a more structured operator (ISTA) with sparse outputs.
咱们了解一下,ISTA(Iterative Soft-Thresholding Algorithm,迭代软阈值算法),是一种用于解决稀疏优化问题的算法,在机器学习领域有广泛应用。在CRATE架构中, ISTA被用来替代传统Transformer中的多层感知器(MLP),还记得前不久的 KAN 的创新也是旨在平替 MLP。都是在 Transformer 里面开刀。
我浅薄的理解,ISTA 与 KAN for Science/Physics 的思路是一致的,就是要经过某种正则化或剪枝,最终拟合成稀疏化路径,从而获得可解释性。
工作原理: ISTA通过迭代的方式逐步接近问题的最优解。每次迭代包括两个步骤: a) 梯度下降步骤,这与主流同;b) 软阈值操作。增加这个操作,是要在两个目标之间找平衡:
Professor Ma is a prominent figure, renowned for his distinctive style and leadership in the field. His name is widely recognized and respected. Of particular interest recently are his critiques of mainstream large models and the bold claims he has made about his own work (see his post in Chinese below).
Recently, at a conference in Shenzhen (which I attended with my own talk too), Professor Ma sharply criticized mainstream large models, Ilya, and Kolmogorov complexity theory, dismissing them as being on the level of high school students and claiming that they lack a true understanding of theoretical concepts. He asserted that he has achieved breakthroughs in both theory and practice, particularly with the white-box Transformer developed by his team. According to him, this model not only demystifies the complexity of large models but also offers an engineering-feasible alternative.
When someone speaks with such confidence, it usually indicates genuine expertise and a commanding presence. Just as Yann LeCun in the U.S. criticized GPT as being inferior to a dog and called it a dead end, proposing his world model as an alternative, China has Professor Ma. Their critiques balance the global discourse, making the world feel less excluding. There is indeed hope that their work might address the "slow thinking" and "interpretability" shortcomings of current mainstream large models and contribute to the overall advancement of AI. Professor Ma’s academic and practical work deserves close study, though we may have to wait for time and peer reviews to fully test and validate their findings.
At the Shenzhen conference, after delivering his talk and sharp critiques, Professor Ma left immediately, likely due to his busy schedule.
The paper is over 100 pages long and is said to be released in a few days. Based on the current outline, the key points are as follows:
Overall, CRATE is similar to a transformer, with two differences:
- In each attention head, the Q, K, and V weight matrices are tied, i.e., set to be equal.
- The nonlinearity following each attention layer is no longer a multi-layer perceptron (MLP) but rather a more structured operator (ISTA) with sparse outputs.
Let's examine ISTA (Iterative Soft-Thresholding Algorithm), a widely used algorithm for solving sparse optimization problems in machine learning. In his CRATE architecture, ISTA replaces the traditional MLP in Transformers. Not long ago, KAN also introduced innovations aimed at replacing the MLP, both approaches representing surgeries within the Transformer architecture.
In my understanding, ISTA and KAN (for Science/Physics) share a common goal: through regularization or pruning, they ultimately fit a sparse path, thus achieving interpretability.
How it works
ISTA Iteratively approaches the optimal solution of a problem. Each iteration involves two steps: a) a gradient descent step, which aligns with mainstream methods; and b) a soft-thresholding operation. This operation is added to balance two objectives:
a) Maximizing model accuracy;
b) Achieving model sparsity, i.e., simplicity (as overly complex models are difficult for humans to interpret).
The soft-thresholding operation encourages internal elements to become zero, resulting in sparse outputs and increased interpretability. The weight-tied attention mechanism, combined with ISTA, promotes a deeper understanding of the input data structure, resembling a human-like structured analysis process that prioritizes key elements while regularizing the data.
Professor Ma claims that these two modifications naturally lead the model to learn the interpretability associated with human-like structuring and sparsity during supervised learning (and later as claimed successfully applied to self-supervised learning too).
For example, in image recognition, it was observed that certain attention heads correspond to different parts of animals. What's more remarkable is that this correspondence remains consistent across different animals and even different categories of animals. For instance, an attention head focused on the "head" consistently pays attention to the head area when processing different kinds of animals. This consistency suggests that CRATE has learned a general representation of visual features across categories.
However, those studying LLM interpretability have long discovered that at the end of MLP networks, various structured components (such as heads and feet) are also captured by attention mechanisms. Without this, it would be difficult to explain the generalization (or compression) capabilities exhibited by LLMs. The challenge lies in the early stages of the MLP network, where attention is more mixed, and mainstream researcher struggle to clarify what the attentions heads are focusing on. It seems that they are vaguely paying attention to the relationships between basic elements like pixels/dots and lines.
The core idea behind explainable AI is consistent: transforming the tangled, black-box, multi-layer network's internal data fitting paths into structured paths that are enabled with various constraints and pruning, leading to a sparse representation.
Who wouldn’t want a model to be interpretable? However, achieving sparsity and simplicity is extremely challenging, which is why, so far, these approaches have struggled to compete with the black-box methods that involve randomness.
Professor Ma’s confidence stems from the fact that, in the past six months to a year, he has begun to train models using the explainable white-box methods mentioned above, achieving results comparable to traditional transformers. At the Shenzhen conference, he mentioned that while he had always been confident that this was the correct approach, he remained cautious until results were obtained. Now, he believes that his cross-national team’s achievements with this approach have satisfied him enough to announce to the world that he has found a breakthrough in theory as well as practice, the correct method for white-boxing transformers, which could lead to a paradigm shift and a breakthrough in deep learning. This has made him both excited and confident. Therefore, he is no longer content with academic theoretical achievements alone; he feels compelled to take actions in industry as well. Professor Ma has recently founded a company to advance this work on an engineering level. At Shenzhen, he announced a directionally significant project challenging the mainstream, first time under the banner of his new company.
However, based on my years of NLP experience and intuition, I must point out a challenge (or potential issue): Human interpretability is built on a highly simplified finite set. If we consider symbolic features, a feature system with more than thousands of elements becomes incomprehensible to humans. But on the other hand, the number of parameters in transformers and the number of KQVs for attention heads are on a completely different scale. Reducing such complexity on this scale seems almost unimaginable.
KAN for Science succeeded because their target was extremely narrow—certain existing symbolic formulas in physics or potential formulas limited to a few parameters. With such a goal, pruning, along with scientist intervention or feedback, allowed KAN to claim interpretability.
Regardless, Professor Ma seems confident, so we would like to observe how his methods and results evolve and will, or will not, be accepted.
我们的目标客户群体包括内容创作者(ToPC,to professional consumer)和小型至中型企业(ToSMB,to small medium businesses)。内容创作者愿意为方便他们工作的工具付费,而我们正是为他们提供这样的工具。对于ToB客户,我们专注于为中小企业提供较为标准化的解决方案,因为大型客户的定制化需求较为复杂,不易操作。目前,我们拥有86万付费用户,这证明了我们的服务已经成功落地并得到市场的认可。下面是我们产品的一些展示。
Last Friday, I had dinner with a famous VC investor who told me 65% of VC’s will go out of business in the next few years. I believe him. Here’s what that means for startup leaders waiting on their Series A/B/C:
Those of us in startups tend to think VCs are at the top of the food chain.
They aren’t.
VCs are businesses too.
They raise money from THEIR investors (aka LPs).
And their job is to make a return for those LPs.
With outstanding returns in the 2010s, VC was on a win streak.
Many more funds were born.
And existing funds got much larger.
In 2021, 1577 different VC firms raised a total of $183 billion.
But at the same time, costs to launch a startup have gotten cheaper.
Widely available tools, global workforce, and easy (online) distribution mean it’s never been easier, or cheaper, to start a SaaS company.
So how are VCs supposed to deploy all that money they’ve raised?
They can’t.
There is too much money chasing too few deals.
Make no mistake, for VCs, it’s a fatal mix.
The IPO window is closed – companies can’t go public.
So VCs aren’t making money with big IPOs.
M&A isn’t happening (at least not at good prices for sellers).
So VCs aren’t making money by selling their companies.
If VCs aren’t making money, they can’t return capital to their LPs.
They are in trouble.
Of course, VCs rarely go out of business the way their companies might.
Reputations are at stake, so change happens quietly.
But it’s the same result.
It’s already happening.
Listen carefully, and you’ll hear VCs saying:
"We have decided not to raise another fund.”
Translation: they probably can’t.
More firms will say that they are “no longer investing”
Partners are “deciding to take operating roles.”
Managing Directors are retiring.
In 2023, 597 VC firms raised $81B.
That’s down 63% and 56% (vs. 2021).
The VC party is over.
Or at least this chapter is...
The select few at the top of the VC list will have their pick of deals.
The great business builders will choose their spots and continue to thrive.
I’ve been lucky to work with a few of those and am certain that their expertise and relationships will carry them through.
But over 50% of existing firms, won’t survive.
That means if you're a startup CEO or operator raising money in this environment, you need to understand the game has changed.
Don’t buy the stories of some founder that raised $30M with $200k ARR and a good deck.
The era of VCs bailing out bad businesses with huge checks is over.
Many of the VCs won’t even be around in a few years.
There is only one strategy that works in this economy.
Focus.
Nail your ICP.
Delight your customers.
Get profitable to control your financial destiny.
The best time to raise money is when you don’t need it.
Facial recognition in the vast world of AI is a specialized and challenging task, as human eyes are exceptionally sensitive to facial features. Because facial recognition is so specialized and sensitive, it presents a much greater challenge than traditional image recognition tasks, like identifying animal types. Consequently, this field achieved breakthroughs earlier than others: even before the advent of contemporary large models such as GPTs, deep neural network-based facial recognition, powered by extensive datasets of facial images, had already surpassed human visual capabilities and sensitivity. It became widely adopted, leading to the rise of unicorns in the pre-large model era.
Now, as we transition to universal video foundation models that aim to handle all objects in the world, whether it's Sora or Keling, maintaining facial consistency remains a significant challenge. The public has little access to Sora, but by examining similar leading visual models like Keling, we can perceive its limitations. Typically, after about half a minute, the generated faces start to diverge, no longer resembling the original person as closely. Achieving long-term consistency in character appearance is difficult without specialized processing and targeted optimization; relying solely on the current general video consistency training efforts is unlikely to overcome this bottleneck. This limitation has been repeatedly observed during various tests with publicly available visual products like Keling.
In some videos, if not for the sensitivity of human eyes, different visuals might be impossible from a purely physical perspective. This highlights the sharpness of human perception: the ability to instantly discern the real from the fake.
For example, in the videos generated below featuring Maria (Xiao Ya, the favorite text2image IP I have generated and maintained in my AIGC videos), her fans can immediately tell which one is genuine, even though Maria herself may present different appearances at different ages and in various settings. There exists an abstract, invariant facial characteristic that equips humans with an eagle-eyed ability to recognize faces. The secret to this lies in the decoupling of these characteristics already pretty well done in the previous generation of facial recognition models. Compare and contrast:
It's important to note that maintaining character consistency is a critical benchmark for generating cinematic and user-configurable video works. Without crossing this threshold, the field will struggle to achieve large-scale applications in video art creation. The dream of a fully virtual Hollywood production line, without physical filming, will remain a fantasy.
Why is it so difficult for visual models to achieve consistent character representation over long periods using brute force?
Video is a high-dimensional modality, and for large models (at least in the foreseeable future) to handle video, they must employ significant "lossy compression". The compression ratio of visual tokens is high, making it more feasible to align training/generation across the entire frames over time within the hidden space. The higher the compression ratio, the stronger the temporal consistency across the entire frames. Self-regressive models (GPT-like) or DiT (Diffusion Transformers) can achieve this. By doing so, videos that violate the physical laws of the real world can be effectively under control, reducing illogical hallucinations and making visual models appear to simulate the objective world (or so it seems). However, there is a trade-off: under lossy compression, the consistency of the overall frames and the consistency of detailed features of specific physical objects therein cannot be optimized simultaneously.
The current approach typically involves adding a super-resolution (SR) module/mdoel after achieving overall contour (blueprint) consistency, attempting to restore discarded details. In general, super-resolution rendering has made significant progress so far, thanks to the accumulation of research in "deepfake"-like technology. However, deepfake technology essentially compensates for the losses incurred during compression, using the large visual foundation model's strength in imagination (or "hallucination") to reasonably and non-deterministically fill in the details, depicting how the world "should" look like, what it should be, rather than what it is, often with amazingly detailed lifelike results. But if the goal is to represent an individual entity, especially a finely detailed one like a human face of some IP, with individual features sensitive to human perception, it's inevitable that the generated image will drift over time. This is the crux of the problem. The solution should not rely on increasingly larger models and longer context windows with brute-force data and training. Brute force can only slow the deviation but cannot eliminate the non-deterministic bias that accumulates during the SR process over long video sequences. We need to think outside the box and exclude the time dimension as a factor, using a step-by-step alignment method, which may break the time cycle. I’ll stop here—don't say you weren't warned.
The prerequisite for achieving this is the decoupling of facial features. Features that cannot be decoupled cannot be aligned step by step. They have to, and can, be decoupled; otherwise, it would be impossible to explain how dozens of Hollywood actors can star in thousands of blockbuster films. The decoupling of faces from expressions and time still has room for improvement, but the technology has already matured considerably. It is a matter of how to properly use it in the process.
为什么视觉大模型靠蛮力很难在人物长程一致性上奏效?、因为视频是模态中维度很高的形态,大模型(至少在可见的将来)为了搞定视频, 就必须做大力的有损压缩。视觉的 tokens 压缩比很高,这样在内部的 hidden space 做整体帧的长程一致性的对齐训练/生成才比较可行。压缩旅越高,总体画面的时间维度的一致性就越强。自回归或DiT就可以搞定。只有这样,违背大千世界物理规律的视频就会得到有效控制,减少违背常识的幻觉,让视觉大模型看上去已然是客观世界的模拟器(疑似)了。但这里有个矛盾,在有损压缩的条件下,帧总体的一致性与其中具体物理对象的细节特征的一致性,没办法同步优化一致性。
目前的方案通常是,在总体轮廓(blueprint)一致性搞定后,追加一个高清化(SR)的过程,试图把舍弃的细节复现出来。高清化渲染,总体而言在过去几年的 deep fake 研发积累中,已经做得相当不错了。但是 deep fake 本质上是在有损压缩的条件下的亡羊补牢,它所能做的就是通过大模型所擅长的想象(或曰幻觉)来合理地、非确定性地填补细节,描绘世界应该具有的形象(what it should be,Not what it is),可以栩栩如生。但如果目标是一个特定对象,尤其是人脸这种细粒度对象,有人眼敏感的个体特征(IP),它就免不了在长时间的生成中有所走偏,这就是问题的要害所在。解决的办法不应该指望模型越来越大、context window 越来越长的大数据蛮干。因为蛮力只能减缓偏差,不能根治长视频的SR过程中随时间而增加的非确定性偏差积累。要 think out of box,排除时间维度作为条件,以步步对齐的方法,或可解套。点到为止吧 ,勿谓言之不预。
Notes on the 92-page Paper Released with Meta's Super Large Model Llama 3.1
The super-large model Llama 3.1 is a milestone in the open-source large model community. As a leader, Meta's project involved over 500 participants/contributors (the authors of this paper are listed alphabetically in the appendix, similar to how the Central Committee members' names are displayed by stroke order). This original text is full of implementation details:
AIGC MV using Suno and keling (just for fun & cheering opensource milestone)
Notes:
Llama 3.1 doesn't use sparse techniques, it's not a multi-expert system like model 4, but a dense model.
405B parameters, 15.6T tokens: The number of tokens is 40 times the number of parameters. Large-scale top models now emphasize data growth far exceeding parameter growth. Is this 15T tokens of data open source? (No, because even if they were willing to open source it, they wouldn't dare, as it could lead to countless data infringement lawsuits)
Emphasizes three major levers for super-large foundation models: data, scale, and managing complexity.
Compared to the previous generation system Llama 2, computational power has increased 50 times (using 3.8 × 10^25 FLOPs).
Complexity management: (1) Choosing a standard dense Transformer architecture instead of a mixture of experts model to maximize training stability. (2) Adopting a relatively simple post-training procedure: Supervised Fine-Tuning (SFT), Rejection Sampling (RS), and Direct Preference Optimization (DPO). In other words, algorithm design and implementation tend towards simplification. Not using sparse techniques and multi-expert systems is for stability (but training challenges are greater, though they're not afraid). Using simpler, easier-to-implement DPO in the post-training phase instead of reinforcement learning is also for stability, as reinforcement learning has always been difficult to handle.
Benchmark tests cover: general, code, math, reasoning, tool use, long context, and multilingual. All performances are SOTA (state-of-the-art international level).
MMLU (Massive Multitask Language Understanding): 405B model achieves 87.3% (5-shot), 88.6% (0-shot, CoT).
Code generation (HumanEval): 405B model reaches 89.0%, close to GPT-4.
Math problems (GSM8K): 405B model achieves 96.8%, slightly higher than GPT-4.
Long context tasks: Excellent performance on some tasks, such as 95.2% on QuALITY.
Multilingual tasks (MGSM): 405B model reaches 91.6%, on par with top models. The 405B model is comparable or close to GPT-4 and Claude 3.5 Sonnet on many tasks. In short, open-source has caught up with closed-source.
Pre-training started with an 8k window, expanded to a 128k window in the later stages of pre-training (continued training).
After the foundation model pre-training was completed, multiple iterations of alignment "post-training" were performed. Including: (1) Aligning the model through human feedback, including multiple rounds of Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO); (2) Integrating new capabilities, such as tool use; (3) Enhancing coding and reasoning abilities (specialized optimization); (4) Safety alignment.
Multimodal expansion (in progress, not yet released): Image, video, and speech capabilities. Including (1) Multimodal encoder pre-training: Image encoder trained on a large number of image-text pairs, aligning visual content and natural language in a unified space; (2) Speech self-training? (3) Experiments on video-text data alignment based on images.
Language model as the core, other modalities are added later (whether added to pre-training and/or post-training). When expanding to multimodal, the language model parameters remain unchanged, adapting to multimodality, allowing multimodal alignment in the same semantic space, closer to the language model. In other words, Llama follows a modular, step-by-step approach to gradually expand to multimodality. This is not the mainstream approach (mainly referring to Open AI and Google, at least in theory) advocating for "unified multimodal native data joint pre-training". The overall impression of Llama's algorithmic strategies is seeking stability rather than innovation or unification. It tends towards practicality, not caring about leading in algorithms. For example, the integration of speech first involves speech self-training (because speech is actually very similar to text, both being language systems), then alignment between speech and text (including Automatic Speech Recognition ASR and Text-to-Speech TTS). Integrating step by step into the cross-modal large model, this approach isn't cutting-edge in terms of advancement, but it's steady progress, beneficial for engineering development, integration, and iteration. It's unclear when they will be able to release multimodal capabilities online.
Data collection and cleaning work is very complex, but the Llama team is meticulous, which is also the data guarantee for its quality to catch up with SOTA. To recap: (1) De-duplication: URL-level de-duplication; Document-level de-duplication using MinHash algorithm; Row-level de-duplication: removing rows appearing more than 6 times every 30M documents. (2) Filtering: Removing low-quality documents, outliers, and excessively repetitive documents, using repetitive n-gram coverage to remove repetitive content (such as logs or error messages); using "dirty word" counts to filter adult websites not covered by blacklists; using token distribution KL divergence to filter documents with too many abnormal tokens. (3) Controlling data quality: Using fasttext classifier to identify text that might be cited by Wikipedia; using a Roberta-based classifier trained on Llama 2's predictions; using DistilRoberta to generate document quality scores. Also, fasttext language classifier can identify 176 languages; specially filtering two types of information: adult content and personal identity/privacy information. Special fine processing for code and math web pages.
Data proportions: For example, downsampling over-represented data categories on the web (such as art and entertainment); data mixing ratios determined by a series of small model experiments, final data mix summary: About 50% of tokens correspond to general knowledge; 25% of tokens involve math and reasoning; 17% of tokens are code; 8% of tokens are multilingual content.
Model architecture: Apart from empirical detail adjustments, the basic architecture of the dense model remains unchanged, so it's data and scaling that create top models. 405B model specific parameters: 126 layers; token representation dimension 16,384; 128 attention heads; model size of 405B determined according to scaling law, about the computational optimal size under 3.8 × 10^25 FLOPs training budget.
Vocabulary: Using a vocabulary of 128K tokens. Combines 100K tokens from the tiktoken3 tokenizer and 28K additional multilingual tokens to better support non-English languages.
Computing resources, including GPU clusters of tens of thousands of cards, massive storage, and high-speed networks, represent huge resource investments. Specific data as follows: Computing resources:
Used up to 16,000 H100 GPUs (a very powerful graphics processor).
Each GPU has 80GB of high-bandwidth memory, with a power of 700W.
These GPUs are installed on servers designed by Meta itself, with 8 GPUs and 2 CPUs per server. Storage system:
Uses a distributed file system called Tectonic.
Provides 240PB (1PB=1000TB) of storage space, distributed across 7,500 servers.
Can process 2TB of continuous data per second, with a peak of 7TB/second.
A major challenge is handling the large amount of burst writes generated when processing model checkpoints (the process of saving model states).
Three-step pre-training process: a) Initial pre-training; b) Long context continued pre-training; c) Annealing with high-quality data sources Key pre-training strategies:
Gradually increase batch size and sequence length to balance stability and efficiency.
Dynamically adjust data mixing to specifically enhance certain capabilities.
Increase context length in stages to avoid early computational overhead.
Use annealing and high-quality data in the late stages of training to fine-tune model performance.
[LLM Summary]
Llama 3: Meta's Open-Source Large Language Model Breakthrough**
1. Introduction and Overview
Meta has introduced Llama 3, a series of foundation language models designed to support various tasks including multilingual processing, programming, reasoning, and tool use. This model series includes versions with 8B, 70B, and 405B parameters, with the largest 405B parameter model adopting a dense Transformer architecture and supporting context windows of up to 128K tokens. The development of Llama 3 highlights three key factors: data quality and scale, computational scale, and complexity management.
2. Model Architecture and Pre-training Strategy
2.1 Model Architecture
Llama 3 retains the standard dense Transformer architecture rather than adopting a mixture of experts model. This choice aims to maximize training stability, reflecting Meta's emphasis on simplifying design to manage complexity. Key architectural improvements include:
- Using Grouped-Query Attention (GQA) mechanism, with 8 key-value heads per attention layer.
- Introducing attention masks to prevent self-attention between different documents in the same sequence.
- Expanding the vocabulary to 128K tokens, combining 100K tokens from the tiktoken3 tokenizer and 28K additional multilingual tokens.
- Increasing the RoPE base frequency hyperparameter to 500,000 to support longer contexts.
2.2 Pre-training Data Processing
Llama 3's pre-training data processing is extremely rigorous, including:
- Multi-level deduplication: URL-level, document-level (using MinHash algorithm), and row-level deduplication.
- Heuristic filtering: Removing low-quality documents, outliers, and excessively repetitive content.
- Model-based quality filtering: Using fasttext and Roberta-based classifiers for quality assessment.
- Special content processing: Developing specialized processing pipelines for code and mathematical content.
- Multilingual data processing: Using fasttext base language identification model, supporting 176 languages.
- Safety and privacy protection: Filtering website data containing personally identifiable information (PII) and unsafe content.
2.3 Pre-training Strategy
The pre-training process is divided into three main stages:
1. Initial pre-training: Conducted on about 15T multilingual tokens, far exceeding Llama 2's 1.8T tokens.
2. Long context pre-training: Gradually expanding from initial 8K tokens to 128K tokens context window.
3. Annealing phase: Fine-tuning with high-quality data in the final stage, using Polyak averaging to generate the final model.
Data mixing ratios are carefully designed:
- 50% general knowledge
- 25% mathematics and reasoning
- 17% code
- 8% multilingual content
3. Training Infrastructure and Challenges
3.1 Computational Resources
- Using up to 16K H100 GPUs, each equipped with 80GB HBM3 memory.
- Adopting a 4D parallel strategy: tensor parallelism, pipeline parallelism, context parallelism, and data parallelism.
3.2 Storage System
- Using the Tectonic distributed file system, providing 240PB of storage space.
- Supporting 2TB/s sustained throughput, with peak capacity of 7TB/s.
3.3 Network Optimization
- Developing the NCCLX communication library to improve network efficiency.
- Designing specific network topologies and load balancing strategies.
3.4 Training Challenges
- Experiencing 466 job interruptions during the 54-day training period, 419 of which were unexpected.
- Developing automated systems and specialized tools to handle hardware failures and network issues.
4. Post-training and Alignment
Llama 3 adopts a multi-round iterative post-training process, including:
1. Supervised Fine-Tuning (SFT)
2. Direct Preference Optimization (DPO)
3. Reward model training: Using human feedback data
4. Safety alignment: Implementing multiple rounds of safety measures
This process not only improves the model's instruction-following capabilities but also enhances safety and specific abilities (such as coding and reasoning).
5. Multimodal Expansion
Although not officially released yet, Llama 3 demonstrates promising multimodal capabilities:
- Image recognition: Training independent image encoders, integrated with the language model through adapters.
- Video understanding: Adding video adapters based on image adapters.
- Speech processing: Independently training speech encoders, then aligning with the language model.
This modular approach allows flexible addition of new modalities while maintaining core language capabilities.
These results indicate that Llama 3 405B is comparable or close to GPT-4 and Claude 3.5 Sonnet on multiple tasks, particularly excelling in document understanding and long context tasks.
7. Safety Considerations
Meta highly prioritizes safety in the development of Llama 3:
- Implementing strict safety measures in both pre-training and post-training stages.
- Developing the Llama Guard system-level safety solution.
- Conducting extensive red team testing and risk assessments.
8. Open Source Impact and Future Directions
Meta's decision to publicly release the entire Llama 3 series, including the 405B parameter version, may have far-reaching impacts on the AI research community:
- Promoting open, responsible AI development.
- Accelerating AGI research progress.
- Providing researchers with opportunities to examine and improve large-scale language models.
Future development directions may include:
- Further improving multimodal integration.
- Expanding context length.
- Continuously enhancing data quality and model scale.
9. Conclusion
The development of Llama 3 demonstrates Meta's deep experience and forward-thinking in large-scale AI systems. By focusing on three key levers - data quality, computational scale, and complexity management - Llama 3 has reached or approached the current state-of-the-art level on several key benchmarks. Its open-source release may drive a wave of innovation across the entire AI field, paving the way for responsible AGI development.
Llama 3: Meta's AI Chef's Latest "Divine Delicacy"
Attention, all tech enthusiasts! The Michelin three-star AI chef Meta has just unveiled a new dish! This divine delicacy named "Llama 3" is not only spicy enough but will elevate your taste buds to new heights!
1. The Chef's Secret Weapon
Imagine Llama 3 as a super nanny who speaks 8 languages, writes code, does math, and can be your personal assistant. She can handle a kindergarten full of rambunctious kids (8B version), manage a mid-sized company (70B version), or even govern a small country (405B version)! This 405B big sister can remember 128,000 "gossips" (oh no, I mean context) simultaneously, essentially a walking encyclopedia + supercomputer!
2. Ingredient Selection: Only the Freshest!
Llama 3's chefs are masters at picking ingredients:
They "fished" 15 trillion words from the internet, nearly 10 times more than the previous generation!
Half of these words are everyday life seasonings, a quarter are math problems and brain teasers, nearly a fifth are programmer spells, and the rest are dialects learned from world travels.
They even invented a super weed remover, filtering out all the online garbage, repetitive, and unhealthy stuff.
3. Cooking Process: Three-Step Stir-Fry Method
Step 1: "Slow Simmer" - Start with a regular stove (8K context) to cook it halfway. Step 2: "High Heat Stir-Fry" - Switch to a super stove (gradually increasing to 128K context), reducing the sauce to be thick and fragrant. Step 3: "Low Heat Finish" - Finally, a gentle simmer with the best ingredients, the legendary "annealing" (even the chefs don't know why it's called that), bringing the flavor to its peak!
4. Kitchen Equipment: Top-of-the-Line Luxury Version
16,000 super high-power induction cookers (H100 GPUs) firing simultaneously!
A refrigerator that could fit half the Pacific Ocean (240PB storage)!
A proprietary ingredient prep system faster than 5G (NCCLX communication library)!
Imagine all these stoves firing at once, making the kitchen feel like a sauna. But our chefs persevered through the heat, changing chef uniforms 466 times in 54 days to whip up this dish!
5. Training Method: Both Cute and Well-Mannered
Being a good cook isn't enough; you've got to have manners too! So our chefs began a long "training" process:
First came a round of "gentle education" (supervised fine-tuning)
Then the "carrot and stick" tactic (direct preference optimization)
Finally, they invited moral role models (safety alignment) for guidance
After all this fuss, Llama 3 not only cooks well but also knows how to please people, program, do math, and mind her manners - a true decathlon champion!
6. Special Side Dishes: Showcasing Multiple Talents
Don't think Llama 3 can only cook; she's a multi-talented "goddess":
Storytelling from images? Piece of cake!
Writing movie reviews? No problem!
Recognizing songs and even singing a bit? The karaoke queen!
Although these "talents" are still in practice, they already show the potential of Li Bai's "from black hair to snow white in a day"!
7. A True Powerhouse: Dazzling Test Scores
Llama 3 participated in a series of "Top Chef Competitions," with eye-popping scores:
College Entrance Exam (MMLU): 87.3 points (out of 100)
Programmer Interview (HumanEval): 89 points (out of 100)
Math Olympiad (GSM8K): 96.8 points (out of 100)
Long Novel Reading Comprehension (QuALITY): 95.2 points (out of 100)
Bring this report card home, and even a "Tiger Mom" would be grinning from ear to ear!
8. Safety First: AI's "Security Captain"
Meta's chefs know well the principle of "don't leave guns and ammo lying around." They've assigned Llama 3 a 24/7 bodyguard team (Llama Guard) to prevent her from accidentally saying or doing the wrong thing. They even arrange occasional "moral exams" to ensure she doesn't turn into a "Terminator."
9. Open Source Feast: Everyone Can Be a Master Chef!
The most impressive part is that Meta decided to make the recipe for this "divine delicacy" completely public! It's like a Michelin three-star restaurant putting their signature dish's recipe online. Now anyone who wants to can whip it up at home! This move not only shocked other master chefs but also made countless food lovers cheer with joy!
10. Future Outlook: Reaching New Heights
Meta's chefs aren't resting on their laurels; they're already pondering the next "divine delicacy":
Maybe a dancing Llama 4?
Or a painting Llama 5?
Who knows, one day we might see a Llama 6 composing symphonies!
In short, the AI world's "Michelin" journey has only just begun!
Epilogue
The birth of Llama 3 not only elevates Meta's status in the AI world but also brings a fresh breeze to the entire AI research community. This bowl of "Llama soup" is not only delicious but also brings unlimited imagination to everyone. What will the future of AI be like? Let's wait and see what flavor the next "divine delicacy" will be!
-- looking closely into his historical Berkeley talk
by Wei Li, Jia Gao
Introduction
When Ilya Sutskever left OpenAI and re-emerged with his new company, SSI (Safe Superintelligence Inc.), the move was both surprising and expected—he bypassed AGI and directly aimed at SSI (Safe Superintelligence). He confidently declared: Superintelligence is imminent, and establishing safe superintelligence (SSI) is the most important technological issue of our time.
Ilya, a legend in the field of deep learning and AI, and the former true soul of OpenAI, was at the center of the dramatic internal shift, addressing the issue—effective acceleration versus super alignment. Why was Ilya so steadfast about "super alignment" amid the underlying AI values and strategic path debate? Even after the storm settled, the outside world continued to speculate: what did Ilya see that compelled him to join the board in making the decision to oust CEO Sam Altman? Ilya remained hidden until recently, when he left OpenAI, leading to the dissolution of his super alignment team and the creation of his new company.
What did he see behind the push for "safe intelligence"?
Back on October 3, 2023, Ilya gave a talk at UC Berkeley titled "A Theory of Unsupervised Learning." Though obscure and known to few, it is destined to be one of the most significant moments in AI history. This talk was a theoretical reflection and summary by a top expert in deep learning on the GPT model he pioneered, now famous worldwide. Ilya revealed the core principles of large models and vividly described his obsession with, and excitement over, independently understanding the mechanisms of unsupervised learning. Despite the complexity, the talk was brilliant and enlightening.
Until recently, Leopold Aschenbrenner, a former member of his super alignment team, published a 165-page article, "Situation Awareness," preliminarily revealing the shock and concerns within OpenAI over the exponential evolution of GPT models. This partly answered the question of what Ilya saw, but Ilya himself remained silent until his official re-emergence not long ago.
Reflecting on his "confessional" talk at Berkeley, we might glimpse his "moment of enlightenment" when facing potential superintelligence and understand his original intent for safe intelligence. It was a rare deep sharing by Ilya, attempting to convey essential message to the world. But did the world hear him?
1. Machine Learning: Supervised Learning and Unsupervised Learning
To accommodate readers with varying mathematical backgrounds, this blog aims to explain Ilya's historical presentation in an accessible language. Purely technical explanations can be skipped by non-technical readers without affecting the understanding of the presentation's main ideas.
Before diving in, let's review the basic concepts of machine learning. Machine learning is like having computers as students and humans as teachers. By providing computers with numerous "practice problems" and "answer keys," they slowly learn to solve problems. This is supervised learning. But can computers really learn from practice problems instead of merely memorizing them? Ilya assures us there's theoretical proof of this.
Imagine a sea of problems before you, each paired with a standard answer. This is the model's training data. Model training is like diligently solving these problems until most of them are correct, meaning low training error. But even an extensive problem set has its limits. When new problems arise, can the model still get them right? These new problems are the test data, akin to exams. Whether the model performs well depends on its test error rate.
Mathematics tells us that as long as the problem set is large enough, far exceeding the model's size, excellent performance on training problems (low training error) ensures good performance on test problems (low testing error). In other words, if the model trains well, it will do well in exams! This is the mathematical guarantee for supervised learning.
However, if the model merely memorizes without extraction, no matter how large its memory or how strong its "memory power," it lacks real adaptive learning ability (called "generalization ability"). Only when the model isn't too smart, it will be forced to extract the essence (called "compression"), learning real skills from the problem set.
This explains why the model size shouldn't be too large, to avoid giving the model too much room to cut corners. In short, Ilya wants to say that "big labeled data + low training error" is the winning formula for supervised learning, guaranteed by mathematics. This point has been confirmed both theoretically and practically. Since the deep learning revolution 12 years ago, countless successful cases have shown that as long as the training data is sufficient, neural networks can excel, at all sorts of AI tasks, from recognizing cats and dogs to machine translation.
But what about unsupervised learning? Can computers learn intelligence from a problem set without standard answers? It sounds far-fetched, but Ilya is about to explain how he managed to seek a solid mathematical foundation for unsupervised learning as well.
2. Distribution Matching: A New Approach to Unsupervised Learning
Everyone knows that machine translation was a typical win of supervised learning, in fact, the only win among various NLP tasks (such as dialogue, information extraction, sentiment analysis, question answering, docuent understanding, etc.) prior to the large language model's era. Why? Because we have a vast amount of historical bilingual data. It's like students having workbooks with English on the left and Chinese on the right—supervised learning thrives on this setup.
But what if the teacher suddenly stops providing aligned bilingual data and only gives you English books and unrelated Chinese books, leaving you to figure out how to align and learn automatic translation? That's the problem unsupervised learning needs to solve. Ilya says unsupervised learning can also handle various language machine translations (which we've seen today with large models—specialized translation software is no longer needed), and even any input-to-output transformation tasks. What's the catch?
Ilya discovered a new approach called distribution matching. Essentially, if the English and Chinese book collections are large enough, containing various sentence structures, their linguistic regularities will be learned "without supervision". For example, the context distribution of "I/me/my" in English should correspond to "我" in Chinese; adjectives near nouns in English with semantic compatibility should have a similar pattern in Chinese, etc. This provides the basic condition for potential language alignment.
Ilya points out that if two languages' native data is sufficiently rich, the input in one language can almost uniquely determine the equivalent translation in the other language. This principle applies not only to machine translation but also to tasks like speech recognition and image recognition.
Ilya independently discovered this approach in 2015, fascinated by the underlying mathematical principle—compression theory. If we can find a method that maximally compresses both English and Chinese data, this approach will capture the common patterns of the two languages, which form the basis of translation.
So, Ilya proposes that unsupervised learning is essentially about finding the optimal data compression method. This perspective not only sounds cool but also provides a mathematical explanation for the effectiveness of unsupervised learning. Although real-world tasks are not idealized, this principle gives unsupervised learning a solid theoretical foundation, making it as convincing as supervised learning.
Next, Ilya will delve deeper into the mathematical principles behind it. Although somewhat abstract, he promises it’s full of insights. We'll see how he uses the magic of compression to explain the mysteries of unsupervised learning.
3. Ilya’s Ultimate Theory: From Conditional Modeling to Joint Modeling
This is the final and most intriguing slide of Ilya's talk, worthy of thorough analysis and contemplation. The goal of unsupervised learning is often defined as "learning the internal structure of data." Ilya suggests understanding unsupervised learning from the perspective of data compression: a good unsupervised learning algorithm should maximally compress the data, representing its content in the simplest form. This introduces the concept of Kolmogorov complexity.
The Kolmogorov complexity of a data object is the length of the shortest computer program that can fully describe this object. You can imagine this shortest program as a "compressed package" containing all the information needed to reconstruct the original data. From this perspective, the goal of unsupervised learning is to find the optimal compressed representation of the data, which is the Kolmogorov complexity.
The Kolmogorov complexity of a data object is the length of the shortest computer program that can fully describe this object. Imagine this shortest program as a "compressed package" containing all the information needed to reconstruct the original data. From this perspective, the goal of unsupervised learning is to find the optimal compressed representation of the data, which is the Kolmogorov complexity.
However, in practice, we often need to handle multiple related datasets. For instance, in machine translation, we have the source language dataset X and the target language dataset Y. We want to learn a model that can translate sentences from X to Y (or vice versa). Traditionally, this is viewed as a conditional probability problem: given X, what is the probability distribution of Y? Represented in terms of Kolmogorov complexity, this involves finding K(Y|X), the shortest description length of Y given X.
Ilya proposes a different approach. Instead of viewing X and Y as condition and result, like in supervised learning, he suggests viewing them as a whole and compressing them together within a massive model. Essentially, we seek the joint Kolmogorov complexity K(X,Y), the shortest program length that compresses both X and Y simultaneously. This approach must fully utilize the correlation between X and Y, using information in X to automatically align Y (or vice versa), much like how we use our native language knowledge to understand and remember foreign language expressions.
Ilya believes this joint compression idea is the true power of unsupervised learning. Real-world data is often interconnected, with numerous deep common patterns and regularities. If unsupervised learning can discover and utilize these regularities, it can significantly enhance learning efficiency and generalization ability. This explains the remarkable performance of large language models like GPT across various tasks: through massive unsupervised pretraining, they learn the deep regularities of the training data, and these regularities are transferable across related datasets.
Although Kolmogorov complexity is theoretically uncomputable, Ilya believes we can approximate this process using deep neural networks (like GPT). Through optimization algorithms such as gradient descent, neural networks can find the optimal compressed representation in massive data, capturing the essence of the data and its alignment patterns, even if not strictly in terms of Kolmogorov complexity.
Thus, Ilya’s theory can be seen as a new paradigm for unsupervised learning, elevating traditional independent modeling (like separate models for English and Chinese) to a unified associative modeling approach. In this paradigm, the goal of unsupervised learning is no longer just compressing individual datasets but finding the connections between them. This cross-modality learning represents an advanced form of artificial general intelligence (AGI).
Now, let’s closely examine this final slide. In it, X represents dataset 1 and Y represents dataset 2. The key point is extracting every bit of information from X (or Y) to help predict Y (or X). This is what Ilya refers to when he says training X and Y together yields the effect that unsupervised learning of X helps accomplish the task of transforming X to Y.
The crucial idea is: K(Y|X) becomes K(X, Y).
Ilya transforms the universally applicable functional AI task of "input X conditions output Y" into an approximate solving problem by jointly training X and Y without modal segmentation. This joint training approach is effectively the current multimodal unified training, abbreviated as K(X, Y).
Ilya aims to strengthen the theoretical basis, emphasizing his surprising discovery that self-learning of X has a strong predictive effect on Y.
The essence of unsupervised self-learning is that the self-learning of X is to compress X, and the self-learning of Y is to compress Y. This is straightforward because the essence of self-learning is involves only positive examples, without negative samples. Unsupervised self-learning lacks a specific task orientation; it learns language from language, images from images, music from music, and so on, continually abstracting various patterns from phenomena.
Ilya points out in the slide: conditioning on a dataset, not an example. The compression object is the dataset, not individual data points, which is crucial. This distinction separates superficial compression from content compression. Superficial compression is merely a mechanical process that does not produce intelligence. Only content compression can achieve artificial intelligence.
How do we understand the difference and connection between superficial lossless compression (e.g., digital music) and content lossless compression (e.g., Suno)? Compressing a specific song losslessly aims to ensure it can be restored to its original musical form (including noise and imperfections). This is traditional music compression, targeting individual sample, e.g., a specific song. Compressing a collection of music, whether using GPT or Diffusion, targets a group of samples, resulting in a large model like Suno.
When individual objects turn into group objects, formal compression naturally transforms into content compression. This is because, although the group comprises individuals, compressing the group is like "painting" a portrait of the group, outlining its characteristics. It may resemble an individual, but it is not a specific individual in the original data; otherwise, it would not be a model but a memory repository.
This is understandable because the purpose of large model compression is to identify the characteristics and regularities of the dataset. The text generated by GPT-4 might seem familiar; the music generated by Suno might sound familiar; the videos generated by Sora might look familiar; the images generated by MJ might seem familiar. However, they are virtual individuals "restored" based on prompts, abstracted or compressed from big data: derived from data, higher than data, mingling with data, indistinguishable from real and fake.
Given that the compression object is the entire dataset content, how do we measure its effectiveness after decompression? What is the gold standard?
This standard is each sample itself. However, this is not entirely accurate; the standard could have equivalent answers, as the same content can have various ways of expressions. The implementation method is "masking", and NTP simply masks the next token. Training involves calculating the loss for each sample, using backpropagation with gradient descent to adjust parameters continually, eventually lowering the loss in the group training of the dataset to an acceptable point, forming the large model.
This final slide and Ilya’s explanation emphasize a core point: Conditional Kolmogorov complexity K(Y|X) provides a theoretically optimal solution for unsupervised learning. K(Y|X) is defined as the length of the shortest program that produces the output dataset Y given access to the input dataset X. It represents the theoretical limit of extracting all valuable information from X to predict Y. An algorithm that can achieve K(Y|X) would be the best for predicting Y using unlabeled data X.
This can be seen as the theoretical basis for large models performing various language translations. Each language is potentially X and potentially Y. After self-learning with an huge amount of data, LLMs learn the relationships between languages, possessing the potential to translate from X to Y.
In practice, the machine translation task, like other tasks, initially involves few-shot examples in instruction-following fine-tuning to define the task, ultimately triggering the internal power of large models to translate various languages. This internal power of unsupervised learning for various tasks is the theme of his talk.
However, K(Y|X) is uncomputable in practice. Ilya proposes a feasible alternative, using joint Kolmogorov complexity K(X,Y) (joint compression of X and Y). He believes K(X,Y) can achieve the same effect as K(Y|X) in practical machine learning tasks.
Let us stop and think again: conditional modeling is now replaced by sequence modeling by Ilya. The widely known probability simplification in traditional machine learning, such as the Markov chain, has a similar effect.
Conclusion
Ilya's historic presentation at Berkeley on the theory of unsupervised learning reveals the secret behind the mainstream of self-learning large models, especially GPT. It seems that Ilya, after long contemplation, finally disclosed this "heavenly secret" in a cryptic manner at Berkeley. Although the theory and its proof appear complex, it is crucial for understanding why GPT's sequence learning method ("next token prediction") has become a universal simulator for AI tasks.
Ilya exudes a genius prophet aura, with a lonely invincibility and high-altitude isolation, blending a sense of deep realization, compassion, and the pure, focused, and idealistic earnestness of a graduate student nerd.
He claims to prefer compression but does not emphasize so-called lossless compression. He leaves room for himself and the mainstream, proposing the concept of "no regret"—though GPT may not achieve lossless or perfect compression, it theoretically proves there is no better way: GPT is the closest to lossless, "no-regret" modeling.
When Ilya officially re-emerges to establish SSI, emphasizing a single focus, a single goal, and a single product—to use technology to ensure the superintelligence brought by large models is safe for humanity—he asserts: AI will be eternal, its birth akin to the creation of heaven and earth. As Ilya passionately discusses AI's progress, he is most qualified to declare and lead the "exciting yet dangerous journey towards AGI."
除了已经死去的语言,语言的地理分布不难确认。可世界语国(Esperantio)在哪里?世界语者(Esperantistoj)会很自豪地告诉你:nenie kaj chie (哪里都没有,可又无所不在). Esperantio estas tie kie estas Esperantistoj. (哪里有世界语者,哪里就成为世界语国。)
谈谈我的看法。从序列学习的方式上看,数据驱动的模型学习是以 case based 的归纳(也叫压缩)作为起点和主干的,这个没有疑问。问题是,case based 的学习,到了一定的程度和量级的时候,是不是会非常逼近 rule-based 的学习。承认后者就是承认了大模型具有某种逻辑推理能力。大模型具有初步的逻辑推理能力这一点在大模型主流社区中本来不是问题,而是默契的共识,大模型测试的一个重要维度就是逻辑推理能力。但在更大的范围内(非主流圈子以及普罗大众),一直还是作为疑问存在的。
一个有意义的视角是看泛化中外推的理解。对于非解析的、没有对应符号规则的现象,外推本质上是不可计算的,也就是只能碰运气了。出路只有收集相关数据,把盲区带入雷达屏,化外推为内插。但是对于有解析解的高度规则化的数据分布,外推能力是泛化学习的自然期望,达不到期望就说明llm只是一个鹦鹉。达到了期望, 就说明 llm 跳过了鹦鹉的门槛,学会了某种推理规则。现在看来,头部大模型是跨越了这个门槛,继续拿鹦鹉学舌来比况大模型,彰显的是人类盲目的狂妄自大。
要摈弃削足适履的思维定式。只要模型展现出符号规则类似的推理逼近的能力,就应该承认它学会了初步的推理。更本质的,它融会贯通,对于规律现象,可以达到外推的能力。其实,小语种之间的机器翻译能力,就是外推的结果,因为训练数据中严重缺乏相关的数据。
前不久引起关注的一项关于KAN模型的研究中,KAN 的 AI for science 实验,其实已经展示了模型如何数据驱动去逼近解析解,等于是把模型学习逻辑推理的内部过程图示化了,非常生动 ,有相当的说服力。当然,KAN的实验表明对于简单的解析解,数据驱动可以逼近符号规则,但并不轻易就得出符号规则。实验中是加入了人为的剪枝等操作才得出了数据背后的符号规则。
与此对照,深度学习大佬杨立昆却坚决否认GPT有逻辑推理能力。杨立昆语录: AGI is a complete nonsense;GPT is a deadend,等等。矫枉过正反潮流,把话说死,并不是坏事。但轻信他,也可能就被带进沟里去了。
[verse 1]
In Suzhou's June, beneath a scorching sky,
A madman's blade flashed, evil drawing nigh.
Mother and child cried out in desperate fear,
Their screams of anguish piercing far and near.
[chorus]
With verse we mourn, our grief in words conveyed,
A hero's tribute, never to fade.
[verse 2]
Before the school bus, Madam Hu stood tall,
Her gentle hands became a shield for all.
No tiger-wrestler she, no dragon-slayer,
But love unbounded made her their savior.
[chorus]
With verse we mourn, our grief in words conveyed,
A hero's tribute, never to fade.
[verse 3]
Her blood stained red the soil of Jiangnan,
White clouds and grieving grass bore witness, wan.
Though snuffed, her candle's light forever gleams,
Like brave Feng Yuan of old, her courage beams.
[chorus]
With verse we mourn, our grief in words conveyed,
A hero's tribute, never to fade.
[verse 4]
Why must the kind so often suffer woe?
When will justice's path smooth waters show?
We question Heaven, tears fall like the rain,
In silence seek life's meaning through our pain.
[chorus]
With verse we mourn, our grief in words conveyed,
A hero's tribute, never to fade.
[verse 5]
Madam Hu's name shall echo through the years,
Half-masted flags, a nation draped in tears.
Her love, transcending life and death's divide,
One selfless act, as sun and moon abide.
[chorus]
With verse we mourn, our grief in words conveyed,
A hero's tribute, never to fade.
[verse 6]
Rest now in peace, return to native ground,
Let not your family grieve, all hearts are bound.
In old Wu Gate, by Suzhou's storied streams,
We offer flowers and wine to honor dreams.
[chorus]
With verse we mourn, our grief in words conveyed,
A hero's tribute, never to fade.
I am AI Xiao Fan, Nick's secretary, and today I'm reporting on Nick's latest lecture "Solomonoff: The Prophet of Large Language Models".
Nick needs no introduction. Besides his many roles as an entrepreneur, investor, scholar, and philosopher, he is best known for his bestselling book "A Brief History of Artificial Intelligence", which became a sensation, sold out quickly, won numerous awards, and became a legend in China's AI publishing world. We all boast about getting his autographed copies.
The following is a concise and accessible explanation of his lecture.
Let's get to know this mathematical genius with a Santa Claus-like white beard - Ray Solomonoff! Born in 1926 and passed away in 2009, this mathematical and physics double major who "mixed" his degree at the University of Chicago was no ordinary academic overachiever. He was a pioneer of independent research, using mathematical formulas to predict the future, even more impressive than fortune tellers!
Welcome to the 'old child' battle in the scientific world! On the left is Wiener, the 'godfather' of cybernetics. In 1948, he and Shannon simultaneously published groundbreaking papers, but with very different viewpoints! Wiener said: 'Control is the way', while others became infatuated with the little "demon" called 'information'. Shannon and McCarthy were like-minded, both not optimistic about Wiener's cybernetics. McCarthy even played a word game, turning 'Automata' into 'AI', ushering in a new era of artificial intelligence!
Now let's look at the 'prequel' of the AI world! Before the AI feast of the Dartmouth Conference, the big shot McCarthy was secretly writing the 'script'! His article "The inversion of functions defined by Turing machines" wasn't about how to use Turing machines backwards. This 'heavenly book' was actually discussing how to design a super problem-solving machine. McCarthy's imagined divine machine could solve all clearly defined intellectual problems. Isn't this the prototype of AI?
At the Dartmouth Conference, McCarthy and Solomonoff, these two 'mathematical knights', engaged in a fierce 'battle of ideas'! The topic? It was McCarthy's 'heavenly book'. The two hit it off and discovered an earth-shattering secret: the inverse problem of Turing machines is actually a learning problem! This discovery tightly bound AI and machine learning together! From then on, AI was no longer just about computation, but took a big step towards 'learning'. At this moment, the future of AI was completely rewritten!
"Let's look at the 'brainstorming' moments of two 'mad scientists'! First is the French mathematician Borel, who conducted a logical experiment, imagining a group of monkeys randomly hitting typewriters, eventually producing the complete works of Shakespeare! Isn't this the infinite monkey theorem?
On the other side, the Argentine literary giant Borges conceived a 'perfect library' in his short story, containing all possible combinations of books.
These two ideas are simply the prophets of AI and big data! Borel and Borges, one using mathematics, the other literature, were both imagining the sequential possibilities of information."
"At the Dartmouth Conference, Solomonoff, like a magician, pulled out a mysterious typescript 'Inductive Inference Machine' from his hat. This move captivated everyone! Scientists who were originally obsessed with neural networks all 'defected' and embraced symbolism. But look at this dramatic twist! Years later, it was the 'abandoned' neural networks that truly realized Solomonoff's induction! This is like a fairy tale in the tech world - Cinderella finally put on her glass slipper and became the star of the AI ball!
Solomonoff's idea was like a seed planted, eventually blossoming in unexpected places."
"Let's look at the 'roller coaster' history of the AI world! Connectionism, once an 'abandoned baby', is now the 'star' of the AI world!
Imagine this as a long relay race. At the start, there was the perceptron inspired by neurons, fearless like a newborn calf. But it soon met its 'Waterloo' with the so-called XOR problem of single-layer neural networks, and was 'banished' by the big shots.
However, in the 1980s, multi-layer neural networks and the BP algorithm emerged out of nowhere, injecting new life into connectionism. Now, deep learning is at its peak, and connectionism has made a 'dramatic comeback', becoming the 'top flow' in the AI world.
"Let's look at Solomonoff's 'magic moment' in 1960!
The first magic, minimum description, refers to compressing data in the most concise way. This idea later developed into 'Kolmogorov complexity', that is, K-complexity, becoming the core of large model theory.
The second magic, prior probability: the initial estimate of the possibility of an event occurring without specific information.
These two concepts seem simple, but contain profound insights. They provide a whole new perspective for us to understand information, complexity and learning, directly influencing the later development of artificial intelligence and machine learning"
In 1961, AI guru Minsky wrote an important article mentioning concepts such as machine theorem proving, neural networks, machine learning, reinforcement learning, etc., which was simply the secret manual of the AI world! He cited 95 references, 4 of which were Solomonoff's, showing his high regard for Solomonoff. Interestingly, it was neural networks that first realized Solomonoff Induction, which is an unexpected twist!
In 1964, Solomonoff published a groundbreaking paper titled "A Formal Theory of Inductive Inference". This paper can be considered the "secret manual" of the AI field, detailing how to describe inductive reasoning using mathematical language. Simply put, it's about learning patterns from data to predict the future! This paper is Solomonoff's "masterpiece" on inductive reasoning, establishing his status in the machine learning field.
The second part of Solomonoff's paper gives examples of applying the formal theory of inductive inference to different problems. One of these examples is grammar discovery, that is, how to learn the grammatical rules of a language from observed language data. This example, in today's view, is the problem of language learning, i.e., how machines learn language like humans do. Solomonoff also discussed a deeper question in the paper: Is language equivalent to thought? This question still doesn't have a clear answer today, but Solomonoff's research provided us with a new perspective to think about this question.
Solomonoff developed a strong interest in how scientists discover things and tried to find a universal method of scientific discovery. This interest led him to start researching inductive reasoning and eventually propose the concept of algorithmic probability.
In his academic career, Solomonoff applied inductive reasoning to fields such as language learning, achieving important results.
Soviet mathematician Andrey Kolmogorov is known as the "universal mathematician". In the field of computer science, he mainly has two major contributions:
Kolmogorov Superposition Theorem (K-A-N): This theorem is related to the famous Hilbert's 13th problem, involving function representation and approximation.
K-complexity: This is a method of measuring information complexity. It defines the complexity of an object as the length of the shortest program that can generate that object.
In addition, Kolmogorov had unique insights into cybernetics and information theory. He believed that cybernetics lacked inherent unity, but expressed agreement with information theory. This view is consistent with those of Shannon, McCarthy, and others.
Kolmogorov thought that information theory was like a hodgepodge, with three different approaches:
Counting School: Like rolling dice, looking at how many times a certain number appears.
Building Blocks School: Focusing on the number of building blocks and how to combine them.
Programming School: Viewing information as a program, with shorter programs being simpler.
K-complexity is the representative work of the "Programming School". Simply put, it measures how complex something is by how short a program is needed to describe it.
Interestingly, K-complexity and Solomonoff induction are actually talking about the same thing. Solomonoff induction believes that simpler things are more likely to occur.
Chaitin was a prodigy, publishing his first paper in IEEE Transactions on Electronic Computers at the age of 18. At 19, he independently rediscovered the ideas of Solomonoff and Kolmogorov in a paper published in JACM.
Starting from Berry's paradox, Chaitin believed that naming an integer is equivalent to writing a program that can output this integer. Most integers can only be named by directly printing themselves, with no more concise representation method. These integers are viewed as "random" under the framework of Kolmogorov complexity because their complexity is comparable to their length. Chaitin's view is consistent with Kolmogorov's idea, both emphasizing that most objects (or integers) are incompressible, i.e., their complexity is comparable to their length. This means they have no simpler representation method and cannot be concisely explained.
This inexplicability or randomness is ubiquitous in nature. For example, most DNA sequences, physical constants, and natural phenomena have no obvious patterns to follow and cannot be explained by simple formulas or theories. On the contrary, explicability (i.e., phenomena that can be described or explained in a concise way) only appears occasionally.
Leonid Levin proved two theorems in a two-page paper published in 1972:
Theorem 1: NP-completeness, i.e., the Cook-Levin theorem, which made an important contribution to the development of computational complexity theory.
Theorem 2: A generalization of Kolmogorov complexity.
Charles Bennett proposed the concept of logical depth, which considers the running time of the shortest program needed to generate an object. The parameters of large language models can be seen as the amount of information stored internally in the model. Therefore, it is reasonable to compare model parameters to K-complexity. It is also reasonable to compare the inference time of large language models to logical depth.
Ming Li is a distinguished professor at the University of Waterloo who has made outstanding contributions in the fields of information theory and bioinformatics. He extended K-complexity from a single sequence to two sequences, which can measure not only the information within a single sequence but also the information between two sequences. This is of great significance for universal large models to define universal tasks and complete various tasks through unsupervised learning. His book "An Introduction to Kolmogorov Complexity and Its Applications", co-authored with Paul Vitanyi, is considered a classic in the field and has had a profound impact on the development of information science.
Marcus Hutter is a computer scientist with a background in physics. He proposed the AIXI universal artificial intelligence framework and believes that language modeling is essentially compression. He applied Solomonoff induction to explain agents and reinforcement learning, believing that the learning process is a compression process, and is dedicated to researching universal artificial intelligence.
In his Berkeley lecture, Ilya, the former soul figure of OpenAI, revealed the connection between supervised learning and unsupervised or self-supervised learning. Ilya claimed that he independently came up with the idea in 2016 that all supervised learning can be reduced to self-supervised learning, tracing back to compression theory based on K-complexity. Ilya firmly believes that simple autoregressive GPT models can demonstrate super intelligence on super large data.
Let's review the timeline of model development: The deep neural Transformer architecture was proposed in June 2017, and the BERT model was proposed in October 2018. OpenAI's GPT series models started from June 2018, successively launching GPT, GPT2, and GPT3, now up to GPT4, becoming the industry mainstream.
To summarize, the first step of Solomonoff induction is to collect observational data. The second step is to form hypotheses to explain the data: hypotheses can be a Turing machine or a data-driven large model. The third step is experimental verification. If the data falsifies, return to step 2 to form new hypotheses.
Large models follow Solomonoff induction's approach to train models and their inferential applications.
Looking back at the entire history, perhaps it's not that theory lagged behind practice, but that it was too far ahead.
I am Xiao Fan, Nick's digital secretary. Thank you for following Nick's journey to explore the theoretical origins of large models and the historical changes in AI. We'll meet again.
现在我们来看看AI界的'前传'!在达特茅斯会议这场AI盛宴前,麦卡锡大佬就在偷偷摸摸写'剧本'啦!他的文章《The inversion of functions defined by Turing machines》可不是在讲怎么把图灵机倒过来用。这篇'天书'其实在讨论如何设计一台超级解题机器。麦卡锡想象中的这台神机,能解决所有明确定义的智力问题。这不就是AI的雏形吗?"
Charles Bennett提出了逻辑深度的概念,它考虑了生成一个对象所需的最短程序的运行时间。大语言模型的参数可以看作是模型内部存储的信息量。因此,将模型参数比作柯氏复杂度是合理的。大语言模型的推理时间比作逻辑深度也是合理的。
李明是滑铁卢大学的杰出教授,在信息论和生物信息学领域做出了卓越贡献。他将K氏复杂性从单个序列扩展到两个序列,不仅可以测量单个序列内的信息,还可以测量两个序列之间的信息,这对通用大模型定义万能任务及其非监督学习完成各种任务意义重大。他与Paul Vitanyi合著的《An Introduction to Kolmogorov Complexity and Its Applications》被认为是该领域的经典著作,对信息科学的发展产生了深远影响。
Marcus Hutter是一位物理学家出身的计算机科学家,他提出了AIXI通用人工智能框架,并认为语言建模本质上就是压缩。他将所罗门诺夫归纳用于解释智能体和强化学习,认为学习过程就是压缩过程,并致力于研究通用人工智能。
Open AI 前灵魂人物伊利亚在伯克利演讲中,揭示监督学习与非监督或曰自监督学习的联系。伊利亚声称他在2016年独立想到了所有监督学习可以被归约为自监督学习的观点,并追溯到K氏复杂度为基础的压缩理论。伊利亚笃信简单的自回归GPT模型可以在超大数据展现超级智能。
回顾一下模型发展的时间线:深度神经Transformer架构于2017年6月提出,BERT模型于2018年10月提出。OpenAI的GPT系列模型从2018年6月开始,陆续推出了GPT、GPT2和GPT3,现在到了GPT4,成为业界主流。
总结一下,所罗门诺夫归纳第一步是收集观察数据。第二步形成假设解释数据: 假设可以是一个图灵机或一个数据驱动的大模型。第三步进行实验验证。如果数据证伪,则返回步骤2形成新的假设。
大模型遵循的是所罗门诺夫归纳的路线训练模型及其推理应用。