What did Ilya see? -- secret behind success of LLMs

What did Ilya see?

-- looking closely into his historical Berkeley talk

by Wei Li, Jia Gao


When Ilya Sutskever left OpenAI and re-emerged with his new company, SSI (Safe Superintelligence Inc.), the move was both surprising and expected—he bypassed AGI and directly aimed at SSI (Safe Superintelligence). He confidently declared: Superintelligence is imminent, and establishing safe superintelligence (SSI) is the most important technological issue of our time.

Ilya, a legend in the field of deep learning and AI, and the former true soul of OpenAI, was at the center of the dramatic internal shift, addressing the issue—effective acceleration versus super alignment. Why was Ilya so steadfast about "super alignment" amid the underlying AI values and strategic path debate? Even after the storm settled, the outside world continued to speculate: what did Ilya see that compelled him to join the board in making the decision to oust CEO Sam Altman? Ilya remained hidden until recently, when he left OpenAI, leading to the dissolution of his super alignment team and the creation of his new company.

What did he see behind the push for "safe intelligence"?

Back on October 3, 2023, Ilya gave a talk at UC Berkeley titled "A Theory of Unsupervised Learning." Though obscure and known to few, it is destined to be one of the most significant moments in AI history. This talk was a theoretical reflection and summary by a top expert in deep learning on the GPT model he pioneered, now famous worldwide. Ilya revealed the core principles of large models and vividly described his obsession with, and excitement over, independently understanding the mechanisms of unsupervised learning. Despite the complexity, the talk was brilliant and enlightening.

Until recently, Leopold Aschenbrenner, a former member of his super alignment team, published a 165-page article, "Situation Awareness," preliminarily revealing the shock and concerns within OpenAI over the exponential evolution of GPT models. This partly answered the question of what Ilya saw, but Ilya himself remained silent until his official re-emergence not long ago.

Reflecting on his "confessional" talk at Berkeley, we might glimpse his "moment of enlightenment" when facing potential superintelligence and understand his original intent for safe intelligence. It was a rare deep sharing by Ilya, attempting to convey essential message to the world. But did the world hear him?

1. Machine Learning: Supervised Learning and Unsupervised Learning

To accommodate readers with varying mathematical backgrounds, this blog aims to explain Ilya's historical presentation in an accessible language. Purely technical explanations can be skipped by non-technical readers without affecting the understanding of the presentation's main ideas.

Before diving in, let's review the basic concepts of machine learning. Machine learning is like having computers as students and humans as teachers. By providing computers with numerous "practice problems" and "answer keys," they slowly learn to solve problems. This is supervised learning. But can computers really learn from practice problems instead of merely memorizing them? Ilya assures us there's theoretical proof of this.

Imagine a sea of problems before you, each paired with a standard answer. This is the model's training data. Model training is like diligently solving these problems until most of them are correct, meaning low training error. But even an extensive problem set has its limits. When new problems arise, can the model still get them right? These new problems are the test data, akin to exams. Whether the model performs well depends on its test error rate.

Mathematics tells us that as long as the problem set is large enough, far exceeding the model's size, excellent performance on training problems (low training error) ensures good performance on test problems (low testing error). In other words, if the model trains well, it will do well in exams! This is the mathematical guarantee for supervised learning.

However, if the model merely memorizes without extraction, no matter how large its memory or how strong its "memory power," it lacks real adaptive learning ability (called "generalization ability"). Only when the model isn't too smart, it will be forced to extract the essence (called "compression"), learning real skills from the problem set.

This explains why the model size shouldn't be too large, to avoid giving the model too much room to cut corners. In short, Ilya wants to say that "big labeled data + low training error" is the winning formula for supervised learning, guaranteed by mathematics. This point has been confirmed both theoretically and practically. Since the deep learning revolution 12 years ago, countless successful cases have shown that as long as the training data is sufficient, neural networks can excel, at all sorts of AI tasks, from recognizing cats and dogs to machine translation.

But what about unsupervised learning? Can computers learn intelligence from a problem set without standard answers? It sounds far-fetched, but Ilya is about to explain how he managed to seek a solid mathematical foundation for unsupervised learning as well.

2. Distribution Matching: A New Approach to Unsupervised Learning

Everyone knows that machine translation was a typical win of supervised learning, in fact, the only win among various NLP tasks (such as dialogue, information extraction, sentiment analysis,  question answering, docuent understanding, etc.) prior to the large language model's era. Why? Because we have a vast amount of historical bilingual data. It's like students having workbooks with English on the left and Chinese on the right—supervised learning thrives on this setup.

But what if the teacher suddenly stops providing aligned bilingual data and only gives you English books and unrelated Chinese books, leaving you to figure out how to align and learn automatic translation? That's the problem unsupervised learning needs to solve. Ilya says unsupervised learning can also handle various language machine translations (which we've seen today with large models—specialized translation software is no longer needed), and even any input-to-output transformation tasks. What's the catch?

Ilya discovered a new approach called distribution matching. Essentially, if the English and Chinese book collections are large enough, containing various sentence structures, their linguistic regularities will be learned "without supervision". For example, the context distribution of "I/me/my" in English should correspond to "我" in Chinese; adjectives near nouns in English with semantic compatibility should have a similar pattern in Chinese, etc. This provides the basic condition for potential language alignment.

Ilya points out that if two languages' native data is sufficiently rich, the input in one language can almost uniquely determine the equivalent translation in the other language. This principle applies not only to machine translation but also to tasks like speech recognition and image recognition.

Ilya independently discovered this approach in 2015, fascinated by the underlying mathematical principle—compression theory. If we can find a method that maximally compresses both English and Chinese data, this approach will capture the common patterns of the two languages, which form the basis of translation.

So, Ilya proposes that unsupervised learning is essentially about finding the optimal data compression method. This perspective not only sounds cool but also provides a mathematical explanation for the effectiveness of unsupervised learning. Although real-world tasks are not idealized, this principle gives unsupervised learning a solid theoretical foundation, making it as convincing as supervised learning.

Next, Ilya will delve deeper into the mathematical principles behind it. Although somewhat abstract, he promises it’s full of insights. We'll see how he uses the magic of compression to explain the mysteries of unsupervised learning.

3. Ilya’s Ultimate Theory: From Conditional Modeling to Joint Modeling

This is the final and most intriguing slide of Ilya's talk, worthy of thorough analysis and contemplation. The goal of unsupervised learning is often defined as "learning the internal structure of data." Ilya suggests understanding unsupervised learning from the perspective of data compression: a good unsupervised learning algorithm should maximally compress the data, representing its content in the simplest form. This introduces the concept of Kolmogorov complexity.

The Kolmogorov complexity of a data object is the length of the shortest computer program that can fully describe this object. You can imagine this shortest program as a "compressed package" containing all the information needed to reconstruct the original data. From this perspective, the goal of unsupervised learning is to find the optimal compressed representation of the data, which is the Kolmogorov complexity.

The Kolmogorov complexity of a data object is the length of the shortest computer program that can fully describe this object. Imagine this shortest program as a "compressed package" containing all the information needed to reconstruct the original data. From this perspective, the goal of unsupervised learning is to find the optimal compressed representation of the data, which is the Kolmogorov complexity.

However, in practice, we often need to handle multiple related datasets. For instance, in machine translation, we have the source language dataset X and the target language dataset Y. We want to learn a model that can translate sentences from X to Y (or vice versa). Traditionally, this is viewed as a conditional probability problem: given X, what is the probability distribution of Y? Represented in terms of Kolmogorov complexity, this involves finding K(Y|X), the shortest description length of Y given X.

Ilya proposes a different approach. Instead of viewing X and Y as condition and result, like in supervised learning, he suggests viewing them as a whole and compressing them together within a massive model. Essentially, we seek the joint Kolmogorov complexity K(X,Y), the shortest program length that compresses both X and Y simultaneously. This approach must fully utilize the correlation between X and Y, using information in X to automatically align Y (or vice versa), much like how we use our native language knowledge to understand and remember foreign language expressions.

Ilya believes this joint compression idea is the true power of unsupervised learning. Real-world data is often interconnected, with numerous deep common patterns and regularities. If unsupervised learning can discover and utilize these regularities, it can significantly enhance learning efficiency and generalization ability. This explains the remarkable performance of large language models like GPT across various tasks: through massive unsupervised pretraining, they learn the deep regularities of the training data, and these regularities are transferable across related datasets.

Although Kolmogorov complexity is theoretically uncomputable, Ilya believes we can approximate this process using deep neural networks (like GPT). Through optimization algorithms such as gradient descent, neural networks can find the optimal compressed representation in massive data, capturing the essence of the data and its alignment patterns, even if not strictly in terms of Kolmogorov complexity.

Thus, Ilya’s theory can be seen as a new paradigm for unsupervised learning, elevating traditional independent modeling (like separate models for English and Chinese) to a unified associative modeling approach. In this paradigm, the goal of unsupervised learning is no longer just compressing individual datasets but finding the connections between them. This cross-modality learning represents an advanced form of artificial general intelligence (AGI).

Now, let’s closely examine this final slide. In it, X represents dataset 1 and Y represents dataset 2. The key point is extracting every bit of information from X (or Y) to help predict Y (or X). This is what Ilya refers to when he says training X and Y together yields the effect that unsupervised learning of X helps accomplish the task of transforming X to Y.

The crucial idea is: K(Y|X) becomes K(X, Y).

Ilya transforms the universally applicable functional AI task of "input X conditions output Y" into an approximate solving problem by jointly training X and Y without modal segmentation. This joint training approach is effectively the current multimodal unified training, abbreviated as K(X, Y).

Ilya aims to strengthen the theoretical basis, emphasizing his surprising discovery that self-learning of X has a strong predictive effect on Y.

The essence of unsupervised self-learning is that the self-learning of X is to compress X, and the self-learning of Y is to compress Y. This is straightforward because the essence of self-learning is involves only positive examples, without negative samples. Unsupervised self-learning lacks a specific task orientation; it learns language from language, images from images, music from music, and so on, continually abstracting various patterns from phenomena.

Ilya points out in the slide: conditioning on a dataset, not an example. The compression object is the dataset, not individual data points, which is crucial. This distinction separates superficial compression from content compression. Superficial compression is merely a mechanical process that does not produce intelligence. Only content compression can achieve artificial intelligence.

How do we understand the difference and connection between superficial lossless compression (e.g., digital music) and content lossless compression (e.g., Suno)? Compressing a specific song losslessly aims to ensure it can be restored to its original musical form (including noise and imperfections). This is traditional music compression, targeting individual sample, e.g., a specific song. Compressing a collection of music, whether using GPT or Diffusion, targets a group of samples, resulting in a large model like Suno.

When individual objects turn into group objects, formal compression naturally transforms into content compression. This is because, although the group comprises individuals, compressing the group is like "painting" a portrait of the group, outlining its characteristics. It may resemble an individual, but it is not a specific individual in the original data; otherwise, it would not be a model but a memory repository.

This is understandable because the purpose of large model compression is to identify the characteristics and regularities of the dataset. The text generated by GPT-4 might seem familiar; the music generated by Suno might sound familiar; the videos generated by Sora might look familiar; the images generated by MJ might seem familiar. However, they are virtual individuals "restored" based on prompts, abstracted or compressed from big data: derived from data, higher than data, mingling with data, indistinguishable from real and fake.

Given that the compression object is the entire dataset content, how do we measure its effectiveness after decompression? What is the gold standard?

This standard is each sample itself. However, this is not entirely accurate; the standard could have equivalent answers, as the same content can have various ways of expressions. The implementation method is "masking", and NTP simply masks the next token. Training involves calculating the loss for each sample, using backpropagation with gradient descent to adjust parameters continually, eventually lowering the loss in the group training of the dataset to an acceptable point, forming the large model.

This final slide and Ilya’s explanation emphasize a core point: Conditional Kolmogorov complexity K(Y|X) provides a theoretically optimal solution for unsupervised learning. K(Y|X) is defined as the length of the shortest program that produces the output dataset Y given access to the input dataset X. It represents the theoretical limit of extracting all valuable information from X to predict Y. An algorithm that can achieve K(Y|X) would be the best for predicting Y using unlabeled data X.

This can be seen as the theoretical basis for large models performing various language translations. Each language is potentially X and potentially Y. After self-learning with an huge amount of data, LLMs learn the relationships between languages, possessing the potential to translate from X to Y.

In practice, the machine translation task, like other tasks, initially involves few-shot examples in instruction-following fine-tuning to define the task, ultimately triggering the internal power of large models to translate various languages. This internal power of unsupervised learning for various tasks is the theme of his talk.

However, K(Y|X) is uncomputable in practice. Ilya proposes a feasible alternative, using joint Kolmogorov complexity K(X,Y) (joint compression of X and Y). He believes K(X,Y) can achieve the same effect as K(Y|X) in practical machine learning tasks.

Let us stop and think again: conditional modeling is now replaced by sequence modeling by Ilya. The widely known probability simplification in traditional machine learning, such as the Markov chain, has a similar effect.


Ilya's historic presentation at Berkeley on the theory of unsupervised learning reveals the secret behind the mainstream of self-learning large models, especially GPT. It seems that Ilya, after long contemplation, finally disclosed this "heavenly secret" in a cryptic manner at Berkeley. Although the theory and its proof appear complex, it is crucial for understanding why GPT's sequence learning method ("next token prediction") has become a universal simulator for AI tasks.

Ilya exudes a genius prophet aura, with a lonely invincibility and high-altitude isolation, blending a sense of deep realization, compassion, and the pure, focused, and idealistic earnestness of a graduate student nerd.

He claims to prefer compression but does not emphasize so-called lossless compression. He leaves room for himself and the mainstream, proposing the concept of "no regret"—though GPT may not achieve lossless or perfect compression, it theoretically proves there is no better way: GPT is the closest to lossless, "no-regret" modeling.

When Ilya officially re-emerges to establish SSI, emphasizing a single focus, a single goal, and a single product—to use technology to ensure the superintelligence brought by large models is safe for humanity—he asserts: AI will be eternal, its birth akin to the creation of heaven and earth. As Ilya passionately discusses AI's progress, he is most qualified to declare and lead the "exciting yet dangerous journey towards AGI."


Chinese full-length post

Ilya's presentation:

Related Links:



Ilya Sutskever:你会 bet against AI 吗?

挥刀董事会,OpenAI 如何解决致命难题


为什么说 Sora 是世界的模拟器?





除了已经死去的语言,语言的地理分布不难确认。可世界语国(Esperantio)在哪里?世界语者(Esperantistoj)会很自豪地告诉你:nenie kaj chie (哪里都没有,可又无所不在). Esperantio estas tie kie estas Esperantistoj. (哪里有世界语者,哪里就成为世界语国。)



英国曼城是我出国留学的第一站。跟很多人一样,第一次远离故国,伴随着难以名状的痛苦,内心空荡而恍惚。百无聊赖,我于是找来电话黄页,查询 Esperanto, 果然发现有联络人,原来是一帮退休老人组成的俱乐部,每周在 Pub(酒馆) 活动一次。他们很高兴,我的加入给他们带来了新奇。

于是每个周末,他们派人来车接我送我。这是我第一次接触英国的 Pub 文化。刚开始对 Pub 不是很习惯,里面闹哄哄的,硕大的屏幕上总是播放着足球赛事,有人打台球,有人玩游戏,更多的人在那里喝着啤酒高谈阔论。英国人对 Pub 的热衷,超出我的想像,有人每天傍晚来这里泡到后半夜,海量的啤酒入肚,满面通红,谈些不知所云的话题。以酒会友,人生几何。


在英国尝到了寻找世界语“同志”的甜头,到了温哥华第二天,就打开黄页,果然联系上了一位老世界语者J,德国人,极为彬彬有礼,和蔼热情。温哥华的五年,他成为我来往最密切的忘年之交。有次我在系里讲演“世界语的机器处理”,他象个记者一样扛来他的老式摄像机, 跑前跑后,给我录像,使我们系里的教授同学亲眼见到世界语者的热情。


温哥华世界语俱乐部当时还有一批电话公司的白人小伙子,长的都很精神,听说来了一个如假包换的中国世界语者, 都很兴奋。相约聚餐后,他们诚邀我周末跟他们一起滑雪去。我当时刚来,功课很紧,可是盛情难却,还是豁出去一天跟他们去了。这是我第一次滑雪,尽管老摔跤,感觉新鲜美好。我以前从来没有置身过这样的环境,松树白雪,笑语喧哗,各类雪衣,色彩缤纷,真是天上人间。







Suno:《立委:Esperanto: Al Nia Kara Lingvo(世界语之恋)》




生成式AI的到来,也就意味着真假莫辨时代的到来。现如今,无论文字、音频还是视频,随着大模型的普及,深度造假(deep fake)的门槛无限低。耳听为虚,眼见也一样可能是虚。有什么信息可以相信呢?社会似乎并没有为此做好准备。


最近几个月在美国,身边不少年轻人失业了,有好几个是伯克利data science毕业的学生。一方面AI大厂如英伟达、苹果、微软等市值不断攀升,另一方面包括大厂在内的IT裁员风潮也一波接着一波。不是 hard core engineering 的数据科学专业生,冲击很大,所以很多年轻人对大模型有怨气。这才刚刚开始。名校毕业生也一样两极分化。非常可怕的AI碾压人工的潮流,data science 是重灾区。

曾几何时,在大模型之前,全美国都有预见,说未来最大的需求是数据科学,每个企业、产品,无论大小,都对数据工作有极大的需求。于是,各大学都开始增加 ds 的 programs,校内极速扩张这个介于电脑和统计之间的学士、硕士课程,各种网上的课程和学位也满天飞。结果,大模型来了,数据分析、总结和渲染都比人做得既快又好。非常可怕的AI碾压人工的潮流,data science 是重灾区。

美国做教授的老友说,数据科学杀掉了统计,人工智能杀掉了数据科学。现在高等教育里还有一个潮流,叫micro credentials, 各种证书。大量的half baked potatoes 就会抢占工作市场,真正全方位科班出身的人反而找不到工作。这些拿了证书的技工把工作都占了,是不是对正式毕业生不公平? 学生怎样会有动力学习?各种研究生教育有什么用?




AI 也许真应该缓行。但没人能阻挡技术的内卷外卷。人类跳不出自己的怪圈。以职业生涯(career development)作为指挥棒的社会传统意识和价值观,必须改变,但改变需要很长的时间和配套的机制,这些都看不到任何靠谱的方案和行动。

处在漩涡中心的大模型产业也卷得惨烈。上一次移动技术革命在衣食住行、娱乐通信这些刚需领域已经做到了 极致,出现了一批超级应用,美团、滴滴、微信、抖音、拼多多等。剩下的知识和艺术工作属于高端需求,写文章、作曲儿、画个画、做个视频、用个秘书,大多是锦上添花。也许有一天这些高端需求会成为刚需,但在目前很像是伪需求,因此规模化落地应用就显得格外困难。



最讽刺的是,以前被尊为打工贵族的码农,也首当其冲。在横扫了 data science 工作后,cs 的毕业生中比较平庸的,也在工作被铲除的路上。美国作为IT超级大国,多少年来在培养程序员方面一直有缺口,本国毕业的cs学生填不满,只好大量留用印度、中国和其他国家的毕业生。这样的好日子,也快到头了。

我不认同一个流行观念,说以前的技术革命消灭了一些工作,也创造了新的就业,所以大可放心,这次也会如此。时代不同了,大概率不会如此。必须认清消灭的岗位比创造的岗位,会多得多。虽然美团外卖小哥、滴滴司机是上一次移动平台技术扫荡了实体店以后产生的新工作,但看看这些最低工资的工作岗位的内卷情况就知道,就连这些工作也是僧多粥少,大家都在挣扎。人工智能的 Robo Taxi 的到来,会逐步消灭滴滴司机。
生产效率的提高一定伴有大量的失业,产业机构改变,也有失业。很多这种失业人员,基本没有希望东山再起。所谓结构性失业,大批中年人,只能等退休。除非奇迹发生,年轻人的就业问题也越来越严峻。人类必须改变和面对的是,不工作是常态,UBI 必须建设。

UBI制度(Universal Basic Income,全民基本收入)必须提上日程,这是因为生产力和GPT并不因为就业人员的急剧萎缩而降低,反而得益于技术革命而在稳步或快速增长中。两极分化必须遏制,必须防止技术革命的红利为少数人独占。否则国将不国,球将不球,人类将非人类。




这是最近一篇论文的题目。ICML 2024:“Case-Based or Rule-Based: How Do Transformers Do the Math?” 第一作者为来自北京大学物理学院、即将加入人工智能研究院读博的胡逸。有点意思。

论文第一个结论是,llm 学不到真正的推理规则,只能通过相似案例学到有限泛化。人学到的推理规则是可以 外推的(extrapolation) , 而相似案例的有限泛化却只能做内插(intrapolation)。无法外推,所以训练集没有类似案例的区域就成了盲区。
这个实验是在 gpt2 上做的。可以理解为什么第一个实验必须在 gpt 2 上做,因为他可以控制训练数据,制造数据真空,来验证有没有逻辑推理带来的外推能力。但这种GPT2这样规模的“大”模型没有出现的能力,并不表明真正的大模型就不会“涌现”。

论文后来的实验是在比GPT2大得多的 “辣妈” 上做的,似乎得出一个相左的结论。结论是,如果模型足够大,只需要少量的任务微调,llm 就学会了接近规则推理的效果。在长整数加法上的表现,表明模型不仅会内插,在外推上也表现很好。


谈谈我的看法。从序列学习的方式上看,数据驱动的模型学习是以 case based 的归纳(也叫压缩)作为起点和主干的,这个没有疑问。问题是,case based 的学习,到了一定的程度和量级的时候,是不是会非常逼近 rule-based 的学习。承认后者就是承认了大模型具有某种逻辑推理能力。大模型具有初步的逻辑推理能力这一点在大模型主流社区中本来不是问题,而是默契的共识,大模型测试的一个重要维度就是逻辑推理能力。但在更大的范围内(非主流圈子以及普罗大众),一直还是作为疑问存在的。

一个有意义的视角是看泛化中外推的理解。对于非解析的、没有对应符号规则的现象,外推本质上是不可计算的,也就是只能碰运气了。出路只有收集相关数据,把盲区带入雷达屏,化外推为内插。但是对于有解析解的高度规则化的数据分布,外推能力是泛化学习的自然期望,达不到期望就说明llm只是一个鹦鹉。达到了期望, 就说明 llm 跳过了鹦鹉的门槛,学会了某种推理规则。现在看来,头部大模型是跨越了这个门槛,继续拿鹦鹉学舌来比况大模型,彰显的是人类盲目的狂妄自大。
前不久引起关注的一项关于KAN模型的研究中,KAN 的 AI for science 实验,其实已经展示了模型如何数据驱动去逼近解析解,等于是把模型学习逻辑推理的内部过程图示化了,非常生动 ,有相当的说服力。当然,KAN的实验表明对于简单的解析解,数据驱动可以逼近符号规则,但并不轻易就得出符号规则。实验中是加入了人为的剪枝等操作才得出了数据背后的符号规则。

与此对照,深度学习大佬杨立昆却坚决否认GPT有逻辑推理能力。杨立昆语录: AGI is a complete nonsense;GPT is a deadend,等等。矫枉过正反潮流,把话说死,并不是坏事。但轻信他,也可能就被带进沟里去了。