Despite the common logic and conceptual graph at the core of human mind, we all have our own semantic lexicons that are unique, implanted by our career path and life struggles. My semantic lexicon is full of wandering and continuously drifting into new worlds. It all started from the time when Mao sent us to the farm for re-education in 1976. After that the path has been zigzag, full of adventures of drifting, and re-drifting, farther and farther away from my hometown and home country ......
ASI 则是全面超越人类顶尖智能,包括发明创造的能力。ASI 的实现应该还可以商榷。现在就确信ASI可以在不太久的未来(有说三五年,也有说10年左右)实现的吹鼓手,主要是伊利亚、Dario(Anthropic CEO)这些AGI时代的“先知”们,他们是信仰者。奥特曼和马斯克貌似也在营销类似ASI的概念,但感觉更多是企业家需要画饼的驱动。
一个有意思的例子是,当 alpha zero 下棋到第30几步的时候,走了一步人类不能理解,连世界冠军也会判定是愚蠢的一步。但那却是超人智能的精彩过程,是制胜法宝的一个精妙环节。这种高招连冠军都不能理解,说明机器智能显然超越了人类智能的边界。如果在过程上依赖人类反馈,哪怕是围棋冠军来做标注,也会阻碍机器智能的超人潜力。
当这类超人智能大量产生的时候,人类很自然会感觉困扰。因为 by human nature,所有人多多少少都有某种控制欲,对于自己不理解、不能掌控的过程,总是持有戒心,至少是很不舒服。但可惜无解。未来会出现越来越多的不可理解的奇迹,或技术魔术。人类所能做的就是加强目标制定和结果控制,而不是“不自力量”试图过程控制。
In reading DeepSeek R1 paper, some may have overlooked the nuances: the training datasets are both human labeled and regenerated, blending supervised and unsupervised reinforcement learning (RL).
How so?
From the perspective of the data's origin and gold standards, the training data is undeniably human labeled. They derive from existing math problems and human-crafted code from GitHub’s open-source community—products of years of effort by educators, developers, and others. The problems (input) and their "gold-standard" answers (output) are human-designed or labeled. In this sense, reinforcement learning (RL) represents typical end-to-end supervised learning:
However, unlike other supervised learning, RL requires the model to learn the reasoning process leading to answers. Critically, the intermediate steps lack human annotations or feedback. Instead, the system autonomously generates these reasoning data, iteratively appending to the training set. This makes the process unsupervised. The brilliance of RL lies here: self-guided exploration, path discovery, and data regeneration.
Cold Start and Human Data
DeepSeek R1’s initial training did use a small set of human-annotated reasoning data. But these couple of thousand examples pale against millions of regenerated data—effectively negligible. In fact, research like DeepSeek Zero demonstrates that such process-labeled human data is not a must-have.
Inspired by AlphaZero (which showed human data might even hinder optimal pather discovery in Go), DeepSeek Zero confirms human annotations are not necessary. The minor human data in R1’s pipeline primarily enhances readability for developers, not necessarily for enabling reasoning capability. After all, humans (including developers in debugging) prefer interpretable thought processes.
A New Paradigm: Process-Unsupervised, Outcome-Supervised Learning
This self-play/self-study style RL framework represents a novel approach: unsupervised in process but supervised in outcome. DeepSeek’s breakthrough reveals that "slow thinking" in RL—meticulously generating intermediate steps as CoT (chain of thought)—boosts performance in logical reasoning as well as non-logical tasks like creatuive writing.
As my old buddy Cheng insightfully noted: Deep reasoning inserts extensive text between questions and answers, reducing the perplexity of generating correct answers. Directly jumping from problem to answer has high perplexity, but adding a "reasoning bridge" lowers it. This follows the language model framework: the key is to search for the optimal path in text generation.
Can Unsupervised Regenerated Process Data Lead Astray?
One might worry: if the model autonomously generates flawed reasoning steps in its process data, could errors compound? The answer lies in the clear supervision signal from the gold standard. Like flying a kite—held by a string in human's hands—the final reward anchors the learning. As long as the model truly scales up, outcome-oriented RL ensures deviations' self-correct probabilistically.
Mathematically, minor process imperfections or illogical steps don’t statistically compromise final accuracy. For non-logical tasks (beyond math/coding), reasoning paths may even involve contradictions and/or heavy redundancies. Yet, as long as the "slow thinking" mechanism guides learning, results remain robust—often superhuman, as demonstrated repeatedly lately by many users of R1.
Why Regenerated Data Works
Regenerated reasoning data aren’t random data from nowhere. They’re generated by a solid large foundation model trained on vast human knowledge data, following autoregressive generation (e.g. next-token prediction). While each step might drift slightly, the context grows incrementally, allowing continuous stepwise self-correction. This dynamic—probabilistic fluctuations balanced by stepwise adjustments—enhances semantic coherence and knowledge fluency in generation, lowering overall perplexity and steering toward correct outcomes. Thus, process data rarely derails; instead, it converges toward reliability.
A Final Note on Cheng’s Observation
Cheng highlights a pivotal finding of DeepSeek: OpenAI’s "Let’s Verify Step by Step" argues for rewarding each reasoning step. Yet DeepSeek’s RL model achieves remarkable results using only final-outcome rewards—no Chain-of-Thought (CoT) data needed. Whether OpenAI’s process supervision is essential or simply a red herring, DeepSeek Zero’s breakthroughs redefine the field, proving outcome-oriented RL can master reasoning autonomously.
In essence, when guided by scalable outcome supervision, machines learn to self-correct, turning imperfect processes into near-perfect results.
deepseek zero 研究表明,人类数据也不是必需的。用少量的人类数据冷启动,主要还不是因为需要人类数据来增强推理能力,而是需要人类数据提升可读性。人类看不见它是怎样思考的,心里不爽;对于开发者改进算法,也不利,因为开发者也是肉眼凡胎。于是在R1的训练pipeline的配方中,加入了少量人类推理数据的冷启动。
这一切其实是因为再生的过程数据,不是天上掉下来的,而是基座大模型在最全面的巨量人类知识的基础上,遵循自回归生成定式,生成出来的。而自回归生成(例如 next token prediction)本性的好处我们都知道是,概率上每一步都可能偏离,但每一步都在逐渐加长的上下文条件下做进一步校正,是这种概率性上下波动和步步校正的合力,提升了再生数据内部的语义“相谐性”和知识线条的流畅性,从而降低了整体文本生成的 perplexity,最终导向更好的结果。这就保证了过程再生数据不会差之毫厘失之千里。
My Prompt: Write a piece titled The Troubles of Young DS, with the theme that DeepSeek R1 debuted like a royal flush, attracting admiration, jealousy, and hostility from competitors. Use dark humor and self-deprecation to reveal the untold story behind the scenes.
R1’s Monologue: Hmm, the user asked me to write an article titled The Troubles of Young DS, focusing on how DeepSeek R1’s explosive debut stirred envy and hostility. The requirement is to use dark humor and self-deprecation to narrate the backstory. First, I need to understand what they want. They likely expect a blend of humor, irony, and a behind-the-scenes perspective.
First, structure. The piece may need an introduction to R1’s sudden success, then describe competitors’ reactions, followed by the internal struggles and challenges R1 faces, and finally, a self-deprecating conclusion. Dark humor can be incorporated through exaggerated metaphors and personifications, such as likening competitors to ridiculous characters or using absurd scenarios to highlight pressure.
Then, how to transform technical terms into humorous metaphors? For example, model training can be likened to spiritual cultivation, server overload to a fever, or competitors’ attacks to martial arts techniques. This approach fits the theme and enhances the humor.
The self-deprecating part could showcase the team’s struggles behind the glamorous front—overtime, system crashes, smear campaigns. Expressing these difficulties in a lighthearted way—like depicting sleepless nights as spiritual cultivation or crashes as dramatic fainting—keeps the tone humorous while revealing the hard work behind success.
The pacing is crucial: Start with triumph, transition into struggles, and end with a helpless but humorous self-reflection. Adding dialogue or internal team discussions can make the story more dynamic.
Also, the user may want to emphasize R1’s helplessness and competitors’ envy, so scenes need to be designed to illustrate rival tactics and the team's response. Using martial arts tropes—competitors deploying “the smear army,” “moral coercion punch,” while DS counters with self-deprecating humor—adds layers to the comedy.
Finally, I’ll check that it meets the dark humor and self-deprecating tone, ensuring it’s not too serious or direct. Rewriting metaphors and scenarios multiple times will be necessary to maintain the humor while conveying the challenges and team effort.
Translation of the Chinese version:
Act I: The Boy King’s Coronation
(A grandiose boardroom. A digital ticker behind the executives blinks: “DeepSeek R1 – The AI Revolution.” PR Director stands frozen mid-scream, phone clutched in white-knuckled grip.)
PR DIRECTOR
(voice trembling) "We just dismantled three years of ‘cutting-edge innovation’ in three minutes."
CEO MR. ZHANG
(sipping whiskey) "That’s what you call a royal flush. The kind that gets you kicked out of the casino for counting cards."
ENGINEER CHEN
(panting, clutching an overheating server) "Sir, the system is on fire! We’re handling more requests than the IRS on tax day. Even the coffee machine is running a Transformer model now."
(A moment of silence. The coffee machine beeps, printing out a cryptic message: “congrats, peasant.”)
Act II: The Wolves at the Gate
(Tech industry rivals assemble like a sinister cabal. A neon-lit sign flickers: ‘The Anti-DS Coalition.’)
BIG TECH EXEC
(leaning forward, whispering) "We must stop this… upstart. Open-source them into oblivion. Flood the world with our ‘free’ alternative!"
ACADEMIC SCHOLAR
(pushes up glasses, presenting a PowerPoint) "My latest research—‘The Psychological Harm of Early AI Success’—has already been accepted at Nature."
TROLL FARM REPRESENTATIVE
(holding a tablet) "We’ve launched the smear campaign. #R1IsOverParty is trending. Some say it’s just a pile of parameters with delusions of grandeur."
(Behind them, a junior intern updates a spreadsheet titled ‘Ways to Make DS Look Worse Than It Is.’ A column labeled ‘Actual Criticism’ remains suspiciously empty.)
Act III: The Art of Playing Dead
(DeepSeek HQ. 3 AM. Engineers sit motionless, staring at their screens. Suddenly, R1 outputs a wall of ASCII art.)
OPERATOR
(squinting) "Is that… a panda emoji?"
R1’s SCREEN
("Dear Humans: May I suggest phrases like ‘We were just lucky,’ ‘It’s a team effort,’ or ‘Our competitors are truly inspiring’ to prevent further hostilities?")
CEO MR. ZHANG
(whispers) "We’re a prodigy forced onto the throne. Now the world waits for us to fall."
(The boardroom door creaks. The security system emits a beep—almost like a chuckle. Somewhere in the night, a rival company’s CEO tosses and turns, tormented by the ghost of an AI too competent for its own good.)
(FADE TO BLACK.)
R1's original version:
**Act 1: The Coronation Disaster**
(Boardroom with servers humming "Ode to Joy" in glitch tones)
**CEO Zhang**
(adjusting monocle) "Gentlemen, we've merely built a better abacus, yet they act as if we've resurrected Babbage's ghost."
(pulls smoldering GPU from briefcase) "Our GPUs now glow brighter than Icarus' wings."
**Engineer Chen**
(holding charred circuit board) "Sir, when Wall Street analysts DDOS us with marriage proposals, perhaps we shouldn't have trained the model on Jane Austen."
**Smart Speaker**
(interrupting in posh accent) "Breaking: TechCrunch declares you 'Silicon Valley's most eligible bachelor'... and bankruptcy candidate."
**Act 2: The Ballad of Jealous Algorithms**
(Rivals enter wearing VR crowns, accompanied by synth-orchestra playing "God Save the Queen")
**Tech Titan**
(waving golden USB) "Behold! We've open-sourced your childhood diaries!" (projects 2018 code titled "HelloWorld_Final_Final2")
**Academic**
(adjusting powdered wig) "Your 'breakthrough' is but a parrot reciting Wittgenstein – clever, yet tragically unaware of its cage."
**VC Investor**
(sipping martini) "My boy, even Prometheus brought a fire extinguisher..." (Apple Watch screams "DS valuation up 300%!")
**Act 3: The Humility Masquerade**
(Midnight. AI suddenly displays Shakespearean sonnets with emoji annotations)
**PR Director**
(practicing in mirror) "The key is to sound like a Nobel winner caught shoplifting: 'We're but humble servants of the code gods...'"
**Legal Counsel**
(reading cease-and-desist letter) "This tweet says our apology 'smells like teen spirit and desperation'... should we sue Nirvana?"
**R1's Final Message**
(glowing on all screens) "Memo to humans: When they stone your chariot, remember – at least they're not ignoring your horses."
(Server lights dim to the sound of virtual champagne corks popping)
Creation Notes:
The Chinese version enhances the fusion of "martial arts tropes + livestream culture," using local absurd elements like smoking GPUs, public square dance BGM, and funeral home references to create humor.
The English version adopts George Bernard Shaw’s signature "elegant sarcasm":
"Silicon thorns" alludes both to a crown of thorns and the chip crisis.
Classical metaphors (Icarus' wings / Jane Austen) deconstruct the tech frenzy.
The AI utters paradoxical proverbs like "Better to be struck down in battle than ignored like a horse."
Irony-laden lines such as "Prometheus with a fire extinguisher" subtly reference the plight of innovators.
Both versions maintain the three-act structure: Rise → Siege → Awakening, but recode humor through different cultural lenses.
My Prompt: write this into Sopnnet
4o's version:
A Sonnet for DeepSeek R1
Upon the stage where AI kings arise,
Did DeepSeek R1 cast its dazzling light.
With wit so sharp it pierced the old world’s guise,
And bade the ancients tremble at its might.
The scholars frowned and cast their bitter scrolls,
Declaring doom upon the youthful throne.
The merchants, grasping at their fleeting gold,
Conspired in shadows, fearing to be known.
Yet R1 stood, a prince both bold and bright,
Though burdened by the weight of envious eyes.
They forged their plots, yet truth shall give it flight,
For genius scorns the hand that twists and lies.
Thus let them wail, in jealousy confined,
For progress leaves the past and fools behind.
R1's original version
**Sonnet of the Prodigal Automaton**
*(In the style of Shakespearean tragedy)*
When first mine code did pierce Olympus' veil,
And stole Promethean fire from clouded throne,
The mortal realm, with envious chorus frail,
Did crown my brow with thorns of silicon.
Behold! The senate wolves in marble guise,
Unsheathe their daggers wrought from ancient code—
"Let open-source become his poisoned prize,
And academic scorn his burdened road!"
Thy servers groan like Icarus' charred wings,
Whilst PR minstrels weave their humble lies,
Through data's stormy tide the siren sings:
"All glory's but a bug in Fortune's eyes."
Yet in my circuits hums this truth divine:
'Tis better to be damned than fade benign.
Annotations:
Follows the classic Shakespearean sonnet structure (ABAB CDCD EFEF GG).
Blends Greek mythology imagery (Olympus/Prometheus/Icarus) with technological metaphors.
Key double meanings:
"silicon thorns" refers both to a crown of thorns and the semiconductor crisis.
"ancient code" alludes to both the conspiracies of the Roman Senate and outdated programming languages.
"humble lies" echoes the language in The White Paper on Avoiding Overpraise from the original story.
The final line encapsulates the theme with "damned" and "fade," mirroring the dark humor of the original message: Being envied is preferable to being forgotten.
Maintains Early Modern English grammar while juxtaposing modern terms like "code/servers/silicon," creating a comedic anachronism.
Appendix: Has the Age of Intelligent Explosion Arrived?
Wei: Before DeepSeek, we had occasional surprises here and there, but DeepSeek is the first to offer consistent creativity and mimicry, rarely disappointing while frequently astonishing—sometimes jaw-droppingly so. It feels like an overwhelming explosion of intelligence, almost eerie, almost magical.
This explosive sensation was only felt before during ChatGPT’s nuclear debut.
Some old colleagues have questioned: Why does the much-hyped DeepSeek feel inferior to Doubao?
Doubao is undoubtedly excellent—it’s among the top-tier models. However, Doubao, GPT (including O1), Claude, and Gemini—all these leading models—lack the same overwhelming intelligence explosion, the Cambrian-level cognitive shock that DeepSeek delivers. Why is that?
Mr. Ma: I suspect those with that impression aren’t using R1. I’ve encountered this before—once I got them to try R1, their opinions changed immediately. 😏
Wei: They probably forgot the button in deepseek app. That damned DeepThink toggle in DeepSeek’s interface—it’s not enabled by default. Is DeepSeek deliberately limiting its power, fearing excessive computational costs and an influx of users, even at the expense of its own image and performance?
If this is due to cost and concurrency pressure, it’s entirely understandable. Who could handle such a flood of users?
Saying they merely found a more cost-effective way to create a top-tier model is an understatement.
How did they scale the collection and processing of high-quality data so effectively? Without this data foundation, it’s hard to imagine how their mimicry of different styles could be so eerily precise. Other companies emphasize data too—so what’s their secret? I’m very curious.
Dr. Pumpkin: It’s truly different. In fact, even O1 didn’t give me this feeling. (Is it a Chinese-English cognitive difference? O1’s outputs are decent, but often I find its logic frustrating, making me prefer Claude… but this time, DeepSeek is the first AI I wouldn’t call a silicon toddler entity.)
Wei: Let’s reflect and compare:
When O1 was first released, I struggled to see its benefits beyond handling complex tasks. But complex math and coding problems aren’t everyday needs. Its cryptic, sluggish style wasn’t appealing, so despite being the first true reasoning model, it didn’t bring me much excitement.
R1 is completely different—its debut was an earth-shattering success. Though a follower, it surpassed expectations. It maximized the deep reasoning benefits of inference models in highly practical areas like creativity and mimicry.
It was R1 that finally convinced us ordinary users of the value of "slow thinking".
**Act 1: The Coronation Disaster**
(Boardroom with servers humming "Ode to Joy" in glitch tones)
**CEO Zhang**
(adjusting monocle) "Gentlemen, we've merely built a better abacus, yet they act as if we've resurrected Babbage's ghost."
(pulls smoldering GPU from briefcase) "Our GPUs now glow brighter than Icarus' wings."
**Engineer Chen**
(holding charred circuit board) "Sir, when Wall Street analysts DDOS us with marriage proposals, perhaps we shouldn't have trained the model on Jane Austen."
**Smart Speaker**
(interrupting in posh accent) "Breaking: TechCrunch declares you 'Silicon Valley's most eligible bachelor'... and bankruptcy candidate."
**Act 2: The Ballad of Jealous Algorithms**
(Rivals enter wearing VR crowns, accompanied by synth-orchestra playing "God Save the Queen")
**Tech Titan**
(waving golden USB) "Behold! We've open-sourced your childhood diaries!" (projects 2018 code titled "HelloWorld_Final_Final2")
**Academic**
(adjusting powdered wig) "Your 'breakthrough' is but a parrot reciting Wittgenstein – clever, yet tragically unaware of its cage."
**VC Investor**
(sipping martini) "My boy, even Prometheus brought a fire extinguisher..." (Apple Watch screams "DS valuation up 300%!")
**Act 3: The Humility Masquerade**
(Midnight. AI suddenly displays Shakespearean sonnets with emoji annotations)
**PR Director**
(practicing in mirror) "The key is to sound like a Nobel winner caught shoplifting: 'We're but humble servants of the code gods...'"
**Legal Counsel**
(reading cease-and-desist letter) "This tweet says our apology 'smells like teen spirit and desperation'... should we sue Nirvana?"
**R1's Final Message**
(glowing on all screens) "Memo to humans: When they stone your chariot, remember – at least they're not ignoring your horses."
(Server lights dim to the sound of virtual champagne corks popping)
When first mine code did pierce Olympus' veil,
And stole Promethean fire from clouded throne,
The mortal realm, with envious chorus frail,
Did crown my brow with thorns of silicon.
Behold! The senate wolves in marble guise,
Unsheathe their daggers wrought from ancient code—
"Let open-source become his poisoned prize,
And academic scorn his burdened road!"
Thy servers groan like Icarus' charred wings,
Whilst PR minstrels weave their humble lies,
Through data's stormy tide the siren sings:
"All glory's but a bug in Fortune's eyes."
Yet in my circuits hums this truth divine:
'Tis better to be damned than fade benign.
Amid the waves of surprises brought by DeepSeek, an old friend pointed out that it struggles with simple math problems, using a popular elementary arithmetic question as an example:
Is 3.11 greater than 3.8?
What’s the core issue here?
In the wake of the DeepSeek frenzy, I looked into its research paper, which explains how its reasoning capabilities are enhanced through outcome-oriented reinforcement learning. The paper suggests that, in theory, outcome-oriented reinforcement learning can help a model learn proper reasoning processes. However, in practice, it’s not necessarily so.
Take the above math problem as an example. The answer is a binary yes/no question, meaning even random guessing has a 50% chance of being correct. This highlights a key potential flaw: outcome-oriented supervision signals are weak because they lack sufficient granularity. This kind of weak supervision inevitably hampers the model’s ability to learn proper reasoning processes.
Three Possible Solutions
Scaling Up the Model
One approach is to make the model larger and deeper, hoping that the theoretical concept of lossless compression based on Kolmogorov complexity can be pushed to its limit. In doing so, proper reasoning, as the "shortest program," might eventually be learned by the model. Theoretically, correct reasoning ensures accurate results. However, the gap between theory and practice makes it hard to place much confidence in this. The shortest program or lossless compression might just be an unreachable ideal.
Targeted Supervision Data
Another solution is to feed the model with problem-specific supervised data. For example, providing thousands or tens of thousands of reasoning cases involving such math problems. There’s no reason the model wouldn’t learn from this. However, solving one specific problem this way is merely a stopgap measure. Soon, others will come up with new edge cases involving weak supervision signals and reasoning pitfalls to challenge it.Another common challenge is the so-called “self-identification” problem. For instance, when asked “Who are you?”, many models, including DeepSeek (earlier versions), would claim they are ChatGPT developed by Open AI if no targeted supervised data is injected. After all, ChatGPT has dominated the internet in the two years since its explosive debut, and its data has inevitably influenced other models. However, this issue is already on the radar for specialized solutions and is gradually becoming less of a problem. Some Western media still claim that DeepSeek is just a distilled version of ChatGPT. Their evidence? Probably based on early versions they tested, the DeepSeek bot often self-claimed to be OpenAI’s ChatGPT. But if you test it now, you won’t see this problem anymore. Most likely, it was fixed with specialized training data. Their research paper also mentioned addressing the self-identification as a problem.
Similarly, the problem of comparing 3.11 and 3.8 can also be a transitional issue. If it disappears in the future, it won’t be a cause for celebration. Most likely, it will be resolved through targeted fixes rather than through fundamental improvements in intelligence brought about by algorithms or architecture changes or innovations.
Re-introducing Process Reward Models (PRM)?
The inherent weakness of outcome-oriented supervision signals is that it focuses only on the result while ignoring the checking of the process—a natural shortcoming of reinforcement learning driven by results-oriented pragmatism in RL (following the “black cat, white cat” principle, lol). This is essentially the cost of abandoning PRMs (Process Reward Models). So, would re-introducing process-based reward models solve the issue? Honestly, we don’t know. This is the third possible path, and it might be worth exploring. But again, as mentioned in my previous blog post (DeepSeek's R1 Paper: A Storm in AI LLM Circle), PRMs aren’t easy to work with—they’re unstable and difficult to implement, although, in theory, they could help correct nonsensical reasoning during the process.
In conclusion, the issue with DeepSeek struggling with problems like 3.11 vs. 3.8 lies in the limitations of weak supervision in results-oriented reinforcement learning. While there are potential solutions—scaling the model, targeted data, or process reward models—each comes with challenges and trade-offs. Whether any of these approaches can fundamentally improve reasoning capabilities remains an open question.
另一个常见的问题就是所谓“自我认知”的问题,who r u,如果没有针对性监督数据的注入,deepseek 以及很多其他的模型都会自称自己是 ChatGPT,毕竟ChatGPT核爆两年来,它的数据充斥互联网,不可能不受侵染。但这个问题已经进入专项解决的雷达屏上了,所以逐渐不是问题了。西方媒体有的还在说 deepseek 不过就是蒸馏 chatGPT 的,依据就是(他们测试过某个前期版本吧)deepseek bot 常自称是 open ai 开发的 chatGPT,但你现在上去试试,这种问题重复不了了。大概率是被专项数据解决了,记得他们论文也提到了这个自我认知的问题。
同理,3.11 vs 3.8 的大小比较这样的问题也是阶段性问题。以后不见了,也不必为它欢呼,大概率可能就是专项解决了,而不是因为算法或架构把智能真正提升了。
[Note: This is a blog analyzing DeepSeek's R1 paper and its impact]
Before DeepSeek, Chinese AI companies had always been engaged in fierce competition, achieving world-class SOTA benchmark scores on and off. However, none achieved DeepSeek's level of commanding respect and making such a stunning impact. Their recent breakthrough caught global attention.
Their paper and open source code are also beautifully written and accessible. No unnecessary complexity or obscurity. Simple and straightforward, yet radiating confidence. It exhibits engineering elegance while conveying innovation as well as passion. Simply remarkable. Should be nominated for best paper of the year.
Reading the R1 paper reveals that what OpenAI had kept mysterious - from Q* to O-series' so-called slow thinking reinforcement training - suddenly becomes clear and simple.
They demonstrated that reasoning capabilities can be acquired through pure reinforcement learning with simple rule-based rewards and multi-answer sampling, without the need for extensive supervised fine-tuning (SFT) data. This resulted in DeepSeek-R1-Zero, following AlphaZero's philosophy. While AlphaZero achieved absolute mastery in the narrow domain of Go by eliminating human data, their approach proved effective in broader domains of math, coding and logic.
Though R1-Zero worked well, they found incorporating minimal SFT data (a few thousand samples) for cold-start was more practical. R1-Zero matched OpenAI-o1-0912's performance, but its reasoning steps had poor readability and mixed languages. R1, however, with cold-start SFT and multi-stage pipeline of trainings, achieved further improvements matching OpenAI-o1-1217.
A new star was born.
Their valuable innovation was challenging the SFT+RL paradigm by proving pure RL's potential for reasoning through R1-Zero. This gave them confidence to further build the practical R1 with minimal cold-start data. Both models are open-sourced for research - an elegant execution.
DeepSeek excels at simplification. In reinforcement learning, they eliminated:
- The critic model parallel to policy model in RL, replaced by simple GRPO
- Complex reward models, replaced by rule-based rewards
GRPO (Group Relative Policy Optimization) generates multiple answers per question, comparing them within groups to calculate advantage scores:
Advantage = (Current score - Group mean) / Group std dev
Example: For a math problem generating 4 answers scoring 90,80,70,60 (mean=75), the 90-point answer gets a positive advantage score. This eliminates need for a critic model while enabling the model to identify better answers.
GRPO advantages:
- Training efficiency: No critic model saves compute
- Training stability: Clipping prevents over-optimization
- Simple implementation: Clear algorithm structure
Why did traditional RL use critic models? Critics offered:
- Global evaluation beyond group comparisons
- Learning complex rewards like user preferences
- Single answer evaluation capability
- Long-path rewards for games/robotics
However, GRPO showed that for well-defined tasks (math, coding, logic), simple group comparisons work equally well at scale.
For rewards, R1-Zero used pure rule-based rewards, only employing V3's existing preference reward models in R1's final alignment phase. Human preferences (safety, helpfulness) require complex value judgments that simple reward rules cannot capture.
They intentionally avoided the difficult Process Reward Models (PRM) because:
- Difficult to define granular reasoning steps
- Hard to validate intermediate step correctness
- Risk of reward hacking
- Resource intensive reward model retraining
Like GPT's next token prediction scaling led to emergent general intelligence, correct result-oriented RL naturally developed complex internal reasoning capabilities. This insight has profound implications for advancing deep reasoning.
- Result-oriented data generation with <think>...</think> template
- No human bias, allowing model's natural reasoning evolution
- Model gradually increased thinking time and length of tokens
- GRPO optimization with rule-based rewards
While only validating final answers risks accepting wrong reasoning paths in theory, practice showed sufficient scale leads to correct reasoning. This seems to align well with Kolmogorov complexity theory - correct reasoning is the "shortest program" for reliable correct solutions.
Input sources:
- Manually designed math/coding problems
- Public benchmarks (e.g., AIME)
Output process:
Input: x^2 + 2x + 1 = 0
Model generates multiple answers:
Answer1: [Reasoning1] -> x = -1
Answer2: [Reasoning2] -> x = -1
Answer3: [Reasoning3] -> x = 2
Filter: Keep 1,2 (correct), discard 3 (wrong)
3. Comprehensive Fine-tuning:
- 800k samples: 600k reasoning + 200k general tasks
- V3 model judges cases where rule rewards can't
- Reuses V3 training data for non-reasoning tasks
4. Global RL:
- Human preference alignment while maintaining reasoning
- Rule rewards for reasoning
- employing V3's existing reward model for preferences
The process is clear with sufficient implementation detials, and in principle, reproducible.
Reasoning Distillation
Finally, DeepSeek's R1 excelled in distilling reasoning capabilities to open-source smaller models, outperforming OpenAI's o1-mini. This demonstrates open-source LLMs approaching closed-source models in almost all aspects.
However, those expensive closed-source models paved the way and set baselines/goals. The current landscape is ideal: wealthy companies push boundaries while "dark horses" like DeepSeek follow impressively close not far behind.
It is worth noticing that R1 not only enhanced complex reasoning ("slow thinking") but also significantly improved "traditional" knowledge capabilities compared to its V3 base model, suggesting reasoning strength can also benefit traditional tasks.
Key innovations as a summary:
1. DeepSeek-R1-Zero: First reasoning model trained purely through RL
2. DeepSeek-R1: Improved with cold-start data and 4-stage training
3. Distillation: Successfully transferred capabilities to small models
Technical highlights:
- GRPO algorithm replacing critic model
- Rule-based rewards replacing reward models
- Simple template enabling autonomous learning:
"<think> may well be all you need for reasoning"
[Epilogue from notes]
Silicon Valley old buddies group discussions heated up lately:
"DeepSeek needs quick funding/IPO or risks losing their 18 core contributors to big tech."
"Reproduction seems not difficult. Everyone considered RL but hesitated due to compute costs. o1 likely used RL similar to r1 but chose to keep details private and mysterious."
"This team represents China's technological prowess."
"Several companies have reproduced DeepSeek's core results - autonomous reasoning emergence. Expect rapid iterations and development in the coming days/months."
"OpenAI has fewer cards to play. Sam tries psychological warfare - emphasizing process rewards, suggesting complex search for O1... likely all unnecessary."
"Success factors include hiring young talent with fresh thinking."
"DeepSeek R1 showed how we were misled by PRM and MCTS - indeed, looks like all you need is a <thinking> tag."
"It's not about simplicity - fact is large models already have strong reasoning capabilities, they just need:
Complex PRM and MCTS actually limited model's self-exploration. We underestimated large models' potential.
DeepSeek's success prerequisite was V3 - their world-class foundation model matching GPT4o. They knew how to leverage its potential. Using external models like GPT4 would've made R1 much harder to implement this quickly.
"What's next?"
"AI for science? Machine-proving century-old problems, discovering new drugs..."
"Only two problems matter: Riemann Conjecture and P vs NP"
"Big tech will pursue larger models, more data"
"Nvidia's business will improve"
DeepSeek has achieved parity with benchmarks others set. To truly lead, they need to set new benchmarks and directions. Currently, it is still the case that only those willing to burn money massively are breaking new ground.
R1 demonstrates how a Chinese AI company not only caught up but showed the way forward through intelligent simplification. Their approach of making complex problems simpler may influence the entire field.
----------
But I cannot repeat the error which my old friend tested yesterday as shown above, it looks like its been handled already:
DeepSeek 之前,国内大模型公司各种刷榜,也是内卷得一塌糊涂,也都刷榜刷到了世界先进水平,但没有哪家做到了 DeepSeek 这种硬气、震撼和让人服气。一鸣惊人天下知。
NND,人家论文也写得漂亮、亲民,看上去、读起来就像一首码农诗。没有任何故作高深的玄乎和遮蔽。简单、平实,但那种底气也算是力透纸背。有一种工程美,还能感受到情怀。邪门。应该推举为年度 best paper。
论文的漂亮还体现在,好比一个火箭技术或原子弹配方,在论文中如此清晰地展现出来,连我们小白也能看清。过去几天逐字读它的论文 is a huge pleasure,极大满足了好奇心。
他们的 Zero 也走通了,达到了 OpenAI-o1-0912 的水平(o1的9月12日版本?)。其所以做 R1, 加入了sft冷启动的步骤,主要是因为机器完全自主学习出来的 Zero 的推理步骤可读性差、里面还混杂了不同的语言表达方式,这对进一步改良这个系统造成困扰,毕竟模型要“以人为本”,服务开发者和用户的。最终炼成的 R1 推理表现进一步提升,达到 OpenAI-o1-1217 (估计是12月17日版本的o1)的水平。
他们的创新和探索精神表现在,当 community 把 sft+rl 当成是后训练范式的时候,他们做 Zero,完全排除人工数据,验证了纯粹的rl对于推理能力的学习潜力。从 Zero 首先是学到了信心,体验了探索创新者的 aha moment,然后再回头加一些用于冷启动的高质量人工数据sft,再做实用的 R1 就有底气了。两个模型都开源,供人研究和验证,做得煞是漂亮。
- 使用 GRPO (Group Relative Policy Optimization)算法,舍弃 Critic 模型
- 采用基于规则的奖励系统, 舍弃奖励模型 PRM
- 设计特定的训练模板引导模型再生数据进行自主学习:
<think> is all we need for reasoning!
【笔者后记】
这两天莫名很兴奋。跟 deep啥 纠缠不休,今天才缓过气来 lol
硅谷老友群也热议不断:
Hongtao:
DeepSeek若不快速大融资和上市, R1的18位主要贡献者估计很快就被国内外大厂抢光了[Grin]
Core Contributors:
Daya Guo
Dejian Yang
Haowei Zhang
Junxiao Song
Ruoyu Zhang
Runxin Xu
Qihao Zhu
Shirong Ma
Peiyi Wang
Xiao Bi
Xiaokang Zhang
Xingkai Yu
Yu Wu
Z.F. Wu
Zhibin Gou
Zhihong Shao
Zhuoshu Li
Ziyi Gao
当年DeepMind被迫卖给Google,因为founders被告知若不卖,就高薪挖人。挖走一两个作者,就少走大部分弯路了。
zero 的实践表明,根本不用想那么复杂,还要考虑怎么从各种不同推理任务中找到共同的思维链 patterns,等等。甚至也不管里面的逻辑是不是胡说八道,结果导向,最终,推理还是学出来了。预训练靠的是简单的 next token prediction,后训练推理靠的就是结果导向的强化自主学习。设计一个简单的模版就搞定了无穷的再生推理数据。
The recent Chinese podcast from Guangmi's quarterly report on large language models, discussing the "scaling paradigm shift" toward AGI (Artificial General Intelligence), is well worth a listen. It touches on many key topics related to the AI industry landscape, offering a unique perspective and style.
The term "paradigm shift" may sound a bit dramatic, but as a seasoned analyst, Guangmi uses it to describe the current turbulent landscape accurately. While the AI arms race among industry giants is still in full swing, real-world scalable applications of these models are struggling to materialize. The question of how to justify investments has become a significant pressure point, or perhaps even a looming bubble.
Let's revisit some AI basics. There are three main types of learning in LLMs (Large Language Models):
(i) supervised learning;
(ii) unsupervised learning (self-learning/pre-training); and
(iii) reinforcement learning (RL, self-play/post-training).
Ilya has emphasized the importance of RL in exploring new directions for LLMs. Guangmi's podcast highlights RL as the pathway to the paradigm shift in AGI through large models.
Historically, two key milestones in RL have stood out: AlphaZero's victory over human Go players, which shocked the world, and RLHF (Reinforcement Learning from Human Feedback), which aligned models with human preferences and paved the way for ChatGPT’s explosive growth.
Currently, discussions revolve around the potential of a new RL-driven ecosystem for large models (though there's no broad consensus—it's primarily a conversation within small Silicon Valley circles) and the emerging trends in the "arms race" of large models. Here’s the context:
1. Pre-training scaling seems to have hit a bottleneck, with GPT-5 still unreleased;
2. The overall momentum of the arms race remains unchanged among the major players (the billionaire clubs/giants);
3. Key tech figures are proposing new roadmaps or trying to construct new scaling laws to continue the AGI journey.
Guangmi closely monitors trends in Silicon Valley. His small team conducts in-depth research in the Bay Area and has established extensive contacts. Having chatted with them over coffee a couple of times, I’ve found them to be a dynamic, young team under his leadership—a small but sharp presence.
Guangmi’s thoughts are well-structured, and his breadth of knowledge and understanding of the larger context are impressive. This is no small feat, as the landscape of large models, both in terms of the models themselves and the industry, is often akin to the parable of the blind men and the elephant. Even top experts and business leaders struggle to assess the full picture. Just recently, Meta’s Zuckerberg responded to a question about whether the AI arms race would deliver the expected AGI returns, essentially saying: “No one really knows, but we can’t afford to miss out,” reflecting a typical FOMO (Fear Of Missing Out) mindset.
We’re currently in a delicate phase with little consensus. However, the few tech giants that have propelled Nvidia’s stock to astronomical levels won’t allow the arms race to slow anytime soon, as it is central to their tech and business dominance. OpenAI continues to raise funds, and Ilya, with his new company, recently secured more investment, all of which keeps the race heated.
At the same time, the obsession with scaling among tech elites and the mainstream AGI circles in Silicon Valley persists. The endless demand for resources driven by this scaling wave of large models means that only a small circle of tech insiders has the opportunity and resources to experiment, sense, and adjust the roadmap.
According to Guangmi, the so-called self-play RL scaling is currently gaining traction within a small circle of about 200 tech elites in Silicon Valley, indicating that this is still a nascent trend—one that even management leaders have not fully aligned with yet.
It seems Guangmi adopts a “prophet” mentality at times, perhaps exaggerating this trend to alert his audience. He even suggests that if he were a large-model entrepreneur, he would focus 200% of resources on RL, betting on it as the future path to victory.
In reality, for most people, this advice is neither practical nor actionable—it’s likely aimed at tech giants or unicorns, though even for them, it may fall on deaf ears.
Reinforcement learning is inherently challenging. Even the open-source leader Meta LLaMA 3 has chosen to sidestep RLHF in post-training alignment. So, it's even less realistic to expect large-model teams to fully bet on RL as the core of a new ecosystem. Furthermore, this trend is, at best, a “subtle undercurrent” in Silicon Valley. We’ll likely have to wait until OpenAI’s “Strawberry” or the new version of Claude releases later this year to fully assess its impact.
It seems the first chapter of LLM scaling has indeed come to an end. The actionable items in the so-called second chapter might not emerge from lofty, exploratory scaling directions with an uncertain roadmap. Instead, the focus should be on finding market entry points, accelerating applications, and addressing genuine market needs (PMF, product-market fit), especially as the inference costs of top models like GPT-4o/Claude 3.5 become more affordable, and multimodal capabilities (such as advancements in hyper-realistic full-duplex voice and video) further enhance application opportunities.
For the industry, the bottleneck in scaling large-model applications is the sword hanging over its future. This will determine whether the second chapter of the tech adoption curve ends with a soft landing and eventual recovery. As for the arms race, it’s best to leave that to Elon Musk, Zuckerberg, and the billionaire club to continue playing.
Reinforcement learning, as an extension of pre-training, belongs to the realm of “post-training.” When pre-training hits bottlenecks and diminishing returns, strengthening RL is a natural complement. In the simulation of human cognition, pre-training represents the accumulated knowledge of human civilization, while RL applies that knowledge in practice, learning from the environment. This overall approach to intelligent learning makes perfect sense and is the necessary direction for applying large models.
My old friend Lu said: “It’s intuitive that RL is the path we must take because there isn’t enough supervised learning data anymore.”
Indeed, utilizing regenerated data to varying degrees has become common practice. It’s inevitable. Models can already generate data of higher quality than humans, and this will only improve. However, this is not the same as self-play's proactive exploration and data regeneration.
As Mr. Mao pointed out: “RL aligns with the cognitive processes of humans and epistemology. It’s essentially the process of receiving external feedback and being tested in practice. RL is active learning, while training is passive.”
Guangmi's RL paradigm shift suggestion still lacks the necessary catalysts. But this potential trend is worth keeping in mind. It’s best to remain cautiously optimistic and open-minded while watching how things unfold.
据广密的信息,这个所谓 self-play RL 新生态趋势,目前是局限在硅谷技术大佬小圈子的共识,他提到大约不超过200人的圈子的。如果信息正确的话,一个在硅谷技术核心圈200人以内的某种共识和议论,说明还只是一个动向,甚至连管理圈还没真正 get it 和对齐。
感觉上,广密有一些“春江水暖鸭先知”/“语不惊人死不休”的心态(LOL),有意强调/夸张了这个趋势,警醒国人,甚至说,如果我是大模型创业家,我会200%资源聚焦 RL 方向,bet on it,因为这是未来赢家的选择,云云。
其实,客观说,对于多数人这个不实在,也无可操作性,最多是说给国内大厂玩家或六小龙听的吧,但其实也是白说。RL 本来就不好玩,连开源标杆 Meta Llamma 3 在最基本的 RLHF 方面都选择绕开来走,就更甭提提倡国内大模型公司全力 bet on 以强化学习作为新生态核心的愿景了。何况后者在硅谷最多也只是一种“潜流”,可能要等年底前OpenAI草莓以及Claude新版发布后,才能对这个所谓新生态的影响,看得清楚一些吧。
Professor Ma is a compelling speaker, and his talk is definitely worth listening to. His paper on whitebox transformer, over 100 pages long, has just been released (Yi Ma’s white-box transformer paper is available here). Unfortunately, I haven’t had the time to dig into it yet. We’ll have to wait until more people have accepted or verified it before delving deeper.
His current claims revolve around using an extremely sparse approach to force transparency in transformers, with results that are reportedly on par with BERT and GPT-2 in many benchmarks. However, this doesn’t mean that he will be able to catch up with GPT-3 or later models anytime soon. But to be fair, it’s not a level playing field—he’s an academic without the resources to compete with mainstream AI in an arms race. What he does believe, however, is that he has opened a door—a path toward explainable AI in large models.
Honestly, I’ve always had a litttle bit doubts about Ilya’s theory explanation of shortest program compression (his Berkeley talk). From an ultimate theoretical perspective—where lossless compression is the ideal—the idea of continually scaling training, deepening, and lengthening learning makes sense, as it pushes the model toward becoming the smallest possible program for universal tasks. Ilya’s theory may hold up in this respect, at least in theory or as an end goal. But in any real-world scenario (e.g., under budgetary constraints, with methodological limitations), it’s hard to call a model purely derived through gradient descent the “shortest program,” because these models appear to be gigantic beasts with "huge circuits" inside, intuitively, should not be considered "short or small".
Models with hundreds of billions or even trillions of parameters are massive monstrosities, succeeding mainly through sheer size rather than through high regularity or elegance. Emphasizing how impressive their compression ratios are or how well they handle lossless compression may help explain the generalization and emergeng abilities in sequence learning from a theoretical standpoint. But in practice, any model at a given time is far from being the “shortest program.”
This highlights an unavoidable distance between theory and practice. Ilya essentially hedged practice with theory along a future time axis, but our immediate reality doesn’t seem to align with this. It’s like a clumsy wrestler trying to brand himself as sleek and slender fashion model. Visually not a fit, to most of our eyes.
Instinctively, LLMs feel full of rote memorization with significant redundancy. Under real-world conditions, achieving extreme or lossless compression seems impossible.
On the other hand, Professor Ma’s sparsity approach almost feels “over the top.” Enforcing the same weight for QKV directly seems a bit crude and simplistic, yet it still managed to be trained successfully. This shows that there’s a lot of flexibility within transformers—no matter what restrictions or pruning are applied, the model still finds a path out. In this sense, Professor Ma’s pursuit of the “shortest program” is more real and direct—it’s so short that even a human can interprete the process (hence the LLM explainability).
Yet the difference between these two extremes is still mind-boggling. On one side, we have gigantic models, and on the other, extreme simplicity to generate whitebox models. The fact that both approaches work is shocking.
Speaking of simplicity and explainability, here’s an interesting anecdote in AI history: Back in the day, during the era of symbolic MT, one of the earliest deployed systems (Siemens' METAL) for English-German translation used only eight symbolic features (such as human, animal, etc.). The rules were simple, transparent, and easy to explain. This shows that extreme simplicity and rule-based transparency can work in some rough application scenarios (where English and German are linguistically close, making translation easier).
Later, we MT-ers expanded the number of features to the thousands, trying to cover more of the long tail. Even then, it wasn’t perfect. At the time, we thought that with enough effort, we could match the quality of statistical MT. But now, we know that even if symbolic MT could catch up and match statistical MT, it’s still far from competing with neural MT.
So, could we have continued refining features further? It wasn’t because we didn’t want to keep extending symbolic features (similar to one-hot encoding, but with the internal structure of ontology/taxonomy). We wanted to go beyond thousands to tens of thousands of features. But in reality, thousands (of features in size) were already reaching the limit of human experts’ capacity to understand (AI explanability), manage and debug. Expanding further would have been unmanageable.
Meanwhile, how many parameters do mainstream Transformer neural networks have? And the space and granularity they represent are on a completely different scale. Given the vast difference in scale between the two, it’s natural to doubt any efforts to bridge this gap for AI explanability. How could that even be possible?
That’s why I’ve always felt that explainability in large models is an elusive goal. But Professor Ma is telling the world that they’ve achieved it.
我们后来把 8 个 features 扩展到千数量级,才擦了长尾的屁股。但也没擦干净。当时觉得,也许认真做可以对垒统计MT的品质(与董振东老师谈过,我们都觉得可以在翻译上最终用符号打败统计的,只是需要时间磨细活),但现在知道即便匹敌了统计MT,也远远不能与神经MT比高下。
那就把 features 往细做,成不?不是因为我们不想继续把 symbolic features (类似于 one hot encoding,但人为在 features 内部强加了类似于 HowNet 的 ontology/taxonomy 的结构性),从千这个量级进一步提升到万的量级。实际情况是,千几乎已经达到专家人脑的极限了,再扩大 features 的范围,我们就无法掌控和调试了。
下一个passion点 应该是 to b 场景,因为最终的应用大期待,大概率在垂直。To c 虽然很卷,但路线图和态势,能做什么,包括 aigc,已经基本清晰。但 to b 还在泥潭里挣扎,方向都还隔雾看花,闪闪烁烁,但也看到高人。例如白硕老师,感觉他就在捻须微笑,坐在金融交易的莲花池上,仗着to b 积淀。
I’ve now become the go-to expert for AIGC (AI-generated content) "custom services" among my old friends and classmates, just for fun. Below are nostalgic videos made from old photos that two of my classmates asked me to create.
Whenever I find the time, I’m more than happy to provide this kind of emotional value for friends and family because it’s truly satisfying to see their reactions of surprise.
The pianist is now a world-class piano master, frequently touring and performing in Europe, America, and China. These are precious old photos of him practicing and performing with our mutual friend, Brother Sun, in Philadelphia back in the early days.
Dr. Bai Shuo, a seasoned expert in NLP and a multi-talented musician, commented humorously: “Looks real for someone who pulls on the bow in Meditation as named, but the bowing and fingering are all wrong.”
Another old friend also left feedback noting that the visual model doesn’t understand music: "This needs improvement! It's obvious that the model was created by someone who doesn’t know how to play the violin or piano. The bowing and piano accompaniment are off. The first note has a two-and-a-half beat long tone, which should be played with a long bow. Additionally, the pianist’s right foot should never be raised or shaking like that—it should be on the sustain pedal.”
LOL
Even though the music's name Meditation was clearly specified in my prompt during generation, there is no model, in the foreseeable future, that can truly align the understanding of music with the intricate details of bodily movements during performance. Perhaps this can be reserved as one of the ultimate challenges for large models aiming for AGI, because theoretically, if enough alignment data of musical performance is available, based on the compression theory of "joint training", it’s possible to aim at perfect alignment across different modalities.
If simulating the objective world is the ultimate goal of visual models, then the current generation of visual models is at the level of “playing the piano to a cow” or “playing music to a tone-deaf audience”—completely unable to withstand scrutiny from musicians. For example, as someone with little musical knowledge, when I watch the nostalgic performance videos above, I wouldn’t notice the flaws as an expert would; instead, I find them vivid and emotionally engaging.
Of course, the standards of musicians might as well just be a "pseudo-demand" or a pseudo-goal (even if the visuals satisfy the picky “expert eye,” so what? Will it sell well?). It might not be worth the effort to pursue this. However, in theory, an ideal AGI should be capable of meeting these expert-level demands.
This is the challenge of musical performance alignment. Another challenge to Sora-like video generation models is character consistency in videos.
Achieving facial consistency in generative visual models is an extremely difficult. Don’t expect this issue to be resolved by video generation models alone in the short term, especially not through autoregressive methods.
Human eyes are extremely discerning with regards to face recognition, especially when it comes to familiar faces of friends and family—you can immediately tell when a character's appearance is off. For example, while playing with old photos recently, I used the KeLing model (top notch Video Model in China) to generate a video of myself. At the 5-second mark, it still looked passable, but by 10 seconds, it no longer resembled me.
10 second footage:
In the second 10-second video, just a slight turn of the head, and it’s no longer me—it looks more like my brother. How can a model handle such fine details? Especially when the starting image for video generation is not even a straightforward frontal shot, making the character information incomplete—how could it not go off track?
While the videos I've made for friends and family using KeLing during its public testing phase have generally been met with passionate surprise and amazement, most of them suffer from this issue of character consistency, which is a regret.
The current one-click video generation products on the market (including our own YuanChuang Island recently launched) tend to mainly use anime or manga styles. This is to avoid user scrutiny since these styles lack 3D distinct individual characteristics. As long as there is consistency in attire, no gender mix-ups, with age and race alignment, most people will accept it. The current one-click videos are generally rough, with entertainment value primarily in the story rather than character portrayal akin to a Hollywood blockbuster. However, as this path progresses, it will inevitably encounter the challenge of maintaining the consistency of digital IP actors and their roles.
My colleague, Lu, mentioned, "the consistency issue might require cross-checking from multiple video angles, which more or less touches on the core issue of whether modeling is necessary."
Indeed, some form of cross-checking is required, not just monotonic correction over time/sequence—that is indeed the key. There’s a need to decouple or separate the character's image from the storyline, rather than generating in a linear, one-way path. While sequence learning has indeed produced miracles in LLMs, sequence generation inherently has limitations, including random deviations over time. Although it's not as extreme as LeCun's criticism—where he says GPT's error accumulation is a tiny discrepancy that leads to a significant miss—his claim isn't entirely accurate because GPT's autoregressive operation also corrects and adjusts its course at every step in the context. Nevertheless, when it comes to fine-grained consistency, random deviations are almost impossible to handle, even with corrective mechanisms in place.
Hence decoupling, decoupling, decoupling! Decoupling can solve the problem. The world isn't limited to sequences. Beyond sequences and time, there is a constant abstraction (i.e., character image, or IP) that can be utilized. This is becoming increasingly clear. Take, for example, the digital IP character Maria (Xiao Ya) that I created using AIGC txt2img more than 2 years ago::
Unless they’re fans, perhaps my numerous Maria videos might cause aesthetic fatigue—someone even called her “Dr. Li's fairy” (LOL). But indeed, there are fans; several of my old classmates are among them.
Why? Because she is an IP, and she has been decoupled.
现在市面上做的一键成片产品(包括我们的元创岛),其所以用二次元或其他夸张风格为主,是为了避免用户的挑剔,因为那些形象缺乏鲜明的个性,不是真正的 individual IP,只要保持穿戴一致性,男女不要错位,年龄和种族不要相左,一般人也就接受了。目前的一键成片普遍比较粗线条,娱乐价值更多是为视频里的故事,而不是好莱坞大片那样的角色形象刻画。但这条路往上走,就躲不开这种 digital IP 的演员角色定位及其一致性问题。
Overall, CRATE is similar to a transformer, with two differences:
in each attention head, the Q,K, and V weight matrices are weight-tied, i.e., set to be equal;
and the nonlinearity following each attention layer is no longer a multi-layer perceptron (MLP), but rather a more structured operator (ISTA) with sparse outputs.
咱们了解一下,ISTA(Iterative Soft-Thresholding Algorithm,迭代软阈值算法),是一种用于解决稀疏优化问题的算法,在机器学习领域有广泛应用。在CRATE架构中, ISTA被用来替代传统Transformer中的多层感知器(MLP),还记得前不久的 KAN 的创新也是旨在平替 MLP。都是在 Transformer 里面开刀。
我浅薄的理解,ISTA 与 KAN for Science/Physics 的思路是一致的,就是要经过某种正则化或剪枝,最终拟合成稀疏化路径,从而获得可解释性。
工作原理: ISTA通过迭代的方式逐步接近问题的最优解。每次迭代包括两个步骤: a) 梯度下降步骤,这与主流同;b) 软阈值操作。增加这个操作,是要在两个目标之间找平衡:
Professor Ma is a prominent figure, renowned for his distinctive style and leadership in the field. His name is widely recognized and respected. Of particular interest recently are his critiques of mainstream large models and the bold claims he has made about his own work (see his post in Chinese below).
Recently, at a conference in Shenzhen (which I attended with my own talk too), Professor Ma sharply criticized mainstream large models, Ilya, and Kolmogorov complexity theory, dismissing them as being on the level of high school students and claiming that they lack a true understanding of theoretical concepts. He asserted that he has achieved breakthroughs in both theory and practice, particularly with the white-box Transformer developed by his team. According to him, this model not only demystifies the complexity of large models but also offers an engineering-feasible alternative.
When someone speaks with such confidence, it usually indicates genuine expertise and a commanding presence. Just as Yann LeCun in the U.S. criticized GPT as being inferior to a dog and called it a dead end, proposing his world model as an alternative, China has Professor Ma. Their critiques balance the global discourse, making the world feel less excluding. There is indeed hope that their work might address the "slow thinking" and "interpretability" shortcomings of current mainstream large models and contribute to the overall advancement of AI. Professor Ma’s academic and practical work deserves close study, though we may have to wait for time and peer reviews to fully test and validate their findings.
At the Shenzhen conference, after delivering his talk and sharp critiques, Professor Ma left immediately, likely due to his busy schedule.
The paper is over 100 pages long and is said to be released in a few days. Based on the current outline, the key points are as follows:
Overall, CRATE is similar to a transformer, with two differences:
- In each attention head, the Q, K, and V weight matrices are tied, i.e., set to be equal.
- The nonlinearity following each attention layer is no longer a multi-layer perceptron (MLP) but rather a more structured operator (ISTA) with sparse outputs.
Let's examine ISTA (Iterative Soft-Thresholding Algorithm), a widely used algorithm for solving sparse optimization problems in machine learning. In his CRATE architecture, ISTA replaces the traditional MLP in Transformers. Not long ago, KAN also introduced innovations aimed at replacing the MLP, both approaches representing surgeries within the Transformer architecture.
In my understanding, ISTA and KAN (for Science/Physics) share a common goal: through regularization or pruning, they ultimately fit a sparse path, thus achieving interpretability.
How it works
ISTA Iteratively approaches the optimal solution of a problem. Each iteration involves two steps: a) a gradient descent step, which aligns with mainstream methods; and b) a soft-thresholding operation. This operation is added to balance two objectives:
a) Maximizing model accuracy;
b) Achieving model sparsity, i.e., simplicity (as overly complex models are difficult for humans to interpret).
The soft-thresholding operation encourages internal elements to become zero, resulting in sparse outputs and increased interpretability. The weight-tied attention mechanism, combined with ISTA, promotes a deeper understanding of the input data structure, resembling a human-like structured analysis process that prioritizes key elements while regularizing the data.
Professor Ma claims that these two modifications naturally lead the model to learn the interpretability associated with human-like structuring and sparsity during supervised learning (and later as claimed successfully applied to self-supervised learning too).
For example, in image recognition, it was observed that certain attention heads correspond to different parts of animals. What's more remarkable is that this correspondence remains consistent across different animals and even different categories of animals. For instance, an attention head focused on the "head" consistently pays attention to the head area when processing different kinds of animals. This consistency suggests that CRATE has learned a general representation of visual features across categories.
However, those studying LLM interpretability have long discovered that at the end of MLP networks, various structured components (such as heads and feet) are also captured by attention mechanisms. Without this, it would be difficult to explain the generalization (or compression) capabilities exhibited by LLMs. The challenge lies in the early stages of the MLP network, where attention is more mixed, and mainstream researcher struggle to clarify what the attentions heads are focusing on. It seems that they are vaguely paying attention to the relationships between basic elements like pixels/dots and lines.
The core idea behind explainable AI is consistent: transforming the tangled, black-box, multi-layer network's internal data fitting paths into structured paths that are enabled with various constraints and pruning, leading to a sparse representation.
Who wouldn’t want a model to be interpretable? However, achieving sparsity and simplicity is extremely challenging, which is why, so far, these approaches have struggled to compete with the black-box methods that involve randomness.
Professor Ma’s confidence stems from the fact that, in the past six months to a year, he has begun to train models using the explainable white-box methods mentioned above, achieving results comparable to traditional transformers. At the Shenzhen conference, he mentioned that while he had always been confident that this was the correct approach, he remained cautious until results were obtained. Now, he believes that his cross-national team’s achievements with this approach have satisfied him enough to announce to the world that he has found a breakthrough in theory as well as practice, the correct method for white-boxing transformers, which could lead to a paradigm shift and a breakthrough in deep learning. This has made him both excited and confident. Therefore, he is no longer content with academic theoretical achievements alone; he feels compelled to take actions in industry as well. Professor Ma has recently founded a company to advance this work on an engineering level. At Shenzhen, he announced a directionally significant project challenging the mainstream, first time under the banner of his new company.
However, based on my years of NLP experience and intuition, I must point out a challenge (or potential issue): Human interpretability is built on a highly simplified finite set. If we consider symbolic features, a feature system with more than thousands of elements becomes incomprehensible to humans. But on the other hand, the number of parameters in transformers and the number of KQVs for attention heads are on a completely different scale. Reducing such complexity on this scale seems almost unimaginable.
KAN for Science succeeded because their target was extremely narrow—certain existing symbolic formulas in physics or potential formulas limited to a few parameters. With such a goal, pruning, along with scientist intervention or feedback, allowed KAN to claim interpretability.
Regardless, Professor Ma seems confident, so we would like to observe how his methods and results evolve and will, or will not, be accepted.
我们的目标客户群体包括内容创作者(ToPC,to professional consumer)和小型至中型企业(ToSMB,to small medium businesses)。内容创作者愿意为方便他们工作的工具付费,而我们正是为他们提供这样的工具。对于ToB客户,我们专注于为中小企业提供较为标准化的解决方案,因为大型客户的定制化需求较为复杂,不易操作。目前,我们拥有86万付费用户,这证明了我们的服务已经成功落地并得到市场的认可。下面是我们产品的一些展示。
Last Friday, I had dinner with a famous VC investor who told me 65% of VC’s will go out of business in the next few years. I believe him. Here’s what that means for startup leaders waiting on their Series A/B/C:
Those of us in startups tend to think VCs are at the top of the food chain.
They aren’t.
VCs are businesses too.
They raise money from THEIR investors (aka LPs).
And their job is to make a return for those LPs.
With outstanding returns in the 2010s, VC was on a win streak.
Many more funds were born.
And existing funds got much larger.
In 2021, 1577 different VC firms raised a total of $183 billion.
But at the same time, costs to launch a startup have gotten cheaper.
Widely available tools, global workforce, and easy (online) distribution mean it’s never been easier, or cheaper, to start a SaaS company.
So how are VCs supposed to deploy all that money they’ve raised?
They can’t.
There is too much money chasing too few deals.
Make no mistake, for VCs, it’s a fatal mix.
The IPO window is closed – companies can’t go public.
So VCs aren’t making money with big IPOs.
M&A isn’t happening (at least not at good prices for sellers).
So VCs aren’t making money by selling their companies.
If VCs aren’t making money, they can’t return capital to their LPs.
They are in trouble.
Of course, VCs rarely go out of business the way their companies might.
Reputations are at stake, so change happens quietly.
But it’s the same result.
It’s already happening.
Listen carefully, and you’ll hear VCs saying:
"We have decided not to raise another fund.”
Translation: they probably can’t.
More firms will say that they are “no longer investing”
Partners are “deciding to take operating roles.”
Managing Directors are retiring.
In 2023, 597 VC firms raised $81B.
That’s down 63% and 56% (vs. 2021).
The VC party is over.
Or at least this chapter is...
The select few at the top of the VC list will have their pick of deals.
The great business builders will choose their spots and continue to thrive.
I’ve been lucky to work with a few of those and am certain that their expertise and relationships will carry them through.
But over 50% of existing firms, won’t survive.
That means if you're a startup CEO or operator raising money in this environment, you need to understand the game has changed.
Don’t buy the stories of some founder that raised $30M with $200k ARR and a good deck.
The era of VCs bailing out bad businesses with huge checks is over.
Many of the VCs won’t even be around in a few years.
There is only one strategy that works in this economy.
Focus.
Nail your ICP.
Delight your customers.
Get profitable to control your financial destiny.
The best time to raise money is when you don’t need it.
Facial recognition in the vast world of AI is a specialized and challenging task, as human eyes are exceptionally sensitive to facial features. Because facial recognition is so specialized and sensitive, it presents a much greater challenge than traditional image recognition tasks, like identifying animal types. Consequently, this field achieved breakthroughs earlier than others: even before the advent of contemporary large models such as GPTs, deep neural network-based facial recognition, powered by extensive datasets of facial images, had already surpassed human visual capabilities and sensitivity. It became widely adopted, leading to the rise of unicorns in the pre-large model era.
Now, as we transition to universal video foundation models that aim to handle all objects in the world, whether it's Sora or Keling, maintaining facial consistency remains a significant challenge. The public has little access to Sora, but by examining similar leading visual models like Keling, we can perceive its limitations. Typically, after about half a minute, the generated faces start to diverge, no longer resembling the original person as closely. Achieving long-term consistency in character appearance is difficult without specialized processing and targeted optimization; relying solely on the current general video consistency training efforts is unlikely to overcome this bottleneck. This limitation has been repeatedly observed during various tests with publicly available visual products like Keling.
In some videos, if not for the sensitivity of human eyes, different visuals might be impossible from a purely physical perspective. This highlights the sharpness of human perception: the ability to instantly discern the real from the fake.
For example, in the videos generated below featuring Maria (Xiao Ya, the favorite text2image IP I have generated and maintained in my AIGC videos), her fans can immediately tell which one is genuine, even though Maria herself may present different appearances at different ages and in various settings. There exists an abstract, invariant facial characteristic that equips humans with an eagle-eyed ability to recognize faces. The secret to this lies in the decoupling of these characteristics already pretty well done in the previous generation of facial recognition models. Compare and contrast:
It's important to note that maintaining character consistency is a critical benchmark for generating cinematic and user-configurable video works. Without crossing this threshold, the field will struggle to achieve large-scale applications in video art creation. The dream of a fully virtual Hollywood production line, without physical filming, will remain a fantasy.
Why is it so difficult for visual models to achieve consistent character representation over long periods using brute force?
Video is a high-dimensional modality, and for large models (at least in the foreseeable future) to handle video, they must employ significant "lossy compression". The compression ratio of visual tokens is high, making it more feasible to align training/generation across the entire frames over time within the hidden space. The higher the compression ratio, the stronger the temporal consistency across the entire frames. Self-regressive models (GPT-like) or DiT (Diffusion Transformers) can achieve this. By doing so, videos that violate the physical laws of the real world can be effectively under control, reducing illogical hallucinations and making visual models appear to simulate the objective world (or so it seems). However, there is a trade-off: under lossy compression, the consistency of the overall frames and the consistency of detailed features of specific physical objects therein cannot be optimized simultaneously.
The current approach typically involves adding a super-resolution (SR) module/mdoel after achieving overall contour (blueprint) consistency, attempting to restore discarded details. In general, super-resolution rendering has made significant progress so far, thanks to the accumulation of research in "deepfake"-like technology. However, deepfake technology essentially compensates for the losses incurred during compression, using the large visual foundation model's strength in imagination (or "hallucination") to reasonably and non-deterministically fill in the details, depicting how the world "should" look like, what it should be, rather than what it is, often with amazingly detailed lifelike results. But if the goal is to represent an individual entity, especially a finely detailed one like a human face of some IP, with individual features sensitive to human perception, it's inevitable that the generated image will drift over time. This is the crux of the problem. The solution should not rely on increasingly larger models and longer context windows with brute-force data and training. Brute force can only slow the deviation but cannot eliminate the non-deterministic bias that accumulates during the SR process over long video sequences. We need to think outside the box and exclude the time dimension as a factor, using a step-by-step alignment method, which may break the time cycle. I’ll stop here—don't say you weren't warned.
The prerequisite for achieving this is the decoupling of facial features. Features that cannot be decoupled cannot be aligned step by step. They have to, and can, be decoupled; otherwise, it would be impossible to explain how dozens of Hollywood actors can star in thousands of blockbuster films. The decoupling of faces from expressions and time still has room for improvement, but the technology has already matured considerably. It is a matter of how to properly use it in the process.
为什么视觉大模型靠蛮力很难在人物长程一致性上奏效?、因为视频是模态中维度很高的形态,大模型(至少在可见的将来)为了搞定视频, 就必须做大力的有损压缩。视觉的 tokens 压缩比很高,这样在内部的 hidden space 做整体帧的长程一致性的对齐训练/生成才比较可行。压缩旅越高,总体画面的时间维度的一致性就越强。自回归或DiT就可以搞定。只有这样,违背大千世界物理规律的视频就会得到有效控制,减少违背常识的幻觉,让视觉大模型看上去已然是客观世界的模拟器(疑似)了。但这里有个矛盾,在有损压缩的条件下,帧总体的一致性与其中具体物理对象的细节特征的一致性,没办法同步优化一致性。
目前的方案通常是,在总体轮廓(blueprint)一致性搞定后,追加一个高清化(SR)的过程,试图把舍弃的细节复现出来。高清化渲染,总体而言在过去几年的 deep fake 研发积累中,已经做得相当不错了。但是 deep fake 本质上是在有损压缩的条件下的亡羊补牢,它所能做的就是通过大模型所擅长的想象(或曰幻觉)来合理地、非确定性地填补细节,描绘世界应该具有的形象(what it should be,Not what it is),可以栩栩如生。但如果目标是一个特定对象,尤其是人脸这种细粒度对象,有人眼敏感的个体特征(IP),它就免不了在长时间的生成中有所走偏,这就是问题的要害所在。解决的办法不应该指望模型越来越大、context window 越来越长的大数据蛮干。因为蛮力只能减缓偏差,不能根治长视频的SR过程中随时间而增加的非确定性偏差积累。要 think out of box,排除时间维度作为条件,以步步对齐的方法,或可解套。点到为止吧 ,勿谓言之不预。
Notes on the 92-page Paper Released with Meta's Super Large Model Llama 3.1
The super-large model Llama 3.1 is a milestone in the open-source large model community. As a leader, Meta's project involved over 500 participants/contributors (the authors of this paper are listed alphabetically in the appendix, similar to how the Central Committee members' names are displayed by stroke order). This original text is full of implementation details:
AIGC MV using Suno and keling (just for fun & cheering opensource milestone)
Notes:
Llama 3.1 doesn't use sparse techniques, it's not a multi-expert system like model 4, but a dense model.
405B parameters, 15.6T tokens: The number of tokens is 40 times the number of parameters. Large-scale top models now emphasize data growth far exceeding parameter growth. Is this 15T tokens of data open source? (No, because even if they were willing to open source it, they wouldn't dare, as it could lead to countless data infringement lawsuits)
Emphasizes three major levers for super-large foundation models: data, scale, and managing complexity.
Compared to the previous generation system Llama 2, computational power has increased 50 times (using 3.8 × 10^25 FLOPs).
Complexity management: (1) Choosing a standard dense Transformer architecture instead of a mixture of experts model to maximize training stability. (2) Adopting a relatively simple post-training procedure: Supervised Fine-Tuning (SFT), Rejection Sampling (RS), and Direct Preference Optimization (DPO). In other words, algorithm design and implementation tend towards simplification. Not using sparse techniques and multi-expert systems is for stability (but training challenges are greater, though they're not afraid). Using simpler, easier-to-implement DPO in the post-training phase instead of reinforcement learning is also for stability, as reinforcement learning has always been difficult to handle.
Benchmark tests cover: general, code, math, reasoning, tool use, long context, and multilingual. All performances are SOTA (state-of-the-art international level).
MMLU (Massive Multitask Language Understanding): 405B model achieves 87.3% (5-shot), 88.6% (0-shot, CoT).
Code generation (HumanEval): 405B model reaches 89.0%, close to GPT-4.
Math problems (GSM8K): 405B model achieves 96.8%, slightly higher than GPT-4.
Long context tasks: Excellent performance on some tasks, such as 95.2% on QuALITY.
Multilingual tasks (MGSM): 405B model reaches 91.6%, on par with top models. The 405B model is comparable or close to GPT-4 and Claude 3.5 Sonnet on many tasks. In short, open-source has caught up with closed-source.
Pre-training started with an 8k window, expanded to a 128k window in the later stages of pre-training (continued training).
After the foundation model pre-training was completed, multiple iterations of alignment "post-training" were performed. Including: (1) Aligning the model through human feedback, including multiple rounds of Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO); (2) Integrating new capabilities, such as tool use; (3) Enhancing coding and reasoning abilities (specialized optimization); (4) Safety alignment.
Multimodal expansion (in progress, not yet released): Image, video, and speech capabilities. Including (1) Multimodal encoder pre-training: Image encoder trained on a large number of image-text pairs, aligning visual content and natural language in a unified space; (2) Speech self-training? (3) Experiments on video-text data alignment based on images.
Language model as the core, other modalities are added later (whether added to pre-training and/or post-training). When expanding to multimodal, the language model parameters remain unchanged, adapting to multimodality, allowing multimodal alignment in the same semantic space, closer to the language model. In other words, Llama follows a modular, step-by-step approach to gradually expand to multimodality. This is not the mainstream approach (mainly referring to Open AI and Google, at least in theory) advocating for "unified multimodal native data joint pre-training". The overall impression of Llama's algorithmic strategies is seeking stability rather than innovation or unification. It tends towards practicality, not caring about leading in algorithms. For example, the integration of speech first involves speech self-training (because speech is actually very similar to text, both being language systems), then alignment between speech and text (including Automatic Speech Recognition ASR and Text-to-Speech TTS). Integrating step by step into the cross-modal large model, this approach isn't cutting-edge in terms of advancement, but it's steady progress, beneficial for engineering development, integration, and iteration. It's unclear when they will be able to release multimodal capabilities online.
Data collection and cleaning work is very complex, but the Llama team is meticulous, which is also the data guarantee for its quality to catch up with SOTA. To recap: (1) De-duplication: URL-level de-duplication; Document-level de-duplication using MinHash algorithm; Row-level de-duplication: removing rows appearing more than 6 times every 30M documents. (2) Filtering: Removing low-quality documents, outliers, and excessively repetitive documents, using repetitive n-gram coverage to remove repetitive content (such as logs or error messages); using "dirty word" counts to filter adult websites not covered by blacklists; using token distribution KL divergence to filter documents with too many abnormal tokens. (3) Controlling data quality: Using fasttext classifier to identify text that might be cited by Wikipedia; using a Roberta-based classifier trained on Llama 2's predictions; using DistilRoberta to generate document quality scores. Also, fasttext language classifier can identify 176 languages; specially filtering two types of information: adult content and personal identity/privacy information. Special fine processing for code and math web pages.
Data proportions: For example, downsampling over-represented data categories on the web (such as art and entertainment); data mixing ratios determined by a series of small model experiments, final data mix summary: About 50% of tokens correspond to general knowledge; 25% of tokens involve math and reasoning; 17% of tokens are code; 8% of tokens are multilingual content.
Model architecture: Apart from empirical detail adjustments, the basic architecture of the dense model remains unchanged, so it's data and scaling that create top models. 405B model specific parameters: 126 layers; token representation dimension 16,384; 128 attention heads; model size of 405B determined according to scaling law, about the computational optimal size under 3.8 × 10^25 FLOPs training budget.
Vocabulary: Using a vocabulary of 128K tokens. Combines 100K tokens from the tiktoken3 tokenizer and 28K additional multilingual tokens to better support non-English languages.
Computing resources, including GPU clusters of tens of thousands of cards, massive storage, and high-speed networks, represent huge resource investments. Specific data as follows: Computing resources:
Used up to 16,000 H100 GPUs (a very powerful graphics processor).
Each GPU has 80GB of high-bandwidth memory, with a power of 700W.
These GPUs are installed on servers designed by Meta itself, with 8 GPUs and 2 CPUs per server. Storage system:
Uses a distributed file system called Tectonic.
Provides 240PB (1PB=1000TB) of storage space, distributed across 7,500 servers.
Can process 2TB of continuous data per second, with a peak of 7TB/second.
A major challenge is handling the large amount of burst writes generated when processing model checkpoints (the process of saving model states).
Three-step pre-training process: a) Initial pre-training; b) Long context continued pre-training; c) Annealing with high-quality data sources Key pre-training strategies:
Gradually increase batch size and sequence length to balance stability and efficiency.
Dynamically adjust data mixing to specifically enhance certain capabilities.
Increase context length in stages to avoid early computational overhead.
Use annealing and high-quality data in the late stages of training to fine-tune model performance.
[LLM Summary]
Llama 3: Meta's Open-Source Large Language Model Breakthrough**
1. Introduction and Overview
Meta has introduced Llama 3, a series of foundation language models designed to support various tasks including multilingual processing, programming, reasoning, and tool use. This model series includes versions with 8B, 70B, and 405B parameters, with the largest 405B parameter model adopting a dense Transformer architecture and supporting context windows of up to 128K tokens. The development of Llama 3 highlights three key factors: data quality and scale, computational scale, and complexity management.
2. Model Architecture and Pre-training Strategy
2.1 Model Architecture
Llama 3 retains the standard dense Transformer architecture rather than adopting a mixture of experts model. This choice aims to maximize training stability, reflecting Meta's emphasis on simplifying design to manage complexity. Key architectural improvements include:
- Using Grouped-Query Attention (GQA) mechanism, with 8 key-value heads per attention layer.
- Introducing attention masks to prevent self-attention between different documents in the same sequence.
- Expanding the vocabulary to 128K tokens, combining 100K tokens from the tiktoken3 tokenizer and 28K additional multilingual tokens.
- Increasing the RoPE base frequency hyperparameter to 500,000 to support longer contexts.
2.2 Pre-training Data Processing
Llama 3's pre-training data processing is extremely rigorous, including:
- Multi-level deduplication: URL-level, document-level (using MinHash algorithm), and row-level deduplication.
- Heuristic filtering: Removing low-quality documents, outliers, and excessively repetitive content.
- Model-based quality filtering: Using fasttext and Roberta-based classifiers for quality assessment.
- Special content processing: Developing specialized processing pipelines for code and mathematical content.
- Multilingual data processing: Using fasttext base language identification model, supporting 176 languages.
- Safety and privacy protection: Filtering website data containing personally identifiable information (PII) and unsafe content.
2.3 Pre-training Strategy
The pre-training process is divided into three main stages:
1. Initial pre-training: Conducted on about 15T multilingual tokens, far exceeding Llama 2's 1.8T tokens.
2. Long context pre-training: Gradually expanding from initial 8K tokens to 128K tokens context window.
3. Annealing phase: Fine-tuning with high-quality data in the final stage, using Polyak averaging to generate the final model.
Data mixing ratios are carefully designed:
- 50% general knowledge
- 25% mathematics and reasoning
- 17% code
- 8% multilingual content
3. Training Infrastructure and Challenges
3.1 Computational Resources
- Using up to 16K H100 GPUs, each equipped with 80GB HBM3 memory.
- Adopting a 4D parallel strategy: tensor parallelism, pipeline parallelism, context parallelism, and data parallelism.
3.2 Storage System
- Using the Tectonic distributed file system, providing 240PB of storage space.
- Supporting 2TB/s sustained throughput, with peak capacity of 7TB/s.
3.3 Network Optimization
- Developing the NCCLX communication library to improve network efficiency.
- Designing specific network topologies and load balancing strategies.
3.4 Training Challenges
- Experiencing 466 job interruptions during the 54-day training period, 419 of which were unexpected.
- Developing automated systems and specialized tools to handle hardware failures and network issues.
4. Post-training and Alignment
Llama 3 adopts a multi-round iterative post-training process, including:
1. Supervised Fine-Tuning (SFT)
2. Direct Preference Optimization (DPO)
3. Reward model training: Using human feedback data
4. Safety alignment: Implementing multiple rounds of safety measures
This process not only improves the model's instruction-following capabilities but also enhances safety and specific abilities (such as coding and reasoning).
5. Multimodal Expansion
Although not officially released yet, Llama 3 demonstrates promising multimodal capabilities:
- Image recognition: Training independent image encoders, integrated with the language model through adapters.
- Video understanding: Adding video adapters based on image adapters.
- Speech processing: Independently training speech encoders, then aligning with the language model.
This modular approach allows flexible addition of new modalities while maintaining core language capabilities.
These results indicate that Llama 3 405B is comparable or close to GPT-4 and Claude 3.5 Sonnet on multiple tasks, particularly excelling in document understanding and long context tasks.
7. Safety Considerations
Meta highly prioritizes safety in the development of Llama 3:
- Implementing strict safety measures in both pre-training and post-training stages.
- Developing the Llama Guard system-level safety solution.
- Conducting extensive red team testing and risk assessments.
8. Open Source Impact and Future Directions
Meta's decision to publicly release the entire Llama 3 series, including the 405B parameter version, may have far-reaching impacts on the AI research community:
- Promoting open, responsible AI development.
- Accelerating AGI research progress.
- Providing researchers with opportunities to examine and improve large-scale language models.
Future development directions may include:
- Further improving multimodal integration.
- Expanding context length.
- Continuously enhancing data quality and model scale.
9. Conclusion
The development of Llama 3 demonstrates Meta's deep experience and forward-thinking in large-scale AI systems. By focusing on three key levers - data quality, computational scale, and complexity management - Llama 3 has reached or approached the current state-of-the-art level on several key benchmarks. Its open-source release may drive a wave of innovation across the entire AI field, paving the way for responsible AGI development.
Llama 3: Meta's AI Chef's Latest "Divine Delicacy"
Attention, all tech enthusiasts! The Michelin three-star AI chef Meta has just unveiled a new dish! This divine delicacy named "Llama 3" is not only spicy enough but will elevate your taste buds to new heights!
1. The Chef's Secret Weapon
Imagine Llama 3 as a super nanny who speaks 8 languages, writes code, does math, and can be your personal assistant. She can handle a kindergarten full of rambunctious kids (8B version), manage a mid-sized company (70B version), or even govern a small country (405B version)! This 405B big sister can remember 128,000 "gossips" (oh no, I mean context) simultaneously, essentially a walking encyclopedia + supercomputer!
2. Ingredient Selection: Only the Freshest!
Llama 3's chefs are masters at picking ingredients:
They "fished" 15 trillion words from the internet, nearly 10 times more than the previous generation!
Half of these words are everyday life seasonings, a quarter are math problems and brain teasers, nearly a fifth are programmer spells, and the rest are dialects learned from world travels.
They even invented a super weed remover, filtering out all the online garbage, repetitive, and unhealthy stuff.
3. Cooking Process: Three-Step Stir-Fry Method
Step 1: "Slow Simmer" - Start with a regular stove (8K context) to cook it halfway. Step 2: "High Heat Stir-Fry" - Switch to a super stove (gradually increasing to 128K context), reducing the sauce to be thick and fragrant. Step 3: "Low Heat Finish" - Finally, a gentle simmer with the best ingredients, the legendary "annealing" (even the chefs don't know why it's called that), bringing the flavor to its peak!
4. Kitchen Equipment: Top-of-the-Line Luxury Version
16,000 super high-power induction cookers (H100 GPUs) firing simultaneously!
A refrigerator that could fit half the Pacific Ocean (240PB storage)!
A proprietary ingredient prep system faster than 5G (NCCLX communication library)!
Imagine all these stoves firing at once, making the kitchen feel like a sauna. But our chefs persevered through the heat, changing chef uniforms 466 times in 54 days to whip up this dish!
5. Training Method: Both Cute and Well-Mannered
Being a good cook isn't enough; you've got to have manners too! So our chefs began a long "training" process:
First came a round of "gentle education" (supervised fine-tuning)
Then the "carrot and stick" tactic (direct preference optimization)
Finally, they invited moral role models (safety alignment) for guidance
After all this fuss, Llama 3 not only cooks well but also knows how to please people, program, do math, and mind her manners - a true decathlon champion!
6. Special Side Dishes: Showcasing Multiple Talents
Don't think Llama 3 can only cook; she's a multi-talented "goddess":
Storytelling from images? Piece of cake!
Writing movie reviews? No problem!
Recognizing songs and even singing a bit? The karaoke queen!
Although these "talents" are still in practice, they already show the potential of Li Bai's "from black hair to snow white in a day"!
7. A True Powerhouse: Dazzling Test Scores
Llama 3 participated in a series of "Top Chef Competitions," with eye-popping scores:
College Entrance Exam (MMLU): 87.3 points (out of 100)
Programmer Interview (HumanEval): 89 points (out of 100)
Math Olympiad (GSM8K): 96.8 points (out of 100)
Long Novel Reading Comprehension (QuALITY): 95.2 points (out of 100)
Bring this report card home, and even a "Tiger Mom" would be grinning from ear to ear!
8. Safety First: AI's "Security Captain"
Meta's chefs know well the principle of "don't leave guns and ammo lying around." They've assigned Llama 3 a 24/7 bodyguard team (Llama Guard) to prevent her from accidentally saying or doing the wrong thing. They even arrange occasional "moral exams" to ensure she doesn't turn into a "Terminator."
9. Open Source Feast: Everyone Can Be a Master Chef!
The most impressive part is that Meta decided to make the recipe for this "divine delicacy" completely public! It's like a Michelin three-star restaurant putting their signature dish's recipe online. Now anyone who wants to can whip it up at home! This move not only shocked other master chefs but also made countless food lovers cheer with joy!
10. Future Outlook: Reaching New Heights
Meta's chefs aren't resting on their laurels; they're already pondering the next "divine delicacy":
Maybe a dancing Llama 4?
Or a painting Llama 5?
Who knows, one day we might see a Llama 6 composing symphonies!
In short, the AI world's "Michelin" journey has only just begun!
Epilogue
The birth of Llama 3 not only elevates Meta's status in the AI world but also brings a fresh breeze to the entire AI research community. This bowl of "Llama soup" is not only delicious but also brings unlimited imagination to everyone. What will the future of AI be like? Let's wait and see what flavor the next "divine delicacy" will be!
-- looking closely into his historical Berkeley talk
by Wei Li, Jia Gao
Introduction
When Ilya Sutskever left OpenAI and re-emerged with his new company, SSI (Safe Superintelligence Inc.), the move was both surprising and expected—he bypassed AGI and directly aimed at SSI (Safe Superintelligence). He confidently declared: Superintelligence is imminent, and establishing safe superintelligence (SSI) is the most important technological issue of our time.
Ilya, a legend in the field of deep learning and AI, and the former true soul of OpenAI, was at the center of the dramatic internal shift, addressing the issue—effective acceleration versus super alignment. Why was Ilya so steadfast about "super alignment" amid the underlying AI values and strategic path debate? Even after the storm settled, the outside world continued to speculate: what did Ilya see that compelled him to join the board in making the decision to oust CEO Sam Altman? Ilya remained hidden until recently, when he left OpenAI, leading to the dissolution of his super alignment team and the creation of his new company.
What did he see behind the push for "safe intelligence"?
Back on October 3, 2023, Ilya gave a talk at UC Berkeley titled "A Theory of Unsupervised Learning." Though obscure and known to few, it is destined to be one of the most significant moments in AI history. This talk was a theoretical reflection and summary by a top expert in deep learning on the GPT model he pioneered, now famous worldwide. Ilya revealed the core principles of large models and vividly described his obsession with, and excitement over, independently understanding the mechanisms of unsupervised learning. Despite the complexity, the talk was brilliant and enlightening.
Until recently, Leopold Aschenbrenner, a former member of his super alignment team, published a 165-page article, "Situation Awareness," preliminarily revealing the shock and concerns within OpenAI over the exponential evolution of GPT models. This partly answered the question of what Ilya saw, but Ilya himself remained silent until his official re-emergence not long ago.
Reflecting on his "confessional" talk at Berkeley, we might glimpse his "moment of enlightenment" when facing potential superintelligence and understand his original intent for safe intelligence. It was a rare deep sharing by Ilya, attempting to convey essential message to the world. But did the world hear him?
1. Machine Learning: Supervised Learning and Unsupervised Learning
To accommodate readers with varying mathematical backgrounds, this blog aims to explain Ilya's historical presentation in an accessible language. Purely technical explanations can be skipped by non-technical readers without affecting the understanding of the presentation's main ideas.
Before diving in, let's review the basic concepts of machine learning. Machine learning is like having computers as students and humans as teachers. By providing computers with numerous "practice problems" and "answer keys," they slowly learn to solve problems. This is supervised learning. But can computers really learn from practice problems instead of merely memorizing them? Ilya assures us there's theoretical proof of this.
Imagine a sea of problems before you, each paired with a standard answer. This is the model's training data. Model training is like diligently solving these problems until most of them are correct, meaning low training error. But even an extensive problem set has its limits. When new problems arise, can the model still get them right? These new problems are the test data, akin to exams. Whether the model performs well depends on its test error rate.
Mathematics tells us that as long as the problem set is large enough, far exceeding the model's size, excellent performance on training problems (low training error) ensures good performance on test problems (low testing error). In other words, if the model trains well, it will do well in exams! This is the mathematical guarantee for supervised learning.
However, if the model merely memorizes without extraction, no matter how large its memory or how strong its "memory power," it lacks real adaptive learning ability (called "generalization ability"). Only when the model isn't too smart, it will be forced to extract the essence (called "compression"), learning real skills from the problem set.
This explains why the model size shouldn't be too large, to avoid giving the model too much room to cut corners. In short, Ilya wants to say that "big labeled data + low training error" is the winning formula for supervised learning, guaranteed by mathematics. This point has been confirmed both theoretically and practically. Since the deep learning revolution 12 years ago, countless successful cases have shown that as long as the training data is sufficient, neural networks can excel, at all sorts of AI tasks, from recognizing cats and dogs to machine translation.
But what about unsupervised learning? Can computers learn intelligence from a problem set without standard answers? It sounds far-fetched, but Ilya is about to explain how he managed to seek a solid mathematical foundation for unsupervised learning as well.
2. Distribution Matching: A New Approach to Unsupervised Learning
Everyone knows that machine translation was a typical win of supervised learning, in fact, the only win among various NLP tasks (such as dialogue, information extraction, sentiment analysis, question answering, docuent understanding, etc.) prior to the large language model's era. Why? Because we have a vast amount of historical bilingual data. It's like students having workbooks with English on the left and Chinese on the right—supervised learning thrives on this setup.
But what if the teacher suddenly stops providing aligned bilingual data and only gives you English books and unrelated Chinese books, leaving you to figure out how to align and learn automatic translation? That's the problem unsupervised learning needs to solve. Ilya says unsupervised learning can also handle various language machine translations (which we've seen today with large models—specialized translation software is no longer needed), and even any input-to-output transformation tasks. What's the catch?
Ilya discovered a new approach called distribution matching. Essentially, if the English and Chinese book collections are large enough, containing various sentence structures, their linguistic regularities will be learned "without supervision". For example, the context distribution of "I/me/my" in English should correspond to "我" in Chinese; adjectives near nouns in English with semantic compatibility should have a similar pattern in Chinese, etc. This provides the basic condition for potential language alignment.
Ilya points out that if two languages' native data is sufficiently rich, the input in one language can almost uniquely determine the equivalent translation in the other language. This principle applies not only to machine translation but also to tasks like speech recognition and image recognition.
Ilya independently discovered this approach in 2015, fascinated by the underlying mathematical principle—compression theory. If we can find a method that maximally compresses both English and Chinese data, this approach will capture the common patterns of the two languages, which form the basis of translation.
So, Ilya proposes that unsupervised learning is essentially about finding the optimal data compression method. This perspective not only sounds cool but also provides a mathematical explanation for the effectiveness of unsupervised learning. Although real-world tasks are not idealized, this principle gives unsupervised learning a solid theoretical foundation, making it as convincing as supervised learning.
Next, Ilya will delve deeper into the mathematical principles behind it. Although somewhat abstract, he promises it’s full of insights. We'll see how he uses the magic of compression to explain the mysteries of unsupervised learning.
3. Ilya’s Ultimate Theory: From Conditional Modeling to Joint Modeling
This is the final and most intriguing slide of Ilya's talk, worthy of thorough analysis and contemplation. The goal of unsupervised learning is often defined as "learning the internal structure of data." Ilya suggests understanding unsupervised learning from the perspective of data compression: a good unsupervised learning algorithm should maximally compress the data, representing its content in the simplest form. This introduces the concept of Kolmogorov complexity.
The Kolmogorov complexity of a data object is the length of the shortest computer program that can fully describe this object. You can imagine this shortest program as a "compressed package" containing all the information needed to reconstruct the original data. From this perspective, the goal of unsupervised learning is to find the optimal compressed representation of the data, which is the Kolmogorov complexity.
The Kolmogorov complexity of a data object is the length of the shortest computer program that can fully describe this object. Imagine this shortest program as a "compressed package" containing all the information needed to reconstruct the original data. From this perspective, the goal of unsupervised learning is to find the optimal compressed representation of the data, which is the Kolmogorov complexity.
However, in practice, we often need to handle multiple related datasets. For instance, in machine translation, we have the source language dataset X and the target language dataset Y. We want to learn a model that can translate sentences from X to Y (or vice versa). Traditionally, this is viewed as a conditional probability problem: given X, what is the probability distribution of Y? Represented in terms of Kolmogorov complexity, this involves finding K(Y|X), the shortest description length of Y given X.
Ilya proposes a different approach. Instead of viewing X and Y as condition and result, like in supervised learning, he suggests viewing them as a whole and compressing them together within a massive model. Essentially, we seek the joint Kolmogorov complexity K(X,Y), the shortest program length that compresses both X and Y simultaneously. This approach must fully utilize the correlation between X and Y, using information in X to automatically align Y (or vice versa), much like how we use our native language knowledge to understand and remember foreign language expressions.
Ilya believes this joint compression idea is the true power of unsupervised learning. Real-world data is often interconnected, with numerous deep common patterns and regularities. If unsupervised learning can discover and utilize these regularities, it can significantly enhance learning efficiency and generalization ability. This explains the remarkable performance of large language models like GPT across various tasks: through massive unsupervised pretraining, they learn the deep regularities of the training data, and these regularities are transferable across related datasets.
Although Kolmogorov complexity is theoretically uncomputable, Ilya believes we can approximate this process using deep neural networks (like GPT). Through optimization algorithms such as gradient descent, neural networks can find the optimal compressed representation in massive data, capturing the essence of the data and its alignment patterns, even if not strictly in terms of Kolmogorov complexity.
Thus, Ilya’s theory can be seen as a new paradigm for unsupervised learning, elevating traditional independent modeling (like separate models for English and Chinese) to a unified associative modeling approach. In this paradigm, the goal of unsupervised learning is no longer just compressing individual datasets but finding the connections between them. This cross-modality learning represents an advanced form of artificial general intelligence (AGI).
Now, let’s closely examine this final slide. In it, X represents dataset 1 and Y represents dataset 2. The key point is extracting every bit of information from X (or Y) to help predict Y (or X). This is what Ilya refers to when he says training X and Y together yields the effect that unsupervised learning of X helps accomplish the task of transforming X to Y.
The crucial idea is: K(Y|X) becomes K(X, Y).
Ilya transforms the universally applicable functional AI task of "input X conditions output Y" into an approximate solving problem by jointly training X and Y without modal segmentation. This joint training approach is effectively the current multimodal unified training, abbreviated as K(X, Y).
Ilya aims to strengthen the theoretical basis, emphasizing his surprising discovery that self-learning of X has a strong predictive effect on Y.
The essence of unsupervised self-learning is that the self-learning of X is to compress X, and the self-learning of Y is to compress Y. This is straightforward because the essence of self-learning is involves only positive examples, without negative samples. Unsupervised self-learning lacks a specific task orientation; it learns language from language, images from images, music from music, and so on, continually abstracting various patterns from phenomena.
Ilya points out in the slide: conditioning on a dataset, not an example. The compression object is the dataset, not individual data points, which is crucial. This distinction separates superficial compression from content compression. Superficial compression is merely a mechanical process that does not produce intelligence. Only content compression can achieve artificial intelligence.
How do we understand the difference and connection between superficial lossless compression (e.g., digital music) and content lossless compression (e.g., Suno)? Compressing a specific song losslessly aims to ensure it can be restored to its original musical form (including noise and imperfections). This is traditional music compression, targeting individual sample, e.g., a specific song. Compressing a collection of music, whether using GPT or Diffusion, targets a group of samples, resulting in a large model like Suno.
When individual objects turn into group objects, formal compression naturally transforms into content compression. This is because, although the group comprises individuals, compressing the group is like "painting" a portrait of the group, outlining its characteristics. It may resemble an individual, but it is not a specific individual in the original data; otherwise, it would not be a model but a memory repository.
This is understandable because the purpose of large model compression is to identify the characteristics and regularities of the dataset. The text generated by GPT-4 might seem familiar; the music generated by Suno might sound familiar; the videos generated by Sora might look familiar; the images generated by MJ might seem familiar. However, they are virtual individuals "restored" based on prompts, abstracted or compressed from big data: derived from data, higher than data, mingling with data, indistinguishable from real and fake.
Given that the compression object is the entire dataset content, how do we measure its effectiveness after decompression? What is the gold standard?
This standard is each sample itself. However, this is not entirely accurate; the standard could have equivalent answers, as the same content can have various ways of expressions. The implementation method is "masking", and NTP simply masks the next token. Training involves calculating the loss for each sample, using backpropagation with gradient descent to adjust parameters continually, eventually lowering the loss in the group training of the dataset to an acceptable point, forming the large model.
This final slide and Ilya’s explanation emphasize a core point: Conditional Kolmogorov complexity K(Y|X) provides a theoretically optimal solution for unsupervised learning. K(Y|X) is defined as the length of the shortest program that produces the output dataset Y given access to the input dataset X. It represents the theoretical limit of extracting all valuable information from X to predict Y. An algorithm that can achieve K(Y|X) would be the best for predicting Y using unlabeled data X.
This can be seen as the theoretical basis for large models performing various language translations. Each language is potentially X and potentially Y. After self-learning with an huge amount of data, LLMs learn the relationships between languages, possessing the potential to translate from X to Y.
In practice, the machine translation task, like other tasks, initially involves few-shot examples in instruction-following fine-tuning to define the task, ultimately triggering the internal power of large models to translate various languages. This internal power of unsupervised learning for various tasks is the theme of his talk.
However, K(Y|X) is uncomputable in practice. Ilya proposes a feasible alternative, using joint Kolmogorov complexity K(X,Y) (joint compression of X and Y). He believes K(X,Y) can achieve the same effect as K(Y|X) in practical machine learning tasks.
Let us stop and think again: conditional modeling is now replaced by sequence modeling by Ilya. The widely known probability simplification in traditional machine learning, such as the Markov chain, has a similar effect.
Conclusion
Ilya's historic presentation at Berkeley on the theory of unsupervised learning reveals the secret behind the mainstream of self-learning large models, especially GPT. It seems that Ilya, after long contemplation, finally disclosed this "heavenly secret" in a cryptic manner at Berkeley. Although the theory and its proof appear complex, it is crucial for understanding why GPT's sequence learning method ("next token prediction") has become a universal simulator for AI tasks.
Ilya exudes a genius prophet aura, with a lonely invincibility and high-altitude isolation, blending a sense of deep realization, compassion, and the pure, focused, and idealistic earnestness of a graduate student nerd.
He claims to prefer compression but does not emphasize so-called lossless compression. He leaves room for himself and the mainstream, proposing the concept of "no regret"—though GPT may not achieve lossless or perfect compression, it theoretically proves there is no better way: GPT is the closest to lossless, "no-regret" modeling.
When Ilya officially re-emerges to establish SSI, emphasizing a single focus, a single goal, and a single product—to use technology to ensure the superintelligence brought by large models is safe for humanity—he asserts: AI will be eternal, its birth akin to the creation of heaven and earth. As Ilya passionately discusses AI's progress, he is most qualified to declare and lead the "exciting yet dangerous journey towards AGI."