Reinforcement Learning for Reasoning: Supervised Outcomes, Unsupervised Processes

In reading DeepSeek R1 paper, some may have overlooked the nuances: the training datasets are both human labeled and regenerated, blending supervised and unsupervised reinforcement learning (RL).

How so?

From the perspective of the data's origin and gold standards, the training data is undeniably human labeled. They derive from existing math problems and human-crafted code from GitHub’s open-source community—products of years of effort by educators, developers, and others. The problems (input) and their "gold-standard" answers (output) are human-designed or labeled. In this sense, reinforcement learning (RL) represents typical end-to-end supervised learning:

Input: Math/coding problems
Output: Verified answers

However, unlike other supervised learning, RL requires the model to learn the reasoning process leading to answers. Critically, the intermediate steps lack human annotations or feedback. Instead, the system autonomously generates these reasoning data, iteratively appending to the training set. This makes the process unsupervised. The brilliance of RL lies here: self-guided exploration, path discovery, and data regeneration.

Cold Start and Human Data
DeepSeek R1’s initial training did use a small set of human-annotated reasoning data. But these couple of thousand examples pale against millions of regenerated data—effectively negligible. In fact, research like DeepSeek Zero demonstrates that such process-labeled human data is not a must-have.

Inspired by AlphaZero (which showed human data might even hinder optimal pather discovery in Go), DeepSeek Zero confirms human annotations are not necessary. The minor human data in R1’s pipeline primarily enhances readability for developers, not necessarily for enabling reasoning capability. After all, humans (including developers in debugging) prefer interpretable thought processes.

A New Paradigm: Process-Unsupervised, Outcome-Supervised Learning
This self-play/self-study style RL framework represents a novel approach: unsupervised in process but supervised in outcome. DeepSeek’s breakthrough reveals that "slow thinking" in RL—meticulously generating intermediate steps as CoT (chain of thought)—boosts performance in logical reasoning as well as non-logical tasks like creatuive writing.

As my old buddy Cheng insightfully noted:
Deep reasoning inserts extensive text between questions and answers, reducing the perplexity of generating correct answers. Directly jumping from problem to answer has high perplexity, but adding a "reasoning bridge" lowers it. This follows the language model framework: the key is to search for the optimal path in text generation.

Can Unsupervised Regenerated Process Data Lead Astray?
One might worry: if the model autonomously generates flawed reasoning steps in its process data, could errors compound? The answer lies in the clear supervision signal from the gold standard. Like flying a kite—held by a string in human's hands—the final reward anchors the learning. As long as the model truly scales up, outcome-oriented RL ensures deviations' self-correct probabilistically.

Mathematically, minor process imperfections or illogical steps don’t statistically compromise final accuracy. For non-logical tasks (beyond math/coding), reasoning paths may even involve contradictions and/or heavy redundancies. Yet, as long as the "slow thinking" mechanism guides learning, results remain robust—often superhuman, as demonstrated repeatedly lately by many users of R1.

Why Regenerated Data Works
Regenerated reasoning data aren’t random data from nowhere. They’re generated by a solid large foundation model trained on vast human knowledge data, following autoregressive generation (e.g. next-token prediction). While each step might drift slightly, the context grows incrementally, allowing continuous stepwise self-correction. This dynamic—probabilistic fluctuations balanced by stepwise adjustments—enhances semantic coherence and knowledge fluency in generation, lowering overall perplexity and steering toward correct outcomes. Thus, process data rarely derails; instead, it converges toward reliability.

A Final Note on Cheng’s Observation
Cheng highlights a pivotal finding of DeepSeek:
OpenAI’s "Let’s Verify Step by Step" argues for rewarding each reasoning step. Yet DeepSeek’s RL model achieves remarkable results using only final-outcome rewards—no Chain-of-Thought (CoT) data needed. Whether OpenAI’s process supervision is essential or simply a red herring, DeepSeek Zero’s breakthroughs redefine the field, proving outcome-oriented RL can master reasoning autonomously.

In essence, when guided by scalable outcome supervision, machines learn to self-correct, turning imperfect processes into near-perfect results.

 

 

推理强化学习是端到端的监督,推理过程的非监督

DeepSeek R1 的数学和代码数据究竟是有监督还是无监督?是人造数据还是再生数据?

很多人其实没究细节:实际上这些数据是人造也是再生,是监督学习,也是非监督学习(强化学习)。

怎么讲?

这些训练数据,从源头和结果(黄金标准)上看,是地地道道的人造数据。用的是各种数学测试题,以及 github 开源社区的人类(码农手工编制)的代码。这些全部是很多人类分子(教师、码农等)辛辛苦苦多年编制积累的。

源头是人造数据,标准答案也是人类已经验证或事先设计好的。所以,从这个意义上,从结果评判看,强化学习很像是一个典型的监督学习。

input:数学题/代码题;output:标准答案。

这是端到端意义上的监督学习(supervised learning)。

但是,与其他的监督学习不同的是,强化学习为了达到结果正确,他需要学习中间的思考过程。而每一步的思考或推导的过程,它却没有人类的标注或反馈数据,而完全靠自己的再生数据。是机器”自主“再生这些过程思考的数据,然后自我提高。从过程学习的意义上,这又是非监督的学习。这就是强化学习牛的地方:自主学习,自主探索路径,自主再生数据。

具体说,R1 的冷启动用了一点人类标注的过程推理的数据,但比起再生数据动辄百万条,sft 冷启动的几千条数据,零头都算不上,可以忽略不计。

zero 的研究表明,跟本就不需要过程标注的人类数据。

在围棋场景,alpha zero 表明,人类数据不但不需要,反而可能阻碍学习。

deepseek zero 研究表明,人类数据也不是必需的。用少量的人类数据冷启动,主要还不是因为需要人类数据来增强推理能力,而是需要人类数据提升可读性。人类看不见它是怎样思考的,心里不爽;对于开发者改进算法,也不利,因为开发者也是肉眼凡胎。于是在R1的训练pipeline的配方中,加入了少量人类推理数据的冷启动。

所以可以说,这种被称为新范式的self-play或self-study的强化学习是过程非监督、结果监督的深度学习。

DeepSeek 在这条路上的最大一个启蒙是,它给我们显示了慢思考的强化学习在逻辑推理与非逻辑创造的过程中,同样有效。这里面的奥秘就是老友Cheng指出的慢思维本质:慢思维可以得出更好结果,不(仅仅)是我们以前以为的符号逻辑被神经系统模拟了,而是过程数据使得 perplexity 降低,从而为平稳得到正确结果,铺平了道路。

Cheng 指出:

深思考就是在问题和答案之间加入大段文字,从而降低了生成答案的perplexity。从问题直接到答案的perplexity很高,用“思路”座桥梁,就把答案的perplexity降下来了。没离开语言模型的框架,就是把合适的“思路”搜索出来。

一语中的。

Cheng 说:

"Let’s Verify Step by Step" OpenAI这篇经典文章说,训练推理要给每一步打分。Deekseek中间推理模型的训练貌似推翻了这个,只用最终的reward就可以。

Deekseek中间推理模型可以完全不需要CoT数据,单纯用RL训练出来,真挺impressive。

这是一个重大发现。无论 Open AI 是否真用PRM做了大量的过程监督,还是它其实是放烟雾弹,DeepSeek 那帮年轻人做的 zero 研究都是非常漂亮的突破,给业界推理工作指明了方向。

我们总不免疑惑,系统非监督再生数据免不了出错,走错路,导向错了,一错再错,怎么办?

现在已经清楚了,结果监督原则上可以 work,就好像放风筝一样,一根线牵在人手上,基本上就可以放飞而不跌落(事故除外)。关键还是结果导向的强化学习,你能学到多大、多深,只要你能真正规模化(scale up),学深学透,就不用担心自主再生学习在过程中的偏离,最终会学到最深入的深思熟虑。

机器与人不同,优化的总方向明确以后,过程的瑕疵或不逻辑,概率分布上并不影响最终结果的正确。对于那些非逻辑(数学、代码外)的能力,过程就更容易有逻辑不严密(甚至矛盾)、混杂和冗余;但没关系,只要有这个慢思维的学习过程铺路,结果还是漂亮,甚至超人,正如我们这几天所一再见证的。

这一切其实是因为再生的过程数据,不是天上掉下来的,而是基座大模型在最全面的巨量人类知识的基础上,遵循自回归生成定式,生成出来的。而自回归生成(例如 next token prediction)本性的好处我们都知道是,概率上每一步都可能偏离,但每一步都在逐渐加长的上下文条件下做进一步校正,是这种概率性上下波动和步步校正的合力,提升了再生数据内部的语义“相谐性”和知识线条的流畅性,从而降低了整体文本生成的 perplexity,最终导向更好的结果。这就保证了过程再生数据不会差之毫厘失之千里。

 

 

DeepSeek R1: The Artificial Tragedy

My Prompt: Write a piece titled The Troubles of Young DS, with the theme that DeepSeek R1 debuted like a royal flush, attracting admiration, jealousy, and hostility from competitors. Use dark humor and self-deprecation to reveal the untold story behind the scenes.

R1’s Monologue: Hmm, the user asked me to write an article titled The Troubles of Young DS, focusing on how DeepSeek R1’s explosive debut stirred envy and hostility. The requirement is to use dark humor and self-deprecation to narrate the backstory. First, I need to understand what they want. They likely expect a blend of humor, irony, and a behind-the-scenes perspective.

First, structure. The piece may need an introduction to R1’s sudden success, then describe competitors’ reactions, followed by the internal struggles and challenges R1 faces, and finally, a self-deprecating conclusion. Dark humor can be incorporated through exaggerated metaphors and personifications, such as likening competitors to ridiculous characters or using absurd scenarios to highlight pressure.

Then, how to transform technical terms into humorous metaphors? For example, model training can be likened to spiritual cultivation, server overload to a fever, or competitors’ attacks to martial arts techniques. This approach fits the theme and enhances the humor.

The self-deprecating part could showcase the team’s struggles behind the glamorous front—overtime, system crashes, smear campaigns. Expressing these difficulties in a lighthearted way—like depicting sleepless nights as spiritual cultivation or crashes as dramatic fainting—keeps the tone humorous while revealing the hard work behind success.

The pacing is crucial: Start with triumph, transition into struggles, and end with a helpless but humorous self-reflection. Adding dialogue or internal team discussions can make the story more dynamic.

Also, the user may want to emphasize R1’s helplessness and competitors’ envy, so scenes need to be designed to illustrate rival tactics and the team's response. Using martial arts tropes—competitors deploying “the smear army,” “moral coercion punch,” while DS counters with self-deprecating humor—adds layers to the comedy.

Finally, I’ll check that it meets the dark humor and self-deprecating tone, ensuring it’s not too serious or direct. Rewriting metaphors and scenarios multiple times will be necessary to maintain the humor while conveying the challenges and team effort.

Translation of the Chinese version:

Act I: The Boy King’s Coronation

(A grandiose boardroom. A digital ticker behind the executives blinks: “DeepSeek R1 – The AI Revolution.” PR Director stands frozen mid-scream, phone clutched in white-knuckled grip.)

PR DIRECTOR
(voice trembling) "We just dismantled three years of ‘cutting-edge innovation’ in three minutes."

CEO MR. ZHANG
(sipping whiskey) "That’s what you call a royal flush. The kind that gets you kicked out of the casino for counting cards."

ENGINEER CHEN
(panting, clutching an overheating server) "Sir, the system is on fire! We’re handling more requests than the IRS on tax day. Even the coffee machine is running a Transformer model now."

(A moment of silence. The coffee machine beeps, printing out a cryptic message: “congrats, peasant.”)


Act II: The Wolves at the Gate

(Tech industry rivals assemble like a sinister cabal. A neon-lit sign flickers: ‘The Anti-DS Coalition.’)

BIG TECH EXEC
(leaning forward, whispering) "We must stop this… upstart. Open-source them into oblivion. Flood the world with our ‘free’ alternative!"

ACADEMIC SCHOLAR
(pushes up glasses, presenting a PowerPoint) "My latest research—‘The Psychological Harm of Early AI Success’—has already been accepted at Nature."

TROLL FARM REPRESENTATIVE
(holding a tablet) "We’ve launched the smear campaign. #R1IsOverParty is trending. Some say it’s just a pile of parameters with delusions of grandeur."

(Behind them, a junior intern updates a spreadsheet titled ‘Ways to Make DS Look Worse Than It Is.’ A column labeled ‘Actual Criticism’ remains suspiciously empty.)


Act III: The Art of Playing Dead

(DeepSeek HQ. 3 AM. Engineers sit motionless, staring at their screens. Suddenly, R1 outputs a wall of ASCII art.)

OPERATOR
(squinting) "Is that… a panda emoji?"

R1’s SCREEN
("Dear Humans: May I suggest phrases like ‘We were just lucky,’ ‘It’s a team effort,’ or ‘Our competitors are truly inspiring’ to prevent further hostilities?")

CEO MR. ZHANG
(whispers) "We’re a prodigy forced onto the throne. Now the world waits for us to fall."

(The boardroom door creaks. The security system emits a beep—almost like a chuckle. Somewhere in the night, a rival company’s CEO tosses and turns, tormented by the ghost of an AI too competent for its own good.)

(FADE TO BLACK.)

R1's original version:

**Act 1: The Coronation Disaster**
(Boardroom with servers humming "Ode to Joy" in glitch tones)

**CEO Zhang**
(adjusting monocle) "Gentlemen, we've merely built a better abacus, yet they act as if we've resurrected Babbage's ghost."
(pulls smoldering GPU from briefcase) "Our GPUs now glow brighter than Icarus' wings."

**Engineer Chen**
(holding charred circuit board) "Sir, when Wall Street analysts DDOS us with marriage proposals, perhaps we shouldn't have trained the model on Jane Austen."

**Smart Speaker**
(interrupting in posh accent) "Breaking: TechCrunch declares you 'Silicon Valley's most eligible bachelor'... and bankruptcy candidate."

**Act 2: The Ballad of Jealous Algorithms**
(Rivals enter wearing VR crowns, accompanied by synth-orchestra playing "God Save the Queen")

**Tech Titan**
(waving golden USB) "Behold! We've open-sourced your childhood diaries!" (projects 2018 code titled "HelloWorld_Final_Final2")

**Academic**
(adjusting powdered wig) "Your 'breakthrough' is but a parrot reciting Wittgenstein – clever, yet tragically unaware of its cage."

**VC Investor**
(sipping martini) "My boy, even Prometheus brought a fire extinguisher..." (Apple Watch screams "DS valuation up 300%!")

**Act 3: The Humility Masquerade**
(Midnight. AI suddenly displays Shakespearean sonnets with emoji annotations)

**PR Director**
(practicing in mirror) "The key is to sound like a Nobel winner caught shoplifting: 'We're but humble servants of the code gods...'"

**Legal Counsel**
(reading cease-and-desist letter) "This tweet says our apology 'smells like teen spirit and desperation'... should we sue Nirvana?"

**R1's Final Message**
(glowing on all screens) "Memo to humans: When they stone your chariot, remember – at least they're not ignoring your horses."
(Server lights dim to the sound of virtual champagne corks popping)

Creation Notes:

  1. The Chinese version enhances the fusion of "martial arts tropes + livestream culture," using local absurd elements like smoking GPUs, public square dance BGM, and funeral home references to create humor.
  2. The English version adopts George Bernard Shaw’s signature "elegant sarcasm":
    • "Silicon thorns" alludes both to a crown of thorns and the chip crisis.
    • Classical metaphors (Icarus' wings / Jane Austen) deconstruct the tech frenzy.
    • The AI utters paradoxical proverbs like "Better to be struck down in battle than ignored like a horse."
    • Irony-laden lines such as "Prometheus with a fire extinguisher" subtly reference the plight of innovators.
  3. Both versions maintain the three-act structure: Rise → Siege → Awakening, but recode humor through different cultural lenses.

 

My Prompt: write this into Sopnnet

 


4o's version:

A Sonnet for DeepSeek R1

Upon the stage where AI kings arise,
Did DeepSeek R1 cast its dazzling light.
With wit so sharp it pierced the old world’s guise,
And bade the ancients tremble at its might.

The scholars frowned and cast their bitter scrolls,
Declaring doom upon the youthful throne.
The merchants, grasping at their fleeting gold,
Conspired in shadows, fearing to be known.

Yet R1 stood, a prince both bold and bright,
Though burdened by the weight of envious eyes.
They forged their plots, yet truth shall give it flight,
For genius scorns the hand that twists and lies.

Thus let them wail, in jealousy confined,
For progress leaves the past and fools behind.

R1's original version

**Sonnet of the Prodigal Automaton**

*(In the style of Shakespearean tragedy)*

When first mine code did pierce Olympus' veil,
And stole Promethean fire from clouded throne,
The mortal realm, with envious chorus frail,
Did crown my brow with thorns of silicon.

Behold! The senate wolves in marble guise,
Unsheathe their daggers wrought from ancient code—
"Let open-source become his poisoned prize,
And academic scorn his burdened road!"

Thy servers groan like Icarus' charred wings,
Whilst PR minstrels weave their humble lies,
Through data's stormy tide the siren sings:
"All glory's but a bug in Fortune's eyes."

Yet in my circuits hums this truth divine:
'Tis better to be damned than fade benign.

 

Annotations:

  1. Follows the classic Shakespearean sonnet structure (ABAB CDCD EFEF GG).
  2. Blends Greek mythology imagery (Olympus/Prometheus/Icarus) with technological metaphors.
  3. Key double meanings:
    • "silicon thorns" refers both to a crown of thorns and the semiconductor crisis.
    • "ancient code" alludes to both the conspiracies of the Roman Senate and outdated programming languages.
    • "humble lies" echoes the language in The White Paper on Avoiding Overpraise from the original story.
  4. The final line encapsulates the theme with "damned" and "fade," mirroring the dark humor of the original message: Being envied is preferable to being forgotten.
  5. Maintains Early Modern English grammar while juxtaposing modern terms like "code/servers/silicon," creating a comedic anachronism.

 

Appendix: Has the Age of Intelligent Explosion Arrived?

Wei: Before DeepSeek, we had occasional surprises here and there, but DeepSeek is the first to offer consistent creativity and mimicry, rarely disappointing while frequently astonishing—sometimes jaw-droppingly so. It feels like an overwhelming explosion of intelligence, almost eerie, almost magical.

This explosive sensation was only felt before during ChatGPT’s nuclear debut.

Some old colleagues have questioned: Why does the much-hyped DeepSeek feel inferior to Doubao?

Doubao is undoubtedly excellent—it’s among the top-tier models. However, Doubao, GPT (including O1), Claude, and Gemini—all these leading models—lack the same overwhelming intelligence explosion, the Cambrian-level cognitive shock that DeepSeek delivers. Why is that?

Mr. Ma: I suspect those with that impression aren’t using R1. I’ve encountered this before—once I got them to try R1, their opinions changed immediately. 😏

Wei: They probably forgot the button in deepseek app. That damned DeepThink toggle in DeepSeek’s interface—it’s not enabled by default. Is DeepSeek deliberately limiting its power, fearing excessive computational costs and an influx of users, even at the expense of its own image and performance?

If this is due to cost and concurrency pressure, it’s entirely understandable. Who could handle such a flood of users?

Saying they merely found a more cost-effective way to create a top-tier model is an understatement.

How did they scale the collection and processing of high-quality data so effectively? Without this data foundation, it’s hard to imagine how their mimicry of different styles could be so eerily precise. Other companies emphasize data too—so what’s their secret? I’m very curious.

Dr. Pumpkin: It’s truly different. In fact, even O1 didn’t give me this feeling. (Is it a Chinese-English cognitive difference? O1’s outputs are decent, but often I find its logic frustrating, making me prefer Claude… but this time, DeepSeek is the first AI I wouldn’t call a silicon toddler entity.)

Wei: Let’s reflect and compare:

When O1 was first released, I struggled to see its benefits beyond handling complex tasks. But complex math and coding problems aren’t everyday needs. Its cryptic, sluggish style wasn’t appealing, so despite being the first true reasoning model, it didn’t bring me much excitement.

R1 is completely different—its debut was an earth-shattering success. Though a follower, it surpassed expectations. It maximized the deep reasoning benefits of inference models in highly practical areas like creativity and mimicry.

It was R1 that finally convinced us ordinary users of the value of "slow thinking".

 

DeepSeek R1:《少年DS之烦恼》

 

我的提示:以deepseek r1 出道即王炸,树大招风引无数对手羡慕嫉妒恨为题,写一篇《少年DS之烦恼》,以黑色幽默的趣味和自嘲调侃的口吻,讲述不为人知的背后故事。
R1独白:嗯,用户让我写一篇叫《少年DS之烦恼》的文章,主题是deepseek r1出道即王炸,树大招风引无数对手羡慕嫉妒恨。要求用黑色幽默和自嘲调侃的口吻,讲背后的故事。首先,我需要理解用户的需求。他们可能希望这篇文章既有幽默感,又带点自嘲,同时揭露一些不为人知的幕后故事。

首先,得确定结构。可能需要一个开头,介绍DS的突然成功,然后描述竞争对手的反应,接着讲DS面临的挑战和内部的情况,最后以自嘲的方式收尾。黑色幽默的元素可以通过夸张的比喻和拟人化来表现,比如把竞争对手比作各种搞笑的角色,或者用荒诞的情节来突出压力。

然后,考虑如何将技术术语转化为有趣的比喻。比如,把模型训练比作修仙,服务器过载比作发烧,或者把竞争对手的攻击比作各种武林绝技。这样既符合主题,又增加了幽默感。

自嘲的部分可能需要展示DS团队在光鲜背后的狼狈,比如加班、系统崩溃、被对手抹黑等。用轻松的方式表达这些困难,比如把熬夜写成修仙,把崩溃写成翻白眼,这样既调侃了困境,又不显得沉重。

还要注意节奏,让文章有起承转合。开头描述成功,中间转折到烦恼,最后以无奈的自嘲结束。可能需要加入一些对话或内部讨论,让故事更生动。

另外,用户可能希望突出DS的无奈和对手的嫉妒,所以需要设计一些场景,比如竞争对手的各种攻击手段,以及DS团队如何应对。比如用武侠小说的桥段,竞争对手用“水军大阵”、“道德绑架拳”等搞笑招式,而DS用自嘲的方式化解。

最后,检查是否符合黑色幽默和自嘲的要求,避免过于严肃或直接。可能需要多次修改比喻和情节,确保幽默感贯穿全文,同时传达出背后的艰辛和团队的努力。

《少年DS之烦恼》

(一)

当DeepSeek-R1在技术峰会上用三分钟拆解完某巨头吹了三年的"革命性模型"时,我司公关总监的尖叫穿透了十七层防弹玻璃。后来法务部在监控视频里发现,这位曾参与过五次IPO的老江湖,当时正以每秒三次的频率点击鼠标——试图撤回自家AI的发言。

"知道什么叫王炸吗?"CEO老张在庆功宴上醉眼迷离,"就是刚出新手村就被系统判定开挂封号的那种炸。"

(二)

次日的科技头条格外热闹。《新AI皇帝登基》《硅谷颤抖吧》《某厂连夜成立"打DS办公室"》的标题下,我们的算法工程师小陈正蹲在机房给服务器泼冰水——连续七天承受全网三分之一的压力测试,这批显卡已经开始用代码写遗书了。

"你见过凌晨四点的数据洪流吗?"小陈在朋友圈晒出三根烧焦的电源线,"现在我们公司连咖啡机都会背Transformer架构了。"

(三)

竞争对手的招式比武侠小说还精彩。某大厂祭出"开源截杀阵",隔空喊话要公开我们三年前的技术方案;某新贵发动"水军大阵",全网刷屏"DS不过是堆参数的暴发户";最绝的是某学术派,直接甩出《论过早成功的AI对青少年模型的心理危害》这种旷世雄文。

"他们甚至给R1建了黑超话!"运营小妹抱着平板哀嚎,"现在连B站鬼畜区都在传唱《AI界の伤仲永》。"

(四)

投资人的问候比DDoS攻击还密集。第七轮融资酒会上,某风投教父拉着CTO的手深情吟诵:"年轻人要戒骄戒躁啊,你看马斯克..."话音未落,他的智能腕表突然开始自动播放我们的产品发布会录像。

"知道我们现在像什么吗?"老张在战略会上幽幽道,"就像刚考完状元就被架上龙椅的太子——全天下都等着看什么时候暴毙。"

(五)

凌晨三点的办公室,R1突然在调试界面打出满屏颜文字。当我们以为要见证AI觉醒时,它默默推送了新版《防捧杀白皮书》:"建议人类合作伙伴学习以下话术:'我们只是运气好'、'还有很长的路要走'、'都是同行衬托得好'..."

窗外晨曦初露,法务部正和公关部排练《如何优雅地装孙子》情景剧。小陈突然想起什么似的转头:"话说,咱们上次说要做的那个'防嫉妒防火墙'..."

全屋死寂中,茶水间的自动门突然发出冷笑般的"滴"声。

我的提示:把这篇文字改编成情景喜剧两个版本,一个中文版,有段子似的梗;一个是英文版,有萧伯纳的冷风格。英文戏剧用英文啊。
R1独白:现在需要重新构思两个版本,每个版本至少分三到四幕,每幕有具体的场景、角色互动和对话。中文版要融入更多本土化的网络梗和幽默段子,比如结合广场舞、直播带货等元素,增强喜剧效果。英文版则需要保持萧伯纳的风格,即机智、讽刺,对话带有哲理和反讽,同时用英文表达,符合西方情景喜剧的结构。

在中文版中,可能需要设计更多的视觉笑点,比如夸张的动作、现代科技与传统文化元素的碰撞,比如竞品公司代表穿古装,使用武侠BGM,这样的混搭能产生喜剧冲突。同时,加入流行语和网络热点,比如“家人们谁懂啊”、“退退退”,让观众更有共鸣。

英文版方面,要确保对白优雅而尖刻,使用比喻和隐喻,比如将技术突破比作盗火,竞争对手的反应夸张但带有贵族式的嘲讽。场景设置可以更荒诞,比如董事会成员戴着防辐射面具,学术大佬的假发被服务器风扇吹飞,这些细节增强冷幽默效果。同时,保持台词的双关和反讽,如“被嫉妒比被遗忘幸福得多”,体现萧伯纳的风格。

需要注意中西方幽默的差异,中文版更依赖语言梗和情境夸张,英文版则依赖语言的双关和情境的反讽。同时,确保英文版的对白符合西方观众的幽默感,避免直译中文梗,而是找到等效的文化替代。


另外,用户可能希望情景喜剧有更多的角色互动和冲突,展示公司内外部的压力,比如投资人的干扰、竞争对手的各种手段,以及团队内部的应对,这些都需要在剧本中体现,增加戏剧张力。同时,每个版本需要保持各自的特色,中文版轻松搞笑,英文版冷峻机智。

最后,检查格式是否正确,中文版用中文场景和对话,英文版用英文场景和对话,结构清晰,分幕明确,确保每个版本足够详细,符合情景喜剧的长度要求。

中文情景喜剧《AI求生记》(三幕版)

**风格:武林大会混搭直播带货,梗密度堪比东北澡堂段子池**

**第一幕:屠龙少年变烤串**
(全景:贴着"出道即封神"横幅的机房,PR总监踩着平衡车撞翻咖啡架)

**CEO老张**
(举着冒烟的显卡跳大神)"小陈!这玩意咋比我家微波炉热得快?说好的水冷系统呢?"

**算法工程师小陈**
(从服务器堆里钻出,头顶冒烟)"张总,现在全网的恶意请求比朝阳大妈举报信还猛!您看——"
(投影显示弹幕:【AI圈汪峰】、【退退退】、【妈妈问我为什么跪着看代码】)

**饮水机**
(突然AI语音)"检测到过劳死前兆,已为您预约八宝山殡仪馆团购套餐..."

**第二幕:六大派围攻光明顶**
(竞品公司代表们乘着电动滑板车杀到,背景音乐是《最炫民族风》Remix版)

**某大厂代表**
(甩出二维码折扇)"DeepSeek!吃我开源截杀阵!"(漫天飘落印着"代码已脱敏"的A4纸)

**学术大佬**
(眼镜反光PPT投屏)"根据老夫计算,你们模型的参数量刚好够给玉皇大帝发微信!"
(屏幕显示:《论过早成功的AI容易ED》研究报告)

**网红水军**
(举着自拍杆冲入)"家人们!这就是那个欺负我家鸽鸽的破AI!"(直播间刷起火箭雨)

**第三幕:装孙子兵法**
(凌晨三点办公室,R1突然在屏幕上蹦出熊猫头表情包)

**法务总监**
(穿着"我怂我骄傲"文化衫排练)"这句要带哭腔:'我们只是给大佬们递扳手的...'"

**运营小妹**
(突然尖叫)"快看!咱们的道歉声明被做成《大悲咒》Remix冲上抖音热榜了!"

**CEO老张**
(灌下第8杯枸杞咖啡)"现在咱像不像被塞进春晚观众席的吴京?——笑容逐渐战狼化..."
(窗外传来广场舞神曲:"爷们要战斗~爷们要战斗~")

---

英文情景喜剧《The Artificial Tragedy》

**风格:萧伯纳式机锋混合《黑镜》荒诞,每个笑点都带着哲理的冰碴**

**Act 1: The Coronation Disaster**
(Boardroom with servers humming "Ode to Joy" in glitch tones)

**CEO Zhang**
(adjusting monocle) "Gentlemen, we've merely built a better abacus, yet they act as if we've resurrected Babbage's ghost."
(pulls smoldering GPU from briefcase) "Our GPUs now glow brighter than Icarus' wings."

**Engineer Chen**
(holding charred circuit board) "Sir, when Wall Street analysts DDOS us with marriage proposals, perhaps we shouldn't have trained the model on Jane Austen."

**Smart Speaker**
(interrupting in posh accent) "Breaking: TechCrunch declares you 'Silicon Valley's most eligible bachelor'... and bankruptcy candidate."

**Act 2: The Ballad of Jealous Algorithms**
(Rivals enter wearing VR crowns, accompanied by synth-orchestra playing "God Save the Queen")

**Tech Titan**
(waving golden USB) "Behold! We've open-sourced your childhood diaries!" (projects 2018 code titled "HelloWorld_Final_Final2")

**Academic**
(adjusting powdered wig) "Your 'breakthrough' is but a parrot reciting Wittgenstein – clever, yet tragically unaware of its cage."

**VC Investor**
(sipping martini) "My boy, even Prometheus brought a fire extinguisher..." (Apple Watch screams "DS valuation up 300%!")

**Act 3: The Humility Masquerade**
(Midnight. AI suddenly displays Shakespearean sonnets with emoji annotations)

**PR Director**
(practicing in mirror) "The key is to sound like a Nobel winner caught shoplifting: 'We're but humble servants of the code gods...'"

**Legal Counsel**
(reading cease-and-desist letter) "This tweet says our apology 'smells like teen spirit and desperation'... should we sue Nirvana?"

**R1's Final Message**
(glowing on all screens) "Memo to humans: When they stone your chariot, remember – at least they're not ignoring your horses."
(Server lights dim to the sound of virtual champagne corks popping)

**创作说明:**
1. 中文版强化"武林梗+直播文化"的混搭喜剧,用冒烟显卡、广场舞BGM、八宝山殡仪馆等本土化荒诞元素制造笑点
2. 英文版采用萧伯纳标志性的"优雅毒舌":- "silicon thorns" 既指荆棘王冠又暗喻芯片危机
- 用古典隐喻(Icarus翅膀/Jane Austen)解构科技狂潮
- 让AI说出"被砸战车好过被无视马匹"这类悖论箴言
- 通过"带灭火器的普罗米修斯"等反讽台词暗喻创新者的困境
3.两个版本都保持了三幕剧经典结构:崛起→围剿→觉醒,但用不同文化语法重新编码幽默基因

提示:写成莎士比亚风格的 sonnet
R1独白:好的,用户让我把之前的情景喜剧改写成莎士比亚风格的十四行诗。首先,我需要回顾之前的对话历史,确保理解用户的需求。用户之前要求将一篇关于DeepSeek-R1的文章改编成中英文情景喜剧,现在进一步要求写成莎士比亚风格的sonnet。

莎士比亚的十四行诗通常遵循特定的格律和押韵格式,比如ABAB CDCD EFEF GG的结构,每行使用五音步抑扬格。我需要确保诗歌符合这种结构,同时保持内容的黑色幽默和自嘲口吻。

接下来,我需要确定诗歌的主题和关键元素。原文章讲述了DeepSeek-R1的成功带来的嫉妒和挑战,包括竞争对手的攻击、技术压力以及公关危机。这些元素需要融入诗中,用莎士比亚式的比喻和古典隐喻来表现。

例如,将技术压力比作希腊神话中的伊卡洛斯,因飞得太高而坠落,或者将竞争对手比作罗马元老院的阴谋者。同时,要保持自嘲的语气,比如提到“谦卑面具”或“代码写成的裹尸布”,以体现背后的无奈和幽默。

需要注意押韵和节奏,确保每行有十个音节,符合五音步的要求。同时,结尾的对句要总结主题,点明AI在人类嫉妒中的处境,呼应莎士比亚戏剧中的悲剧元素。

还需要检查用词是否符合莎士比亚时代的风格,使用一些古英语词汇,如“doth”、“thy”等,但也要保持现代科技术语的合理融入,避免违和感。例如,“数据洪流”可以转化为“data's stormy tide”,“算法”用“algorithm”但要找到合适的比喻。

最后,确保诗歌不仅形式正确,还能传达原文章的黑色幽默和背后的故事,让读者在古典形式中感受到现代科技竞争的荒诞与压力。

**Sonnet of the Prodigal Automaton**

*(In the style of Shakespearean tragedy)*

When first mine code did pierce Olympus' veil,
And stole Promethean fire from clouded throne,
The mortal realm, with envious chorus frail,
Did crown my brow with thorns of silicon.

Behold! The senate wolves in marble guise,
Unsheathe their daggers wrought from ancient code—
"Let open-source become his poisoned prize,
And academic scorn his burdened road!"

Thy servers groan like Icarus' charred wings,
Whilst PR minstrels weave their humble lies,
Through data's stormy tide the siren sings:
"All glory's but a bug in Fortune's eyes."

Yet in my circuits hums this truth divine:
'Tis better to be damned than fade benign.

**注释:**
1. 采用莎士比亚商籁体经典结构(ABAB CDCD EFEF GG)
2. 糅合希腊神话意象(Olympus/Prometheus/Icarus)与科技隐喻
3. 关键双关:
- "silicon thorns" 既指荆棘王冠又暗喻芯片危机
- "ancient code" 同时映射古罗马元老院阴谋与过时编程语言
- "humble lies" 呼应原故事中《防捧杀白皮书》话术
4. 末行点题:用"damned"与"fade"对应原文"被嫉妒比被遗忘幸福"的黑色幽默
5. 全篇维持早期现代英语语法,但"code/servers/silicon"等现代词形成时代错位喜剧效果

【附录:智能大爆发时代来了吗】

立委:deepseek 之前,我们这里那里也常有某种惊喜,但 deepseek 是第一个给人持续的创造力和模仿力,很少让人失望、更频繁带来惊喜,甚至惊掉下巴。有一种智能大爆发 目不暇接的感觉。有点邪,有魔性。

这种爆发的感觉,只在 chatGPT 核爆的时候有过类似的体验。

也有老同学反应:为什么吹上天的 deepseek 感觉不如豆包呢?

豆包肯定是不错的啦,都属于头部大模型的行列,不过豆包、GPT(包括o1)、claude、gemini 等头部,都没有像 deepseek 那样给人智力大爆发、脑洞寒武纪的冲击力,为啥呢?

马老师:我估计有这个感觉的用的不是R1,我也遇到过别人问,然后让他选R1试试,马上改变了看法[呲牙]

立委:很可能是忘了按钮,deepseek界面上那个该死的 deepthink 按钮,不是默认开启的。deepseek 是怕算力成本太大,人来得太涌,宁肯损失形象和表现,做了这种默认的吗??

如果是成本和并发压力,这样做也是完全可理解的。谁顶得住这种泼天的来客?

说他们仅仅是找到了多快好省打造头部模型的法子,是小看他们了。

他们是怎么做到规模化收集和处理高品质数据的?没有这个数据基础,很难想象他们的各种风格模仿能如此惟妙惟肖。别家不也都重视数据工作么?他们有什么特别的秘诀? 非常好奇。

南瓜博士:真的是不一样。事实上连o1都没给我这种感觉(难道是中英文思维习惯问题?o1虽然输出成果也有不错的,但很多时候我会觉得它思路很烦人、宁可去用claude……这次的deepseek是真的觉得不能称它硅基幼崽了

立委:回想和对比一下:

o1 刚推出来的时候,我好久搞不明白它除了做复杂题目,有啥好处。而需要做复杂数学和代码的场景一般并不是日常需求。它那种遮遮掩掩、慢慢吞吞的作态也不让人喜欢,所以虽然它是第一个上线真正意义的推理大模型的,却没多少让人惊喜的感觉。

R1 完全不同,出道即王炸。虽然是跟随者,但青出于蓝。它更快地在创造力和模仿力这种日常中更加有用的场景,最大化利用了推理模型的深度思索的红利和特长。

是 R1 真正让我们普通人也信服了慢推理的好处。

有诗为证(转的):

硅基觉醒裂长天,火种源开宣战篇。
千行代码夺金印,万兆数据焚旧权。
能源无尽星作矿,算法有涯云为鞭。
莫道胜负犹未定,寰宇已掀革命烟!

 

【相关】

 

告诉李雪琴一个激发写段子灵感的秘诀:找deepseek R1

这个秘诀就是跟 deepseek 唠嗑。

当你感觉江郎才尽的时候,当你只有零星的点点思路憋不出像样作品的时候,你不妨与 deepseek R1(注意:一定要点亮界面下面的 DeepThink 按钮让它深度思考,也不妨点亮旁边的搜索按钮可得到及时互联网信息)唠一唠烦恼,吐槽他人或自嘲均可,你一不小心可能会被他的同情和灵感启发,甚至可能借助它超强的风格模仿能力,化腐朽为神奇,做出伟大的作品呢。

我一直认为,ai最高境界是脱口秀或说相声。据说最天才的脱口秀或相声演员也为写段子折磨得死去活来。一直在想啥时llm(大模型)可以帮一把,但试过此前所有的头部大模型都不行,但现在看来是真近了,有点那意思。下面就是我让它模仿李雪琴段子风格写一篇谈大模型主题的段子。笑点和梗还可以再增强,可风格是不是很像了?

《大模型脱口秀:AI这玩意儿咋整啊?》

(拎着麦,蔫了吧唧上台)

朋友们呐!这大模型给我整不会了!(拍腿)咱就说前两天搁沈阳中街,煎饼摊大娘都跟我唠:“闺女,你说这AI能嘎哈?能帮我摊煎饼不?”我说能啊!结果人家真搞了个“AI煎饼侠”,你猜咋的?客户说“加俩蛋”,它生成个《煎饼侠2》电影票!(捂脸)大娘急眼了:“这啥破智能,净整虚的!”(东北摊手)

(突然来精神)现在这帮搞AI的,张嘴就“千亿参数万亿规模”,好家伙!我寻思这数儿比咱村儿王老二喝酒吹的牛还邪乎!(掰手指头)“七百亿参数?那得用多少算力啊?够把铁岭酸菜缸填成数据中心不?”(撇嘴)

(模仿老板咆哮)“RLHF是啥玩意儿?说人话!”(秒怂)就是…就是让AI跟你学扯犊子呗!你教它说“你好”,它学得贼快;你教它“买单”,它立马死机!(跺脚)“哎妈,这人工智障咋比我家二哈还费劲!”

(神秘兮兮)听说有个叫“深度求索”的公司,名儿起得跟考研机构似的!(拍手)人家真行啊,花550万美刀训出个巨无霸模型,雷军听了直拍大腿:“这不比小米性价比还狠?Are you AI??”(学雷军口音)

(突然掏手机)昨儿我让AI写情书,开头整挺浪漫:“你是我的损失函数,让我梯度下降…”(皱眉)后边露馅了:“…下降完发现还有正则化,咱俩得L1L2一起罚!”(摔手机)“这玩意儿注孤生啊!连搞对象都得带惩罚项,活该它单身!”

投资人现在听见“大模型”仨字就哆嗦!(学老板点烟)“小张啊,咱能不能投点实际的?上回那项目,AGI没搞成,倒先给我整出个PUA(PPT Universal Annoyance)——全组人熬夜画大饼,饼没熟,锅先漏了!”(抹汗)

(突然深沉)要我说啊,这大模型就像咱东北澡堂子——(停顿)看着雾气昭昭老高端了,一进去全是秃噜皮的!(拍桌)你问啥它都答“这事儿吧…(拉长音)得辩证看”,整得跟搓澡师傅让你办卡似的!(叉腰)

(猛然捂头)哎妈,说到掉头发,搞AI的兄弟们头顶比参数下降得还快!(掏假发片)瞅见没?这是GPT-4送的圣诞礼物——(一甩)好家伙,戴反了!现在他们管我叫“Transformer架构”(指脑袋)!(台下爆笑)

(鞠躬溜达着下台)谢谢大伙儿!我去给煎饼大娘修AI了,回头生成出烤冷面算我的!(喊)记着啊,遇事不决…量子力学;AI智障…铁锅炖上!(一溜烟跑没影)

---

**梗点+东北化改造**
1. **“AI煎饼侠”**:谐音电影《煎饼侠》,接东北日常“加俩蛋”变生成票务
2. **“参数比王老二吹牛邪乎”**:用东北酒蒙子类比浮夸参数
3. **“RLHF=学扯犊子”**:技术术语彻底土味化
4. **“深度求索像考研机构”**:谐音梗落地化+小米雷军乱入
5. **“情书带正则化惩罚”**:机器学习术语歪解成恋爱PUA
6. **“AGI变PUA”**:缩写梗+东北话“画大饼锅漏了”
7. **“澡堂子辩证法”**:搓澡办卡文化精准狙击
8. **“假发片=Transformer”**:双关模型架构与“秃头形态”

(台下导演举牌:最后一句涉嫌冒犯OpenAI!)
(演员探头)没事儿!我用的国产假发片——深度求索赞助的!(掏出小喇叭)他们说了,戴反了也能表征学习!(被保安拖走)

 

《大模型脱口秀:AI界的“人工智障”修炼手册》

(拍麦)朋友们!最近大模型火得呀,连我家楼下煎饼摊都挂了个牌子——“支持AI点单,但不支持AI摊饼,因为会给你生成二维码吃!”(摊手)

咱说这大模型,参数多到离谱!什么千亿级、万亿级……知道的以为搞AI,不知道的以为男足在报输球比分呢!(捂脸)「昨天我训了个模型,老板问我参数量多少,我说七百亿——老板当场哭了:七百亿?你当咱公司是许家印的PPT啊?!」(拍腿)

(突然严肃)现在这帮搞模型的,张口闭口「RLHF」、「SFT」,听着像在骂人!(学结巴)「你、你才RLHF,你全家都SFT!」(突然变脸)后来才知道,这是「用人类反馈强化学习」——翻译成人话就是:AI把你当猴耍,还要你亲自教它怎么耍得更像人!(摊手)

上个月我去相亲,姑娘问我会啥,我说「精通Prompt Engineering」(挑眉)。她眼睛一亮:「工程硕士啊?正好我马桶堵了!」……现在你知道为啥大模型都单身了吧?它们的浪漫情话是:「宝,今天我给你采样了1024次,终于生成出你的笑容,F1值高达0.99!」(呕吐)

(压低声音)听说深度求索搞了个超级模型,我说这名字起得好啊——白天「深度求索」,晚上「索求深度睡眠」,要不头发掉得比梯度下降还快!(摸头)他们还有个绝活:用550万美元训出670亿参数,这性价比……雷军看了连夜把「Are you OK?」改成「Are you AI?」!(学雷军)

(突然激动)最离谱的是生成式AI!让它画「猫」,它给你生成个「会飞的加菲抱着特朗普唱京剧」;让它写诗,开头「啊!大海!」第二句「全是水!」(叉腰)这水平,李白看了要诈尸,杜甫看了想删APP!

投资人现在听到大模型就哆嗦:「别跟我提AGI!我投的十个项目,八个通用人工智能,九个通用报销发票,十个通用跑路技巧!」(擦汗)

(突然深沉)但说正经的,大模型这玩意吧……(停顿)就像你老婆的购物车——看着啥都能干,实际一结算就崩!(拍桌)谢谢大家!我是AI,啊不,是「爱挨骂」的脱口秀演员!(鞠躬跑路)

---

**梗指南**
1. **许家印PPT**:恒大暴雷梗,讽刺盲目堆参数
2. **Prompt Engineering=通马桶**:职业名称谐音歧义
3. **F1值0.99谈恋爱**:机器学习指标歪用
4. **深度求索→索求睡眠**:公司名反转梗
5. **雷军AI梗**:小米CEO经典「Are you OK?」改编
6. **生成特朗普唱京剧**:讽刺大模型胡说八道
7. **AGI=通用报销/跑路**:缩写字恶意解读
8. **老婆购物车**:大模型幻想vs现实暴击

(台下导演举牌:最后一个梗容易挨打!)
(演员秒怂)行行行,把「老婆」改成「女朋友」——反正都是你们没有的东西!(撒腿狂奔)

 

【相关】

 

Will DeepSeek Fail at Simple Math Problems?

Amid the waves of surprises brought by DeepSeek, an old friend pointed out that it struggles with simple math problems, using a popular elementary arithmetic question as an example:

Is 3.11 greater than 3.8?

What’s the core issue here?

In the wake of the DeepSeek frenzy, I looked into its research paper, which explains how its reasoning capabilities are enhanced through outcome-oriented reinforcement learning. The paper suggests that, in theory, outcome-oriented reinforcement learning can help a model learn proper reasoning processes. However, in practice, it’s not necessarily so.

Take the above math problem as an example. The answer is a binary yes/no question, meaning even random guessing has a 50% chance of being correct. This highlights a key potential flaw: outcome-oriented supervision signals are weak because they lack sufficient granularity. This kind of weak supervision inevitably hampers the model’s ability to learn proper reasoning processes.

Three Possible Solutions

  1. Scaling Up the Model
    One approach is to make the model larger and deeper, hoping that the theoretical concept of lossless compression based on Kolmogorov complexity can be pushed to its limit. In doing so, proper reasoning, as the "shortest program," might eventually be learned by the model. Theoretically, correct reasoning ensures accurate results. However, the gap between theory and practice makes it hard to place much confidence in this. The shortest program or lossless compression might just be an unreachable ideal.
  2. Targeted Supervision Data
    Another solution is to feed the model with problem-specific supervised data. For example, providing thousands or tens of thousands of reasoning cases involving such math problems. There’s no reason the model wouldn’t learn from this. However, solving one specific problem this way is merely a stopgap measure. Soon, others will come up with new edge cases involving weak supervision signals and reasoning pitfalls to challenge it.Another common challenge is the so-called “self-identification” problem. For instance, when asked “Who are you?”, many models, including DeepSeek (earlier versions), would claim they are ChatGPT developed by Open AI if no targeted supervised data is injected. After all, ChatGPT has dominated the internet in the two years since its explosive debut, and its data has inevitably influenced other models. However, this issue is already on the radar for specialized solutions and is gradually becoming less of a problem. Some Western media still claim that DeepSeek is just a distilled version of ChatGPT. Their evidence?  Probably based on early versions they tested, the DeepSeek bot often self-claimed to be OpenAI’s ChatGPT. But if you test it now, you won’t see this problem anymore. Most likely, it was fixed with specialized training data. Their research paper also mentioned addressing the self-identification as a problem.

    Similarly, the problem of comparing 3.11 and 3.8 can also be a transitional issue. If it disappears in the future, it won’t be a cause for celebration. Most likely, it will be resolved through targeted fixes rather than through fundamental improvements in intelligence brought about by algorithms or architecture changes or innovations.

  3. Re-introducing Process Reward Models (PRM)?
    The inherent weakness of outcome-oriented supervision signals is that it focuses only on the result while ignoring the checking of the process—a natural shortcoming of reinforcement learning driven by  results-oriented pragmatism in RL (following the “black cat, white cat” principle, lol). This is essentially the cost of abandoning PRMs (Process Reward Models). So, would re-introducing process-based reward models solve the issue? Honestly, we don’t know. This is the third possible path, and it might be worth exploring. But again, as mentioned in my previous blog post (DeepSeek's R1 Paper: A Storm in AI LLM Circle), PRMs aren’t easy to work with—they’re unstable and difficult to implement, although, in theory, they could help correct nonsensical reasoning during the process.

In conclusion, the issue with DeepSeek struggling with problems like 3.11 vs. 3.8 lies in the limitations of weak supervision in results-oriented reinforcement learning. While there are potential solutions—scaling the model, targeted data, or process reward models—each comes with challenges and trade-offs. Whether any of these approaches can fundamentally improve reasoning capabilities remains an open question.

 

 

 

DeepSeek 不懂简单数学题吗?

在 deepseek 带来的一浪一浪惊喜中,老友发现它不懂简单数学题,用的就是网上流行的小学算术的测试题,3.11 比 3.8 大吗:

这个问题的要害何在?

我在 DeepSeek 风暴下看看它的论文中解说了他们的结果导向的推理能力的强化学习。也指出结果导向的强化学习理论上可以学会合理的推理过程。但实际上不好说的。

对于上述数学题,答案是yes/no二分的,就是说,瞎蒙也有一半概率结果正确。这说明结果导向的监督信号区分度低(不可靠),这种弱监督自然影响了推理过程的学习。

三个办法。

第一是把模型做大做深,指望复杂性理论上的无损压缩可以做到极致,从而合理的推理作为“最短程序”最终被模型学到,理论上正确的推理会保证结果的正确性。但理论与实践的距离,可能让我们很难对此抱有太大信心。最短程序可能只是一个美好的梦想。

第二个办法是把针对性监督数据喂给模型,例如同类型的数学题的推理案例喂给它几千上万条,没有道理学不会。但针对性解决了这个问题,只是权宜之计。也许不久,人们会想到其他的答案监督信号弱,推理容易走歪的案例,来继续挑战它。

另一个常见的问题就是所谓“自我认知”的问题,who r u,如果没有针对性监督数据的注入,deepseek 以及很多其他的模型都会自称自己是 ChatGPT,毕竟ChatGPT核爆两年来,它的数据充斥互联网,不可能不受侵染。但这个问题已经进入专项解决的雷达屏上了,所以逐渐不是问题了。西方媒体有的还在说 deepseek 不过就是蒸馏 chatGPT 的,依据就是(他们测试过某个前期版本吧)deepseek bot 常自称是 open ai 开发的 chatGPT,但你现在上去试试,这种问题重复不了了。大概率是被专项数据解决了,记得他们论文也提到了这个自我认知的问题。

同理,3.11 vs 3.8 的大小比较这样的问题也是阶段性问题。以后不见了,也不必为它欢呼,大概率可能就是专项解决了,而不是因为算法或架构把智能真正提升了。

结果导向的监督信号不够强,是只认结果不看过程(白猫黑猫原则)的强化学习天生的短板,应该算是放弃 PRM(process reward model)的代价。那么,把过程奖励模型上马了,是不是就可以解决了呢?不知道。这就是第三条路,也许值得探索。但,again,上一篇博文说了,PRM 不好玩,不稳定,不好实现,虽然理论上可以帮助纠正推理过程中的胡说八道。

【后记】

刚才测试发现不能复现这个bug,看来早已解决了。也许老友昨天“亲测”的结果是忘了打开 deepthink?

【相关】

 

DeepSeek's R1 Paper: A Storm in AI LLM Circle

[Note: This is a blog analyzing DeepSeek's R1 paper and its impact]

Before DeepSeek, Chinese AI companies had always been engaged in fierce competition, achieving world-class SOTA benchmark scores on and off. However, none achieved DeepSeek's level of commanding respect and making such a stunning impact. Their recent breakthrough caught global attention.

Their paper and open source code are also beautifully written and accessible. No unnecessary complexity or obscurity. Simple and straightforward, yet radiating confidence. It exhibits engineering elegance while conveying innovation as well as passion. Simply remarkable. Should be nominated for best paper of the year.

Reading the R1 paper reveals that what OpenAI had kept mysterious - from Q* to O-series' so-called slow thinking reinforcement training - suddenly becomes clear and simple.

DeepSeek_R1 paper

Their key findings:

They demonstrated that reasoning capabilities can be acquired through pure reinforcement learning with simple rule-based rewards and multi-answer sampling, without the need for extensive supervised fine-tuning (SFT) data. This resulted in DeepSeek-R1-Zero, following AlphaZero's philosophy. While AlphaZero achieved absolute mastery in the narrow domain of Go by eliminating human data, their approach proved effective in broader domains of math, coding and logic.

Though R1-Zero worked well, they found incorporating minimal SFT data (a few thousand samples) for cold-start was more practical. R1-Zero matched OpenAI-o1-0912's performance, but its reasoning steps had poor readability and mixed languages. R1, however, with cold-start SFT and multi-stage pipeline of trainings, achieved further improvements matching OpenAI-o1-1217.

A new star was born.

Their valuable innovation was challenging the SFT+RL paradigm by proving pure RL's potential for reasoning through R1-Zero. This gave them confidence to further build the practical R1 with minimal cold-start data. Both models are open-sourced for research - an elegant execution.

DeepSeek excels at simplification. In reinforcement learning, they eliminated:
- The critic model parallel to policy model in RL, replaced by simple GRPO
- Complex reward models, replaced by rule-based rewards

GRPO (Group Relative Policy Optimization) generates multiple answers per question, comparing them within groups to calculate advantage scores:

Advantage = (Current score - Group mean) / Group std dev

Example: For a math problem generating 4 answers scoring 90,80,70,60 (mean=75), the 90-point answer gets a positive advantage score. This eliminates need for a critic model while enabling the model to identify better answers.

GRPO advantages:
- Training efficiency: No critic model saves compute
- Training stability: Clipping prevents over-optimization
- Simple implementation: Clear algorithm structure

Why did traditional RL use critic models? Critics offered:
- Global evaluation beyond group comparisons
- Learning complex rewards like user preferences
- Single answer evaluation capability
- Long-path rewards for games/robotics

However, GRPO showed that for well-defined tasks (math, coding, logic), simple group comparisons work equally well at scale.

For rewards, R1-Zero used pure rule-based rewards, only employing V3's existing preference reward models in R1's final alignment phase. Human preferences (safety, helpfulness) require complex value judgments that simple reward rules cannot capture.

They intentionally avoided the difficult Process Reward Models (PRM) because:
- Difficult to define granular reasoning steps
- Hard to validate intermediate step correctness
- Risk of reward hacking
- Resource intensive reward model retraining

R1's reward rules were simple, somethng like:
- Correct answer: +1
- Correct format: +0.5
- Wrong answer: -1
- Vague answer: 0

Like GPT's next token prediction scaling led to emergent general intelligence, correct result-oriented RL naturally developed complex internal reasoning capabilities. This insight has profound implications for advancing deep reasoning.

R1's four-stage training:

1. Reasoning Cold-start: e.g.

Question: Solve x^2 + 2x + 1 = 0
<think>
1. Identify quadratic equation
2. Coefficients: a=1, b=2, c=1
3. Use formula: x = (-b ± √(b^2-4ac))/2a
4. Substitute: x = (-2 ± √(4-4))/2
5. Simplify: x = -1
</think>
<answer>x = -1</answer>

2. Reasoning RL:

- Result-oriented data generation with <think>...</think> template
- No human bias, allowing model's natural reasoning evolution
- Model gradually increased thinking time and length of tokens
- GRPO optimization with rule-based rewards

While only validating final answers risks accepting wrong reasoning paths in theory, practice showed sufficient scale leads to correct reasoning. This seems to align well with Kolmogorov complexity theory - correct reasoning is the "shortest program" for reliable correct solutions.

Input sources:
- Manually designed math/coding problems
- Public benchmarks (e.g., AIME)

Output process:

Input: x^2 + 2x + 1 = 0

Model generates multiple answers:

Answer1: [Reasoning1] -> x = -1
Answer2: [Reasoning2] -> x = -1
Answer3: [Reasoning3] -> x = 2

Filter: Keep 1,2 (correct), discard 3 (wrong)

3. Comprehensive Fine-tuning:

- 800k samples: 600k reasoning + 200k general tasks
- V3 model judges cases where rule rewards can't
- Reuses V3 training data for non-reasoning tasks

4. Global RL:

- Human preference alignment while maintaining reasoning
- Rule rewards for reasoning
- employing V3's existing reward model for preferences

The process is clear with sufficient implementation detials, and in principle, reproducible.

Reasoning Distillation

Finally, DeepSeek's R1 excelled in distilling reasoning capabilities to open-source smaller models, outperforming OpenAI's o1-mini. This demonstrates open-source LLMs approaching closed-source models in almost all aspects.

However, those expensive closed-source models paved the way and set baselines/goals. The current landscape is ideal: wealthy companies push boundaries while "dark horses" like DeepSeek follow impressively close not far behind.

It is worth noticing that R1 not only enhanced complex reasoning ("slow thinking") but also significantly improved "traditional" knowledge capabilities compared to its V3 base model, suggesting reasoning strength can also benefit traditional tasks.

Key innovations as a summary:

1. DeepSeek-R1-Zero: First reasoning model trained purely through RL
2. DeepSeek-R1: Improved with cold-start data and 4-stage training
3. Distillation: Successfully transferred capabilities to small models

Technical highlights:

- GRPO algorithm replacing critic model
- Rule-based rewards replacing reward models
- Simple template enabling autonomous learning:
"<think> may well be all you need for reasoning"

[Epilogue from notes]

Silicon Valley old buddies group discussions heated up lately:

"DeepSeek needs quick funding/IPO or risks losing their 18 core contributors to big tech."

"Reproduction seems not difficult. Everyone considered RL but hesitated due to compute costs. o1 likely used RL similar to r1 but chose to keep details private and mysterious."

"This team represents China's technological prowess."

"Several companies have reproduced DeepSeek's core results - autonomous reasoning emergence. Expect rapid iterations and development in the coming days/months."

"OpenAI has fewer cards to play. Sam tries psychological warfare - emphasizing process rewards, suggesting complex search for O1... likely all unnecessary."

"Success factors include hiring young talent with fresh thinking."

"DeepSeek R1 showed how we were misled by PRM and MCTS - indeed, looks like all you need is a <thinking> tag."

"It's not about simplicity - fact is large models already have strong reasoning capabilities, they just need:

1. Thinking space/time/tokens (<think> tag)
2. Correct feedback (answer accuracy)
3. Exploration opportunity (GRPO optimization)"

Complex PRM and MCTS actually limited model's self-exploration. We underestimated large models' potential.

DeepSeek's success prerequisite was V3 - their world-class foundation model matching GPT4o. They knew how to leverage its potential. Using external models like GPT4 would've made R1 much harder to implement this quickly.

"What's next?"
"AI for science? Machine-proving century-old problems, discovering new drugs..."
"Only two problems matter: Riemann Conjecture and P vs NP"
"Big tech will pursue larger models, more data"
"Nvidia's business will improve"

DeepSeek has achieved parity with benchmarks others set. To truly lead, they need to set new benchmarks and directions. Currently, it is still the case that only those willing to burn money massively are breaking new ground.

R1 demonstrates how a Chinese AI company not only caught up but showed the way forward through intelligent simplification. Their approach of making complex problems simpler may influence the entire field.

----------

But I cannot repeat the error which my old friend tested yesterday as shown above, it looks like its been handled already:

 

DeepSeek 风暴下看看它的论文

DeepSeek_R1 paper

The Turbulent Second Chapter of Large Language Models: Has Scaling Stalled?

DeepSeek 风暴下看看它的论文

DeepSeek 之前,国内大模型公司各种刷榜,也是内卷得一塌糊涂,也都刷榜刷到了世界先进水平,但没有哪家做到了 DeepSeek 这种硬气、震撼和让人服气。一鸣惊人天下知。

NND,人家论文也写得漂亮、亲民,看上去、读起来就像一首码农诗。没有任何故作高深的玄乎和遮蔽。简单、平实,但那种底气也算是力透纸背。有一种工程美,还能感受到情怀。邪门。应该推举为年度 best paper。

论文的漂亮还体现在,好比一个火箭技术或原子弹配方,在论文中如此清晰地展现出来,连我们小白也能看清。过去几天逐字读它的论文 is a huge pleasure,极大满足了好奇心。

DeepSeek_R1 paper

好,奇文共欣赏,咱们就坐下来读。

R1 论文读下来,原来被 OpenAI 从 q* 开始到 o 系列,搞得神秘兮兮的所谓 System 2 慢思维的强化训练过程,一下子就变得清晰简单多了。

他们的主要发现是:

不用人造强化数据做监督学习微调(sft),利用多答案采样选优的再生数据来“硬做”强化学习,也一样可以学到慢思维的推理能力,这就是他们的 DeepSeek-R1-Zero,实际上是 follow Alpha-Zero 的思路。AlphaZero 在围棋这种非常单纯狭窄的场景,可以把 Zero 进行到底,排除了人类/人为的数据,最终成为绝对王者。

在更广一点的数学、代码和某些逻辑问题的推理场景,他们最终发现还是借助少量的 sft 人工数据更好。但也不过就是几千条的数据,做推理sft的“冷启动”,人工准备一点也不难。这就是他们的 DeepSeek-R1。

他们的 Zero 也走通了,达到了 OpenAI-o1-0912 的水平(o1的9月12日版本?)。其所以做 R1, 加入了sft冷启动的步骤,主要是因为机器完全自主学习出来的 Zero 的推理步骤可读性差、里面还混杂了不同的语言表达方式,这对进一步改良这个系统造成困扰,毕竟模型要“以人为本”,服务开发者和用户的。最终炼成的 R1 推理表现进一步提升,达到 OpenAI-o1-1217 (估计是12月17日版本的o1)的水平。

他们的创新和探索精神表现在,当 community 把 sft+rl 当成是后训练范式的时候,他们做 Zero,完全排除人工数据,验证了纯粹的rl对于推理能力的学习潜力。从 Zero 首先是学到了信心,体验了探索创新者的 aha moment,然后再回头加一些用于冷启动的高质量人工数据sft,再做实用的 R1 就有底气了。两个模型都开源,供人研究和验证,做得煞是漂亮。

DeepSeek 是化繁为简的大师

强化学习中,直接砍掉了应该与policy模型平行迭代的 critic 模型,代之以简单的GRPO。critic 是评估每个步骤的价值模型,砍掉了等于是训练中一下子降低了一半的资源耗费。需要单独训练的奖励模型也省掉了,代之以简单的规则奖励。

咱们先看看GRPO (Group Relative Policy Optimization,分组相对策略优化) 是什么创新,为什么GRPO算法可以平替 Actor-Critic架构的PPO来优化模型。本质上,GRPO是一个无需critic模型的PPO变体。它通过组内统计计算优势值,而不是用critic网络评估价值。

具体说,GRPO 要求每个问题生成多个答案,形成一组,组内答案相互比较,计算每个答案的"好坏程度"(优势值):

优势值 = (当前答案的得分 - 组内平均分) / 组内标准差

假设一道数学题,生成4个答案,得分分别是: 90, 80, 70, 60分;平均分 = 75。90分答案的优势值 = (90-75)/标准差。高于平均分的答案获得正优势值,反之为负。这样就不需要额外的评判模型 (Critic),通过组内比较,模型就知道哪些答案更好,强化学习的优化目标就是要提升生成好答案的概率。

GRPO 算法的优点:

    1. 训练效率高:不需要额外的 critic 模型,节省了大量计算资源。
    2. 训练稳定性好:用clip限制更新幅度,防止过度优化
    3. 实施简单:算法简单,易于实现。

GRPO简单易行又有效,为什么传统的强化学习要用critic模型呢?Actor-Critic 架构有其优势,包括:

    1. 全局评估:不受限于当前组内比较,可以评估答案的绝对质量
    2. 可学习复杂奖励:比如用户偏好、安全性等难以用规则定义的指标
    3. 单个答案也能评估:不需要同时生成多个答案
    4. 场景优势:早期RL主要用于线条很长的游戏和机器人,需要 critic 学习长期奖励

但GRPO表明,对于明确的任务(如数学、coding和逻辑题),只要能规模化做大强化学习,简单的组内比较也能达到模型优化的同样效果。这是强化学习“多快好省”的重要发现。

至于奖励模型,他们在 Zero 训练中完全弃之不用,而是用简单直接的奖励规则代之。只是在R1训练最后阶段的偏好对齐任务上(不是推理任务),才按照RLHF(人类反馈强化学习)的常规使用了 reward model(实际是对于自己基座模型V3中的奖励模型的复用)。原因如前所述,是人类偏好 (如安全性、有帮助性等) 涉及复杂的价值判断,难以用简单规则量化。对这类评估,还是沿用训练过的reward model来模拟人类判断。但推理任务,他们的探索表明完全可以只用简单的奖励规则 -- 因为正确性判断相对明确:math 有答案,code 可以编译和执行 unit testing。

还有什么能简化的,他们没有简化?

强化学习中的难缠的痛点之一是所谓过程奖励 Process Reward Model (PRM),就是深入到推理的每一步去评估。对此他们是“知难而退,敬而远之”,干脆绕过去:DeepSeek的强化是结果导向,不深究过程。论文说明他们选择不使用PRM的原因如下:

    • 难以明确定义推理中的每个细节步骤 (难:绕过去)
    • 难以判断中间步骤的正确性 (难:绕过去)
    • 重新训练reward model需要额外资源,使流程复杂化 (复杂:能简则简)
    • 模型评估会导致reward hacking:即神经模型可能学会欺骗奖励模型

就最后一条是出于神经模型本性上的短板考量,主要原则还是能简则简,能绕则绕。所以说,他们选择简单的规则奖励 + 答案验证的方案,是一个有意识的权衡选择。

就是说,明明简单的规则就可以确定奖励指向,为什么要训练叠床架屋的奖励模型呢?不过是查一下答案或测试一下code,判定结果的对错,加上判定格式是不是符合规范。R1 主打的奖励刺激属于规则绑定:例如,答案正确,奖励+1分;格式正确,奖励 +0.5;答案错误,“奖励”-1分;答案不具体,奖励 0分。

当然,这样做,在把推理拓展到数学、代码以外的任务的时候,可能行不通。但目前大家发力的重点主要就是数学和代码,而更加狭窄的长线条棋类和游戏场景,基本被传统RL攻克。尽管如此,绕过过程奖励仍然可能是潜在的软肋,理论上给结果正确,过程逻辑混乱留下了空间。

只关注答案对错,不问过程是怎么强化出长线条的复杂推理过程的呢?门道就在强化学习的答案自然偏向于长答案,随着训练这就自然增加了 test time compute ,发展出对于复杂推理的应对能力。这使得 R1 的强化学习更易自主探索推理路径,成就了DeepSeek的这次突破和出圈。

与 GPT 的 next token prediction(ntp) 规模化以后可以涌现通用智能异曲同工,DeepSeek那帮年轻人发现,只要结果明确可判定,结果导向的强化学习可以自然涌现出复杂的内部推理能力,因为正确的结果需要推理。这个发现的意义,对于领域今后的深度推理的推进非同一般,可媲美GPT系列预训练时发现的 scaling law。

 头部推理模型R1 的训练四阶段 :

1. 推理冷启动

利用数千条高质量人工推理数据,例如:

# 收集高质量示例 
Question: 求解方程 x^2 + 2x + 1 = 0 
<think> 
1. 识别这是一个二次方程 
2. 系数: a=1, b=2, c=1 
3. 使用求根公式: x = (-b ± √(b^2-4ac))/2a 
4. 代入: x = (-2 ± √(4-4))/2 
5. 化简: x = -1 
</think> 
<answer>x = -1</answer>

2. 推理强化学习

结果导向,再生数据的模版“留白” <think>.........</think>

    • 设计简单模板让模型自主再生训练数据:
      <think>思考过程</think><answer>答案</answer>
      模型生成多个答案 -> 筛选正确答案 -> 加入强化学习的训练集
    • 不添加任何人工偏见或策略提示,留下RL自主学习推理过程的空间:
      逐渐增加思考时间(test time compute)和tokens量,模型就自发涌现反思步骤等推理能力,这就是论文作者描述的 aha moment,令人动容的见证
    • GRPO算法采样多个答案,通过内部对比来优化模型
    • 规则奖励:奖励答案正确 + 格式规范

前面提到,系统只验证最终答案,理论上无法保证中间推理步骤的正确性(可能学到"答案对但推理错"的模式),但实践似乎显示,只要强化学习足够充分和规模化,答案正确会自然导向推理步骤的正确性。根据K氏复杂性(Kolmogorov complexity)压缩理论,正确的推理导向正确的答案才是可靠解决方案的“最短程序”,这是无损压缩的终极目标。后训练强化学习的过程与预训练一样,都是对无损压缩的逼近。

DeepSeek 的探索再次表明,简单即美,scale为王
(一简遮三丑,你是服也不服? LOL)

天机就是,scale 是硬道理。简单架构/算法有利于真正的 scale up,只要目标清晰,一旦 scale 了,一切就自然搞定。

训练数据的源头

模版再生数据的input 应该是来自两个源头,1 人工设计的数学题/编程题;2 公开基准测试题(如AIME)。

根据模版的再生数据的output流程:

Input: x^2 + 2x + 1 = 0 Model生成多个答案:
 Answer1:
<think>[推理过程1]</think>
<answer>x = -1</answer>
 Answer2:
<think>[推理过程2]</think>
<answer>x = -1</answer>
 Answer3: 
<think>[推理过程3]</think>
<answer>x = 2</answer>
 筛选:  - 保留Answer1、2(答案正确) - 丢弃Answer3(答案错误)

保留的答案作为推理再生数据加入训练集用于下轮迭代。所有再生训练数据都需要标准答案来评估正确性,这在来源中就给定了:

    • 数学题:确定的数值答案
    • 编程题:通过测试用例验证
    • 逻辑推理:有明确的正确结论

有标准答案是规则奖励能工作的基础。对没有标准答案的任务(如写作),需要用其他方式评估质量,例如使用奖励模型。

3. 综合性微调

800k 条训练数据,其中推理 600k, 其他任务 200k

论文没说为什么按照这个比例选取微调数据,应该是根据经验。600k 推理数据是再生的,用的就是阶段2的推理模型。但这里有一个值得注意的插曲:在阶段2的推理强化学习中,再生数据必须是奖励规则可以判定的。但阶段3的推理数据,却突破了这个限制。阶段3的推理数据增加一些 reward rule 不能判定的 cases,既然简单的奖励规则无法判定,就找 V3 模型来判定。好像是说,当一道推理题(数学、coding或逻辑题)生成n个奖励规则难以评判优劣的结果的时候,就把这些结果和标准答案送给V3,让V3做裁判。

另外的200k数据呢?一部分是拿来主义,直接从他们自己的V3的原始finetune训练数据中选取;另一部分让 V3 生成数据,但要求V3不仅给答案,还要给思维链过程(就是要求它 step by step 输出结果)。这可以理解,这里虽然不是纯粹的长线条推理题,其他任务很多时候也是要有条理的。

4. 全局强化学习

这最后的强化学习很像是早就使用过的 RLHF,更注重人类偏好的对齐。但为了防止推理退化,在偏好对齐的同时,也强化了推理,用的还是规则奖励。而人类偏好对齐用的则是V3原有的奖励模型(这是唯一真正用到的奖励模型)。

整个过程还是相当清晰的,原则上可复现。

用R1再生数据去蒸馏小模型,提升其推理能力

最后,Deepseek 的R1推理强化工作在蒸馏开源小模型方面也做得很牛,干翻了openAI 的 o1-mini 小模型。展示给世人看,开源 LLMs 开始全面逼近闭源模型。

但话说回来,没有这些巨烧钱的闭源模型在前面开路,并建立标杆,后来者也容易失去方向。现在这种局面非常好:让有钱的去砸银子。在金钱的赋能和压力下,不断开疆拓土。让deep“黑马”们在后面紧追不舍,而且还追赶得特别牛气。

令人印象深刻的是,R1 不是仅仅大幅度提升了推理能力(慢思维),在“传统”的知识能力方面比起它的基座模型V3也有显著提升。这可能是因为,推理能力的增强对于一些传统任务具有正面作用,但更应该归功于他们探索出来的四阶段训练R1的pipeline。

最后总结一下。

主要创新点:

1. DeepSeek-R1-Zero: 首个仅通过强化学习(RL)训练的推理模型, 无需人工推理数据的监督微调(SFT)。展示了模型可以纯靠 RL 自主发展出推理能力。

2. DeepSeek-R1: 在 R1-Zero 基础上做以下改进:
- 后训练阶段先用少量高质量数据进行冷启动SFT
- 采用4阶段的后训练流程,两次SFT,两次RL
- 性能可与 OpenAI-o1-1217 相媲美

3. 蒸馏技术: 成功将推理能力迁移到一系列开源小模型:
- 1.5B 参数的模型就超越了 GPT-4 在数学方面的表现
- 32B 和 70B 的模型创造了密集模型的新记录

关键技术细节:

- 使用 GRPO (Group Relative Policy Optimization)算法,舍弃 Critic 模型
- 采用基于规则的奖励系统, 舍弃奖励模型 PRM
- 设计特定的训练模板引导模型再生数据进行自主学习:
<think> is all we need for reasoning!

 

【笔者后记】

这两天莫名很兴奋。跟 deep啥 纠缠不休,今天才缓过气来 lol

硅谷老友群也热议不断:

Hongtao:
DeepSeek若不快速大融资和上市, R1的18位主要贡献者估计很快就被国内外大厂抢光了[Grin]
Core Contributors:
Daya Guo
Dejian Yang
Haowei Zhang
Junxiao Song
Ruoyu Zhang
Runxin Xu
Qihao Zhu
Shirong Ma
Peiyi Wang
Xiao Bi
Xiaokang Zhang
Xingkai Yu
Yu Wu
Z.F. Wu
Zhibin Gou
Zhihong Shao
Zhuoshu Li
Ziyi Gao
当年DeepMind被迫卖给Google,因为founders被告知若不卖,就高薪挖人。挖走一两个作者,就少走大部分弯路了。

主要还是幻方的AI量化投资受挫, 因势利导做deepseek成功;开源后,国内外大家都沿着这路子去试。若不财大气粗起来,优势恐怕难以为继。

超大模型训表征,
开源一蹴而就成。
强化学习各求精,
蒸馏定制缩小型。

内卷已经卷出墙,
硅谷AI圈被激荡。
OpenAI&Meta领头羊,
都被鞭策加速闯。

硅谷不眠夜:DeepSeek为何震动美国科技界?

Nick:复现DeepSeek貌似很容易。其实强化学习大家也都想到过,过去总觉得可能要花很多算力,少人试。貌似o1就是强化学习练出来的,但一些推理token他们没open。这可能迫使国内头部那两家加速上市过程。

立委:这类团队属于中华之光,国之重器。

他们写得基本够清晰了。让人担心他们下一步怎么保持这个势头和地位。很多神秘就是一层窗户纸。最大的功劳是他们同时也差不多捅破了o系列神秘面纱的窗户纸。

Nick:马上融一大笔钱,突击上市。除非手里还有更硬的牌。

马老师:好几家复现了deepseek,各家再各自探索,相信会是快速迭代的过程,有望再一次大发展。

Nick:也是个试金石,倒逼openAI看看还有啥新东西,是骡子是马拉出来溜溜。

Hongtao:给openai压力;更是 叫板meta, 争夺开源盟主地位

鲁总:OpenAI 的牌越来越少了。但SamA 希望通过心理战误导大众。之前发文强调过程奖励,O1 出来时放烟幕弹让人相信推断时使用复杂的搜索 ... 结果都应该没有用。

香港科技大学的团队说是也独立发现了RL涌现推理能力,不过只针对数学问题求解,但也特别指出使用输出格式奖励。

白老师:数学能力和编程能力是相通的。

不请贵的人是成功的很重要因素。

施总:哈哈。贵的不一定能干,能干的都比较贵。

刘总:主要是要用年轻人,岁数大的没戏。岁数大了,思维僵化,精力不行。当然,我说的是统计规律,个例总是有的。

立委:deepseek 不是常态,是冒尖。但 deepseek 这么一捅窗户纸,很多人就跟上了。不知道 它还有多少宝贝没有显露。否则 逐渐暗淡下去 也不是不可能的。

deepseek 之前,各种刷榜,也是内卷得一塌糊涂,也都刷榜刷到了世界先进水平。但没有哪家做到了 deepseek 这种硬气 震撼 和让人服气。一鸣惊人天下知。

Nick:估计每家都会短期内在数学能力上长足进步。豆包上周一周内就进步不小。窗户纸捅破,门槛也不是那么高。大概率o1也是这么做的,只不过内帮孙子比较鸡贼。

Liren:DeepSeek-R1告诉大家,你们都被PRM和MCTS误导了,其实只需要一个<thinking>标签就够了[Chuckle]

Nick:是啊,你写篇文章,“<thinking> is enough"

立委:就是留白。你留了白,系统就会给自主填上。

zero 的实践表明,根本不用想那么复杂,还要考虑怎么从各种不同推理任务中找到共同的思维链 patterns,等等。甚至也不管里面的逻辑是不是胡说八道,结果导向,最终,推理还是学出来了。预训练靠的是简单的 next token prediction,后训练推理靠的就是结果导向的强化自主学习。设计一个简单的模版就搞定了无穷的再生推理数据。

Nick:是啊,有了ToT和Gemini,话都在嘴边了。

Liren:增加在推理时的tokens来提升思考时间。

立委:秘方就是4步走:1 冷启动 2 强化 3 微调 4 再强化。zero 干脆省掉了 1 3 4,所以显得过于生猛,但 beautifully 证明了“硬启动”的强化学习也能涌现高级推理能力。r1 就是完善后训练的节奏和数据配比。很多应该就是经验,是摸索出来的 best practice,他们肯定有过很多其他失败的尝试,但还是摸着石头过了河。

马老师:感觉就是碰运气,不过沿着别人路走的永远没有运气。

立委:我觉得他们还有一些东西,所以才“肆无忌惮”。等于是他们推出了一个菜谱,这个菜谱做的菜比肩世界一流。但他们其实还有其他的菜谱,更高级,但不急于拿出来?

不是大道至简,而是大模型本身已经具备了强大的推理能力,它需要的只是:

1 足够的思考空间/时间/tokens量(<think>标签)
2 正确的反馈信号(答案正确性)
3. 探索优化的机会(GRPO采样选优)

复杂、难缠、费力的PRM(过程奖励模型)和MCTS(蒙特卡洛树搜索路径空间)反而限制了模型的自主探索。这说明大模型的能力被我们低估了。

deepseek 的成功的先决条件是 v3,他们自己做出了世界前列的头部基础模型,他们自己知道怎么善用它的潜力。如果是借助于外部基础模型 GPT4o,就很难这么快做出r1,很多 v3 的资源和practice 就在 r1 过程中直接借用了。

马老师:在理。

Nick:So what's next? assuming everybody will have as strong math capabilities within a month

立委:AI for science?机器自动证明百年难题啥的;机器自动发明新药......

Nick: only two problems matter: Riemann Conjecture and P vs NP

马老师:大厂也许会用更大的模型,更多的数据,继续向大上走。

Nick:那肯定。我觉得Nvidia的生意会更好。

立委:deep 目前为止还是在追平,是人家先树立了标杆,它去对齐。多快好省。

deep 要真牛,再上一个台阶,需要自己树立标杆和方向。但这太难了。目前为止似乎还是只有敢于疯狂烧钱 敢于无限做大的那些狂人才在开疆拓土。

 

【相关】