— Reflections on LLM Scaling Laws and DeepSeek's R1
My friend Zhang Junlin's article "Looking at the Future of Scaling Laws through DeepSeek R1" has sparked interesting discussions among peers.
Core Insights from Initial Discussions
Professor Bai summarised the key highlights as follows:
Infinite stacking won't lead to infinite growth (physical laws don't support this)
Only S-shaped growth is possible, with diminishing returns inevitably appearing
The initial emergence of language capabilities relates to the density of linguistic knowledge in training data
The next growth phase represents a second S-curve, driven by common sense knowledge, which requires more computing power due to lower knowledge density
The third phase involves learning logical reasoning (Chain of Thought), where natural data has even lower density of such knowledge. Brute-force mining with computing power becomes inefficient, making reinforcement learning with synthetic data a more rational approach
As Dr. Lu points out: The term "Scaling Law" is becoming overloaded. While S-curves (nonlinear curves characterized by sigmoid functions) can describe technology adoption lifecycles, they typically occur in succession (one technology hits its ceiling, making way for another). Large language models' multiple "Scaling Laws" confirm this pattern, with some overlap between Test-Time and Post-Training "Scaling Laws".
The Nature of LLM Scaling
Let's examine the fundamental logic behind LLM scaling. First, it's crucial to understand that LLMs are not databases - they don't aim to memorize long-tail data details. Large model training essentially compresses big data, or more precisely, compresses the knowledge systems behind the data (including common sense and encyclopedic knowledge), focusing on capturing patterns and regularities of various patterns (what we call generalizations).
Conventional intuition suggests that as data scale increases, redundancy increases too. Regardless of filtering, cleaning, and deduplication, growing redundancy seems to imply diminishing returns. So why do large models still appear "hungry" even at the unprecedented scale of hundreds of billions of tokens? Why does the scaling law remain effective from hundreds of billions to trillions of tokens?
The key lies in LLMs being sequence learning and sequence decoding systems. While sequences are one-dimensional, the patterns and regularities behind are high-dimensional. For instance, even a simple sequence like "cat chases mouse" potentially involves multiple knowledge dimensions: species relationships, predatory behavior, spatial movement, actor-patient roles, etc. This multi-dimensional knowledge naturally leads to combinatorial explosion at the sequence level as information is flattened in language. The "appetite" for insatiable big data effectively addresses this combinatorial explosion. As long as there isn't complete information redundancy, additional diverse sequences will help models abstract data patterns more precisely.
The Two vs. Three S-curves Debate
Zhang Junlin observes that since OpenAI's O1, two other phases have gained recognition with their own Scaling Laws: the reinforcement learning Scaling Law (RL Scaling Law) for post-training, and the Inference Scaling Law (also called Test Time Scaling Law).
This raises a crucial question: Are there really three S-curves, or just two? How comparable is the reasoning model's S-curve to the pre-training S-curve?
While theoretically we can identify three phases:
Pre-training Post-training (especially reasoning-focused reinforcement learning) Inference phase
In practice, post-training and inference phases likely share a single S-curve; there aren't two independent growth curves.
DeepSeek R1's Insights: The Truth About "Slow Thinking"
Consider DeepSeek R1: users can activate "deepthink" mode to enable Chain-of-Thought (CoT) reasoning, but they can't actually control reasoning quality by increasing computation time. Why is this?
Let's examine a concrete example. When R1 solves a complex mathematical problem:
Traditional models might directly answer: "The result is 42" R1 shows detailed reasoning: "Let's think step by step: 1) First consider... 2) Then we can... 3) Finally, we get 42"
While R1's response appears to demonstrate "slow thinking" (CoT), this reasoning process reflects actually a generation pattern fixed during training, not dynamic exploration of multiple potential reasoning paths during response time. In other words, CoT+answer might look like "slow thinking," but it doesn't fundamentally change the unidirectional next-token prediction paradigm. R1's CoT+answer creates an illusion of slow thinking, but the generative nature remains fundamentally the GPT "fast thinking" paradigm. At test time, unlike AlphaGo, the depth and scale of thinking isn't dynamically explored, though beam search, if applied, can provide implicit multi-path optimization internally.
Test Time Compute Constraints
The industry's buzz word "test time compute" refers to reasoning models requiring more online computational resources compared to traditional non-reasoning models. For example, R1 with CoT enabled might need several times more computation time than its base model V3 for the same problem. However, this increased computation results from behavior patterns acquired during training, not dynamically adjustable compute investment. Without controllable scalability in test time compute, we can't really talk about a test time scaling law.
A major difference between pre-training and CoT reinforcement learning lies here: pre-training scaling laws can remain stable long-term because once training completes, it doesn't significantly impact online response time - the generation mode remains a simple query+answer. Therefore, offline training for months is acceptable if the resulting model shows significant capability improvements. However, reasoning models' post-training CoT reinforcement learning differs - it cultivates models' habits of responding with slow thinking, changing the generation mode to query+CoT+answer. Extending the CoT isn't just about the cost of training resources and time; more critically, it reflects in extended test time compute for each query during deployment, severely delaying system response time. Users generally have limited tolerance for slow thinking computation time and delays during online system use.
The Sustainability Debate
OpenAI's Sam Altman and Anthropic's Dario might argue that for extremely complex problems (like proving the Riemann hypothesis or designing next-generation aerospace vehicles), even if a model needs a week of computation time, it's still a massive improvement over human teams requiring decades. However, this argument has two issues:
LLM feasibility for such super-complex problems remains far from validated Extreme scenarios lack universality and can't serve as data points for sustainable scaling laws
This isn't to deny S-curves as effective models for describing scaling laws, nor to reject the rationality of S-curve stacking. The combination of pre-training and post-training growth curves (s1 and s2) might indeed reflect the overall relationship between resource investment and performance improvement. However, we should carefully examine whether CoT reasoning truly opens a sustainable scaling curve.
Conclusion: How Far Is the LLM Road to AGI?
If reasoning models' scaling laws lack sustainability, this raises a deeper question: Can we reach the promised land of Artificial General Intelligence (AGI) through these two scaling laws alone? Furthermore, is the technical ideal of Artificial Super Intelligence (ASI) - AI replacing human labor and dramatically improving productivity - truly feasible?
Current evidence suggests that while pre-training scaling laws have shown considerable sustainability, reasoning models' scaling laws may quickly hit practical constraints. This reminds us that the path to AGI/ASI likely requires more innovative breakthroughs, not just simple extrapolation of existing methods. In the next phase of artificial intelligence development, we might need to discover entirely new growth curves.
[#LLMs #ArtificialIntelligence #DeepLearning #AGI #ScalingLaws #MachineLearning]
【相关】