Amid the waves of surprises brought by DeepSeek, an old friend pointed out that it struggles with simple math problems, using a popular elementary arithmetic question as an example:
Is 3.11 greater than 3.8?
What’s the core issue here?
In the wake of the DeepSeek frenzy, I looked into its research paper, which explains how its reasoning capabilities are enhanced through outcome-oriented reinforcement learning. The paper suggests that, in theory, outcome-oriented reinforcement learning can help a model learn proper reasoning processes. However, in practice, it’s not necessarily so.
Take the above math problem as an example. The answer is a binary yes/no question, meaning even random guessing has a 50% chance of being correct. This highlights a key potential flaw: outcome-oriented supervision signals are weak because they lack sufficient granularity. This kind of weak supervision inevitably hampers the model’s ability to learn proper reasoning processes.
Three Possible Solutions
- Scaling Up the Model
One approach is to make the model larger and deeper, hoping that the theoretical concept of lossless compression based on Kolmogorov complexity can be pushed to its limit. In doing so, proper reasoning, as the "shortest program," might eventually be learned by the model. Theoretically, correct reasoning ensures accurate results. However, the gap between theory and practice makes it hard to place much confidence in this. The shortest program or lossless compression might just be an unreachable ideal. - Targeted Supervision Data
Another solution is to feed the model with problem-specific supervised data. For example, providing thousands or tens of thousands of reasoning cases involving such math problems. There’s no reason the model wouldn’t learn from this. However, solving one specific problem this way is merely a stopgap measure. Soon, others will come up with new edge cases involving weak supervision signals and reasoning pitfalls to challenge it.Another common challenge is the so-called “self-identification” problem. For instance, when asked “Who are you?”, many models, including DeepSeek (earlier versions), would claim they are ChatGPT developed by Open AI if no targeted supervised data is injected. After all, ChatGPT has dominated the internet in the two years since its explosive debut, and its data has inevitably influenced other models. However, this issue is already on the radar for specialized solutions and is gradually becoming less of a problem. Some Western media still claim that DeepSeek is just a distilled version of ChatGPT. Their evidence? Probably based on early versions they tested, the DeepSeek bot often self-claimed to be OpenAI’s ChatGPT. But if you test it now, you won’t see this problem anymore. Most likely, it was fixed with specialized training data. Their research paper also mentioned addressing the self-identification as a problem.Similarly, the problem of comparing 3.11 and 3.8 can also be a transitional issue. If it disappears in the future, it won’t be a cause for celebration. Most likely, it will be resolved through targeted fixes rather than through fundamental improvements in intelligence brought about by algorithms or architecture changes or innovations.
- Re-introducing Process Reward Models (PRM)?
The inherent weakness of outcome-oriented supervision signals is that it focuses only on the result while ignoring the checking of the process—a natural shortcoming of reinforcement learning driven by results-oriented pragmatism in RL (following the “black cat, white cat” principle, lol). This is essentially the cost of abandoning PRMs (Process Reward Models). So, would re-introducing process-based reward models solve the issue? Honestly, we don’t know. This is the third possible path, and it might be worth exploring. But again, as mentioned in my previous blog post (DeepSeek's R1 Paper: A Storm in AI LLM Circle), PRMs aren’t easy to work with—they’re unstable and difficult to implement, although, in theory, they could help correct nonsensical reasoning during the process.
In conclusion, the issue with DeepSeek struggling with problems like 3.11 vs. 3.8 lies in the limitations of weak supervision in results-oriented reinforcement learning. While there are potential solutions—scaling the model, targeted data, or process reward models—each comes with challenges and trade-offs. Whether any of these approaches can fundamentally improve reasoning capabilities remains an open question.
- DeepSeek 风暴下看看它的论文
- DeepSeek's R1 Paper: A Storm in AI LLM Circle
- The Turbulent Second Chapter of Large Language Models: Has Scaling Stalled?
- 大模型风云诡谲的下半场:scaling 失效?
- DeepSeek_R1 paper