Will DeepSeek Fail at Simple Math Problems?

Amid the waves of surprises brought by DeepSeek, an old friend pointed out that it struggles with simple math problems, using a popular elementary arithmetic question as an example:

Is 3.11 greater than 3.8?

What’s the core issue here?

In the wake of the DeepSeek frenzy, I looked into its research paper, which explains how its reasoning capabilities are enhanced through outcome-oriented reinforcement learning. The paper suggests that, in theory, outcome-oriented reinforcement learning can help a model learn proper reasoning processes. However, in practice, it’s not necessarily so.

Take the above math problem as an example. The answer is a binary yes/no question, meaning even random guessing has a 50% chance of being correct. This highlights a key potential flaw: outcome-oriented supervision signals are weak because they lack sufficient granularity. This kind of weak supervision inevitably hampers the model’s ability to learn proper reasoning processes.

Three Possible Solutions

  1. Scaling Up the Model
    One approach is to make the model larger and deeper, hoping that the theoretical concept of lossless compression based on Kolmogorov complexity can be pushed to its limit. In doing so, proper reasoning, as the "shortest program," might eventually be learned by the model. Theoretically, correct reasoning ensures accurate results. However, the gap between theory and practice makes it hard to place much confidence in this. The shortest program or lossless compression might just be an unreachable ideal.
  2. Targeted Supervision Data
    Another solution is to feed the model with problem-specific supervised data. For example, providing thousands or tens of thousands of reasoning cases involving such math problems. There’s no reason the model wouldn’t learn from this. However, solving one specific problem this way is merely a stopgap measure. Soon, others will come up with new edge cases involving weak supervision signals and reasoning pitfalls to challenge it.Another common challenge is the so-called “self-identification” problem. For instance, when asked “Who are you?”, many models, including DeepSeek (earlier versions), would claim they are ChatGPT developed by Open AI if no targeted supervised data is injected. After all, ChatGPT has dominated the internet in the two years since its explosive debut, and its data has inevitably influenced other models. However, this issue is already on the radar for specialized solutions and is gradually becoming less of a problem. Some Western media still claim that DeepSeek is just a distilled version of ChatGPT. Their evidence?  Probably based on early versions they tested, the DeepSeek bot often self-claimed to be OpenAI’s ChatGPT. But if you test it now, you won’t see this problem anymore. Most likely, it was fixed with specialized training data. Their research paper also mentioned addressing the self-identification as a problem.

    Similarly, the problem of comparing 3.11 and 3.8 can also be a transitional issue. If it disappears in the future, it won’t be a cause for celebration. Most likely, it will be resolved through targeted fixes rather than through fundamental improvements in intelligence brought about by algorithms or architecture changes or innovations.

  3. Re-introducing Process Reward Models (PRM)?
    The inherent weakness of outcome-oriented supervision signals is that it focuses only on the result while ignoring the checking of the process—a natural shortcoming of reinforcement learning driven by  results-oriented pragmatism in RL (following the “black cat, white cat” principle, lol). This is essentially the cost of abandoning PRMs (Process Reward Models). So, would re-introducing process-based reward models solve the issue? Honestly, we don’t know. This is the third possible path, and it might be worth exploring. But again, as mentioned in my previous blog post (DeepSeek's R1 Paper: A Storm in AI LLM Circle), PRMs aren’t easy to work with—they’re unstable and difficult to implement, although, in theory, they could help correct nonsensical reasoning during the process.

In conclusion, the issue with DeepSeek struggling with problems like 3.11 vs. 3.8 lies in the limitations of weak supervision in results-oriented reinforcement learning. While there are potential solutions—scaling the model, targeted data, or process reward models—each comes with challenges and trade-offs. Whether any of these approaches can fundamentally improve reasoning capabilities remains an open question.

 

 

 

发布者

立委

立委博士,出门问问大模型团队前工程副总裁,聚焦大模型及其AIGC应用。Netbase前首席科学家10年,期间指挥研发了18种语言的理解和应用系统,鲁棒、线速,scale up to 社会媒体大数据,语义落地到舆情挖掘产品,成为美国NLP工业落地的领跑者。Cymfony前研发副总八年,曾荣获第一届问答系统第一名(TREC-8 QA Track),并赢得17个小企业创新研究的信息抽取项目(PI for 17 SBIRs)。

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注

此站点使用Akismet来减少垃圾评论。了解我们如何处理您的评论数据