Professor Ma's long paper out

Here is the link to Professor Ma Yi’s presentation from the Shenzhen Entrepreneurship Forum, in Chinese, recommended.

Professor Ma is a compelling speaker, and his talk is definitely worth listening to. His paper on whitebox transformer, over 100 pages long, has just been released (Yi Ma’s white-box transformer paper is available here). Unfortunately, I haven’t had the time to dig into it yet. We’ll have to wait until more people have accepted or verified it before delving deeper.

His current claims revolve around using an extremely sparse approach to force transparency in transformers, with results that are reportedly on par with BERT and GPT-2 in many benchmarks. However, this doesn’t mean that he will be able to catch up with GPT-3 or later models anytime soon. But to be fair, it’s not a level playing field—he’s an academic without the resources to compete with mainstream AI in an arms race. What he does believe, however, is that he has opened a door—a path toward explainable AI in large models.

Honestly, I’ve always had a litttle bit doubts about Ilya’s theory explanation of shortest program compression (his Berkeley talk). From an ultimate theoretical perspective—where lossless compression is the ideal—the idea of continually scaling training, deepening, and lengthening learning makes sense, as it pushes the model toward becoming the smallest possible program for universal tasks. Ilya’s theory may hold up in this respect, at least in theory or as an end goal. But in any real-world scenario (e.g., under budgetary constraints, with methodological limitations), it’s hard to call a model purely derived through gradient descent the “shortest program,” because these models appear to be gigantic beasts with "huge circuits" inside, intuitively, should not be considered "short or small".

Models with hundreds of billions or even trillions of parameters are massive monstrosities, succeeding mainly through sheer size rather than through high regularity or elegance. Emphasizing how impressive their compression ratios are or how well they handle lossless compression may help explain the generalization and emergeng abilities in sequence learning from a theoretical standpoint. But in practice, any model at a given time is far from being the “shortest program.”

This highlights an unavoidable distance between theory and practice. Ilya essentially hedged practice with theory along a future time axis, but our immediate reality doesn’t seem to align with this. It’s like a clumsy wrestler trying to brand himself as sleek and slender fashion model. Visually not a fit, to most of our eyes.

Instinctively, LLMs feel full of rote memorization with significant redundancy. Under real-world conditions, achieving extreme or lossless compression seems impossible.

On the other hand, Professor Ma’s sparsity approach almost feels “over the top.” Enforcing the same weight for QKV directly seems a bit crude and simplistic, yet it still managed to be trained successfully. This shows that there’s a lot of flexibility within transformers—no matter what restrictions or pruning are applied, the model still finds a path out. In this sense, Professor Ma’s pursuit of the “shortest program” is more real and direct—it’s so short that even a human can interprete the process (hence the LLM explainability).

Yet the difference between these two extremes is still mind-boggling. On one side, we have gigantic models, and on the other, extreme simplicity to generate whitebox models. The fact that both approaches work is shocking.

Speaking of simplicity and explainability, here’s an interesting anecdote in AI history: Back in the day, during the era of symbolic MT, one of the earliest deployed systems (Siemens' METAL) for English-German translation used only eight symbolic features (such as human, animal, etc.). The rules were simple, transparent, and easy to explain. This shows that extreme simplicity and rule-based transparency can work in some rough application scenarios (where English and German are linguistically close, making translation easier).

Later, we MT-ers expanded the number of features to the thousands, trying to cover more of the long tail. Even then, it wasn’t perfect. At the time, we thought that with enough effort, we could match the quality of statistical MT. But now, we know that even if symbolic MT could catch up and match statistical MT, it’s still far from competing with neural MT.

So, could we have continued refining features further? It wasn’t because we didn’t want to keep extending symbolic features (similar to one-hot encoding, but with the internal structure of ontology/taxonomy). We wanted to go beyond thousands to tens of thousands of features. But in reality, thousands (of features in size) were already reaching the limit of human experts’ capacity to understand (AI explanability), manage and debug. Expanding further would have been unmanageable.

Meanwhile, how many parameters do mainstream Transformer neural networks have? And the space and granularity they represent are on a completely different scale. Given the vast difference in scale between the two, it’s natural to doubt any efforts to bridge this gap for AI explanability. How could that even be possible?

That’s why I’ve always felt that explainability in large models is an elusive goal. But Professor Ma is telling the world that they’ve achieved it.

Relevant link:

Professor Ma Claims to Have Fully Unveiled the Mysteries of Neural Networks

What did Ilya see? -- secret behind success of LLMs

Professor Ma's long paper out

发布者

立委

发表回复