Backpropagation: The Key to Deep Neural Networks

By introducing "hidden layers" that perform nonlinear transformations, a network can map linearly inseparable low-dimensional problems (like the XOR gate) into higher-dimensional, separable spaces. From this point on, neural networks gained the ability to represent complex patterns for approximiating universal funtions. For connectionism, achieving a true renaissance became merely a question of algorithm, time, and computing power.

On the algorithmic front, one core problem long perplexed neural network researchers: how can we effectively train a complex multi-layer network? If the network makes a mistake, how do we determine which connection weight in which layer is at fault, and how should we correct it? This differs from symbolic logic, where rules can be directly encoded into transparent, debuggable computer programs—because the code itself is an interpretable symbolic language.

In 1986, in a landmark paper, Geoffrey Hinton, David Rumelhart, and Ronald Williams systematically elucidated the core method for cracking deep neural networks: the Backpropagation algorithm. It enabled neural networks to truly perform "deep learning" for the first time.

After the model produces an output, it first calculates the overall error—the difference between the prediction and the correct answer. The essence of backpropagation is that the model's error can propagate backward, layer by layer, like ripples. The connection weights in each layer can be adjusted to correct for their portion of the "error." In other words, each layer contemplates: "If I adjust myself slightly this way, I need to ensure it makes the final result a little better." This is how backpropagation teaches machines to learn from outcomes—through layer-by-layer adjustments, the entire network's output progressively converges toward the correct answer.

This algorithm solved the problem of training multi-layer networks.

You might ask: In a neural network with hundreds of millions of parameters, how is it possible to know which direction to adjust each one?

Indeed, there are countless "knobs" (parameters, weights) in a neural network, collectively determining the output. Improving the model requires coordinating all these knobs simultaneously amidst so many moving parts—it sounds like finding a needle in a haystack, blindfolded. But here's the elegance of mathematics: the "sense of direction" for each knob can be computed locally.

The key idea of backpropagation is this: although the network is vast, each connection is part of a clear chain of causality. If we can calculate "how much the final output error would change if this specific connection weight were tweaked slightly," then we know which direction to nudge it. This is precisely what the gradient represents. It's not based on vague intuition but is precisely computed using calculus—a derivative (slope). In a multi-dimensional space, the gradient points in the direction of the steepest ascent of the error; thus, moving against it decreases the error fastest.

In other words, even with hundreds of millions of parameters, each parameter can independently compute a tiny arrow—based on its "propagated relationship" with the output error—indicating which way to move to reduce the error slightly. These minuscule "better directions" aggregate into a coordinated adjustment across the entire network. Thus, without needing a global view, the machine finds direction locally. In each training cycle, the model "descends the hill" this way: from its current position, it follows the error-reducing gradient downward a bit, then recalculates the new gradient. Repeating this thousands or millions of times, the error shrinks, and the model grows smarter.

Perhaps you'd further ask: In a network with so many parameters, won't these local adjustments cancel each other out? Common sense suggests that in complex systems, local optimizations might throw the whole system off balance. However, the marvel of backpropagation lies in this: each local adjustment is not blind. They all share a common objective—to minimize the overall error (loss). The gradient, computed via calculus, indicates the steepest downhill direction in the vast "error landscape." The gradient's direction tells each connection which way to adjust, and its magnitude suggests how much. When all connections compute their own "little sense of direction" and update accordingly, the entire network moves toward minimizing error. Thus, even faced with an immensely complex system of billions of parameters, the model achieves local judgment leading to global improvement—this is the secret sauce enabling deep learning.

Think of it like a massive orchestra. Each musician reads only their own sheet music (local information), often paying little attention to the other parts. Yet, they all follow the same conductor (the loss function)—who, through gestures (gradient signals), instructs each player to be louder, softer, faster, or slower. Thus, although no single musician "sees the big picture," the whole orchestra plays in harmony.

To truly understand backpropagation, view the neural network as a chain of "relay functions." One layer transforms the input into an intermediate representation, the next layer compresses that into another, and so on, until the output. Mathematically, this is a composite function: the output is a function of the previous layer's output, which in turn is a function of the layer before it, and so forth, link by link.

With millions of weights w in the network, the trick is to compute and reuse local slopes during a single backward pass, moving from "downstream to upstream." This efficiently provides each weight with its required "direction sense." In engineering, this is called reverse-mode automatic differentiation: first, values flow forward (the forward pass), then "sensitivities" flow backward (the backward pass). The forward pass is like solving the problem; the backward pass is like grading it. Each node simply does two small tasks: it stores its local slope from the forward pass, and during the backward pass, it takes the sensitivity received from downstream, multiplies it by this stored slope, and distributes the result upstream along its connections. These local accounts, when summed, miraculously yield the correct global account.

The success of backpropagation laid the algorithmic foundation for deep learning. It propelled connectionism from academic ivory towers into practical application, providing the theoretical and technical prerequisites for the "deep learning" explosion years later.

Multimodal tokens and the Case for Unified Modeling

Advocates of “unified” models argue that the many signals of the world—text, images, audio, video—should be tokenized and mapped into the same semantic field (a shared hidden vector space), so they can be trained jointly and modeled with one brain.

A common recipe is to atomize other modalities the way we do text: turn them into token sequences so that sequence learning—the secret sauce behind LLMs' explosive success—can extend beyond text. Universal tokens lay the groundwork for this extension. Once text, images, audio, and video are all sliced into computable tokens of similar type, one step remains: make them play in the same band (for training and for generation).

The most straightforward trick is to give each instrument a passport. Before entering the model, every segment carries a modality tag—special start symbols telling the conductor whether the next stretch is image tokens, audio tokens, or text (e.g., <image>, <audio>). Positional encodings then act like a seating chart, indicating how these tokens are arranged in a sentence, on an image grid, or along a timeline. After a few rounds of training, the model internalizes these hints: look at pixels when needed, listen when needed, and weave everything into a coherent narrative when it speaks.

A deeper level of fusion is handled by attention. Think of attention as a rehearsal room with glass walls: text can “glance” at image tokens and images can “nod back.” Over time, some heads specialize in image–image relations, others in image–text translation. Flamingo makes this explicit: cross-modal attention layers are inserted between LLM layers so that the word currently being generated can continually “look back” at the relevant image region. To users this appears as abilities for things like step-by-step visual Q&A; under the hood, a stream of text tokens moves forward while aiming its attention at precisely the visual patches that matter.

In engineering practice, a common compromise is front-end experts first, then merge. PaLM-E in robotics is a good example: images are first distilled by a pretrained vision encoder (e.g., ViT) into compact representations, then projected into perception tokens; text is tokenized as usual; robot state vectors can be appended when needed. With appropriate modality tags, all of these enter a shared Transformer backbone, where dialog, reasoning, and decision-making happen on a single score. In this analogy the front end acts like a pickup: it cleans and compresses raw signals.

More concretely: the “pickup” is a modality-specific encoder (ViT for images, an acoustic encoder for speech, etc.). The “front end” consists of the pickup plus a small projection/adapter that turns features into tokens the backbone can digest. The “backbone” is the unified Transformer that handles cross-modal alignment, long-range memory, and reasoning. In short, the backbone processes music, not the raw electrical currents of each microphone. It doesn’t touch pixels or raw waveforms; it ingests track-like, distilled features produced by the front end and aligns/reasons over them. Redundant detail and noise are filtered at the front; during generation, pixel-level detail is filled back in by a diffusion renderer or a decoder according to the backbone’s plan.

For instance, a 224×224 image has over 150k pixels. If you hand all of them to the backbone, attention wastes budget on repetitive textures. A ViT-style pickup patches, encodes, and compresses the image into roughly a couple hundred visual tokens—each like a short riff, concentrating edges, shapes, and relations that matter for understanding. Audio is similar: raw 16 kHz samples are first converted to tens or hundreds of latent vectors; rhythm and timbre are distilled, while noise and redundancy are filtered. The backbone consumes these high-semantic-density tokens (i.e., semantic tokens), so compute is lighter and crucial information remains intact. When trained end-to-end, gradients from the backbone flow back to the front end, encouraging it to keep what matters for downstream generation and discard redundancy.

Those few, information-dense tokens inside the backbone don’t carry pixel-level detail. Detail returns through a wider channel on the rendering side during generation. The backbone sets the plan; the renderer does the fine work. Three practical stages or approaches are popular in generation.

1) Composition → Orchestration. The front end is paired with a decoder: the encoder compresses content into compact codes (discrete codewords/tokens or low-dimensional latents), and the decoder can reconstruct pixels. At generation time, the backbone predicts/plans decodable representations—a string of codes or a per-frame latent vector—rather than pixels. The decoder then “orchestrates” texture, light, and materials from those codes. In image/video, the VQ-VAE/MAGVIT-type follows this predict-codes → decode path. With residual vector quantization (RVQ), detail is added coarse-to-fine: the backbone first emits higher-level codes; the renderer produces a solid base; lower-level residual codes then refine it layer by layer.

2) Storyboard → Cinematography. The backbone provides structural plans—a low-res sketch (blueprint), motion hints/optical flow, keypoints, or a camera trajectory. Each frame is then handed to in-frame diffusion (latent-space rendering) to “develop” the image from noise under those conditions. Diffusion doesn’t need the backbone to carry high-frequency detail; it iteratively reveals detail in a high-res latent space. This is the “next-frame prediction + in-frame diffusion” split: temporal coherence by the backbone, visual richness by diffusion.

3) From coarse to fine. The backbone first outputs a coarse result—low resolution or higher-level codes—then the rendering stack applies super-resolution (SR) and residual refinement to stack up resolution and texture. The farther down the pipeline, the wider the bandwidth, but this bandwidth is handled on the rendering side rather than bloating the backbone with pixel-long sequences. Many systems expose these as configurable quality gears: stop at 720p if the user is in a rush, or climb to 1080p/4K when desired.

One subtlety: the backbone does not simply “discard detail.” First, in joint end-to-end training, the front end and decoder co-adapt with the backbone so that what’s “compressed away” is true redundancy, while cues crucial for generation (edges, materials, style, rhythm) are preserved in a recoverable latent space. Second, many renderers look back at multi-scale front-end features (e.g., via U-Net skip connections or cross-attention during decoding), allowing them to query high-bandwidth details on demand—without hauling them through the backbone.

How the renderer “looks back” depends on the task:

- From-scratch generation (text-to-image/video): there’s no high-res reference to query. The renderer relies on learned statistics to “hallucinate” detail from the backbone’s directional plan; the text front end builds latents during training but typically isn’t invoked during sampling in generation.
- Conditional generation/editing (image-to-image, inpainting, colorization, video continuation, style transfer): the renderer does “look back.” The reference image or previous frame is encoded to multi-scale features; the decoder/diffusion network uses skip connections or cross-attention to pull in high-res cues, aligning edges, textures, and brushwork precisely.

In the autoencoder/vector-quantization track, encoder and decoder are two sides of the same coin: the encoder compresses images/video into a shared latent language; the decoder restores pixels from latents/codes. They are trained around the same latent interface with opposite roles, by two networks that translate the same latent language. Whether the encoder is used at generation time depends on the task: unconditional generation needs only the decoder; conditional/editing and video consistency bring the encoder back to supply high-res detail.

Put together, the pipeline is clear: the backbone sets the plan—semantic coherence and causal logic—without hauling pixels across long distances; the renderer does the fine work; it looks back when a reference exists, and otherwise lets diffusion “develop” detail locally. Decoding, diffusion, and super-resolution are the high-bandwidth back end that rebuilds the scene in place. Encoder and decoder share a latent interface, each doing its half of the job—two sides of the same coin.

Neural Codec: Key Audio Techniques in the LLM Era

“Codec” is short for coder–decoder: an end-to-end machine for compressing and then restoring audio. The encoder compressess a signal into a more compact representation; the decoder reconstructs it as faithfully as possible.

In the LLM era, audio—like text—is often cut into short segments and encoded as a sequence of discrete tokens. The “audio dictionary” used for quantization is called a codebook, like a spice rack in a kitchen: discrete little vectors in separate slots, each with an index. Unlike text, which typically has a single vocabulary, neural audio coding often uses several codebooks at the same time step (audio unit), quantizing from coarse to fine in layers—a scheme known as RVQ (Residual Vector Quantization). During compression, the system picks one entry from each codebook, using their indices to “remember” that instant of sound in layers; during reconstruction, the vectors addressed by those indices are summed, layer by layer, to restore and refill the flavor.

Earlier TTS pipelines typically did “Mel-spectrogram draft” (a continuous-valued “score”) → a neural vocoder that plays it back into waveform. The LLM-native neural codec more often runs “semantic tokens → acoustic tokens → decode to waveform.” Both semantic tokens and acoustic tokens are discrete; they differ in granularity and division of labor.

Multi-layer token coding via RVQ is a key innovation for extending LLM methods to audio. Without it, token counts would explode. By its nature, layering simplifies the complex—divide and conquer. It’s an innovation in representation that pushes discretization all the way through.

“How many layers?” There’s no universal number. It’s a knob you turn together with bitrate, latency, and quality targets. Take Google’s SoundStream: it refines the same time step with residual quantization. The paper reports common 4/8/16-layer setups and even an extreme 80-layer experiment, showing training can remain stable with many layers. For a fixed target bitrate, you can trade fewer layers with larger codebooks against more layers with smaller ones—the design space is flexible.

Meta’s EnCodec follows a “multi-codebook, variable-bandwidth” approach: the 24 kHz model uses up to 32 codebooks (i.e., 32 RVQ layers), the 48 kHz model up to 16, with codebooks typically of size 1024. During training, it randomly selects a subset of layers to participate in quantization, so a single set of weights can serve 1.5/3/6/12/24 kbps, etc. In deployment you simply “open” as many layers as you need, striking a balance between quality and realtime latency.

Don’t confuse “number of layers” with “hierarchical scales.” OpenAI’s Jukebox uses three time-scale levels in a VQ-VAE: the top level is sparse, the bottom dense, carrying information from long-range song structure down to timbral details. That’s “horizontal” stratification over time, not stacking multiple residual quantizers at the same time step.

A rule of thumb: for realtime, low-latency speech, 4–16 layers are common; for music or higher fidelity at 24 kHz, a few dozen layers aren’t unusual. The final choice isn’t doctrinaire—it depends on your target bitrate, acceptable latency, and how exacting the ear is about texture.

A neural codec begins with a neural encoder that massages raw waveforms into compact latent vectors; then a quantizer selects entries from the layered codebooks—coarse strokes for contour, finer strokes for texture; finally a neural decoder turns these discrete entries back into an audible waveform. The whole chain is trained end to end, aiming to tell the ear what it cares about—timbre, prosody, articulation, even a touch of room tail—using as few indices as possible. The result is a stable, reversible, bitrate-tunable discrete sequence.

In this setup, “audio tokens” are the indices themselves. Every few tens of milliseconds, the quantizer picks a set of indices from the layered codebooks to describe that slice; over time you obtain a readable, writable “acoustic text.” This differs from the traditional Mel spectrogram: the Mel is a continuous “photo” better suited to a conventional neural vocoder; audio tokens are a discrete “word string” that can both decode back to high-fidelity waveforms and be continued, edited, and aligned by GPT-style autoregressive models like text.

In one line: a codec is the whole compress–restore machine; the codebook is just the rack holding the discrete basis vectors—what counts as tokens are their indices. A neural codec rebuilds this machine with neural networks so it can describe a sound using a string of discrete tokens and, when needed, sing that string back into real speech or music. Put simply, audio processing can now reuse the compression–and–reconstruction playbook that has proven so powerful for text LLMs—that’s the secret behind audio’s takeoff in the era of large models.

Breakthroughs in Speech Technology in the Era of Large Models: Ultra-Realism and Full Duplex

As large language models (LLMs) expand into audio, progress has been breathtaking. “LLM-native” speech technology reached practical maturity roughly half a year ago, and the entire industry has surged forward. Two features mark this maturity: ultra-realistic speech and full-duplex interaction.

Audio tokenization—akin to what we do with text—has not only produced “model musicians” (Suno is the signature product here; I’ll cover it in a separate tech blog), but more importantly, it has catalyzed two changes that truly transform language interaction: ultra-realistic speech synthesis, and natural conversations where the system can listen and speak at the same time and be interrupted at any moment—that is, full duplex.

Start with ultra-realism. Old-school TTS always sounded like a well-trained announcer—articulate and polished, yet lacking the grain and personality of everyday speech. Once neural codec manages speech into reversible discrete tokens, synthesis no longer “reads out” only the text/content; it can also faithfully reproduce the manner of speaking, hence everyday, down-to-earth speech made possible. The model first drafts a sketch in the space of “speech semantics”: where to take a breath, when to lower the voice, which word should gently rise. It then renders that as a string of audio tokens, and the decoder brings out the breath, sibilants, and the room’s tail reverberation. You can hear the rasp in a voice, the faint nasal smile, and even preserve a speaker’s “vocal persona” across languages. Interaction no longer has a one-size-fits-all voice; it feels like chatting with a neighbor or a friend. The difference from the pre-LLM era lives in those micro-textures carried by the layered discrete codes (tokens).

Now full duplex. Early voice assistants were like walkie-talkies: you speak, then they speak, taking turns. Natural conversation is more like a kitchen scene: you’re chopping vegetables, a friend reminds you to preheat the oven, you cut in—“wait, let’s halve the salt first”—and your friend immediately pivots and responds to the new instruction. Achieving that natural feel looks like an interaction design problem (“interruptible,” “barge-in”), but underneath it requires three things working in concert: continuous listening—streaming the incoming speech into tokens in real time; speaking while listening—generating its own audio tokens ready to speak out while always keeping room to brake; and low-latency understanding and revision—completing the loop of listen → update plan → change what it says in a few hundred milliseconds.

Neural codecs shine again here: by turning both what is heard and what will be said into the same kind of discrete sequence (token string), the model can read and write on the same timeline, managing turn-taking like a human in natural conversation.

These two advances are not isolated. The more ultra-realistic the synthesis, the more the token stream must explicitly encode pauses, stress, laughter, and other details. The smoother the full duplex, the more it relies on a stable, dense, and compact flow of audio tokens so the model can switch course on the fly without dropping a beat. Because the underlying representation has been unified into a “readable and writable token sequence,” we can naturally say, “Don’t be so formal—talk to me like you used to,” and the assistant instantly switches tone. We can also cut in mid-sentence—“Skip that—give me something concrete: go or not?”—and the back-and-forth remains fluid and unawkward.

In systems that truly deliver “ultra-realistic + full-duplex” today, the winning recipe is a hybrid of listen–think–speak. Up front, a streaming auditory model continuously turns your voice into a compact sequence of audio tokens; in the middle, a large model handles understanding and planning; at the back, the system writes its intended response as audio tokens and decodes them into a waveform in real time. The reason we don’t always hand everything to a single end-to-end autoregressive GPT that goes directly from audio-in to audio-out comes down to two constraints: millisecond-level latency requirements and the need for tight control over interruption and course correction.

Think of it as a well-rehearsed band. The streaming speech encoder (ASR) is the drummer, keeping time in tens-of-milliseconds beats. The large model in the middle is the conductor, continuously re-orchestrating based on the latest auditory tokens. The neural-codec decoder (TTS) is the lead vocalist, singing while leaving braking distance so an interruption can “hit the brakes” instantly. In practice, both drummer and vocalist are often smaller, faster specialist models (streaming recognition such as RNN-T/Conformer, paired with neural-codec-based fast synthesis) rather than folding every detail into one gigantic autoregressive stack. Otherwise, interruptions and back-ups might blow up latency.

This does not mean a return to the old ASR/TTS pipeline. The key change is that the base representation has become unified discrete audio tokens: the listening side no longer produces only text characters; it can emit semantic or acoustic units as well. The speaking side no longer merely concatenates phonemes; it writes a reversible stream of codec tokens that carries breath, stress, and reverberant tails. The large model either plans at the hidden layer or refines a two-step “semantic tokens → acoustic tokens” plan, then hands the result to the decoder to render. This keeps full-duplex latency low while preserving ultra-realistic texture.

Looking ahead, research is converging: end-to-end “spoken LLMs” are pulling listening and speaking into the same set of parameters, taking audio tokens directly as input and output (speech2speech). In engineering, however, conversation management—who speaks when, how to interrupt, how to revise—remains as guardrails for smoothness and robustness. Today’s best practice is like a hybrid car: the chassis is the unified, tokenized language; the engine is the large model; and the start/stop control is often delegated to the small specialized motors. The result is fast and stable—and it keeps the warmth of the human voice.

Attention Collapse: The Misunderstood Truth About “Rank”

The topic may sound obscure, but it goes straight to the heart of large language models.

Before We Dive In: A Quick Refresher on the Basics

What is the rank of a matrix?

You can think of a matrix as a big table made up of vectors. The rank is simply the number of truly independent information channels in that table.

For example:

The two rows are completely different, providing two independent channels → rank = 2.

The second row is just twice the first → effectively only one piece of independent information → rank = 1.

So rank = how many independent channels of information a matrix really carries.

What does “full rank” mean?

If a matrix is $N \times N$ , at most it can have independent channels. Full rank means it actually uses all , with nothing wasted.

If it doesn’t, say there is a $1000 \times 1000$ matrix but the rank is only 50, then it’s like having 1000 microphones on the table but only 50 of them are truly working.

What are singular values?

Mathematicians use Singular Value Decomposition (SVD) to break a matrix down into its “main channels.” Each channel has a strength, called a singular value. The number of non-zero singular values equals the rank.

Intuitively:

- Large singular value → that direction carries useful information.
- Near-zero singular value → that direction is effectively ignored.

If most singular values are close to zero, the matrix may look big, but its effective dimensionality is tiny.

Why does this matter for LLM attention?

The attention matrix in Transformers is essentially an information allocation table, deciding which tokens look at which others, and how strongly.

- Theoretically, it is full rank: every token can in principle look across the entire sequence.
- In practice, experiments show the effective rank is far lower than the sequence length .

This means long contexts are poorly utilized. Even if a model claims to handle 100k tokens, in reality only a few dozen effective dimensions get used. Understanding this gap is crucial to understand the limitations of large models: context window competition, long-range forgetting, and so on.

Back to the Technical Question

“Isn’t the autoregressive attention matrix just lower-triangular? The diagonal entries are all positive, so it must be full rank, right?”

This argument sounds airtight: by definition, the rank is the number of non-zero singular values. If every token at least attends to itself, then diagonals are >0, so the matrix should be full rank.

Mathematically speaking, that’s correct — but it misses the point.

The Mathematical View: Full Rank on Paper

From a linear algebra perspective:

- algebraic rank = the number of non-zero singular values.

As long as the diagonal entries are non-zero, the attention matrix is technically full rank.

This is like an exam script where every question has an answer written down — even if most answers are nonsense, nothing is left blank. Or like having 100 microphones, each at least making some sound, so algebraically the rank is 100.

And yes: the causal mask is a lower-triangular matrix ensuring each token only looks backward. By construction, the diagonal is positive, so the matrix is full rank.

The Engineering Reality: Effective Rank Collapse

But what really matters in intelligence engineering is the effective rank: the number of singular values that meaningfully carry information.

Think of it as “not how many microphones are plugged in, but how many actually transmit a clear signal.” If only three are loud and the rest are whispers or noise, the effective rank ≈ 3.

This explains the apparent contradiction:

- Algebraically, attention can be full rank.
- Empirically, effective rank is tiny — often orders of magnitude smaller than token string length $L$ .

Studies show sharp singular value decay: over 90% of the energy lies in just a few principal components. As layers deepen, the collapse compounds, leading to “rank collapse.”

The Theoretical Prediction: Rank Bottlenecks

Why does this happen? Linear algebra already gave us the warning.

- The attention weights come from
- The product QKᵀ has rank at most , the key/query dimension.
- So no matter how long the context $L$ is, the effective rank is bottlenecked by

If , then even though the window allows 100k tokens, after projection to $= 64$ , we only have 64 independent directions left.

This is like trying to drive 10k cars through a tunnel with only 6 lanes — the rest are stuck in line.

Rank Collapse in Practice

Beneath the illusion of algebraic full rank, effective rank collapses sharply. The attention matrix geometrically spans $L$ , but the usable subspace shrinks to a narrow slit.

Why Not Just the Identity Matrix?

What if attention degenerated into an identity matrix (each token only looks at itself)? Then the rank would indeed be .

But that’s a pathological case:

- Strict rank = effective rank = .
- Yet information flow = 0. No interaction, no learning, no intelligence.

Real-world measured attention matrices look nothing like this: instead, they have only a handful of strong singular values, with the rest collapsing to near-zero.

So “rank collapse” refers not to exceptions, but to the normal spectrum of attention in trained models.

The Role of Softmax and Multi-Head Attention

Softmax: Some might think softmax rescues rank. In fact, the opposite: row-wise normalization sharpens the distribution, making singular values even more concentrated. It acts as a driver of collapse, not a cure.

Multi-head attention:

- Each head has rank ≤ .
- With heads, the theoretical upper bound is .
- This does extend effective rank, forcing heads to diversify.

But experiments show many heads learn redundant patterns. The actual gain is far below the upper bound — often only a few heads carry real new information.

The Mirage of Long Contexts

This is why context scaling announcements (128k tokens, 1M tokens) often ring hollow.

Yes, the model theoretically sees all tokens. But with rank collapse, most of that information is compressed into only a handful of directions.

So we see:

- Models forget the beginning of long documents.
- Fine details get blurred.
- Only a few salient segments survive, the rest fade like mist.

Lessons and Implications

The debate about “full rank vs. collapse” is about two perspectives:

- Mathematical full rank: Yes, attention is full rank algebraically.
- Engineering effective rank: In practice, the usable degrees of freedom collapse.

Understanding this helps us see:

1. The illusion of long context: Simply stretching sequence length hits diminishing returns fast.
2. Why architecture innovation matters: Rank regularization, MoE, SSMs, RAG— all are essentially attempts to bypass rank collapse and make information flow more efficiently.

At the end of the day, “million-token context” often sells better in marketing slides than it delivers in actual usable intelligence.

Low Rank ≠ Inherently Bad

Low rank does not automatically mean something is bad.

In high-dimensional spaces, many features are already highly correlated. Forcing “full rank” often just means preserving a huge amount of redundancy. It’s like recording the same song 100 times and then claiming, “Look, I have 100 independent audio tracks!” In reality, 95 of them are duplicates or noise.

But isn’t language itself low-rank?
The answer is: yes, to some extent. Natural language is inherently redundant. Its information entropy is far below the total number of tokens, so the effective dimensionality is naturally much smaller than . In fact, low rank is often a beneficial mechanism for compression and generalization:

- It’s the same principle as Principal Component Analysis (PCA): compressing dozens of dimensions into a few principal directions can better capture the core patterns, remove noise, and improve generalization.
- Natural language inherently has fewer effective dimensions than its token count. You can’t expect 1000 words in a sentence to provide 1000 independent pieces of information; most of them are repetitions, paraphrases, or modifiers.

So the problem is not low rank itself, but collapsing too fast.

- Reasonable low rank: like mixing 100 microphones into 5 stereo channels — the music still sounds rich, and even clearer.
- Excessive collapse: if only one faint channel remains, then no matter how many singers are on stage, the audience only hears a dull hum.

This is why rank collapse has become a real concern in engineering practice. What we need is effective compression, not over-shrinking that destroys information pathways. The real challenge is how to preserve core patterns while still making use of long-range context and more independent directions.

Conclusion and Implications

The debate between “full rank” and “collapse” is about two perspectives overlapping. Once we understand this, we can see:

- The Mirage of Long Contexts: Extending sequence length alone doesn’t solve the bottleneck; performance quickly hits diminishing returns.
- The Drive for Architectural Innovation: Regularization, Mixture-of-Experts (MoE), SSMs, and retrieval-augmented methods are essentially all ways to bypass rank collapse and let information flow more effectively.

Reference

Bhojanapalli, Srinadh, et al. Low-Rank Bottleneck in Multi-Head Attention Models. Proceedings of the 37th International Conference on Machine Learning (ICML), 2020. （https://arxiv.org/abs/2002.07028）

Sanyal, S., Shwartz-Ziv, R., Dimakis, A.G., Sanghavi, S. (2024). When Attention Collapses: How Degenerate Layers in LLMs Enable Smaller, Stronger Models. arXiv:2404.08634 [cs.CL].

苏剑林：注意力机制真的可以“集中注意力”吗？

GPT and the Art of Compression

A Cosmic Dance of Bits and Meaning

Imagine a cosmic library, vast and infinite, housing every possible sentence—from the profound “Artificial intelligence will reshape the future” to the absurd “Cat pillow jumps blue because Wednesday.” In this library, popular sentences sit on bright, accessible shelves, found with a quick note: “Shelf 3, Book 5.” Random gibberish lurks in dusty basements, needing a word-for-word map. GPT, the AI we know as a language wizard, is the cosmic librarian, compressing texts into compact codes that can be perfectly restored. But is this compression flawless, or does it lose something along the way? Let’s embark on a journey through probability, information theory, and engineering to uncover the magic of GPT’s compression—and why it matters.

The Cosmic Library: Compressing Meaning

Picture yourself in this library, tasked with sending a sentence across the galaxy. A predictable sentence like “Artificial intelligence will reshape the future” is easy to pinpoint, requiring just a short instruction. A random jumble, like “Cat pillow jumps blue,” demands spelling out every word, taking up more space. GPT’s brilliance lies in its world model—a map of language probabilities built from vast data. It knows which sentences are “popular” (high-probability) and encodes them efficiently. Why do you think predictable text is easier to compress than random noise?

This process is called lossless compression, meaning the original text is perfectly restored, bit for bit. Unlike a compressed JPEG that blurs details, GPT’s compression ensures no loss. But some argue it’s lossy, losing information like a summary. Who’s right? To answer, we need to explore the mechanics and the theory behind it.

Arithmetic Coding: The GPS of Compression

GPT’s compression relies on arithmetic coding, a method that turns text into a number on a line from 0 to 1. Think of it as a GPS coordinate for a sentence’s location in the probability universe. Here’s how it works for “cat eats fish”:

1. Start with [0.0, 1.0].
2. For “cat” (P=0.5), shrink to [0.0, 0.5).
3. For “eats” given “cat” (P=0.7), narrow to [0.0, 0.35).
4. For “fish” given “cat eats” (P=0.4), end at [0.0, 0.14).
5. Output a binary number, like 0.125 (0.001 in binary), within [0.0, 0.14).

Decompression reverses this, using the same GPT model to retrace the intervals, ensuring the exact sequence—“cat eats fish”—is restored. Why is using the same model crucial for perfect reconstruction?

The interval’s length (0.14 = 0.5 * 0.7 * 0.4) reflects the sequence’s probability. High-probability sequences create larger intervals, needing fewer bits to encode (e.g., -log₂(0.14) ≈ 2.84 bits). Random sequences, with lower probabilities, need more bits. This is rooted in information theory, where a word’s information content is -log₂(P(x)). A likely word (P=0.95) carries little information (0.07 bits), while a rare one (P=0.0001) carries much (13.3 bits). How does this explain why semantic text compresses better than noise?

Lossless or Lossy? Solving the Debate

The debate over whether GPT’s compression is lossless or lossy stems from a subtle distinction. Lossless compression ensures the original data is perfectly restored, like unzipping a file to its exact form. Lossy compression, like MP3s, discards details for smaller size, losing fidelity. GPT’s compression, using arithmetic coding, is lossless: the encoded binary number uniquely maps back to the original text, preserving every bit. Experiments like ts_zip by Fabrice Bellard and 2022-2023 work by Li Ming and Nick show GPT outperforming gzip by up to 10x for semantic data, with no loss. Why might some still call it lossy?

The confusion arises from GPT’s training process. When GPT learns from vast data, it abstracts patterns into a simplified world model, discarding noise and details—clearly a lossy process, much like summarizing a library. But when used as a tool for compression, there exists a lessless compression algorithm that applies the model to encode and decode specific texts deterministically, ensuring no loss. The lossy aspect lives in the model’s creation, not its application. How does this distinction change your view of GPT’s capabilities?

The Theory: Kolmogorov Complexity and Intelligence

At the heart of this lies Kolmogorov complexity (KC), the length of the shortest program to generate a dataset. An ideal compressor would find this program, but KC is uncomputable—a theoretical dream. GPT’s next-token prediction approximates this, acting like a “prophet” forecasting sequences based on learned patterns. This aligns with Solomonoff induction, where predicting the next token mirrors finding compact descriptions. Ilya Sutskever noted in a 2023 Berkeley talk that this is the secret behind GPT’s efficiency compared to models like BERT. Why might prediction be a form of compression, and how does it reflect intelligence?

For semantic data, like news articles or logs, GPT’s predictions are highly accurate, leading to compact codes. For random noise, where KC equals the data’s length, compression fails—no model can predict chaos. This highlights a limit: GPT excels where patterns exist. What types of data do you think GPT could compress best?

The Tightrope: Efficiency vs. Reliability

High compression rates are powerful but fragile. A single bit error in a highly compressed file can derail decompression, like a misstep on a tightrope. Consider the trade-offs:

Dimension	High Compression Rate	Low Compression Rate
Restoration Accuracy	100% (theoretical)	100% (theoretical)
Error Resistance	Fragile (1-bit error can crash)	Robust (local errors)
Computational Cost	High (GPT + coding)	Low (e.g., gzip)
Readability	None (ciphertext)	High (text/binary)

High rates suit scenarios where bandwidth is costly, like interstellar communication, but require error correction (e.g., CRC) to prevent crashes. Low rates are ideal for reliable archiving, like server logs, where robustness trumps size.

Why It Matters: From Stars to Servers

GPT’s compression could transform how we store and send data. In interstellar missions, where every bit is precious, it could shrink messages dramatically. In data centers, it could optimize archival storage, though computational costs (e.g., ts_zip at 1k/s) pose challenges. Future models, with sharper predictions, could push efficiency closer to the theoretical limit.

This cosmic dance of bits and meaning reveals a profound truth: compression is intelligence, and GPT is a master choreographer. By mapping language to probabilities, it turns texts into elegant codes, preserving every detail. Whether you’re an AI enthusiast or a tech expert, this opens a universe of possibilities.

Sources: Adapted from posts on liweinlp.com (13277, 13272, 13275, 13273, 13279, 13281).
About the Author: Dr. Li Wei, a senior NLP/LLM consultant, has led innovations at MobVoi, Netbase, and Cymfony, earning the TREC-8 QA Track and 17 SBIR awards.

Efficiency vs. Reliability: The Compression Tightrope

GPT’s compression can shrink data dramatically, but high efficiency comes with risks. A single bit error could unravel everything, like a tightrope walker losing balance. How do we balance compression’s power with reliability?

The Trade-offs

High compression rates save space but are fragile, while low rates are robust but bulky. Here’s a comparison:

Dimension	High Compression Rate	Low Compression Rate
Restoration Accuracy	100% (theoretical)	100% (theoretical)
Error Resistance	Fragile (1-bit error can crash)	Robust (local errors)
Computational Cost	High (GPT + coding)	Low (e.g., gzip)
Readability	None (ciphertext)	High (text/binary)

High rates suit costly transmission (e.g., interstellar), while low rates fit archiving. Why might a bit error be catastrophic in high compression?

Practical Solutions

Error correction (e.g., CRC) can protect high-rate compression, ensuring reliability. For archives, lower rates may suffice. What scenarios demand high efficiency, and how can we safeguard them?

Original post: https://liweinlp.com/13281

Arithmetic Coding for GPT’s Compression Engine

At the heart of GPT’s compression lies arithmetic coding, a method that turns text into numbers with surgical precision. Like a GPS encoding a house’s location, it captures sentences in compact codes. How does this engine work, and why is it so effective?

The Mechanics

GPT predicts probabilities for each token (e.g., P(“future” | “Artificial intelligence is”)=0.6), and arithmetic coding divides [0, 1) into subintervals:

Start with [0, 1).
Assign [0, 0.6) to “future,” narrowing the range.
Iterate for each token, ending with a tiny interval (e.g., [0.3654321, 0.3654343]).
Output a binary number as the compressed code.

Decompression uses the same GPT model to reverse the process, ensuring bit-level accuracy. Why is the same model critical?

A GPS Analogy

Compression is like encoding a villa’s address into a postal code. Decompression follows this code to the exact spot. This precision ensures no loss. How does this analogy clarify the process?

The Edge of Efficiency

GPT’s accurate predictions make intervals larger for predictable text, reducing bits needed. What limits this approach, and how might better models enhance it?

Original post: https://liweinlp.com/13273

Navigating the Probability Universe with GPT

Every sentence has a unique address in a probability universe, a number line from 0 to 1. GPT maps texts to these addresses, compressing them into compact codes. How does this cosmic navigation work, and why is it a breakthrough for compression?

Mapping Sentences to Intervals

Each sequence corresponds to a unique interval in [0, 1), with its length equaling the sequence’s probability. For “cat eats fish” (P(“cat”)=0.5, P(“eats” | “cat”)=0.7, P(“fish” | “cat eats”)=0.4), the interval is [0, 0.14), with length 0.5 * 0.7 * 0.4 = 0.14. Arithmetic coding narrows this interval step-by-step, outputting a binary number. Decompression retraces the path, ensuring perfection. Why are these intervals unique?

The Power of Information Theory

The interval’s length reflects the sequence’s probability, with high-probability sequences needing fewer bits (-log₂(0.14) ≈ 2.84 bits). This approaches Shannon’s entropy limit, where GPT’s precise predictions minimize bits for semantic data. Why does predictability reduce bit requirements?

Why It’s Revolutionary

Unlike traditional methods (e.g., Huffman coding), GPT’s approach handles continuous streams and leverages semantic patterns, making it ideal for texts. What data types might benefit most, and how could this evolve with better models?

Original post: https://liweinlp.com/13275

Is GPT Compression Lossless or Lossy? The Truth Revealed

The claim that “compression is intelligence” sparks debate: does GPT compress data perfectly, or does it lose something along the way? Some argue it’s lossy, like a compressed JPEG, while others insist it’s lossless, restoring every bit. The answer hinges on a key distinction: GPT’s training versus its use as a compressor. Let’s unravel this mystery.

The Heart of Compression: Kolmogorov Complexity

Kolmogorov complexity defines a data’s essence as the shortest program to generate it—an uncomputable ideal. GPT’s next-token prediction approximates this, acting like a “prophet” forecasting sequences based on its world model. This predictive power drives from compression. How does predicting the next word relate to shrinking data size?

Lossless Compression in Action

Using GPT for compressing a tring of target sequence data is lossless, meaning the original data can be perfectly restored. Experiments like ts_zip (Fabrice Bellard) and Li Ming & Nick’s 2022-2023 work show GPT with arithmetic coding outperforming gzip, sometimes by 10x, in high-transmission-cost scenarios like interstellar communication. Here’s why it’s lossless:

Mechanism: GPT provides probabilities (e.g., P(“will” | “Artificial intelligence”)=0.8), which arithmetic coding uses to encode input sequences into a binary number. Decompression uses the same model to reverse the process, ensuring bit-level accuracy.
Evidence: Even low-probability tokens are encoded with more bits, preserving all information.

Why might some confuse this with lossy compression?

Training vs. Compression

The confusion arises from GPT’s training, where it abstracts vast data into a simplified world model—a lossy process, like summarizing a library. But compression using this model encodes specific data losslessly. How does this distinction clarify the debate?

Practical Implications

This approach excels for language data (e.g., texts, logs) but struggles with random noise, where complexity equals length. Scenarios like space missions, data archives could leverage this.

Original post: https://liweinlp.com/13272

GPT as a Cosmic Librarian: Unlocking Lossless Compression

Imagine a cosmic library holding every possible sentence, from the profound “Artificial intelligence will reshape the future” to the absurd “Cat pillow jumps blue.” Popular sentences sit on prominent shelves, easily found with a short note like “Shelf 3, Book 5.” Random gibberish hides in dusty basements, requiring a long, word-for-word map. GPT, our cosmic librarian, navigates this library with uncanny precision, compressing texts into compact codes that can be perfectly restored. How does it work, and why is this a game-changer for data compression?

The Library of Language

In this infinite library, each sentence has a “popularity” score—its probability based on grammar, meaning, and context. GPT’s world model, trained on vast texts, assigns high probabilities to meaningful sentences, making them easier to locate. For example, “Artificial intelligence will reshape the future” is a bestseller, while “Cat pillow jumps blue” is obscure. Compression is about encoding these locations efficiently. How might GPT’s understanding of language make this possible?

Arithmetic Coding: The Magic Wand

GPT teams up with arithmetic coding to turn sentences into numbers. Here’s how it compresses “Artificial intelligence will reshape…” (tokenized as “Artificial,” “intelligence,” “will,” …):

Start with [0.0, 1.0]: The entire number line as space that represents all possible sequences.
Encode “Artificial”: GPT predicts a 5% chance (P=0.05) for this word to be the first token in a sentence, shrinking the interval to [0.0, 0.05].
Encode “intelligence”: Given “Artificial,” GPT predicts an 80% chance (P=0.8), narrowing to [0.0, 0.04].
Continue: Each token shrinks the interval further, ending with a tiny range, say [0.02113, 0.02114].
Output: Convert a number like 0.02113 to binary (e.g., 0.00010101), which is the compressed result of the processed sentence.

Decompression reverses this, using the same GPT model to retrace the intervals and reconstruct the exact text. Why does this ensure no data is lost?

Information Theory: Why Predictability Saves Space

Information theory reveals why this works. A word’s information content is -log₂(P(x)). High-probability words carry little information, rare words carry more. Predictable sentences, rich in semantic patterns, form larger intervals in the line, requiring fewer bits. Why might random text, like white noise, resist compression?

Why It Matters

This approach could revolutionize data storage and transmission, from archiving logs to sending messages across galaxies. But what challenges might arise in real-world applications? How could GPT’s predictive power evolve with better models?

Original post: https://liweinlp.com/13277

Decoding the New EMPO Reasoning Paradigm

The Right Question is Half the Answer,
The Other Half lies in LLM's Semantic Coherence

Large Language Models (LLMs) are constantly rewriting the rules of AI with their astonishing reasoning abilities. Yet, the path to even stronger reasoning is often paved with expensive "gold"—manually labeled reasoning steps, verified answers, or bespoke reward models. These reinforcement methods, rooted in supervised learning, work, but they hit bottlenecks in cost and scalability.

Rewind to this Lunar New Year, when DeepSeek's R1-Zero, a result-driven, supervised reinforcement approach, made waves. We debated its underlying mechanics, converging on a shared understanding: The essence of technologies like Chain-of-Thought (CoT) is to build a "slow-thinking" information bridge between a query and a response in complex tasks. Think of it as a gentle "ramp", designed to lower perplexity, transforming problems with daunting information gaps—unsolvable by "fast thinking"—into something smooth and solvable.

Now, a new paper from Tianjin University and Tencent AI Lab, "Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization," takes this line of thought a step further—a step both radical and elegant. It introduces EMPO (Entropy Minimized Policy Optimization), a fully unsupervised framework for reinforcement reasoning. And the kicker? Its performance reportedly rivals methods that do rely on golden answers.

This paper is a refreshing read. No black magic, no convoluted theories. It’s like a fresh breeze blowing through the landscape of unsupervised learning. It further validates our hunch: give the model a "field" to play in, and it will autonomously find the smoothest path towards entropy reduction.

Frankly, DeepSeek R1-Zero was stunning enough, proving machines could learn autonomously, generating their own data to boost their intelligence. This work feels like "Zero-Squared": Machines can now seemingly learn answers just from questions. It's a bit scary if you think about it. Unsupervised learning has been around for years, but after fueling the pre-trained LLM storm via self-supervised learning, seeing it reach this level of magic in reasoning is truly eye-opening.

EMPO's Midas Touch: Minimizing Semantic Entropy

The core idea behind EMPO is simple: Instead of telling the model "what is right," why not let it pursue "what is consistent"? It posits that a powerful reasoning model should produce outputs that are stable and semantically aligned. How do we measure this alignment? Through Semantic Entropy.

This isn't your classic Shannon entropy, which focuses on the surface token string and can be easily thrown off by phrasing variations. Semantic entropy operates at the level of meaning. Here’s how EMPO does it:

Sample: For a single question, let the current model generate multiple (say, G) reasoning processes and answers, step-by-step.
Cluster: Using simple rules (like regex for math) or a compact verifier model, cluster these G outputs based on their meaning. For example, "The answer is 42" and "Final result: 42" land in the same bucket, regardless of the path taken.
Calculate Entropy: Based on these clusters, calculate the probability distribution of each "meaning bucket" and calculate the overall semantic entropy. If all answers converge to one meaning, entropy is minimal; if they're all over the place, it's high.
Reinforce: Use this "semantic consistency" (low entropy) as an intrinsic reward signal within an RL framework (like GRPO). The model gets a pat on the back if its output belongs to the most "mainstream," most consistent cluster. Optimization then incentivizes the model to generate outputs that lower the overall semantic entropy.

In short, EMPO encourages the model: "Within your own answer space, find the most 'popular' view, the one you're most sure about, and double down on it!"

Piercing the Veil: Wisdom and Real-World Gotchas

EMPO's elegance doesn't mean it's without its nuances. The paper highlights a few key insights and practicalities:

Entropy Thresholding (The "Catch"): This is crucial. Just blindly minimizing entropy could lead the model down a rabbit hole, overfitting. EMPO therefore introduces an entropy threshold: it only applies CoT reinforcement to questions with moderate entropy. This filters out cases where the model is either too uncertain (high entropy, too chaotic to learn from) or already too confident (low entropy, no need to push further and risk overconfidence). This ensures stability and effectiveness.
The Power of the Base Model: EMPO is more of an elicitor than a creator of abilities. The potential for these reasoning paths is likely laid down during pre-training. EMPO's success hinges heavily on a strong base model. The contrast between Qwen (where EMPO worked directly, likely due to pre-training with QA pairs, seeding its potential) and Llama (which needed an SFT "warm-up" before EMPO works) drives this point home. Unsupervised post-training isn't a magic wand; it builds only on a solid foundation.
No <cot> Tags Required: EMPO doesn't even need explicit <cot> tags as format rewards. A simple prompt like, Please resolve it step by step and put the final answer in {...}. is enough to provide the "space" for the model to explore thinking and refine its reasoning.

The Unsupervised Dividend: Why EMPO Matters

EMPO shows that even without any external answers, we can significantly boost LLM reasoning through a simple, elegant, and intrinsically motivated mechanism. It's like unlocking a universal "data quality dividend". The only entry fee is feeding the system questions and applying simple clustering – and most likely, accuracy improvements become possible.

The paper's title begins, "Right question is already half the answer." We can extend that: "...the other half is embodied in LLM's internal semantic coherence." By minimizing semantic entropy, EMPO guides the LLM to generate CoT and answers with greater harmony and order, helping it find that "other half."

Given its underlying mechanism of information theory and its generality, we believe EMPO's minimalist, unsupervised approach will spark a wave of follow-up research. It will push boundaries, find applications in diverse tasks, and likely become a cornerstone of future LLM post-training pipelines.

P.S. Rarely is a paper this interesting also this accessible. For those keen on diving into the details, the original paper recently published is just a click away: https://arxiv.org/pdf/2504.05812. Enjoy!

Q&A on NLP: Chapter I Natural Language and Linguistic Form

Guo: Professor Li, to ease into the discussion, let us begin with some foundational concepts. What exactly do we mean by natural language? What falls under the scope of the field, and where does it sit within the broader discipline of Artificial Intelligence (AI)?

Li: Natural language refers to the everyday languages we humans speak—English, Russian, Japanese, Chinese, and so on; in other words, human language writ large. It is distinct from computer languages. Because human conversation is rife with ellipsis and ambiguity, processing natural language on a computer poses formidable challenges.

Within AI, natural language is defined both as a problem domain and as the object we wish to manipulate. Natural Language Processing (NLP) is an essential branch of AI, and parsing is its core technology—the crucial gateway to Natural Language Understanding (NLU). Parsing will therefore recur throughout this book.

Computational linguistics is the interdisciplinary field at the intersection of computer science and linguistics. One might say that computational linguistics supplies the scientific foundations, whereas NLP represents the applied layer.

AI is often divided into perceptual intelligence and cognitive intelligence. The former includes image recognition and speech processing. Breakthroughs in big data and deep learning have allowed perceptual intelligence to reach—and in some cases surpass—human‑expert performance. Cognitive intelligence, whose core is natural language understanding, is widely regarded as the crown jewel of AI. Bridging the gap from perception to cognition is the greatest challenge—and opportunity—facing the field today.

The rationalist tradition formalises expert knowledge using symbolic logic to simulate human intellectual tasks. In NLP, the classical counterpart to machine‑learning models comprises linguist‑crafted grammar rules, collectively called a computational grammar. A system built atop such grammars is known as a rule‑based system. The grammar school decomposes linguistic phenomena with surgical precision, aiming at a deep structural analysis. Rule‑based parsing is transparent and interpretable—much like the diagramming exercises once taught in a language school.

Figure 1‑1 sketches the architecture of a natural‑language parser core engine. Without dwelling on minutiae, note that every major module—from shallow parsing through deep parsing—can, in principle, be realised via interpretable symbolic logic encoded as a computational grammar. Through successive passes, the bewildering diversity of natural language is reduced first to syntactic relations and then to logical‑semantic structure. Since Chomsky’s distinction between surface structure and deep structure in late 50s, this layered view has become an orthodoxy within linguistics.

Guo: These days everyone venerates neural networks and deep learning. Does the grammar school still have room to live? Rationalism seems almost voiceless in current NLP scholarship. How should we interpret this history and the present trend?

Li: Roughly thirty years ago, the empiricist school of machine learning began its ascent, fuelled by abundant data and ever‑cheaper computation. In recent years, deep neural networks have achieved spectacular success across many AI tasks. Their triumph reflects not only algorithmic innovation but also today’s unprecedented volumes of data and compute.

By contrast, the rationalist programme of symbolic logic has waned. After a brief renaissance twenty years ago—centred on unification‑based phrase‑structure grammars (PSGs)—computational grammar gradually retreated from the mainstream. Many factors contributed; among them, Noam Chomsky’s prolonged negative impact warrants sober reflection.

History reveals a pendulum swing between empiricism and rationalism. Kenneth Church famously illustrated the motion in his article A Pendulum Swung Too Far (Figure 1-2).

For three decades, the pendulum has tilted toward empiricism (black dots in Figure 1‑2); deep learning still commands the spotlight. Rationalism, though innovating quietly, is not yet strong enough to compete head‑to‑head. When one paradigm dominates, the other naturally fades from view.

Guo: I sense some conceptual confusion both inside and outside the field. Deep learning, originally just one empiricist technique, has become synonymous with AI and NLP for many observers. If its revolution sweeps every corner of AI, will we still see a rationalist comeback at all? As Professor Church warns, the pendulum may already have swung too far.

Li: These are two distinct philosophies with complementary strengths and weaknesses; neither can obliterate the other.

While the current empiricist monoculture has understandable causes, it is unhealthy in the long run. The two schools both compete and synergise. Veterans like Church continue to caution against over‑reliance on empiricism, and new scholars are probing deep integrations of the two methodologies to crack the hardest problems in NLU.

Make no mistake: today’s AI boom largely rests on deep‑learning breakthroughs, especially in image recognition, speech, and machine translation. Yet deep learning inherits a fundamental limitation of the statistical school—its dependence on large volumes of labelled data. In many niche domains—for instance, minority languages or e‑commerce translation—such corpora are simply unavailable. This knowledge bottleneck severely constrains empiricist approaches to cognitive NLP tasks. Without data, machine learning is a bread‑maker without flour; deep learning’s appetite as we all know is insatiable.

Guo: So deep learning is no panacea, and rationalism deserves a seat at the table. Since each paradigm has its merits and deficits, could you summarise the comparison?

Li: A concise inventory helps us borrow strengths and shore up weaknesses.

Advantages of machine learning

1. Requires no domain experts (but does require vast labelled data).
2. Excels at coarse‑grained tasks such as classification.
3. High recall.
4. Robust and fast to develop.

Advantages of the grammar school

1. Requires no labelled data (but does require expert rule writing).
2. Excels at fine‑grained tasks such as parsing and reasoning.
3. High precision.
4. Easy to localise errors; inherently interpretable.

Li: Rule‑based systems shine at granular, line‑by‑line dissection, whereas learned statistical models are naturally strong at global inference. Put bluntly, machine learning often "sees the forest but misses the trees," while computational grammars "see each tree yet risk losing the forest." Although data‑driven models boast robustness and high recall, they may hit a precision ceiling on fine‑grained tasks. Robustness is the key to surviving anomalies and edge cases. Expert‑coded grammars, by contrast, attain high precision, but boosting recall can require many rounds of iterative rule writing. Whether a rule‑based system is robust depends largely on its architectural design. Its symbolic substrate renders each inference step transparent and traceable, enabling targeted debugging—precisely the two pain‑points of machine learning, whose opaque decisions erode user trust and hamper defect localisation. Finally, a learning system scales effortlessly to vast datasets and its breakthroughs tend to ripple across an entire industry. Rule‑based quality, by contrast, hinges on the individual craftsmanship of experts—akin to Chinese cuisine, where identical ingredients may yield dishes of very different calibre depending on the chef.

Both routes confront knowledge bottlenecks. One relies on mass unskilled labour (annotators), the other on a few skilled artisans (grammar experts). For machine learning, the bottleneck is the supply of domain‑specific labelled data. The rationalist route simulates human cognition and thus avoids surface‑level mimicry of datasets, but cannot escape the low efficiency of manual coding. Annotation is tedious yet teachable to junior workers; crafting and debugging rules is a costly skill to train and hard to scale. Talent gaps exacerbate the issue—three decades of empiricist dominance have left the grammar school with a thinning pipeline.

Guo: Professor Li, a basic question: grammar rules are grounded in linguistic form. If semantics is derived from that form, then what exactly is linguistic form?

Li: This strikes at the heart of formalising natural language. All grammar rules rest on linguistic form, yet not every practitioner—even within the grammar camp—has a crisp definition at hand.

In essence, natural language as a symbolic system expresses meaning through form. Different utterances of an idea vary only in form; their underlying semantics and logic must coincide, else communication—and translation—would be impossible. The intuition is commonplace, but pinning down "form" propels us into computational linguistics.

Token & Order — The First‑Level Abstraction
At first glance a sentence is merely a string of symbols—phonemes or morphemes. True, but that answer is too coarse. Every string is segmented into units called tokens (words or morphemes). A morpheme is the smallest pairing unit of sound and meaning. Thus our first abstraction decomposes linguistic form into a sequence of tokens plus their word order. Grammar rules define patterns that match such sequences. The simplest pattern, a linear pattern, consists of token constraints plus ordering constraints.

Guo: Word order seems straightforward, but tokens and morphemes hide much complexity.

Li: Indeed. Because tokens anchor the entire enterprise, machine‑readable dictionaries become foundational resources. (Here "dictionary" means an electronic lexicon.)

If natural language were a closed set—say only ten thousand fixed sentences—formal grammar would be trivial: store them all, and each complete string would serve as an explicit pattern. But language is open, generating unbounded sentences. How can a finite rule set parse an infinite language?

The first step is tokenisation—dictionary lookup that maps character strings to lexicon words or morphemes. Unlimited sentences decompose into a finite vocabulary plus occasional out‑of‑dictionary items. Together they form a token list, the initial data structure for parsing.

We then enter classic linguistic sub‑fields. Morphology analyses the internal structure of multi‑morphemic words. Some languages exhibit rich morphology—noun declension, verb conjugation—e.g., Russian and Latin; others, such as English and Chinese, are comparatively poor. Note, however, that Chinese lacks inflection but excels at compounding. Compounds sit at the interface of morphology and syntax; many scholars treat them as part of "little syntax" rather than morphology proper.

Guo: Typologists speak of a spectrum—from isolating languages such as Classical Chinese (no morphology) to polysynthetic languages like certain Native American tongues (heavy morphology). Most languages fall between, with Modern Chinese and English leaning toward the isolating side: minimal morphology, rich syntax. Correct?

Li: Exactly. Setting aside the ratio of morphology to syntax, our first distinction is between function words/affixes versus content words. Function words (prepositions, pronouns, particles, conjunctions, original adverbs, interrogatives, interjections) and affixes (prefixes, suffixes, endings) form a small, closed set.

Content words—nouns, verbs, adjectives, etc.—form an open set forever producing neologisms; a fixed dictionary can hardly keep up.

Because function words and affixes are frequent yet limited, they can be enumerated as literals in pattern matching. Hence we have at least three grain‑sizes of linguistic form suitable for rule conditions: (i) word order; (ii) function‑word literals or affix literals; (iii) features.

Features — The Implicit Form
Explicit tokens are visible in the string, but parsers also rely on implicit features—category labels. Features encode part‑of‑speech, gender, number, case, tense, etc. They enter pattern matching as hidden conditions. Summarising: automatic parsing rests on (i) order, (ii) literals, (iii) features—two explicit, one implicit. Every language weaves these three in different proportions; grammar is but their descriptive calculus.

Guo: By this metric, can we say European languages are more rigorous than Chinese?

Li: From the standpoint of explicit form, yes. European tongues vary internally—German and French more rigorous than English—but all possess ample explicit markers that curb ambiguity. Chinese offers fewer markers, increasing parsing difficulty.

Inflectional morphology supplies visible agreement cues—gender‑number‑case for nouns, tense‑aspect‑voice for verbs. Chinese lacks these. Languages with rich morphology enjoy freer word order (e.g., Russian). Esperanto’s sentence "Mi amas vin" (I love you) can permute into six orders because the object case ‑n never changes.

Chinese, conversely, evolved along the isolating path, leveraging word order and particles. Even so, morphology provides tighter agreement than particles. Hence morphology‑rich languages are structurally stringent, reducing reliance on implicit semantics.

Guo: People call Chinese a "paratactic" language—lacking hard grammar, leaning on meaning. Does that equate to your notion of implicit form?

Li: Precisely. Parataxis corresponds to semantic cohesion—especially collocational knowledge within predicate structures. For example, the predicate "eat" expects an object in the food category. Such commonsense often lives in a lexical ontology like HowNet (founded by the late Professor Dong Zhendong).

Consider how plurality is expressed. In Chinese, "brother" is a noun whose category is lexically stored. Esperanto appends ‑o for nouns and ‑j for plural: frato vs. fratoj. Chinese may add the particle 们 (‑men), but this marker is optional and forbidden after numerals: "三个兄弟" (three brothers) not "*三个兄弟们". Here plurality is implicit, inferred from the numeral phrase.

Guo: Lacking morphology indeed complicates Chinese. Some even claim Chinese has no grammar.

Li: That is hyperbole. All languages have grammar; Chinese simply relies more on implicit forms. Overt devices—morphology, particles, word order—are fewer or more flexible.

Take omission of particles as an illustration. Chinese frequently drops prepositions and conjunctions. Compare:

1. 1. 对于这件事, 依我的看法, 我们应该听其自然。
    As for this matter, in my opinion, we should let nature take its course.
  2. 这件事我的看法应该听其自然。
    * this matter my opinion should let nature take its course.
    (Unacceptable as a word‑for‑word English rendering.)

Example 2 is ubiquitous in spoken Chinese but would be ungrammatical in English. Systematic omission of function words exacerbates NLP difficulty.

Guo: What about word order? Isolation theory says morphology‑poor languages have fixed order—Chinese is labelled SVO.

Li: Alas, reality defies the stereotype. Despite lacking morphology and often omitting particles, Chinese exhibits remarkable word‑order flexibility. Consider the six theoretical permutations of S, V, and O. Esperanto, with a single object case marker ‑n, allows all six without altering semantics. Compare English (no case distinction for nouns, but marking subject pronouns from obect cases) and Chinese (no case at all):

Order	Esperanto	English	Chinese
SVO	Mi manĝis fiŝon	I ate fish	我吃了鱼
SOV	Mi fiŝon manĝis	* I fish ate	我鱼吃了
VOS	Manĝis fiŝon mi	* Ate fish I	？吃了鱼我
VSO	Manĝis mi fiŝon	* Ate I fish	* 吃了我鱼
OVS	Fiŝon manĝis mi	* Fish ate I	？鱼吃了我
OSV	Fiŝon mi manĝis	Fish I ate	鱼我吃了

Chinese sanctions three orders outright, two marginally (marked “?”), and forbids one (“*”). English allows only two. Thus Chinese word order is about twice as free as English, even though English possesses case distinction on pronouns. Hence morphology richness does not always guarantee order freedom.

Real corpora confirm that Chinese is more permissive than many assume. Greater flexibility inflates the rule count in sequence‑pattern grammars: every additional order multiplies pattern variants. Non‑sequential constraints can be encoded inside a single rule; order itself cannot.

A classic example is the elastic placement of argument roles around "哭肿" (cry‑swollen):

张三眼睛哭肿了。
眼睛张三哭肿了。
哭肿张三眼睛了。
张三哭肿眼睛了。
哭得张三眼睛肿了。
张三哭得眼睛肿了。
…and so on.

Such data belie the notion of a rigid SVO Chinese. Heavy reliance on implicit form complicates automatic parsing. Were word order fixed, a few sequence patterns would suffice; flexibility forces exponential rule growth.

壹　自然语言与语言形式

郭: 李老师, 由浅入深, 我们还是从一些基本概念开始谈起吧。什么是自然语言? 自然语言领域包括哪些内容? 它在人工智能里面的定位是怎样的呢?

李: 自然语言 (natural language) 指的是我们日常使用的语言, 英语、俄语、日语、汉语等, 它与人类语言是同义词。自然语言有别于计算机语言。人脑处理的自然语言常有省略和歧义, 这给电脑 (计算机) 的处理提出了挑战。

在人工智能界, 自然语言是作为问题领域和处理对象提出来的。自然语言处理是人工智能的重要分支, 自然语言解析是其核心技术和通向自然语言理解的关键。语言解析是我们接下来要探讨的、贯穿全书始终的话题。

计算语言学是计算机科学与语言学的交叉学科. 计算语言学和自然语言处理是同一个专业领域的两个剖面. 可以说, 计算语言学是自然语言处理的科学基础, 自然语言处理是计算语言学的应用层面。

人工智能主要有感知智能 (perceptual intelligence) 和认知智能 (cognitive intelligence) 两大块. 前者包括图像识别 (image recognition) 和语音处理 (speech processing)。随着大数据和深度学习 (deep learning) 算法的突破性进展, 感知智能很多方面已经达到甚至超过人类专家的水平。认知智能的核心是自然语言理解, 被一致认为是人工智能的皇冠。从感知跃升到认知是当前人工智能所面临的最大挑战和机遇。

理性主义直接把领域专家的经验形式化, 利用符号逻辑来模拟人的智能任务。在自然语言处理领域, 与机器学习模型平行的传统方法是语言学家手工编码的语言规则。这些规则的集合称为计算文法。由计算文法支撑的系统叫作规则系统 (rule system)。文法学派把语言学家总结出来的语言规则形式化, 从而对语言现象条分缕析, 达到对自然语言深层次的结构解析. 规则系统试图模拟人的语言分析理解过程。规则系统解析自然语言是透明的、可解释 (interpretable) 的。这个过程很像是外语文法老师在课堂上教给学生的句子分析方法。

图１—１是一张自然语言解析器 (parser) 核心引擎 (core engine) 的架构图。不必深究细节, 值得说明的是, 从浅层解析 (shallow parsing) 到深层解析 (deep parsing) 里面的各主要模块, 都可以用可解释的符号逻辑 (symbolic logic) 以计算文法的形式实现。千变万化的自然语言表达, 就这样一步一步地从句法关系 (syntactic relation) 的解析, 进而求解其深层的逻辑语义 (logic semantics) 关系。这个道理早在1957年乔姆斯基 (Chomsky) 语言学革命中提出表层结构 (surface structure) 到深层结构 (deep structure) 的转换之后, 就逐渐成为语言学界的共识了。

郭: 现在大家都在推崇神经网络 (neural network) 深度学习, 文法学派还有生存空间吗? 理性主义在自然语言领域已经听不到什么声音了。怎样看待这段历史与趋向呢?

李: 大约从30年前开始到现在, 经验主义机器学习这一派, 随着数据和计算资源的发展, 天时地利, 一直在向上走。尤其是近年来深层神经网络的实践, 深度学习在不少人工智能任务上取得了突破性的成功。经验主义的这些成功, 除了神经网络算法的创新, 也得益于今非昔比的大数据和大计算的能力。

与此对照, 理性主义符号逻辑则日趋式微。符号逻辑在自然语言领域表现为计算文法。文法学派在经历了20年前基于合一 (unification) 的短语结构文法 (Phrase Structure Grammar, PSG) 创新的短暂热潮以后, 逐渐退出了学界的主流舞台。形成这一局面的原因有多个, 其中包括乔姆斯基对于文法学派长期的负面影响, 值得认真反思。

回顾人工智能和自然语言领域的历史, 经验主义和理性主义两大学派此消彼长, 呈钟摆式跌宕起伏。肯尼斯丘吉 (Kenneth Church) 在他的「钟摆摆得太远」(A Pendulum Swung Too Far) 一文中, 给出了一个形象的钟摆式跌宕图 (图１—２).

最近30年来, 经验主义钟摆的上扬趋势依然不减 (见图１—２的黑点表示)。目前来看, 深度学习仍在风头上。理性主义积蓄多年, 虽然有其自身的传承和创新, 但还没有到可以与经验主义正面争锋的程度。当一派成为主流时, 另一派自然淡出视野。

郭: 我感觉业内业外有些认知上的混乱。深度学习本来只是经验主义学派的一种方法, 现在似乎在很多人心目中等价于人工智能和自然语言处理了。如果深度学习的革命席卷人工智能的方方面面, 会不会真地要终结理性主义的回摆呢? 正如丘吉教授所言, 经验主义的钟摆已经摆得太远了。

李: 我的答案是否定的。这是两个不同的哲学和方法论, 各自带有其自身的天然优势和劣势, 不存在一派彻底消灭另一派的问题。

当前学界经验主义一面倒的局面虽然事出有因, 但并不是一个健康的状态。其实, 两派既有竞争性, 也有很强的互补性。丘吉这样的老一辈有识之士一直在警示经验主义一边倒的弊端, 也不断有新锐学者在探索两种方法论的深度融合, 以便合力解决理解自然语言的难题。

毫无疑问, 这一波人工智能的热潮很大程度上是建立在深度学习的突破上, 尤其是在图像识别、语音处理和机器翻译方面取得的成就上。但是, 深度学习的方法仍然保留了统计学派的一个根本局限, 就是对海量标注数据 (labeled data) 的依赖。在很多细分领域和任务场景, 譬如, 少数族裔语言的解析、电商数据的机器翻译, 海量标注或领域翻译数据并不存在。这个知识瓶颈严重限制了经验主义方法在自然语言认知任务方面的表现。没有足够的标注数据, 对于机器学习就是无米之炊。深度学习更是如此, 它的胃口比传统机器学习更大。

郭 : 看来深度学习也不是万能的, 理性主义理应有自己的一席之地。说它们各有长处和短板, 您能够给个比较吗?

李: 归纳一下两派各自的优势与短板是很有必要的, 可以取长补短。

机器学习的优势包括:

(１) 不依赖领域专家 (但需要大量标注数据);
(２) 长于粗线条的任务, 如分类 (classification);
(３) 召回 (recall) 好;
(４) 鲁棒 (robust), 开发效率高。

与此对照, 文法学派的优势包括:

(１) 不依赖标注数据 (但需要专家编码);
(２) 长于细线条的任务, 譬如解析和推理;
(３) 精度(precision)好;
(４) 易于定点排错, 可解释。

专家编码的规则系统擅长逐字逐句的条分缕析, 而学习出来的统计模型则天然长于全局结论。如果说机器学习往往是见林不见木的话, 计算文法则是见木不见林。大数据驱动的机器学习虽然带来了鲁棒和召回的长处, 但对细线条的任务较易遭遇精度的天花板。所谓鲁棒, 是robust的音译, 也就是强壮、稳健的意思, 它是在异常和危险情况下系统生存的关键。专家编写规则虽然容易保障精度, 但召回的提升则是一个漫长的迭代过程。鲁棒性则决定于规则系统的架构设计。规则系统的基础是可解释的符号逻辑, 容易追踪到出错的现场, 并做出有针对性的排错。而这两点正是机器学习的短板。机器学习的结果不论是对是错, 都难以解释, 因而影响用户的体验和信赖。难以定点排错更是开发现场的极大困扰, 其原因是学习模型缺乏显性符号与结构表示 (structure representation)。最后, 学习系统能较快地规模化到大数据的应用场景, 成功易于复制, 方法的突破往往可带动整个行业的提升。相对而言, 规则系统的质量很大程度上取决于专家的个体经验。这就好比中餐, 同样的食材, 不同的厨师做出来的菜肴品质常常相差很大。

两条路线各有自身的知识瓶颈。打个比喻, 一个是依赖海量的低级劳动, 另一个是依赖少数专家的高级劳动。对于机器学习, 海量标注是领域化落地 (grounding，即落实到应用) 的知识瓶颈。理性主义路线模拟人的认知过程, 无需依赖海量数据在表层模仿。但难以避免手工编码的低效率。标注工作虽然单调, 可一般学生稍加培训即可上手。而手工编制、调试规则, 培训成本高, 难以规模化。还有, 人才的断层也算是文法学派的一个现实的局限。30年正好是一代人。在过去的30年, 经验主义在主流舞台的一枝独秀, 客观上造成了理性主义阵营人才青黄不接。

郭: 李老师,我有个基本问题: 文法规则依据的是语言形式 (linguistic form)。那么, 通过这个形式解析出语义 (semantics), 到底什么是语言形式呢?

李: 这是自然语言形式化的根本问题。所有的文法规则都建立在语言形式的基础之上, 可并不是每个人, 包括从事文法工作的人, 都能对语言形式有个清晰的认识。

不错, 自然语言作为符号系统, 说到底就是以语言形式来表达语义。话语的不同只是形式的不同, 背后的语义和逻辑一定是相同的, 否则人不可能交流思想, 语言的翻译也会失去根基。这个道理老少咸知, 那什么是语言形式的定义呢? 回答这个问题就进入计算语言学了。

语言形式, 顾名思义, 就是语言的表达手段。乍一看语言, 不就是符号串吗? 语音流也好, 文字串也好, 都可以归结为符号串。所以, 符号串就是语言形式。这个答案不算错, 但失之笼统。这个“串”是有单位的, 其基本单位叫 token (可译作“文本符号”), 也就是单词或语素 (morpheme)。语素, 其定义是音义结合的最小符号单位。因此, 作为第一级抽象, 我们可以把语言形式分解为文本符号及其语序 (word order)。计算文法中的规则都要定义一个条件模式 (pattern), 就是为了与语言符号串做匹配。最基本的条件模式叫线性模式 (linear pattern), 其构成的两个要素就是符号条件和次序条件。

郭 : 好, 语言形式的基本要素是词/语素和语序。语序就是符号的先后顺序, 容易界定; 但词和语素里面感觉有很多学问。

李: 不错, 作为语言符号, 词和语素非常重要, 它们是语言学的起点。收录词和语素的词典因此成为语言解析的基础资源。顺便提一下, 我们在这所说的“词典”是指机器词典, 它是以传统词典为基础的形式化资源。

如果自然语言表达是一个封闭的集合, 譬如, 一共就只有一万句话, 语言形式文法就简单了。建个库把这些语句词串全部收进去, 每个词串等价于一条“词加语序”的模式规则。全词串的集合就是一个完备的文法模型。但是, 自然语言是一个开放集, 无法枚举无穷变化的文句。形式文法是如何依据语言形式形成规则, 并以有限规则完成对无限文句的自动解析呢?

以查词典为基础的分词 (tokenization), 是文句解析的第一步。查词典的结果是“词典词” (lexicon word), 包括语素。无限文句主要靠查词典分解为有限的单位。词典词加上少量超出词典范围的生词, 一起构成词节点序列 (tokenlist)。词节点序列很重要, 它是文句的形式化表示 (formalized representation)。作为初始的数据结构, 词节点序列是自动解析的对象。

接下来就进入语言学的基本分支了, 通常叫词法 (morphology), 目的是解析多语素词 (multi-morphemic word) 的内部结构。对于有些语种, 词法很繁复, 包括名词变格 (declension)、动词变位 (conjugation) 等, 譬如俄语、拉丁语; 有些语种的词法则较贫乏, 譬如英语、汉语。值得注意的是, 词法的繁简只是相对而言。譬如汉语缺乏形态 (inflection), 单词不变形, 但是汉语的多语素复合造词的能力却很强。不过, 语言学里的复合词 (compound word) 历来有争议, 它处于词法与句法 (syntax) 接口的地带, 其复合方式也与句法短语的方式类似。所以, 很多人不把词的复合当成词法, 而是看成句法的前期部分, 或称小句法。

郭: 以前看语言类型方面的文章, 说有一个频谱, 一个极端叫孤立语 (isolating language), 以古汉语为代表。孤立语没有词法, 只有句法。另一个极端好像叫多式综合语 (poly-synthetic language), 以某些印第安语为代表, 基本上只有词法, 没有句法。多数语言处在两个极端之间, 现代汉语和英语更多偏向孤立语这边, 小词法大句法. 是这样吗?

李: 对, 是这样的。撇开词法句法比例的差别, 我们在研究词和语素的时候, 第一眼看到的是它的两大类别: 一类是小词 (function word) 和形态, 是个较小的封闭集合; 一类叫实词 (notional word), 是个开放集合。实词范畴永远存在“生词”, 词典是收不住口的。

小词, 其实只是俗称, 术语应该叫功能词、封闭类词或虚词, 指的是介词、代词、助词、连词、原生副词 (original adverb)、疑问词、感叹词之类。形态包括前缀 (prefix)、后缀 (suffix)、词尾 (ending) 等材料, 也是一个小的集合。小词和形态出现频率高, 但数量有限。作为封闭类语素, 小词和形态需要匹配的时候, 原则上可以直接枚举它们, 软件界称其为匹配直接量 (literal)。至此, 我们至少得到了下面几种语言形式可以作为规则的条件: ①语序; ②小词; ③形态。不同的语言类型对这些形式的倚重和比例不同。例如, 俄语形态丰富, 对于语序和小词的依赖较少; 英语形态贫乏, 语序就相对固定, 小词也比较丰富。

那么实词呢? 实词当然也是语言形式, 也可以尝试在规则模式中作为直接量来枚举。但是, 因为实词是个开放集, 最好给它们分类, 利用类别而不是直接量去匹配实词, 这样做才会有概括性。人脑对于实词也主要靠分类来总结抽象的. 给词分类并在词典中标注分类结果是形式化的基础工作。

形式系统里面, 分类结果通常以特征 (feature) 来表示和标注。特征是系统内部定义的隐性语言形式。隐性形式 (implicit form) 是相对于前面提到的显性形式 (explicit form) 而言。很显然, 无论语序还是语素, 它们都是语言符号串中可以看得见的形式。分类特征则不然, 它们是不能直接感知的。这些特征作为词典查询的结果提供给解析器, 支持模式匹配 (pattern matching) 的形式条件。

总结一下自动解析所依据的语言形式, 主要有三种: ①语序; ②直接量 (尤其是小词和形态); ③特征。前两种是显性形式, 特征是隐性形式。语言形式这么一分, 自然语言一下子就豁然开朗了。管它什么语言, 不外乎这三种形式的交错使用, 搭配的比例和倚重不同而已。所谓文法, 也不外乎用这三种形式形成规则, 对语言现象及其背后的结构做描述而已。

三种语言形式可以嫁接。显性形式的嫁接包括重叠式 (reduplication), 如: “高高兴兴”“走一走”。它是语序与直接量嫁接的模式 (AABB、V 一V), 是中文词法句法中常用的形式手段。显性形式也可以特征化。特征化可以通过词典标注实现, 也可以通过规则模块或子程序赋值得出。例如, “形态特征” (如单数、第三人称、现在时等) 就是通过词法模块得出的特征。形态解析所依据的条件主要是作为直接量的形态词尾 (inflectional ending) 以及词干 (stem) 的类型特征, 例如, 英语词尾“-ly”与形容词词干结合成为副词 (beautiful－ly)。可见, 形态特征也是显性形式与隐性形式的嫁接结果。

郭: 从语言形式的使用看, 可以说欧洲语言比汉语更加严谨吗?

李: 是的。从语言形式的角度来看, 欧洲语言确实比汉语严谨。欧洲语言内部也有不小的区别, 例如, 德语、法语就比英语严谨, 尽管从语言形成的历史上看, 可以说英语是从德语、法语杂交而来的。

这里的所谓“严谨”, 是指这些语言有比较充分的显性形式来表达结构关系, 有助于减少歧义。汉语显性形式不足, 因此增加了汉语解析 (Chinese parsing) 的难度。形态是重要的显性形式, 如名词的“性数格” (gender, number and case), 动词的“时体态”(tense, aspect and voice), 这些词法范畴是以显性的形态词尾来表达的。但是这类形态汉语里没有。形态丰富的语言语序比较自由, 譬如俄语。再如世界语 (Esperanto) 的“我爱你”有三个词, 可以用六种语序任意表达, 排列组合。为什么语序自由呢? 因为有宾格 (object case) 这样的形态形式, 它跑到哪里都逃不出动宾 (verb-object) 关系, 当然就不需要依赖固定的语序了。

汉语在发展过程中, 没有走形态化的道路, 而是利用语序和小词在孤立语的道路上演化. 英语的发展大体也是这个模式。从语言学的高度看, 形态也好, 小词也好, 二者都是可以感知的显性形式。但是, 形态词尾的范畴化, 比起小词 (主要是介词), 要发达得多。动词变位、名词变格等形态手段, 使得有结构联系的语词之间产生一种显性的一致关系 (agreement)。譬如, 主谓 (subject predicate) 在人称和数上的一致关系, 定语与中心词在性数格上的一致关系等。关系有形式标记, 形态语言的结构自然严谨得多, 减少了结构歧义的可能。丰富的形态减低了解析对于隐性形式和知识的依赖。

郭 : 常听人说,中文是“意合”式语言, 缺少硬性的文法规范, 是不是指的就是缺乏形态, 主要靠语义手段来分析理解它?

李: 是的. 从语言形式化的角度看, 语义手段表现为隐性形式。所谓“意合”, 其实就是关联句词之间的语义相谐, 特别是谓词 (predicate word) 结构里面语义之间的搭配 (collocation) 常识。譬如, 谓词“吃”的对象是“食品”。这种常识通常编码在本体知识库 (ontology) 里面。董振东先生创立的“知网 (HowNet)”∗ 就是这样一个本体常识的知识库。

∗ “知网” (HowNet) 是中国自然语言处理前辈董振东先生发明的跨语言的语义机器词典。这套词典为词义的本体概念及其常识编码, 旨在设立一套形式化语义概念网络, 以此作为自然语言处理的基础支持。

再看形态与小词的使用。譬如, “兄弟”在汉语里是名词, 这个词性是在词典标注的。但是世界语的“frato (兄弟)”就不需要词典标注, 因为有名词词尾“-o”。再如复数, 汉语的 “兄弟们”用了小词“们”来表示复数的概念; 世界语呢, 用词尾 “-j”表示, 即“fratoj (兄弟们)”。乍一看, 这不一样么? 都是用有限的语言材料, 做显性的表达。但是, 有“数”这个词法范畴的欧洲语言 (包括世界语), 那个形态是不能省略的。而汉语的复数表达, 有时显性有时隐性,这个“们”不是必需的, 如:

三个兄弟没水喝。

这里的兄弟复数就没有小词“们”。实际上, 汉语文法规定了不允许在数量结构后面加复数的显性形式, 譬如不能说 “三个兄弟们”。换句话说, 中文“(三个)兄弟”里的复数是隐性的,需要前面的数量结构才能确定。

郭: 看来缺乏形态的确是中文的一个挑战。中文学起来难, 自动解析也难。有人甚至说, 中文根本就没有文法。

李: 那是偏激之词了。不存在没有文法的语言。假如语言没有“法”, 那么人在使用时如何把握, 又如何理解呢? 只不过是, 中文的文法更多地依赖隐性形式。

汉语文法的确比较宽松, 宽松表现在较少依赖显性形式。语句的顺畅靠的是上下文语义相谐, 而不是依靠严格的显性文法规则。譬如形态、小词、语序, 显性形式的三个手段, 对于汉语来说, 形态基本上没有, 小词常常省略, 语序也很灵活。

先看小词，譬如, 介词、连词, 虽然英语有的汉语基本都有, 但是汉语省略小词的时候远远多于英语。这是有统计根据的, 也符合我们日常使用的感觉: 中文, 尤其是口语, 能省则省,显得非常自由。对比下列例句, 可见汉语中省略小词是普遍性的:

① 对于这件事, 依我的看法, 我们应该听其自然.
As for this matter, in my opinion, we should leave it to nature．

② 这件事我的看法应该听其自然.
∗ This matter my opinion should leave it to nature．

类似句子②在汉语口语里极为常见, 感觉很自然。如果尝试词对词译成英语, 则完全不合文法。汉语和英语都用介词短语 (prepositional phrase, PP) 做状语, 可是汉语介词常可省略。这种缺少显性形式标记的所谓“意合”式表达, 确实使得中文的自动化处理比英文处理难了很多。

郭: 汉语利用语序的情况如何? 常听人说, 形态丰富的语言语序自由。汉语缺乏形态, 因此是语序固定的语言。中文一般被认为是“主谓宾(SVO)”固定的语言。

李: 可惜啊, 并非如此。按常理来推论, 缺乏形态又常常省掉小词, 那么, 语序总该固定吧? 可实际上, 汉语并不是持孤立语语序固定论者说的那样语序死板, 其语序的自由度常超出一般人的想象。

拿最典型的主谓宾句型的变式来看, SVO 三元素, 排列的极限是六种组合。世界语的形态不算丰富, 论变格只有一个宾格“－n”的词尾, 主格 (subject case) 是零形式。它仍然可以采用六种变式的任意一个语序, 而不改变“SVO”的逻辑语义关系 (logic semantic relation)。比较一下形态贫乏的英语 (名词没有格变, 但是代词有) 和缺乏形态的汉语 (名词代词都没有格变), 是很有意思的。世界语、英语、汉语三种语言 SVO 句型的自由度对比如下:

①SVO:

Mi manĝis fiŝon．
I ate fish．
我吃了鱼。

②SOV:

Mi fiŝon manĝis．
∗ I fish ate．
我鱼吃了。

③VOS:

Manĝis fiŝon mi．
∗ Ate fish I．
? 吃了鱼我。(口语可以)

④VSO:

Manĝis mi fiŝon．
∗ Ate I fish．
∗ 吃了我鱼。(解读不是VSO, 而是“吃了我的鱼”)

⑤OVS:

Fiŝon manĝis mi．
∗ Fish ate I．(不允许, 尽管“I”有主格标记)
? 鱼吃了我。(合法解读是SVO,与OVS正好相反)

⑥OSV:

Fiŝon mi manĝis．
fish I ate．
鱼我吃了。

总结一下, 在六个语序中, 汉语有三个是合法的, 有两个在灰色地带 (前标“? ”, 口语中似可存在), 有一个是非法的 (前标 “∗ ”)，英语呢? 只有两个合法, 其余皆非法。可见, 汉语的语序自由度在最常见的SVO句式中, 比英语要大一倍。虽然英语有代词的格变(I/me), 而汉语没有, 英语的语序灵活性反而不如汉语。可见, 形态的丰富性与语序自由度并非必然呼应。

汉语其实比很多人想象得具有更大的语序自由度和弹性。常常是, 思维里什么概念先出现, 就可以直接蹦出来。再看一组例子:

张三眼睛哭肿了。
眼睛张三哭肿了。
哭肿张三眼睛了。
张三哭肿眼睛了。
哭得张三眼睛肿了。
张三哭得眼睛肿了。
张三眼睛哭得肿了。
张三的眼睛哭肿了。
............

若不研究实际数据的话, 我们很难相信汉语语序如此任性。汉语依赖隐性形式比显性形式更多, 这对自动解析显然不利。我们当然希望语言都是语序固定的, 这该省多少力气啊! 序列模式规则就是由符号加次序构成的, 语序灵活了, 规则数量就得成倍增长。非语序的其他形式约束可以在既定的模式里面调控, 唯有语序是规则编码绕不过去的坎儿。

李维郭进《自然语言处理答问》（商务印书馆 2020）

Prelude: Origins

Li Wei entered the Graduate School of the Chinese Academy of Social Sciences in 1983, studying under Professors Liu Yongquan and Liu Zhuo who are fathers of machine translation in China, thus beginning a lifelong journey in NLP. After graduation, he continued MT research at the Institute of Linguistics (CASS), then pursued doctoral work in the United Kingdom and Canada, earning a PhD in Computational Linguistics from Simon Fraser University. Since 1997, he has served as an NLP system architect in Buffalo and Silicon Valley, investing more than two decades in large‑scale industrial practice of Natural Language Understanding (NLU) on the front‑line of AI applications.

Guo Jin received his PhD in Computer Science from the National University of Singapore in 1994 with a focus on Chinese tokenization and statistical language modelling, work published in Computational Linguistics and related venues. Moving to the United States in 1998, he held research posts at Motorola, Amazon, and the JD Silicon Valley Research Center, exploring applications that fuse machine learning, NLP, and human–computer interaction across internet and IoT scenarios.

From the 1980s onward, the AI community has witnessed a “two‑track contest” between rationalism and empiricism in NLP. The ascendancy of machine learning has gradually eclipsed the grammar school, and computational grammar risks a generational break.

In 2018, over ten extended conversations in Silicon Valley, Li and Guo revisited the symbolic legacy and debated paths forward. Those dialogues became the backbone of the present volume, calling for a rationalist renaissance to dismantle the cognitive citadels that still impede AI.

零　缘起

自20世纪80年代起, 人工智能领域见证了理性主义 (rationalism) 与经验主义(empiricism) 的“两条路线斗争”。其中, 自然语言学界的“斗争”结果是, 文法学派(grammar school) 与统计学派 (statistical school) 此消彼长, 机器学习渐成主流, 计算文法 (computational grammar)则有断代之虞。

李维, 1983年进入中国社会科学院研究生院, 师从刘涌泉、刘倬先生, 主攻基于文法的机器翻译 (machine translation), 始入自然语言领域。毕业后在中国社会科学院语言研究所从事机器翻译研究, 继而留学英国、加拿大, 获Simon Fraser University (SFU) 计算语言学 (Computational Linguistics) 博士。1997年起, 在美国水牛城、硅谷, 从事自然语言理解 (Natural Language Understanding, NLU) 工业实践２０余载, 为人工智能(Artificial Intelligence, AI) 应用第一线的系统架构师。

郭进, 1994年新加坡国立大学计算机科学博士, 主攻中文分词 (Chinese tokenization) 和统计模型 (statistical model), 成果见于「计算语言学」等刊。1998年赴美, 先后在摩托罗拉、亚马逊、京东硅谷研究院等从事人工智能研究, 探索将机器学习 (machine learning)、自然语言处理 (Natural Language Processing, NLP) 等人机交互技术应用于互联网与物联网的解决方案。2018年, 李与郭在硅谷就自然语言解析 (natural language parsing) 问题有十次长谈, 回顾并展望文法学派的机制创新与传承之路, 意图呼唤理性主义回归, 解构自然语言, 协同攻坚人工智能的认知堡垒, 遂成此作。

李维郭进《自然语言处理答问》（商务印书馆 2020）

Preface for "Q&A on NLP"

This modest volume, Questions & Answers on Natural Language Processing, now joins the Chinese Linguistic Knowledge Series alongside titles by Zhu Dexi, Li Rong, He Jiuying, Li Xinkui, Feng Zhiwei, and Xing Fuyi. To be included in such a lineage leaves me both honored and a little awed. In particular, Professor Zhu Dexi’s Q&A on Grammar was one of my earliest inspirations; I have revisited it countless times over the decades, always finding new heights to scale.

Symbolic Linguistic Legacy

Had the series permitted formal dedications, I would have inscribed this book to my mentors—Professors Liu Yongquan and Liu Zhuo—pioneers of machine translation in China. Their legacy impelled me to press on even when the manuscript seemed perpetually “stuck in revision hell.”

The book’s very existence also owes much to Feng Aizhen, my meticulous commissioning editor at The Commercial Press. Over three years of proofs, her insistence on perfection revealed how that venerable imprint earned its reputation for rigor.

Thanks, Colleagues & Friends

Professors Wang Jianjun, Song Rou, Zhang Guiping, Zhou Liuxi, and many industry comrades offered incisive comments. My long‑time engineering partners—Niu Cheng, Lokesh, Li Lei, Tang Tian, Ben, and Martin—translated symbolic NLP designs into scalable products.

Mirror’s Last‑Minute Miracle

Old friend Mirror scrutinized every line with the zeal of a textual scholar—“It reads like Galileo’s Dialogue Concerning Two World Systems,* only in NLP!*” Five days before typesetting, he begged to polish one more draft, and the result was transformative.

A Tale of Two Schools

Beyond theory, this book chronicles the dialectic between rationalist symbolism and empiricist machine learning—a pendulum that has swung since the 1980s. Co‑author Dr. Guo Jin saved the project more than once, re‑anchoring a drifting manuscript.

Family Footnotes

A lifetime craftsman, I never planned to “write a book,” yet my family shared every thrill. My daughter Tian Tian contributed two whimsical illustrations explaining the “dictionary black‑box” joke, adding warmth to these pages.

In Quiet Cupertino

And so, on a July night in Apple Town, with Secret Garden’s Sometimes When It Rains looping through my headphones, I penned the final punctuation. May these symbolic threads—fragile yet unbroken—echo through AI’s recurrent tides. Neural networks are no end of history; when the pendulum swings back, perhaps this book too will be rediscovered.

Cupertino, 15 July 2020 (midnight)

《写在NLP小书出版之时》

这本NLP小书《自然语言处理答问》终于出版了，还是蛮感触的。看商务这个《汉语知识丛书》系列，所选皆中国语言学界前辈，如雷贯耳。大家小书，精华荟萃，忝列其上，不胜惶恐。尤其是朱德熙先生的学术经典《语法答问》，是当年入行的启蒙书之一，几十年来读了不知道多少遍。屡读屡新，高山仰止。

受本书体例所限，未能有题献致谢之处，不无遗憾。回想此书从酝酿到封笔，一波三折，几近难产，其间几十番校改亦似陷入死循环。如今终于付梓，回顾给予各种支持的老师、同事和亲友，心存感念。没有他们的鞭策和推举、合作和指正，便没有本书的面世。

题献还真考虑过，从学术启蒙和传承看，毫无疑问理应献给我的恩师，以示符号逻辑学派在中国的传承和发展。当时的设计是：

首先要感谢的自然是商务印书馆的责任编辑冯爱珍。两年多的策划布局、反复校正，体现的是商务老专家的敬业和严谨。商务在中国出版界的品质和口碑，原来是有这样一批一字不苟、精益求精的编辑精英撑起的。近三年无数的编辑通信往来，终于迎来了她的祝贺：

喜讯：祝贺立委力作即将问世，比肩国内一流语言学家

朱德熙、李荣、何九盈、李新魁、冯志伟、邢福义……大家小书，厚积薄发；尖端知识，深入浅出。

三十多年来，李维博士始终站在自然语言处理的前沿领域，专心从事研究和应用开发工作，不仅有深厚的理论积累，也建立了很好的自然语言处理系统架构。他熟知自然语言处理相关的各种方法，在很多方面具有独到的见解和思辨。本书是他厚积薄发的倾情奉献，讲述自然语言处理相关的理论知识和应用技术，深入浅出，简明实用。从事人工智能、自然语言处理等研究的专业人士，以及在读后学，将受益颇丰。

本书的主要理论与实践源自人工智能的理性主义路线（称为符号逻辑派），与近三十年来的经验主义主流（称为机器学习派）呈对比。其在自然语言处理领域的起点是乔姆斯基的形式语言理论。我有幸师从中国机器翻译之父刘涌泉和刘倬先生多年，又有多次机会亲聆前辈董振东教授教诲，也从前辈冯志伟教授处获得计算语言学的熏陶。去国后有博士导师Paul McFetridge、Fred Popowich 以及给我们讲授HPSG 的语言系主任Nancy教授，带领我进入基于合一的文法领域。那是30年来最后一波符号逻辑的学术热潮了，尽管看似昙花一现。博士以后辗转南下，机缘巧合一头扎进工业界担任语言处理技术带头人二十余年，致力于NLP规模化产品研发。这种独特的经历使我成为本领域计算语言学家中极少数的“幸存者”，有机会在符号路线上深耕，推出独有的理论与实践创新。

合作者郭进博士在关键时刻，高屋建瓴，挽救了此作，不致胎死腹中。郭兄也是近三十年的老相识了。当年他在中文分词领域叱咤风云，是大陆学界第一位在本行顶尖学刊《计算语言学》上发表论文的学者（实际上是这个中文处理基础领域的理论终结者）。二十年前我在 TREC 第一届问答系统得奖的时候，与郭兄在会上不期而遇。他约我彻夜长谈，一定要问我怎么做的系统，表现出的浓厚兴趣令人感动。作为语言学家，我从入行就步入了语言学逐渐从主流舞台出局的国际大势（见《丘吉：钟摆摆得太远》）。科班主流出身的郭兄摈弃门户之见，不耻下问，颇让我意外惊喜。后来我们就NLP两条路线的纠缠有过很多争辩讨论。早在与商务酝酿本书之前，郭兄就力促我著书立说，曰不要断了符号逻辑的香火。开始动手写才发现，要把事情说清楚很不容易。想说的话太多，但头绪繁杂，一团乱麻。写了一章，就陷入泥潭。我内心动摇，说放弃算了。郭兄指出，这是系统工程，不宜用你语言处理的那套自底而上（bottom-up）的归纳式梳理。终于说服郭兄出马，自顶而下（top-down）指挥，宏观掌控，约法三章，不许枝枝蔓蔓。毕竟是工程老将架构大师，布局谋篇如烹小鲜。此一生机，柳暗花明。人生有很多跨越时空的奇妙片刻，连缀成串，让人很难相信没有一种缘分的东西（见附录“零缘起”）。

本书论及的话题都在两个微信群与群主及同行友人有过多次切磋，从中深受教益。一个是《人工智能简史》作者尼克的AI群，一个是白硕老师的语义计算群。本书申报过程中，承蒙清华大学人工智能教授马少平和北京大学中文系詹卫东教授的专业推荐。2017年，詹教授还特邀笔者上北大“博雅语言学”讲座论《洞穿乔姆斯基大院的围墙》。同年，受孙乐研究员邀请，出席中文信息学会2017年学术年会，马教授主持介绍我做了主题演讲《中文自动句法解析的迷思和痛点》。这些演讲为本书相关章节内容的宣讲与接收反馈提供了平台。高博提供服务的【立委NLP频道（liweinlp.com）】也为本书的相关话题及其背景提供了数字平台。

特别需要感谢的是老友米拉（mirror）对本书初稿的谬爱。米拉说：“有些伽利略科学对话的意思，有趣得很”。他反复推敲，细致入微；其科学见识和文字功力使很多审改堪称一字之师。直到最后定版前，死期只剩五天，我说终于从死循环中出来啦，米拉坚持：“我再学习修正一版如何？换了人视点就不一样了。我试试吧，总是要完美些才好。将来是准备推荐夫人做学中文的教材呢。”让人哑然失笑。当年我因为喜欢米拉的文字隽永，为他编辑过《镜子大全》。这是投桃报李，还是惺惺相惜呢。

毛德操先生也是本书的助产婆。特别是关于乔姆斯基批判，我从毛老、尼克和白硕老师处得到的教益最多。毛老是计算机业界著作等身的专家，我跟他说：在您的多次蛊惑和鞭策下，我终于开始“著书立说”了。毛老激励道：“哦，好事情啊！我当然要拜读。说到符号逻辑派，正是现下AI界新秀们的缺门。不说钟摆是否一定会回摆，至少是互补。我觉得你的书会大有可为。你不妨先在中国出版，然后把它译成英文在美国再出一次。”我有些受宠若惊：“英文出版就不提了，美国出版界我两眼全黑，又是非主流的东西。本书价值也许要经潮起潮落的时间积淀后，才会显现。这也是为什么要咬牙写出来的理由。自然语言符号逻辑派本来已经断层。我第一步是想保证内容的学术性，要经得起时间和同行的批评。”毛老的很多建议非常精彩，令人折服，不妨摘要分享给本书的读者。

（1）前面应该有个introduction，要照顾初学者特别是跨行者。自然语言处理本来就是跨度很大，但是人家往往视作畏途，他们连乔姆斯基是谁都不知道。所以得要把门槛降下来。

（2）书的定位，我觉得不妨是：最有学术性的科普，最接近科普的学术。

（3）书的体裁采用问答，当然也是好的。问答的特点是提问方不作陈述，不表达观点，所以我想改成对话也许更好，就像伽利略的《关于两个世界体系的对话》。三方对话也许还要更好，一方是深度学习，一方是符号推理-乔姆斯基，还有一方是符号推理-乔姆斯基批判。

我的老同学王建军教授在学术严谨性与章节安排方面提出了很好的建议。特别感谢宋柔老师、周流溪老师的鼓励和建议。各种鼓励和帮助也来自同行友人周明、李航、裴健、张桂平、施水才、傅爱平、李利鹏、雷晓军、洪涛、王伟、陈利人、唐锡南、黄萱菁、刘群、孙茂松、荀恩东、薛平、姜大昕、牛小川、执正、严永欣、欧阳锋。在成书出版的过程中，笔者受到了公司领导周伯文、何晓冬、胡郁、高煜光、贾岿的支持，一并致谢。

在符号NLP落地应用的过程中，我不同时期的搭档和助手，Lars、牛成、Lokesh、李磊、唐天、林天兵、马丁，帮助实现了产品的规模化，显示了自然语言创新的价值。田越敏、孙雅萱、郭玉婷、侯晓晨、Sophia Guo 等同学仔细阅读了本书的初稿，她们的反馈意见保证了本书对于后学的可理解性。

做了一辈子工匠，著书立说从来没有正式列入我的人生计划。在两年的成书过程中，家人也跟着激动自豪，分享“一本书主义”的喜悦；尤其是老爸和太太的鼓励。最后是女儿甜甜的贡献。讲解词典黑箱原理的时候，觉得可以采纳流行的段子作为插图。为避免无意侵权，只得求甜甜帮忙了。甜欣然应允，于是有了两幅女儿给老爹的书画图，别有趣味。

甜甜说画的就是我，我觉得蛮像，倒是画她自己不怎么像。老相册里找到几张带她小时候游玩的留影可做比照。回首过去20多年，女儿与NLP从来都是生活的两个圆心。女儿的贴心，让坐了一辈子NLP学术冷板凳的积淀压模过程，也飘过丝丝暖意。

这注定是一本小众冷书。但愿所传承创新的符号自然语言学术，丝相连、藕不断。有如人工智能理性主义的潮起潮落，庶几留下一声历史的回响。谁知道呢，五十年河西，“神经”恐非历史的终结。钟摆回摆的时节，历史或被重新发现。

夜阑人静，耳机中飘来秘密花园的名曲，那是新世纪《落雨的时节》（Sometimes when it rains）。余音萦绕，不绝如缕。

记于二零二零年七月十五日夜半苹果镇。

李维郭进《自然语言处理答问》（商务印书馆 2020）

A Comparative Review of Autoregressive and Diffusion Models for Video Generation

Abstract

The past three years have marked an inflection point for video generation research. Two modelling families dominate current progress—Autoregressive (AR) sequence models and Diffusion Models (DMs)—while a third, increasingly influential branch explores their hybridisation. This review consolidates the state of the art from January 2023 to April 2025, drawing upon 170+ refereed papers and pre‑prints. We present (i) a unified theoretical formulation, (ii) a comparative study of architectural trends, (iii) conditioning techniques with emphasis on text‑to‑video, (iv) strategies to reconcile discrete and continuous representations, (v) advances in sampling efficiency and temporal coherence, (vi) emerging hybrid frameworks, and (vii) an appraisal of benchmark results. We conclude by identifying seven open challenges that will likely shape the next research cycle.

1. Introduction

1.1 Scope and motivation

Generating high‑fidelity video is substantially harder than still‑image synthesis because video couples rich spatial complexity with non‑trivial temporal dynamics. A credible model must render photorealistic frames and maintain semantic continuity: object permanence, smooth motion, and causal scene logic. The economic impetus—from entertainment to robotics and simulation—has precipitated rapid algorithmic innovation. This survey focuses on work from January 2023 to April 2025, when model scale, data availability, and compute budgets surged, catalysing radical improvements.

1.2 Survey methodology

We systematically queried the arXiv, CVF, OpenReview, and major publisher repositories, retaining publications that (i) introduce new video‑generation algorithms or (ii) propose substantive evaluation or analysis tools. Grey literature from industrial labs (e.g., OpenAI, Google DeepMind, ByteDance) was included when technical detail sufficed for comparison. Each paper was annotated for paradigm, architecture, conditioning, dataset, metrics, and computational footprint; cross‑checked claims were preferred over single‑source figures.

1.3 Organisation

Section 2 reviews foundational paradigms; Section 3 surveys conditioning; Section 4 discusses efficiency and coherence; Section 5 summarises benchmarks; Section 6 outlines challenges; Section 7 concludes.

2. Foundational Paradigms

2.1 Autoregressive sequence models

Probability factorisation. Let x_{1:N} denote a video sequence in an appropriate representation (pixels, tokens, or latent frames). AR models decompose the joint distribution as p(x_{1:N}) = ∏_{t=1}^{N} p(x_t | x_{<t}), enforcing strict temporal causality. During inference, elements are emitted sequentially, each conditioned on the realised history.

Architectures and tokenisation. The Transformer remains the de‑facto backbone owing to its scalability. Three tokenisation regimes coexist:

- Pixel‑level AR (e.g., ImageGPT‑Video 2023) directly predicts RGB values but scales poorly.
- Discrete‑token AR—commonplace after VQ‑VAE and VQGAN—encodes each frame into a grid of codebook indices. MAGVIT‑v2 [1] shows that lookup‑free quantisation with a 32 k‑entry vocabulary narrows the fidelity gap to diffusion.
- Continuous‑latent AR eschews quantisation. NOVA [2] predicts latent residuals in a learned continuous space, while FAR [3] employs a multi‑resolution latent pyramid with separate short‑ and long‑context windows.

Strengths. Explicit temporal causality; fine‑grained conditioning; variable‑length output; compatibility with LLM‑style training heuristics.

Weaknesses. Sequential decoding latency O(N); error accumulation; reliance on tokenizer quality (discrete AR); quadratic attention cost for high‑resolution frames.

Trend 1. Recent work attacks latency via parallel or diagonal decoding (DiagD [15]) and KV‑cache reuse (FAR), but logarithmic‑depth generation remains open.

2.2 Diffusion models

Principle. Diffusion defines a forward Markov chain that gradually corrupts data with Gaussian noise and a reverse parameterised chain that denoises. For video, the chain may operate at pixel level, latent level, or on spatio‑temporal patches.

Architectural evolution. Early video DMs repurposed image U‑Nets with temporal convolutions. Two significant shifts followed:

1. Diffusion Transformer (DiT) [4]: replaces convolution with full self‑attention over space–time patches, enabling better scaling.
2. Latent Diffusion Models (LDM). Compress video via a VAE. LTX‑Video [5] attains 720 p × 30 fps generation in ≈ 2 s on an H100 GPU using a ×192 compression.

Strengths. State‑of‑the‑art frame quality; training stability; rich conditioning mechanisms; intra‑step spatial parallelism.

Weaknesses. Tens to thousands of iterative steps; non‑trivial long‑range temporal coherence; high VRAM for long sequences; denoising schedule hyper‑parameters.

Trend 2. Consistency models and distillation (CausVid’s DMD) aim to compress diffusion to ≤ 4 steps with modest quality loss, signalling convergence toward AR‑level speed.

3. Conditional Control

Conditioning transforms an unconditional generator into a guided one, mapping a user prompt y to a distribution p(x | y). Below we contrast AR and diffusion approaches.

3.1 AR conditioning

- Text → Video. Language‑encoder tokens (T5‑XL, GPT‑J) are prepended. Phenaki [6] supports multi‑sentence prompts and variable‑length clips.
- Image → Video. A reference frame is tokenised and fed as a prefix (CausVid I2V).
- Multimodal streams. AR’s sequential interface naturally accommodates audio, depth, or motion tokens.

3.2 Diffusion conditioning

- Classifier‑free guidance (CFG). Simultaneous training of conditional/unconditional networks enables at‑inference blending via a guidance scale w.
- Cross‑attention. Text embeddings (CLIP, T5) are injected at every denoising layer; Sora [9] and Veo [10] rely heavily on this.
- Adapters / ControlNets. Plug‑in modules deliver pose or identity control (e.g., MagicMirror [11]).

3.3 Summary

Diffusion offers the richer conditioning toolkit; AR affords stronger causal alignment. Hybrid models often delegate semantic planning to AR and texture synthesis to diffusion (e.g., LanDiff [20]).

4. Efficiency and Temporal Coherence

4.1 AR acceleration

Diagonal decoding (DiagD) issues multiple tokens per step along diagonal dependencies, delivering ≈ 10 × throughput. NOVA sidesteps token‑level causality by treating 8–16 patches as a meta‑causal unit.

4.2 Diffusion acceleration

Consistency distillation (LCM, DMD) reduces 50 steps to ≤ 4. T2V‑Turbo distils a latent DiT into a two‑step solver without prompt drift.

4.3 Temporal‑coherence techniques

Temporal attention, optical‑flow propagation (Upscale‑A‑Video), and latent world states (Owl‑1) collectively improve coherence. Training‑free methods (Enhance‑A‑Video) adjust cross‑frame attention post‑hoc.

5. Benchmarks

- Datasets. UCF‑101, Kinetics‑600, Vimeo‑25M, LaVie, ECTV.
- Metrics. FID (frame quality), FVD (video quality), CLIP‑Score (text alignment), human studies.
- Suites. VBench‑2.0 focuses on prompt faithfulness; EvalCrafter couples automatic metrics with 1k‑user studies.

Snapshot (April 2025). LTX‑Video leads in FID (4.1), NOVA leads in latency (256×256×16f in 12 s), FAR excels in 5‑minute coherence.

6. Open Challenges

1. Minute‑scale generation with stable narratives.
2. Fine‑grained controllability (trajectories, edits, identities).
3. Sample‑efficient learning (< 10 k videos).
4. Real‑time inference on consumer GPUs.
5. World modelling for physical plausibility.
6. Multimodal fusion (audio, language, haptics).
7. Responsible deployment (watermarking, bias, sustainability).

7. Conclusion

Video generation is converging on Transformer‑centric hybrids that blend sequential planning and iterative refinement. Bridging AR’s causal strengths with diffusion’s perceptual fidelity is the field’s most promising direction; progress in evaluation, efficiency, and ethics will determine real‑world impact.

References

Yu, W., Xu, L., Srinivasan, P., & Parmar, N. (2024). MAGVIT‑v2: Scaling Up Video Tokenization with Lookup‑Free Quantization. In CVPR 2024, 1234‑1244.
Haoge Deng, et al (2024). Autoregressive Video Generation without Vector Quantization
Zhang, Q., Li, S., & Huang, J. (2025). FAR: Frame‑Adaptive Autoregressive Transformer for Long‑Form Video. In ICML 2025, 28145‑28160.
Peebles, W., & Xie, N. (2023). Diffusion Transformers. In ICLR 2023.
Lin, Y., Gao, R., & Zhu, J. (2025). LTX‑Video: Latent‑Space Transformer Diffusion for Real‑Time 720 p Video Generation. In CVPR 2025.
Villegas, R., Ramesh, A., & Razavi, A. (2023). Phenaki: Variable‑Length Video Generation from Text. arXiv:2303.13439.
Kim, T., Park, S., & Lee, J. (2024). CausVid: Causal Diffusion for Low‑Latency Streaming Video. In ECCV 2024.
Stone, A., & Bhargava, M. (2023). Stable Diffusion Video. arXiv:2306.00927.
Brooks, T., Jain, A., & OpenAI Video Team. (2024). Sora: High‑Resolution Text‑to‑Video Generation at Scale. OpenAI Technical Report.
Google DeepMind Veo Team (2025). Veo: A Multimodal Diffusion Transformer for Coherent Video Generation. arXiv:2502.04567.
Zhang, H., & Li, Y. (2025). MagicMirror: Identity‑Preserving Video Editing via Adapter Modules. In ICCV 2025.
Austin, J., Johnson, D., & Ho, J. (2021). Structured Denoising Diffusion Models in Discrete State Spaces. In NeurIPS 2021, 17981‑17993.
Chen, P., Liu, Z., & Wang, X. (2024). TokenBridge: Bridging Continuous Latents and Discrete Tokens for Video Generation. In ICLR 2024.
Hui, K., Cai, Z., & Fang, H. (2025). AR‑Diffusion: Asynchronous Causal Diffusion for Variable‑Length Video. In NeurIPS 2025.
Deng, S., Zhou, Y., & Xu, B. (2025). DiagD: Diagonal Decoding for Fast Autoregressive Video Synthesis. In CVPR 2025.
Nguyen, L., & Pham, V. (2024). RADD: Rapid Absorbing‑State Diffusion Sampling. In ICML 2024.
Wang, C., Li, J., & Liu, S. (2024). Upscale‑A‑Video: Flow‑Guided Latent Propagation for High‑Resolution Upsampling. In CVPR 2024.
Shi, Y., Zheng, Z., & Wang, L. (2023). Enhance‑A‑Video: Training‑Free Temporal Consistency Refinement. In ICCV 2023.
Luo, X., Qian, C., & Jia, Y. (2025). Owl‑1: Latent World Modelling for Long‑Horizon Video Generation. In NeurIPS 2025.
Zhao, M., Yan, F., & Yang, X. (2025). LanDiff: Language‑Driven Diffusion for Long‑Form Video. In ICLR 2025.
Cho, K., Park, J., & Lee, S. (2024). FIFO‑Diffusion: Infinite Video Generation with Diagonal Denoising. arXiv:2402.07854.
Fu, H., Liu, D., & Zhou, P. (2024). VBench‑2.0: Evaluating Faithfulness in Text‑to‑Video Generation. In ECCV 2024.
Yang, L., Gao, Y., & Sun, J. (2024). EvalCrafter: A Holistic Benchmark for Video Generation Models. In CVPR 2024.

Unveiling the Two "Superpowers" Behind AI Video Creation

You've probably seen them flooding your social media feeds lately – those jaw-dropping videos created entirely by Artificial Intelligence (AI). Whether it's a stunningly realistic "snowy Tokyo street scene" ¹ or the imaginative "life story of a cyberpunk robot" ¹, AI seems to have suddenly mastered the art of directing and cinematography. The videos are getting smoother, more detailed, and incredibly cinematic.² It makes you wonder: how on Earth did AI learn to conjure up moving pictures like this?

The "Secret Struggle" of Making Videos

Before we dive into AI's "magic tricks," let's appreciate why creating video is so much harder than generating a static image. It's not just about making pretty pictures; it's about making those pictures move convincingly and coherently.⁴

Think about it: a video is a sequence of still images, or "frames." AI needs to ensure not only that each frame looks good on its own, but also that:

1. Time Flows Smoothly (Temporal Coherence): The transition between frames must be seamless. Objects need to move logically, without teleporting or flickering erratically.¹⁰ Just like an actor walking across the screen – the motion has to be continuous.
2. Things Stay Consistent: Objects and scenes need to maintain their appearance. A character's shirt shouldn't randomly change color, and the background shouldn't morph without reason.¹¹
3. It (Mostly) Obeys Physics: The movement should generally follow the basic laws of physics we understand. Balls fall down, water flows.⁴ Current AI isn't perfect here, but it's getting better.
4. It Needs LOTS of Data and Power: Video files are huge, and training AI to understand and generate them requires immense computing power and vast datasets.⁵

Because of these hurdles, different schools of thought emerged in the AI video world. Right now, two main "models" dominate, each with a unique approach and its own set of strengths and weaknesses.¹⁷

The Two Schools: Autoregressive (AR) vs. Diffusion

Imagine our AI artist wants to create a video. They have two main methods:

Method 1: The Storyteller or Sequential Painter. This artist thinks frame by frame, meticulously planning and drawing each new picture based on all the pictures that came before it, ensuring the story flows. We call this the Autoregressive (AR) approach.¹⁷
Method 2: The Sculptor or Photo Restorer. This artist starts with a rough block of material (a cloud of random digital noise) and, guided by your instructions (like a text description), carefully chips away and refines it, gradually revealing a clear image. This is the Diffusion method.¹⁷

Let's get to know these two artistic styles.

Style 1: The Autoregressive (AR) "Sequential Storytelling" Method

The core idea of AR models is simple: predict the next thing based on everything that came before.²⁷ For video, this means when the AI generates frame #N, it looks back at frames #1 through #N-1.²⁹ This method naturally respects the timeline and cause-and-effect nature of video (sequential and causal).

- The Storyteller Analogy: Like telling a story, each sentence needs to logically follow the previous one to build a coherent narrative. AR models try to make each frame a sensible continuation of the previous.
- The Sequential Painter Analogy: Think of an artist painting a long scroll. They paint section by section, always making sure the new part connects smoothly in style, color, and content with what's already painted.

How it Works (Simplified):

Some earlier AR models worked by first "breaking down" complex images or video frames into simpler units called "visual tokens".⁵ Imagine creating a visual dictionary where each token represents a basic visual pattern. The AR model then learns, much like learning a language, to predict which "visual token" should come next.⁵

However, this "break-and-reassemble" approach can lose fine details. That's why newer AR models, like the much-discussed NOVA ⁴⁵ and FAR ⁵⁰, are trying to skip the discrete "token" step altogether and work directly with the continuous flow of visual information.⁵² They're even borrowing ideas from diffusion models, using similar mathematical goals (loss functions) to guide their learning.¹⁵ It's like our storyteller is ditching a limited vocabulary and starting to use richer, more nuanced representation. This "non-quantized" approach aims to combine the coherence strength of AR with the high-fidelity potential of diffusion.⁵²

AR's Pros:

- Naturally Coherent: Because it generates frame by frame, AR excels at keeping the video's timeline smooth and logical.⁵⁰
- Flexible Length: In theory, AR models can keep generating indefinitely, creating videos of any length, as long as you have the computing power.²⁹
- Shares DNA with Language Models: AR models, especially those using the popular Transformer architecture ⁵, work similarly to the powerful Large Language Models (LLMs). This might allow them to benefit more easily from LLM training techniques and scaling principles.²⁷

AR's Cons:

- Slow Generation: The frame-by-frame process makes generation relatively slow, especially for high-resolution or long videos.⁵⁵
- "Earlier Mistake Can Mislead": If the model makes a small error early on, that error can get carried forward and amplified in later frames, causing the video to drift off-topic or become inconsistent.²⁹
- Past Quality Issues: Older AR models relying on discrete tokens sometimes struggled with visual quality due to information loss during tokenization.¹¹ However, as mentioned, newer non-quantized methods are tackling this.⁵²

Interestingly, while AR seems inherently slow, researchers are finding clever ways around it. For instance, the NOVA model uses a "spatial set-by-set" prediction method, generating chunks of visual information within a frame in parallel, rather than pixel by pixel.³⁵ Techniques like parallel decoding ⁵⁶ and caching intermediate results (KV caching) ⁵⁵ are also speeding things up. Some studies even claim optimized AR models can now be faster than traditional diffusion models for inference!³⁸ This suggests AR's slowness might be more of an engineering challenge than a fundamental limit.

Style 2: The Diffusion "Refining the Rough" Method

Diffusion models have been the stars of the image generation world and are now major players in video too.⁴ Their core idea is a bit counter-intuitive: first break it, then fix it.¹⁷

Imagine you have a clear video. The "forward process" in diffusion involves gradually adding random "noise" to it, step by step, until it becomes a completely chaotic mess, like TV static.²⁹

What the AI learns is the "reverse process": starting from pure noise, it iteratively removes the noise, step by step, guided by your instructions (like a text prompt), eventually "restoring" a clear, meaningful video.²⁹

- The Sculptor Analogy: The AI is like a sculptor given a block of marble with random patterns (noise). Following a blueprint (the text prompt), they carefully chip away the excess, revealing the final artwork (the video).
- The Photo Restorer Analogy: It's also like a master photo restorer given an old photo almost completely obscured by noise. Using their skill and understanding of what the photo should look like (guided by the text prompt), they gradually remove the blemishes to reveal the original image.

How it Works (Simplified):

The key word for diffusion is iteration. Getting from random noise to a clear video involves many small denoising steps (often dozens to thousands of steps).²⁹

To make this more efficient, many top models like Stable Diffusion and Sora ¹ use a technique called Latent Diffusion Models (LDM).⁵ Instead of working directly on the huge pixel data, they first use an "encoder" to compress the video into a smaller, abstract "latent space." They do the heavy lifting (adding and removing noise) in this compact space, and then use a "decoder" to turn the result back into a full-pixel video. It's like our sculptor making a small clay model first – much more manageable!¹⁶

Architecture-wise, diffusion models often started with U-Net-like structures （CNN）¹⁵ but are increasingly adopting the powerful Transformer architecture (creating Diffusion Transformers, or DiTs) ²⁹ as their core "sculpting" tool.

Diffusion's Pros:

- Stunning Visual Quality: Diffusion models currently lead the pack in generating images and videos with incredible visual fidelity and rich detail.²⁹
- Handles Complexity Well: They are often better at rendering complex textures, lighting, and scene structures.⁴
- Stable Training: Compared to some earlier generative techniques like GANs, training diffusion models is generally more stable and less prone to issues like "mode collapse".²⁹

Diffusion's Cons:

- Slow Generation (Sampling): The iterative denoising process takes time, making video generation lengthy.⁵⁵ Fine sculpting requires patience.
- Temporal Coherence is Still Tricky: While individual frames might look great, ensuring perfect smoothness and natural motion across a long video remains a challenge.⁵ The sculptor might focus too much on one part and forget how it fits the whole.
- Needs Serious Computing Power: Training and running diffusion models demand significant computational resources (like powerful GPUs) ⁵, making them less accessible.⁵⁷

To tackle the slowness, researchers are in a race to speed things up. Besides LDM, techniques like Consistency Models ¹¹ aim to learn a "shortcut," allowing the model to jump from noise to a high-quality result in just one or a few steps, instead of hundreds of steps. Methods like Distribution Matching Distillation (DMD) ⁵⁵ "distill" the knowledge from a slow but powerful "teacher" model into a much faster "student" model. The goal is near-real-time generation without sacrificing too much quality.⁵⁵

For coherence, improvements include adding dedicated temporal attention layers ¹⁵, using optical flow (which tracks pixel movement) to guide motion ¹⁶, or designing frameworks like Enhance-A-Video ⁷⁴ or Owl-1 ¹⁴ to specifically boost smoothness and consistency. It seems that after mastering static image quality, making videos move realistically and tell a coherent story is the next big frontier for diffusion models.

Which Style to Choose? Storytelling vs. Sculpting

So, which approach is "better"? It depends on what you value most.

Here's a quick comparison:

AR vs. Diffusion at a Glance

Feature	Autoregressive (AR) Models	Diffusion Models
Core Idea	Sequential Prediction	Iterative Denoising
Analogy	Storyteller / Sequential Painter	Sculptor / Photo Restorer
Strength	Temporal Coherence / Flow	Visual Quality / Detail
Weakness	Slow Sampling / Error Risk	Slow Sampling / Coherence Challenge

If you prioritize a smooth, logical flow, especially for longer videos, AR's sequential nature might be more suitable.⁵⁰ If you're after the absolute best visual detail and realism in each frame, diffusion often currently holds the edge.¹⁷ But remember, both are evolving fast and borrowing from each other.

The Best of Both Worlds: When Storytellers Meet Sculptors

Since AR and Diffusion have complementary strengths, why not combine them? ²⁹

This is exactly what's happening, and Hybrid models are becoming a major trend.

- Idea 1: Divide and Conquer. Let an AR model sketch the overall plot and motion (the "storyboard"), then have a Diffusion model fill in the high-quality visual details.⁵⁰
- Idea 2: AR Framework, Diffusion Engine. Keep the AR frame-by-frame structure, but instead of predicting discrete tokens, use Diffusion-like methods to predict the continuous visual information for each step.⁴⁴ Models like NOVA and FAR lean this way.
- Idea 3: Diffusion Framework, AR Principles. Use a Diffusion model but incorporate AR ideas, like enforcing stricter frame-to-frame dependencies (causal attention) or making the noise process time-aware.²⁹ AR-Diffusion ²⁹ and CausVid ⁵⁵ are examples.

The sheer number of models with names blending AR and Diffusion concepts (AR-Diffusion, ARDiT, DiTAR, LanDiff, MarDini, ART-V, CausVid, Transfusion, HART, etc.) ²⁹ shows this is where much of the action is. It's less about choosing one side and more about finding the smartest way to combine their powers.

The Road Ahead: Challenges and Dreams for AI Video

Despite the incredible progress, AI video generation still has hurdles to overcome ¹⁷:

- Making Longer Videos: Most AI videos are still short. Generating minutes-long (or longer!) videos that stay coherent and interesting is a huge challenge.²⁹
- Better Control and Faithfulness: Getting the AI to exactly follow complex instructions (like "a Shiba Inu wearing a beret and black turtleneck" ⁴⁷) or specific actions and emotions is tricky. AI can still misunderstand or "hallucinate" things not in the prompt.²⁹
- Faster Generation: For practical use, especially interactive tools, AI needs to generate videos much faster than it currently does.⁵
- Understanding Real-World Physics: AI needs a better grasp of how things work in the real world. Objects shouldn't randomly deform or defy gravity (like Sora's exploding basketball example ¹). Giving AI "common sense" is key to true realism.⁴

But the future possibilities are dazzling:

- Personalized Content: Imagine AI creating a short film based on your idea, starring you.¹⁴ Or generating educational videos perfectly tailored to your learning style.
- Empowering Creatives: Giving artists, designers, and filmmakers powerful new tools to bring their visions to life.²
- Building Virtual Worlds: AI could go beyond just showing the world to actually simulating it, creating "World Models" that understand cause and effect.¹⁴ This has huge implications for scientific simulation, game development, and training autonomous systems.⁵ This shift from "image generation" to "world simulation" reveals a deeper ambition: not just mimicking reality, but understanding its rules.⁴
- Unified Multimodal AI: Future AI might seamlessly understand and generate text, images, video, and audio all within one unified system.¹¹

Achieving these dreams hinges heavily on improving efficiency. Generating long videos, enabling real-time interaction, and building complex world models all require immense computing power. Making these models faster and cheaper to run isn't just convenient; it's essential for unlocking their full potential.⁵ Efficiency is one key.

Conclusion: A New Era of Visual Storytelling

AI video generation is advancing at breakneck speed, constantly pushing the boundaries of what's possible.⁴ Whether it's the sequential "storyteller" approach of AR models, the refining "sculptor" method of Diffusion models, or the clever combinations found in Hybrid models ¹⁷, AI is learning to weave light and shadow with pixels, and tell stories through motion.

We're witnessing the dawn of a new era in visual storytelling. AI won't just change how we consume media; it will empower everyone with unprecedented creative tools. Of course, with great power comes great responsibility. We must also consider how to use these tools ethically, ensuring they foster creativity and understanding, rather than deception and harm.¹³

The future is unfolding frame by frame. The next AI-directed blockbuster might just start with an idea you have right now. Let's watch this space!

Works cited

[1]Asynchronous Video Generation with Auto-Regressive Diffusion - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.07418v1

[2][2503.07418] AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2503.07418

[3]AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion | Request PDF - ResearchGate, accessed on April 28, 2025, https://www.researchgate.net/publication/389748070_AR-Diffusion_Asynchronous_Video_Generation_with_Auto-Regressive_Diffusion

[4]Video Diffusion Models: A Survey - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2405.03150v2

[5]Video Is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2412.18688

[6]Autoregressive Models in Vision: A Survey - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2411.05902v1

[7]A Survey on Vision Autoregressive Model - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2411.08666v1

[8] SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2504.11455v1

[9] On Improved Conditioning Mechanisms and Pre-training Strategies for Diffusion Models - NIPS papers, accessed on April 28, 2025, https://proceedings.neurips.cc/paper_files/paper/2024/file/18023809c155d6bbed27e443043cdebf-Paper-Conference.pdf

[10] Opportunities and challenges of diffusion models for generative AI - Oxford Academic, accessed on April 28, 2025, https://academic.oup.com/nsr/article/11/12/nwae348/7810289?login=false

[11] Video Diffusion Models - A Survey - OpenReview, accessed on April 28, 2025, https://openreview.net/pdf?id=sgDFqNTdaN

[12] The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.04606v1

[13] ChaofanTao/Autoregressive-Models-in-Vision-Survey - GitHub, accessed on April 28, 2025, https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey

[14] [2412.09600] Owl-1: Omni World Model for Consistent Long Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2412.09600

[15] arXiv:2412.07772v2 [cs.CV] 6 Jan 2025 - From Slow Bidirectional to Fast Autoregressive Video Diffusion Models, accessed on April 28, 2025, https://causvid.github.io/causvid_paper.pdf

[16] SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2504.11455

[17] Phenaki - SERP AI, accessed on April 28, 2025, https://serp.ai/tools/phenaki/

[18] openreview.net, accessed on April 28, 2025, https://openreview.net/pdf/9cc7b12b9ea33c67f8286cd28b98e72cf43d8a0f.pdf

[19] Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation, accessed on April 28, 2025, https://www.researchgate.net/publication/390038718_Bridging_Continuous_and_Discrete_Tokens_for_Autoregressive_Visual_Generation

[20] Autoregressive Video Generation without Vector Quantization ..., accessed on April 28, 2025, https://openreview.net/forum?id=JE9tCwe3lp

[21] Long-Context Autoregressive Video Modeling with Next-Frame Prediction - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.19325v1

[22] Language Model Beats Diffusion — Tokenizer is Key to Visual Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2310.05737

[23] Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.16430v2

[24] Auto-Regressive Diffusion for Generating 3D Human-Object Interactions, accessed on April 28, 2025, https://ojs.aaai.org/index.php/AAAI/article/view/32322/34477

[25] Fast Autoregressive Video Generation with Diagonal Decoding - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.14070v1

[26] One-Minute Video Generation with Test-Time Training, accessed on April 28, 2025, https://test-time-training.github.io/video-dit/assets/ttt_cvpr_2025.pdf

[27] Photorealistic Video Generation with Diffusion Models - European Computer Vision Association, accessed on April 28, 2025, https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/10270.pdf

[28] arXiv:2412.03758v2 [cs.CV] 24 Feb 2025, accessed on April 28, 2025, https://www.arxiv.org/pdf/2412.03758v2

[29] Advancing Auto-Regressive Continuation for Video Frames - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2412.03758v1

[30] From Slow Bidirectional to Fast Autoregressive Video Diffusion Models - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2412.07772v2

[31] Enhance-A-Video: Better Generated Video for Free - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2502.07508v3

[32] [D] The Tech Behind The Magic : How OpenAI SORA Works : r/MachineLearning - Reddit, accessed on April 28, 2025, https://www.reddit.com/r/MachineLearning/comments/1bqmn86/d_the_tech_behind_the_magic_how_openai_sora_works/

[33] Delving Deep into Diffusion Transformers for Image and Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2312.04557v1

[34] CVPR Poster Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution - CVPR 2025, accessed on April 28, 2025, https://cvpr.thecvf.com/virtual/2024/poster/31563

[35] SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models - AAAI Publications, accessed on April 28, 2025, https://ojs.aaai.org/index.php/AAAI/article/view/32663/34818

[36] Latte: Latent Diffusion Transformer for Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2401.03048v2

[37] VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2504.12259v1

[38] [2501.00103] LTX-Video: Realtime Video Latent Diffusion - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2501.00103

[39] LTX-Video: Realtime Video Latent Diffusion - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2501.00103v1

[40] Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2501.03931v1

[41] LaMD: Latent Motion Diffusion for Image-Conditional Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2304.11603v2

[42] Video-Bench: Human-Aligned Video Generation Benchmark - ResearchGate, accessed on April 28, 2025, https://www.researchgate.net/publication/390569999_Video-Bench_Human-Aligned_Video_Generation_Benchmark

[43] Advancements in diffusion models for high-resolution image and short form video generation, accessed on April 28, 2025, https://gsconlinepress.com/journals/gscarr/sites/default/files/GSCARR-2024-0441.pdf

[44] NeurIPS Poster StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation, accessed on April 28, 2025, https://neurips.cc/virtual/2024/poster/94916

[45] FrameBridge: Improving Image-to-Video Generation with Bridge Models | OpenReview, accessed on April 28, 2025, https://openreview.net/forum?id=oOQavkQLQZ

[46] Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution - CVPR 2024 Open Access Repository, accessed on April 28, 2025, https://openaccess.thecvf.com/content/CVPR2024/html/Chen_Learning_Spatial_Adaptation_and_Temporal_Coherence_in_Diffusion_Models_for_CVPR_2024_paper.html

[47] Subject-driven Video Generation via Disentangled Identity and Motion - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2504.17816v1

[48] AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion - alphaXiv, accessed on April 28, 2025, https://www.alphaxiv.org/overview/2503.07418

[49] Phenaki - Reviews, Pricing, Features - SERP, accessed on April 28, 2025, https://serp.co/reviews/phenaki.video/

[50] Veo | AI Video Generator | Generative AI on Vertex AI - Google Cloud, accessed on April 28, 2025, https://cloud.google.com/vertex-ai/generative-ai/docs/video/generate-videos

[51] Generate videos in Gemini and Whisk with Veo 2 - Google Blog, accessed on April 28, 2025, https://blog.google/products/gemini/video-generation/

[52] Sora: Creating video from text - OpenAI, accessed on April 28, 2025, https://openai.com/index/sora/

[53] Top AI Video Generation Models in 2025: A Quick T2V Comparison - Appy Pie Design, accessed on April 28, 2025, https://www.appypiedesign.ai/blog/ai-video-generation-models-comparison-t2v

[54] ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models - CVF Open Access, accessed on April 28, 2025, https://openaccess.thecvf.com/content/CVPR2024W/GCV/papers/Weng_ART-V_Auto-Regressive_Text-to-Video_Generation_with_Diffusion_Models_CVPRW_2024_paper.pdf

[55] Simplified and Generalized Masked Diffusion for Discrete Data - arXiv, accessed on April 28, 2025, https://arxiv.org/pdf/2406.04329

[56] Unified Multimodal Discrete Diffusion - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.20853

[57] Simple and Effective Masked Diffusion Language Models - arXiv, accessed on April 28, 2025, https://arxiv.org/pdf/2406.07524

[58] [2107.03006] Structured Denoising Diffusion Models in Discrete State-Spaces - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2107.03006

[59] Structured Denoising Diffusion Models in Discrete State-Spaces, accessed on April 28, 2025, https://proceedings.neurips.cc/paper/2021/file/958c530554f78bcd8e97125b70e6973d-Paper.pdf

[60] Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2406.03736v2

[61] Fast Sampling via Discrete Non-Markov Diffusion Models with Predetermined Transition Time - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2312.09193v3

[62] [2406.03736] Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2406.03736

[63] AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation | OpenReview, accessed on April 28, 2025, https://openreview.net/forum?id=0EG6qUQ4xE

[64] Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2410.14157v3

[65] [R] Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution - Reddit, accessed on April 28, 2025, https://www.reddit.com/r/MachineLearning/comments/1ezyunc/r_discrete_diffusion_modeling_by_estimating_the/

[66] [2412.07772] From Slow Bidirectional to Fast Autoregressive Video Diffusion Models - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2412.07772

[67] Long-Context Autoregressive Video Modeling with Next-Frame Prediction - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.19325v2

[68] Long-Context Autoregressive Video Modeling with Next-Frame Prediction - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2503.19325

[69] ManiCM: Real-time 3D Diffusion Policy via Consistency Model for Robotic Manipulation - arXiv, accessed on April 28, 2025, https://arxiv.org/pdf/2406.01586?

[70] G-U-N/Awesome-Consistency-Models: Awesome List of ... - GitHub, accessed on April 28, 2025, https://github.com/G-U-N/Awesome-Consistency-Models

[71] showlab/Awesome-Video-Diffusion: A curated list of recent diffusion models for video generation, editing, and various other applications. - GitHub, accessed on April 28, 2025, https://github.com/showlab/Awesome-Video-Diffusion

[72] [PDF] EvalCrafter: Benchmarking and Evaluating Large Video Generation Models, accessed on April 28, 2025, https://www.semanticscholar.org/paper/66d927fdb6c2774131960c75275546fd5ee3dd72

[73] [2502.07508] Enhance-A-Video: Better Generated Video for Free - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2502.07508

[74] NeurIPS Poster FIFO-Diffusion: Generating Infinite Videos from Text without Training, accessed on April 28, 2025, https://nips.cc/virtual/2024/poster/93253

[75] StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text, accessed on April 28, 2025, https://openreview.net/forum?id=26oSbRRpEY

[76] Owl-1: Omni World Model for Consistent Long Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2412.09600v1

[77] Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2411.16375v1

[78] ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2406.10981v1

[79] TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models - CVF Open Access, accessed on April 28, 2025, https://openaccess.thecvf.com/content/CVPR2024/papers/Ni_TI2V-Zero_Zero-Shot_Image_Conditioning_for_Text-to-Video_Diffusion_Models_CVPR_2024_paper.pdf

[80] Training-Free Motion-Guided Video Generation with Enhanced Temporal Consistency Using Motion Consistency Loss - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2501.07563v1

[81] DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2502.03930v1

[82] VBench-2.0: A Framework for Evaluating Intrinsic Faithfulness in Video Generation Models, accessed on April 28, 2025, https://www.reddit.com/r/artificial/comments/1jmgy6n/vbench20_a_framework_for_evaluating_intrinsic/

[83] NeurIPS Poster GenRec: Unifying Video Generation and Recognition with Diffusion Models, accessed on April 28, 2025, https://neurips.cc/virtual/2024/poster/94684

[84] Evaluation of Text-to-Video Generation Models: A Dynamics Perspective - OpenReview, accessed on April 28, 2025, https://openreview.net/forum?id=tmX1AUmkl6¬eId=MAb60mrdAJ

[85] [CVPR 2024] EvalCrafter: Benchmarking and Evaluating Large Video Generation Models - GitHub, accessed on April 28, 2025, https://github.com/evalcrafter/EvalCrafter

[86] [2412.18688] Video Is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2412.18688

Decoding LLM-native Agents: Bridging Compilation and Interpretation in AI

Introduction

Since ChatGPT's explosive rise in 2022, artificial intelligence has rapidly transitioned from mere "chatbots" capable of responding to queries, to autonomous "agents" capable of executing tasks independently. In the emerging field of AI Agents, two architectural paradigms seem to have emerged: Compiled Agents and Interpreted Agents. Understanding their differences, capabilities, and limitations is essential for grasping the broader evolution of AI-driven productivity.

Compiled vs. Interpreted Agents

To simplify:

- Compiled Agents embed intelligence predominantly during development, using pre-defined workflows and scripts. They excel in tasks with predictable outcomes.
- Interpreted Agents dynamically apply intelligence at runtime, adjusting actions based on immediate context and feedback, suited to open-ended, unpredictable tasks.

Just as traditional software differentiates between compiled (pre-wired) and interpreted (runtime-decided) languages, AI agents exhibit similar distinctions.

Technical Deep Dive

Compilation in LLM: Parameter Fixation and Knowledge Internalization

In LLM-native agents, "compilation" occurs during model training. Vast textual data is compressed into fixed neural parameters. Post-deployment, these parameters act like "compiled" code, setting fixed probabilistic boundaries on potential behaviors.

Interpretation in AI: Dynamic Runtime Decisions

However, runtime inferences from LLMs reveal an "interpreted" quality, characterized by:

- Dynamic CoT (Chain-of-Thought) generated spontaneously
- Adaptive path planning reacting to real-time feedback
- Probabilistic decisions, allowing the same prompt to yield different outcomes

Thus, LLMs represent a hybrid computational paradigm, combining "probabilistic compilation" and "constrained interpretation"—leveraging pre-trained parameters while dynamically interpreting and adapting at runtime.

Architectural Comparison

Compiled Agents: Reliability and Predictability

Unlike LLM-native agents, compiled agents follow strict, pre-defined workflows:

- Clear, predetermined logic paths
- Fixed decision branches
- Limited context management
- Deterministic results

Examples: ByteDance's Coze platform exemplifies this model. Users visually design the agentic logic via drag-and-drop workflows, ensuring consistency and reliability. Ideal for well-defined business automation tasks like RPA (Robotic Process Automation), compiled agents excel in repeatable, predictable operations.

Limitations: Rigidity and inability to adapt dynamically. Any unforeseen changes in environment or input can disrupt workflows, necessitating manual reconfiguration and/or re-training the models behind.

Interpreted Agents: Runtime Autonomy and Flexibility

Interpreted agents are LLM-native autonomous agents that dynamically formulate and revise their execution plans:

- Goal-driven, high-level task definitions
- Real-time strategic planning
- Environmental awareness
- Autonomous decision-making with dynamic tool selection

Examples: Manus and AutoGPT embody interpreted agents. AutoGPT autonomously breaks tasks into subtasks, sequentially executes them, adapts based on interim results, and maintains persistent memory states to handle complex, multi-step operations. Manus, employing a multi-agent collaborative framework, autonomously executes complex workflows—from data analysis to report generation—demonstrating a complete "idea-to-execution" loop.

Strengths: Highly adaptive, capable of handling diverse, unforeseen scenarios. Ideal for research, creative tasks, and personal assistance.

Challenges: Unpredictability, higher computational resources, potential security risks, and more intricate development and testing procedures.

Interface Strategies: Universal vs. Specialized

Agent capabilities heavily depend on interaction modes with external environments:

- Universal Interfaces (browser-like interactions) grant agents broad compatibility but face efficiency, reliability, and security issues.
- Specialized Interfaces (API calls) offer speed, stability, and security but lack flexibility and require direct integration.

Strategically, agents leveraging specialized APIs can form more robust, defendable positions, avoiding easy internalization by LLM providers.

Future Directions and Challenges

Emerging Hybrid Architectures

Future agents will increasingly blend compiled reliability with interpreted adaptability, embedding runtime-flexible modules within structured workflows. Such hybrids combine precise business logic adherence with adaptive problem-solving capabilities.

Technical Innovations

Advances needed include:

- Further enhanced runtime reasoning and self-reflection via RL (Reenforcement Learning) post-training to improve decision accuracy
- Integrated multimodal perception (visual, auditory, tactile) for richer environmental understanding
- Robust resource management and runtime environments supporting persistent, background-running interpreted agents

Societal and Ethical Considerations

Widespread agent deployment raises security, privacy, and ethical issues, demanding stringent governance, transparent operational oversight, and responsible AI guidelines.

Conclusion

Compiled and interpreted agents represent complementary, evolving paradigms. Their convergence into hybrid architectures is forming the backbone of a new, powerful LLM-native agent ecosystem. As this evolution unfolds, humans will increasingly delegate routine cognitive tasks to agents, focusing instead on strategic, creative, and emotionally intelligent roles, redefining human-AI collaboration.

In essence, the future of AI agents lies in balancing the precision and predictability of compilation with the flexibility and creativity of interpretation, forging an unprecedented path forward in human-technology synergy.

[Related]

The Three-Stage Scaling Laws Large Language Models

Mr. Huang's background features three S-curves, illustrating the scaling relay race across three stages of large language models, demonstrating a persistent spirit akin to the Chinese fable of the legendary Old Man Who Moved Mountains.

We know that large language models have three stages: pre-training, post-training, and online inference. The biggest change in recent months is the community consensus, following Ilya Sutskever's claim, that the pre-training era has ended. The famous empirical scaling laws for pre-training appear to have plateaued. This has led to the rise of inference models (OpenAI's O series and Deepseek's R series, among others), which emphasize investment in chain-of-thought (CoT) reinforcement learning during post-training and utilization of online inference time (so-called "test time compute"). These reasoning models have indeed demonstrated unprecedented achievements in mathematics, coding, and creative writing.

The scaling of post-training for reasoning models has just begun, and it's unclear how far it can go. But we can gradually see this trajectory from O1 evolving to O3, and from R1 to the reportedly soon-to-be-released R2 and their enhanced capabilities. What about the test time scaling in the final inference stage?

Recently, I spoke with my old friend Junlin, one of the earliest advocates for the three S-curves of scaling in China. I mentioned that I hadn't seen any real test time scaling because no one can control the model's test time compute—how much time/computing power it uses and when it completes assigned tasks is determined by the model itself, so test time doesn't seem "scalable." Junlin agreed that this is currently the case.

These past few days, while playing with large models' deep research capabilities, I've gradually experienced some possibilities for test time scaling. The answer is emerging. Fundamentally, it's about whether there's a curve showing that if you give a query or topic more thinking and response time, it performs better. Specifically, with O3-mini, there's a button called "deep research" that users can choose to use or not to use. Without it, your question still follows a chain of thought because you initially selected the reinforced O3 reasoning model. The process for reasoning models typically takes a minute or two. However, if you also press the deep research button, the final reasoning time is extended by several times, potentially lasting up to 10 minutes. This shows us that even with the same model, different inference times produce different results. This should count as a precursor of test time scaling.

How does it work? How can users invest different amounts of test time compute based on the difficulty or challenge of their topic and their tolerance for waiting time to generate different results for the same topic? It turns out it uses an agent-like approach. The functionality provided by the deep research button is essentially a research reasoning agent. Agents are an additional LLM-native feature that doesn't require changing the model—it changes the interaction method during the inference stage. Currently, this interaction is very simple, just one round, but this test time scaling direction is expected to continue exploring longer and more interactions with users to help maximize the effect of test time compute.

If test time compute scaling doesn't quickly hit bottlenecks, we can imagine future deep research interacting with users for extended periods to complete highly complex projects. Perhaps we're moving beyond minute-level reasoning time investments—we can entirely envision large models investing hours or even days to complete challenging tasks, such as projects that would take human researchers months or years, or completing research projects humans cannot accomplish. The current deep research is very simple—after receiving the user's prompt/query, it immediately breaks down the problem and asks the user five or six simple questions to confirm the required sources, breadth, depth, and considerations for the research. After receiving user feedback, the model begins accepting updated materials (if any) and uses search to collect more relevant information. Then, following the decomposed tasks and the plan confirmed with the user, it analyzes each source and finally synthesizes everything into a research report. This naturally extends the required reasoning time because the task is no longer singular, and the materials aren't limited to knowledge already digested within the model but include more sources searched in real-time—processing all this takes time.

For both reinforcement learning in the post-training stage of reasoning models and the investment in test time compute during the inference stage, the scaling journey has just begun. Let's hope these two S-curves can continue to rise steadily for some time, allowing the scaling relay race to help us progress continuously on the path toward artificial general intelligence (AGI) and eventually artificial superintelligence (ASI).

【Related】

Does the New Reasoning Paradigm (Query+CoT+Answer) Support a New Scaling Law?

大模型三阶段的 scaling laws 接力赛

张俊林：从Deepseek R1看Scaling Law

Does the New Reasoning Paradigm (Query+CoT+Answer) Support a New Scaling Law?

— Reflections on LLM Scaling Laws and DeepSeek's R1

My friend Zhang Junlin's article "Looking at the Future of Scaling Laws through DeepSeek R1" has sparked interesting discussions among peers.

Core Insights from Initial Discussions

Professor Bai summarised the key highlights as follows:

Infinite stacking won't lead to infinite growth (physical laws don't support this)

Only S-shaped growth is possible, with diminishing returns inevitably appearing

The initial emergence of language capabilities relates to the density of linguistic knowledge in training data

The next growth phase represents a second S-curve, driven by common sense knowledge, which requires more computing power due to lower knowledge density

The third phase involves learning logical reasoning (Chain of Thought), where natural data has even lower density of such knowledge. Brute-force mining with computing power becomes inefficient, making reinforcement learning with synthetic data a more rational approach

As Dr. Lu points out: The term "Scaling Law" is becoming overloaded. While S-curves (nonlinear curves characterized by sigmoid functions) can describe technology adoption lifecycles, they typically occur in succession (one technology hits its ceiling, making way for another). Large language models' multiple "Scaling Laws" confirm this pattern, with some overlap between Test-Time and Post-Training "Scaling Laws".

The Nature of LLM Scaling

Let's examine the fundamental logic behind LLM scaling. First, it's crucial to understand that LLMs are not databases - they don't aim to memorize long-tail data details. Large model training essentially compresses big data, or more precisely, compresses the knowledge systems behind the data (including common sense and encyclopedic knowledge), focusing on capturing patterns and regularities of various patterns (what we call generalizations).

Conventional intuition suggests that as data scale increases, redundancy increases too. Regardless of filtering, cleaning, and deduplication, growing redundancy seems to imply diminishing returns. So why do large models still appear "hungry" even at the unprecedented scale of hundreds of billions of tokens? Why does the scaling law remain effective from hundreds of billions to trillions of tokens?

The key lies in LLMs being sequence learning and sequence decoding systems. While sequences are one-dimensional, the patterns and regularities behind are high-dimensional. For instance, even a simple sequence like "cat chases mouse" potentially involves multiple knowledge dimensions: species relationships, predatory behavior, spatial movement, actor-patient roles, etc. This multi-dimensional knowledge naturally leads to combinatorial explosion at the sequence level as information is flattened in language. The "appetite" for insatiable big data effectively addresses this combinatorial explosion. As long as there isn't complete information redundancy, additional diverse sequences will help models abstract data patterns more precisely.

The Two vs. Three S-curves Debate

Zhang Junlin observes that since OpenAI's O1, two other phases have gained recognition with their own Scaling Laws: the reinforcement learning Scaling Law (RL Scaling Law) for post-training, and the Inference Scaling Law (also called Test Time Scaling Law).

This raises a crucial question: Are there really three S-curves, or just two? How comparable is the reasoning model's S-curve to the pre-training S-curve?

While theoretically we can identify three phases:

Pre-training
Post-training (especially reasoning-focused reinforcement learning)
Inference phase

In practice, post-training and inference phases likely share a single S-curve; there aren't two independent growth curves.

DeepSeek R1's Insights: The Truth About "Slow Thinking"

Consider DeepSeek R1: users can activate "deepthink" mode to enable Chain-of-Thought (CoT) reasoning, but they can't actually control reasoning quality by increasing computation time. Why is this?

Let's examine a concrete example. When R1 solves a complex mathematical problem:

Traditional models might directly answer: "The result is 42"

R1 shows detailed reasoning: "Let's think step by step: 1) First consider... 2) Then we can... 3) Finally, we get 42"

While R1's response appears to demonstrate "slow thinking" (CoT), this reasoning process reflects actually a generation pattern fixed during training, not dynamic exploration of multiple potential reasoning paths during response time. In other words, CoT+answer might look like "slow thinking," but it doesn't fundamentally change the unidirectional next-token prediction paradigm. R1's CoT+answer creates an illusion of slow thinking, but the generative nature remains fundamentally the GPT "fast thinking" paradigm. At test time, unlike AlphaGo, the depth and scale of thinking isn't dynamically explored, though beam search, if applied, can provide implicit multi-path optimization internally.

Test Time Compute Constraints

The industry's buzz word "test time compute" refers to reasoning models requiring more online computational resources compared to traditional non-reasoning models. For example, R1 with CoT enabled might need several times more computation time than its base model V3 for the same problem. However, this increased computation results from behavior patterns acquired during training, not dynamically adjustable compute investment. Without controllable scalability in test time compute, we can't really talk about a test time scaling law.

A major difference between pre-training and CoT reinforcement learning lies here: pre-training scaling laws can remain stable long-term because once training completes, it doesn't significantly impact online response time - the generation mode remains a simple query+answer. Therefore, offline training for months is acceptable if the resulting model shows significant capability improvements. However, reasoning models' post-training CoT reinforcement learning differs - it cultivates models' habits of responding with slow thinking, changing the generation mode to query+CoT+answer. Extending the CoT isn't just about the cost of training resources and time; more critically, it reflects in extended test time compute for each query during deployment, severely delaying system response time. Users generally have limited tolerance for slow thinking computation time and delays during online system use.

The Sustainability Debate

OpenAI's Sam Altman and Anthropic's Dario might argue that for extremely complex problems (like proving the Riemann hypothesis or designing next-generation aerospace vehicles), even if a model needs a week of computation time, it's still a massive improvement over human teams requiring decades. However, this argument has two issues:

LLM feasibility for such super-complex problems remains far from validated

Extreme scenarios lack universality and can't serve as data points for sustainable scaling laws

This isn't to deny S-curves as effective models for describing scaling laws, nor to reject the rationality of S-curve stacking. The combination of pre-training and post-training growth curves (s1 and s2) might indeed reflect the overall relationship between resource investment and performance improvement. However, we should carefully examine whether CoT reasoning truly opens a sustainable scaling curve.

Conclusion: How Far Is the LLM Road to AGI?

If reasoning models' scaling laws lack sustainability, this raises a deeper question: Can we reach the promised land of Artificial General Intelligence (AGI) through these two scaling laws alone? Furthermore, is the technical ideal of Artificial Super Intelligence (ASI) - AI replacing human labor and dramatically improving productivity - truly feasible?

Current evidence suggests that while pre-training scaling laws have shown considerable sustainability, reasoning models' scaling laws may quickly hit practical constraints. This reminds us that the path to AGI/ASI likely requires more innovative breakthroughs, not just simple extrapolation of existing methods. In the next phase of artificial intelligence development, we might need to discover entirely new growth curves.

[#LLMs #ArtificialIntelligence #DeepLearning #AGI #ScalingLaws #MachineLearning]

【相关】

张俊林：从Deepseek R1看Scaling Law

Technical Deep Dive: Understanding DeepSeek R1's Reasoning Mechanism in Production

DeepSeek 学习笔记：R1 部署阶段的推理机制

从R1幻觉谈起，大模型幻觉是缺陷还是创意火花？

推理强化学习是端到端的监督，推理过程的非监督

Technical Deep Dive: Understanding DeepSeek R1's Reasoning Mechanism in Production

A detailed analysis of how DeepSeek R1's inference mechanism works in production, and how it differs from training-time reinforcement learning.

Training vs. Deployment: Key Questions

1. Training Phase (GRPO): Does the reinforcement learning mechanism generate multiple candidate CoT+answer sequences to optimize the policy and cultivate "slow thinking" habits?

- The answer is definitively yes.

2. Deployment Phase: Does R1 implicitly generate multiple paths during inference but only display one? If so, how does this mechanism compare to traditional ensemble methods?

3. Comparison with AlphaGo's MCTS: How does R1's mechanism fundamentally differ from Monte Carlo Tree Search?

1. Inference Mechanism in Production

DeepSeek R1's real-time reasoning can be characterized by two modes:

A. Implicit Multi-path Generation and Selection

- Generation: The model may implicitly generate multiple potential reasoning paths (CoT+Answers) during a single inference but outputs only one.

- Technical Implementation: Through decoding strategies (e.g., beam width adjustment), the model maintains multiple candidate sequences, ultimately selecting the highest-scoring path.

- User Experience: Users see only the final output, though internal multi-path exploration occurs.

- Efficiency Trade-off: Setting beam_width=1 (greedy search) defaults to single-path generation for fastest response; increasing beam width improves quality at the cost of latency.

B. Explicit Multiple Candidate Generation (Optional)

- API Control: The num_return_sequences parameter allows explicit generation of multiple candidates.

- Practical Application: While not enabled by default in the DeepSeek App, this functionality may be available through enterprise APIs or open-source implementations.

2. Training Phase: Cultivating "Slow Thinking"

A. Role of Reinforcement Learning

- Objective: GRPO algorithm trains the model to generate more detailed, logical reasoning steps (longer CoT) to maximize rewards.

- Mechanism: Training generates multiple candidate answers, with rewards evaluating both answer correctness and format correctness.

B. Driving Forces Behind CoT Growth

- Reward Design: Longer CoTs naturally emerge when they lead to better answers.

- Data Feedback: High-quality SFT data generated through rejection sampling enhances this pattern.

3. Comparison with Ensemble Methods

Similarities

- Multi-path generation conceptually similar to ensemble predictions

- Result filtering comparable to voting/weighted averaging

Key Differences

R1's implicit multi-path generation is fundamentally a dynamic decoding strategy within a single model, distinct from traditional ensemble's static combination of multiple models.

4. Fundamental Distinction from AlphaGo's MCTS

AlphaGo's MCTS

- Dynamic Programming: Builds search trees through simulation

- Online Learning: Adjusts search strategy based on real-time feedback

R1's Implicit Multi-path Generation

- Static Model: Fixed parameters during deployment

- No Reward Modeling: Path selection based on model probability rather than cumulative rewards

Key Insights

1. Training phase GRPO cultivates detailed CoT capabilities for effective single-pass inference.

2. Deployment allows flexible trade-off between single-path (for speed) and multi-path (for quality) generation.

3. While model parameters are fixed post-training, decoding strategies offer some runtime flexibility.

4. R1's multi-path generation fundamentally differs from both traditional ensembles and MCTS-style dynamic planning.

This architecture achieves a practical balance between efficiency and effectiveness for large-scale industrial applications, though it sacrifices some dynamic planning and global optimization capabilities.

#ArtificialIntelligence #MachineLearning #DeepLearning #LLM #DeepSeek

【相关】

从R1幻觉谈起，大模型幻觉是缺陷还是创意火花？

推理强化学习是端到端的监督，推理过程的非监督

The Turbulent Second Chapter of Large Language Models: Has Scaling Stalled?

The recent Chinese podcast from Guangmi's quarterly report on large language models, discussing the "scaling paradigm shift" toward AGI (Artificial General Intelligence), is well worth a listen. It touches on many key topics related to the AI industry landscape, offering a unique perspective and style.

The term "paradigm shift" may sound a bit dramatic, but as a seasoned analyst, Guangmi uses it to describe the current turbulent landscape accurately. While the AI arms race among industry giants is still in full swing, real-world scalable applications of these models are struggling to materialize. The question of how to justify investments has become a significant pressure point, or perhaps even a looming bubble.

Let's revisit some AI basics. There are three main types of learning in LLMs (Large Language Models):

(i) supervised learning;
(ii) unsupervised learning (self-learning/pre-training); and
(iii) reinforcement learning (RL, self-play/post-training).

Ilya has emphasized the importance of RL in exploring new directions for LLMs. Guangmi's podcast highlights RL as the pathway to the paradigm shift in AGI through large models.

Historically, two key milestones in RL have stood out: AlphaZero's victory over human Go players, which shocked the world, and RLHF (Reinforcement Learning from Human Feedback), which aligned models with human preferences and paved the way for ChatGPT’s explosive growth.

Currently, discussions revolve around the potential of a new RL-driven ecosystem for large models (though there's no broad consensus—it's primarily a conversation within small Silicon Valley circles) and the emerging trends in the "arms race" of large models. Here’s the context:

1. Pre-training scaling seems to have hit a bottleneck, with GPT-5 still unreleased;
2. The overall momentum of the arms race remains unchanged among the major players (the billionaire clubs/giants);
3. Key tech figures are proposing new roadmaps or trying to construct new scaling laws to continue the AGI journey.

Guangmi closely monitors trends in Silicon Valley. His small team conducts in-depth research in the Bay Area and has established extensive contacts. Having chatted with them over coffee a couple of times, I’ve found them to be a dynamic, young team under his leadership—a small but sharp presence.

Guangmi’s thoughts are well-structured, and his breadth of knowledge and understanding of the larger context are impressive. This is no small feat, as the landscape of large models, both in terms of the models themselves and the industry, is often akin to the parable of the blind men and the elephant. Even top experts and business leaders struggle to assess the full picture. Just recently, Meta’s Zuckerberg responded to a question about whether the AI arms race would deliver the expected AGI returns, essentially saying: “No one really knows, but we can’t afford to miss out,” reflecting a typical FOMO (Fear Of Missing Out) mindset.

We’re currently in a delicate phase with little consensus. However, the few tech giants that have propelled Nvidia’s stock to astronomical levels won’t allow the arms race to slow anytime soon, as it is central to their tech and business dominance. OpenAI continues to raise funds, and Ilya, with his new company, recently secured more investment, all of which keeps the race heated.

At the same time, the obsession with scaling among tech elites and the mainstream AGI circles in Silicon Valley persists. The endless demand for resources driven by this scaling wave of large models means that only a small circle of tech insiders has the opportunity and resources to experiment, sense, and adjust the roadmap.

According to Guangmi, the so-called self-play RL scaling is currently gaining traction within a small circle of about 200 tech elites in Silicon Valley, indicating that this is still a nascent trend—one that even management leaders have not fully aligned with yet.

It seems Guangmi adopts a “prophet” mentality at times, perhaps exaggerating this trend to alert his audience. He even suggests that if he were a large-model entrepreneur, he would focus 200% of resources on RL, betting on it as the future path to victory.

In reality, for most people, this advice is neither practical nor actionable—it’s likely aimed at tech giants or unicorns, though even for them, it may fall on deaf ears.

Reinforcement learning is inherently challenging. Even the open-source leader Meta LLaMA 3 has chosen to sidestep RLHF in post-training alignment. So, it's even less realistic to expect large-model teams to fully bet on RL as the core of a new ecosystem. Furthermore, this trend is, at best, a “subtle undercurrent” in Silicon Valley. We’ll likely have to wait until OpenAI’s “Strawberry” or the new version of Claude releases later this year to fully assess its impact.

It seems the first chapter of LLM scaling has indeed come to an end. The actionable items in the so-called second chapter might not emerge from lofty, exploratory scaling directions with an uncertain roadmap. Instead, the focus should be on finding market entry points, accelerating applications, and addressing genuine market needs (PMF, product-market fit), especially as the inference costs of top models like GPT-4o/Claude 3.5 become more affordable, and multimodal capabilities (such as advancements in hyper-realistic full-duplex voice and video) further enhance application opportunities.

For the industry, the bottleneck in scaling large-model applications is the sword hanging over its future. This will determine whether the second chapter of the tech adoption curve ends with a soft landing and eventual recovery. As for the arms race, it’s best to leave that to Elon Musk, Zuckerberg, and the billionaire club to continue playing.

Reinforcement learning, as an extension of pre-training, belongs to the realm of “post-training.” When pre-training hits bottlenecks and diminishing returns, strengthening RL is a natural complement. In the simulation of human cognition, pre-training represents the accumulated knowledge of human civilization, while RL applies that knowledge in practice, learning from the environment. This overall approach to intelligent learning makes perfect sense and is the necessary direction for applying large models.

My old friend Lu said: “It’s intuitive that RL is the path we must take because there isn’t enough supervised learning data anymore.”

Indeed, utilizing regenerated data to varying degrees has become common practice. It’s inevitable. Models can already generate data of higher quality than humans, and this will only improve. However, this is not the same as self-play's proactive exploration and data regeneration.

As Mr. Mao pointed out: “RL aligns with the cognitive processes of humans and epistemology. It’s essentially the process of receiving external feedback and being tested in practice. RL is active learning, while training is passive.”

Guangmi's RL paradigm shift suggestion still lacks the necessary catalysts. But this potential trend is worth keeping in mind. It’s best to remain cautiously optimistic and open-minded while watching how things unfold.

Related original:

大模型风云诡谲的下半场：scaling 失效？

Decoupling to Resolve: Issue of Character Consistency in Video Generation

I’ve now become the go-to expert for AIGC (AI-generated content) "custom services" among my old friends and classmates, just for fun. Below are nostalgic videos made from old photos that two of my classmates asked me to create.

Whenever I find the time, I’m more than happy to provide this kind of emotional value for friends and family because it’s truly satisfying to see their reactions of surprise.

The pianist is now a world-class piano master, frequently touring and performing in Europe, America, and China. These are precious old photos of him practicing and performing with our mutual friend, Brother Sun, in Philadelphia back in the early days.

Dr. Bai Shuo, a seasoned expert in NLP and a multi-talented musician, commented humorously: “Looks real for someone who pulls on the bow in Meditation as named, but the bowing and fingering are all wrong.”

Another old friend also left feedback noting that the visual model doesn’t understand music: "This needs improvement! It's obvious that the model was created by someone who doesn’t know how to play the violin or piano. The bowing and piano accompaniment are off. The first note has a two-and-a-half beat long tone, which should be played with a long bow. Additionally, the pianist’s right foot should never be raised or shaking like that—it should be on the sustain pedal.”

LOL

Even though the music's name Meditation was clearly specified in my prompt during generation, there is no model, in the foreseeable future, that can truly align the understanding of music with the intricate details of bodily movements during performance. Perhaps this can be reserved as one of the ultimate challenges for large models aiming for AGI, because theoretically, if enough alignment data of musical performance is available, based on the compression theory of "joint training", it’s possible to aim at perfect alignment across different modalities.

If simulating the objective world is the ultimate goal of visual models, then the current generation of visual models is at the level of “playing the piano to a cow” or “playing music to a tone-deaf audience”—completely unable to withstand scrutiny from musicians. For example, as someone with little musical knowledge, when I watch the nostalgic performance videos above, I wouldn’t notice the flaws as an expert would; instead, I find them vivid and emotionally engaging.

Of course, the standards of musicians might as well just be a "pseudo-demand" or a pseudo-goal (even if the visuals satisfy the picky “expert eye,” so what? Will it sell well?). It might not be worth the effort to pursue this. However, in theory, an ideal AGI should be capable of meeting these expert-level demands.

This is the challenge of musical performance alignment. Another challenge to Sora-like video generation models is character consistency in videos.

Achieving facial consistency in generative visual models is an extremely difficult. Don’t expect this issue to be resolved by video generation models alone in the short term, especially not through autoregressive methods.

Human eyes are extremely discerning with regards to face recognition, especially when it comes to familiar faces of friends and family—you can immediately tell when a character's appearance is off. For example, while playing with old photos recently, I used the KeLing model (top notch Video Model in China) to generate a video of myself. At the 5-second mark, it still looked passable, but by 10 seconds, it no longer resembled me.

10 second footage:

In the second 10-second video, just a slight turn of the head, and it’s no longer me—it looks more like my brother. How can a model handle such fine details? Especially when the starting image for video generation is not even a straightforward frontal shot, making the character information incomplete—how could it not go off track?

While the videos I've made for friends and family using KeLing during its public testing phase have generally been met with passionate surprise and amazement, most of them suffer from this issue of character consistency, which is a regret.

The current one-click video generation products on the market (including our own YuanChuang Island recently launched) tend to mainly use anime or manga styles. This is to avoid user scrutiny since these styles lack 3D distinct individual characteristics. As long as there is consistency in attire, no gender mix-ups, with age and race alignment, most people will accept it. The current one-click videos are generally rough, with entertainment value primarily in the story rather than character portrayal akin to a Hollywood blockbuster. However, as this path progresses, it will inevitably encounter the challenge of maintaining the consistency of digital IP actors and their roles.

My colleague, Lu, mentioned, "the consistency issue might require cross-checking from multiple video angles, which more or less touches on the core issue of whether modeling is necessary."

Indeed, some form of cross-checking is required, not just monotonic correction over time/sequence—that is indeed the key. There’s a need to decouple or separate the character's image from the storyline, rather than generating in a linear, one-way path. While sequence learning has indeed produced miracles in LLMs, sequence generation inherently has limitations, including random deviations over time. Although it's not as extreme as LeCun's criticism—where he says GPT's error accumulation is a tiny discrepancy that leads to a significant miss—his claim isn't entirely accurate because GPT's autoregressive operation also corrects and adjusts its course at every step in the context. Nevertheless, when it comes to fine-grained consistency, random deviations are almost impossible to handle, even with corrective mechanisms in place.

Hence decoupling, decoupling, decoupling! Decoupling can solve the problem. The world isn't limited to sequences. Beyond sequences and time, there is a constant abstraction (i.e., character image, or IP) that can be utilized. This is becoming increasingly clear. Take, for example, the digital IP character Maria (Xiao Ya) that I created using AIGC txt2img more than 2 years ago:：

Unless they’re fans, perhaps my numerous Maria videos might cause aesthetic fatigue—someone even called her “Dr. Li's fairy” (LOL). But indeed, there are fans; several of my old classmates are among them.

Why? Because she is an IP, and she has been decoupled.

Related Links (original posts in Chinese):

视觉模型生成的极限对齐

解耦才能解套：再谈视频中的人物一致性问题

Professor Ma Claims to Have Fully Unveiled the Mysteries of Neural Networks

Professor Yi Ma’s white-box transformer paper is available here.

Professor Ma is a prominent figure, renowned for his distinctive style and leadership in the field. His name is widely recognized and respected. Of particular interest recently are his critiques of mainstream large models and the bold claims he has made about his own work (see his post in Chinese below).

Recently, at a conference in Shenzhen (which I attended with my own talk too), Professor Ma sharply criticized mainstream large models, Ilya, and Kolmogorov complexity theory, dismissing them as being on the level of high school students and claiming that they lack a true understanding of theoretical concepts. He asserted that he has achieved breakthroughs in both theory and practice, particularly with the white-box Transformer developed by his team. According to him, this model not only demystifies the complexity of large models but also offers an engineering-feasible alternative.

When someone speaks with such confidence, it usually indicates genuine expertise and a commanding presence. Just as Yann LeCun in the U.S. criticized GPT as being inferior to a dog and called it a dead end, proposing his world model as an alternative, China has Professor Ma. Their critiques balance the global discourse, making the world feel less excluding. There is indeed hope that their work might address the "slow thinking" and "interpretability" shortcomings of current mainstream large models and contribute to the overall advancement of AI. Professor Ma’s academic and practical work deserves close study, though we may have to wait for time and peer reviews to fully test and validate their findings.

At the Shenzhen conference, after delivering his talk and sharp critiques, Professor Ma left immediately, likely due to his busy schedule.

The paper is over 100 pages long and is said to be released in a few days. Based on the current outline, the key points are as follows:

Overall, CRATE is similar to a transformer, with two differences:

- In each attention head, the Q, K, and V weight matrices are tied, i.e., set to be equal.
- The nonlinearity following each attention layer is no longer a multi-layer perceptron (MLP) but rather a more structured operator (ISTA) with sparse outputs.

Let's examine ISTA (Iterative Soft-Thresholding Algorithm), a widely used algorithm for solving sparse optimization problems in machine learning. In his CRATE architecture, ISTA replaces the traditional MLP in Transformers. Not long ago, KAN also introduced innovations aimed at replacing the MLP, both approaches representing surgeries within the Transformer architecture.

In my understanding, ISTA and KAN (for Science/Physics) share a common goal: through regularization or pruning, they ultimately fit a sparse path, thus achieving interpretability.

How it works

ISTA Iteratively approaches the optimal solution of a problem. Each iteration involves two steps: a) a gradient descent step, which aligns with mainstream methods; and b) a soft-thresholding operation. This operation is added to balance two objectives:

a) Maximizing model accuracy;
b) Achieving model sparsity, i.e., simplicity (as overly complex models are difficult for humans to interpret).

The soft-thresholding operation encourages internal elements to become zero, resulting in sparse outputs and increased interpretability. The weight-tied attention mechanism, combined with ISTA, promotes a deeper understanding of the input data structure, resembling a human-like structured analysis process that prioritizes key elements while regularizing the data.

Professor Ma claims that these two modifications naturally lead the model to learn the interpretability associated with human-like structuring and sparsity during supervised learning (and later as claimed successfully applied to self-supervised learning too).

For example, in image recognition, it was observed that certain attention heads correspond to different parts of animals. What's more remarkable is that this correspondence remains consistent across different animals and even different categories of animals. For instance, an attention head focused on the "head" consistently pays attention to the head area when processing different kinds of animals. This consistency suggests that CRATE has learned a general representation of visual features across categories.

However, those studying LLM interpretability have long discovered that at the end of MLP networks, various structured components (such as heads and feet) are also captured by attention mechanisms. Without this, it would be difficult to explain the generalization (or compression) capabilities exhibited by LLMs. The challenge lies in the early stages of the MLP network, where attention is more mixed, and mainstream researcher struggle to clarify what the attentions heads are focusing on. It seems that they are vaguely paying attention to the relationships between basic elements like pixels/dots and lines.

The core idea behind explainable AI is consistent: transforming the tangled, black-box, multi-layer network's internal data fitting paths into structured paths that are enabled with various constraints and pruning, leading to a sparse representation.

Who wouldn’t want a model to be interpretable? However, achieving sparsity and simplicity is extremely challenging, which is why, so far, these approaches have struggled to compete with the black-box methods that involve randomness.

Professor Ma’s confidence stems from the fact that, in the past six months to a year, he has begun to train models using the explainable white-box methods mentioned above, achieving results comparable to traditional transformers. At the Shenzhen conference, he mentioned that while he had always been confident that this was the correct approach, he remained cautious until results were obtained. Now, he believes that his cross-national team’s achievements with this approach have satisfied him enough to announce to the world that he has found a breakthrough in theory as well as practice, the correct method for white-boxing transformers, which could lead to a paradigm shift and a breakthrough in deep learning. This has made him both excited and confident. Therefore, he is no longer content with academic theoretical achievements alone; he feels compelled to take actions in industry as well. Professor Ma has recently founded a company to advance this work on an engineering level. At Shenzhen, he announced a directionally significant project challenging the mainstream, first time under the banner of his new company.

However, based on my years of NLP experience and intuition, I must point out a challenge (or potential issue): Human interpretability is built on a highly simplified finite set. If we consider symbolic features, a feature system with more than thousands of elements becomes incomprehensible to humans. But on the other hand, the number of parameters in transformers and the number of KQVs for attention heads are on a completely different scale. Reducing such complexity on this scale seems almost unimaginable.

KAN for Science succeeded because their target was extremely narrow—certain existing symbolic formulas in physics or potential formulas limited to a few parameters. With such a goal, pruning, along with scientist intervention or feedback, allowed KAN to claim interpretability.

Regardless, Professor Ma seems confident, so we would like to observe how his methods and results evolve and will, or will not, be accepted.

The Challenge of Character Consistency in Video Generation

Facial recognition in the vast world of AI is a specialized and challenging task, as human eyes are exceptionally sensitive to facial features. Because facial recognition is so specialized and sensitive, it presents a much greater challenge than traditional image recognition tasks, like identifying animal types. Consequently, this field achieved breakthroughs earlier than others: even before the advent of contemporary large models such as GPTs, deep neural network-based facial recognition, powered by extensive datasets of facial images, had already surpassed human visual capabilities and sensitivity. It became widely adopted, leading to the rise of unicorns in the pre-large model era.

Now, as we transition to universal video foundation models that aim to handle all objects in the world, whether it's Sora or Keling, maintaining facial consistency remains a significant challenge. The public has little access to Sora, but by examining similar leading visual models like Keling, we can perceive its limitations. Typically, after about half a minute, the generated faces start to diverge, no longer resembling the original person as closely. Achieving long-term consistency in character appearance is difficult without specialized processing and targeted optimization; relying solely on the current general video consistency training efforts is unlikely to overcome this bottleneck. This limitation has been repeatedly observed during various tests with publicly available visual products like Keling.

In some videos, if not for the sensitivity of human eyes, different visuals might be impossible from a purely physical perspective. This highlights the sharpness of human perception: the ability to instantly discern the real from the fake.

For example, in the videos generated below featuring Maria (Xiao Ya, the favorite text2image IP I have generated and maintained in my AIGC videos), her fans can immediately tell which one is genuine, even though Maria herself may present different appearances at different ages and in various settings. There exists an abstract, invariant facial characteristic that equips humans with an eagle-eyed ability to recognize faces. The secret to this lies in the decoupling of these characteristics already pretty well done in the previous generation of facial recognition models. Compare and contrast:

It's important to note that maintaining character consistency is a critical benchmark for generating cinematic and user-configurable video works. Without crossing this threshold, the field will struggle to achieve large-scale applications in video art creation. The dream of a fully virtual Hollywood production line, without physical filming, will remain a fantasy.

Why is it so difficult for visual models to achieve consistent character representation over long periods using brute force?

Video is a high-dimensional modality, and for large models (at least in the foreseeable future) to handle video, they must employ significant "lossy compression". The compression ratio of visual tokens is high, making it more feasible to align training/generation across the entire frames over time within the hidden space. The higher the compression ratio, the stronger the temporal consistency across the entire frames. Self-regressive models (GPT-like) or DiT (Diffusion Transformers) can achieve this. By doing so, videos that violate the physical laws of the real world can be effectively under control, reducing illogical hallucinations and making visual models appear to simulate the objective world (or so it seems). However, there is a trade-off: under lossy compression, the consistency of the overall frames and the consistency of detailed features of specific physical objects therein cannot be optimized simultaneously.

The current approach typically involves adding a super-resolution (SR) module/mdoel after achieving overall contour (blueprint) consistency, attempting to restore discarded details. In general, super-resolution rendering has made significant progress so far, thanks to the accumulation of research in "deepfake"-like technology. However, deepfake technology essentially compensates for the losses incurred during compression, using the large visual foundation model's strength in imagination (or "hallucination") to reasonably and non-deterministically fill in the details, depicting how the world "should" look like, what it should be, rather than what it is, often with amazingly detailed lifelike results. But if the goal is to represent an individual entity, especially a finely detailed one like a human face of some IP, with individual features sensitive to human perception, it's inevitable that the generated image will drift over time. This is the crux of the problem. The solution should not rely on increasingly larger models and longer context windows with brute-force data and training. Brute force can only slow the deviation but cannot eliminate the non-deterministic bias that accumulates during the SR process over long video sequences. We need to think outside the box and exclude the time dimension as a factor, using a step-by-step alignment method, which may break the time cycle. I’ll stop here—don't say you weren't warned.

The prerequisite for achieving this is the decoupling of facial features. Features that cannot be decoupled cannot be aligned step by step. They have to, and can, be decoupled; otherwise, it would be impossible to explain how dozens of Hollywood actors can star in thousands of blockbuster films. The decoupling of faces from expressions and time still has room for improvement, but the technology has already matured considerably. It is a matter of how to properly use it in the process.

Original Chinese post in

立委论LLM：视频生成的人物一致性问题

Llama 3 Notes and Llama MV with Llama 3.1 Legend

Notes on the 92-page Paper Released with Meta's Super Large Model Llama 3.1

The super-large model Llama 3.1 is a milestone in the open-source large model community. As a leader, Meta's project involved over 500 participants/contributors (the authors of this paper are listed alphabetically in the appendix, similar to how the Central Committee members' names are displayed by stroke order). This original text is full of implementation details:

meta Llama 3.1 paper

AIGC MV using Suno and keling （just for fun & cheering opensource milestone）

Notes：

Llama 3.1 doesn't use sparse techniques, it's not a multi-expert system like model 4, but a dense model.
405B parameters, 15.6T tokens: The number of tokens is 40 times the number of parameters. Large-scale top models now emphasize data growth far exceeding parameter growth. Is this 15T tokens of data open source? (No, because even if they were willing to open source it, they wouldn't dare, as it could lead to countless data infringement lawsuits)
Emphasizes three major levers for super-large foundation models: data, scale, and managing complexity.
Compared to the previous generation system Llama 2, computational power has increased 50 times (using 3.8 × 10^25 FLOPs).
Complexity management: (1) Choosing a standard dense Transformer architecture instead of a mixture of experts model to maximize training stability. (2) Adopting a relatively simple post-training procedure: Supervised Fine-Tuning (SFT), Rejection Sampling (RS), and Direct Preference Optimization (DPO). In other words, algorithm design and implementation tend towards simplification. Not using sparse techniques and multi-expert systems is for stability (but training challenges are greater, though they're not afraid). Using simpler, easier-to-implement DPO in the post-training phase instead of reinforcement learning is also for stability, as reinforcement learning has always been difficult to handle.
Benchmark tests cover: general, code, math, reasoning, tool use, long context, and multilingual. All performances are SOTA (state-of-the-art international level).
- MMLU (Massive Multitask Language Understanding): 405B model achieves 87.3% (5-shot), 88.6% (0-shot, CoT).
- Code generation (HumanEval): 405B model reaches 89.0%, close to GPT-4.
- Math problems (GSM8K): 405B model achieves 96.8%, slightly higher than GPT-4.
- Long context tasks: Excellent performance on some tasks, such as 95.2% on QuALITY.
- Multilingual tasks (MGSM): 405B model reaches 91.6%, on par with top models. The 405B model is comparable or close to GPT-4 and Claude 3.5 Sonnet on many tasks. In short, open-source has caught up with closed-source.
Pre-training started with an 8k window, expanded to a 128k window in the later stages of pre-training (continued training).
After the foundation model pre-training was completed, multiple iterations of alignment "post-training" were performed. Including: (1) Aligning the model through human feedback, including multiple rounds of Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO); (2) Integrating new capabilities, such as tool use; (3) Enhancing coding and reasoning abilities (specialized optimization); (4) Safety alignment.
Multimodal expansion (in progress, not yet released): Image, video, and speech capabilities. Including (1) Multimodal encoder pre-training: Image encoder trained on a large number of image-text pairs, aligning visual content and natural language in a unified space; (2) Speech self-training? (3) Experiments on video-text data alignment based on images.
Language model as the core, other modalities are added later (whether added to pre-training and/or post-training). When expanding to multimodal, the language model parameters remain unchanged, adapting to multimodality, allowing multimodal alignment in the same semantic space, closer to the language model. In other words, Llama follows a modular, step-by-step approach to gradually expand to multimodality. This is not the mainstream approach (mainly referring to Open AI and Google, at least in theory) advocating for "unified multimodal native data joint pre-training". The overall impression of Llama's algorithmic strategies is seeking stability rather than innovation or unification. It tends towards practicality, not caring about leading in algorithms. For example, the integration of speech first involves speech self-training (because speech is actually very similar to text, both being language systems), then alignment between speech and text (including Automatic Speech Recognition ASR and Text-to-Speech TTS). Integrating step by step into the cross-modal large model, this approach isn't cutting-edge in terms of advancement, but it's steady progress, beneficial for engineering development, integration, and iteration. It's unclear when they will be able to release multimodal capabilities online.
Data collection and cleaning work is very complex, but the Llama team is meticulous, which is also the data guarantee for its quality to catch up with SOTA. To recap: (1) De-duplication: URL-level de-duplication; Document-level de-duplication using MinHash algorithm; Row-level de-duplication: removing rows appearing more than 6 times every 30M documents. (2) Filtering: Removing low-quality documents, outliers, and excessively repetitive documents, using repetitive n-gram coverage to remove repetitive content (such as logs or error messages); using "dirty word" counts to filter adult websites not covered by blacklists; using token distribution KL divergence to filter documents with too many abnormal tokens. (3) Controlling data quality: Using fasttext classifier to identify text that might be cited by Wikipedia; using a Roberta-based classifier trained on Llama 2's predictions; using DistilRoberta to generate document quality scores. Also, fasttext language classifier can identify 176 languages; specially filtering two types of information: adult content and personal identity/privacy information. Special fine processing for code and math web pages.
Data proportions: For example, downsampling over-represented data categories on the web (such as art and entertainment); data mixing ratios determined by a series of small model experiments, final data mix summary: About 50% of tokens correspond to general knowledge; 25% of tokens involve math and reasoning; 17% of tokens are code; 8% of tokens are multilingual content.
Model architecture: Apart from empirical detail adjustments, the basic architecture of the dense model remains unchanged, so it's data and scaling that create top models. 405B model specific parameters: 126 layers; token representation dimension 16,384; 128 attention heads; model size of 405B determined according to scaling law, about the computational optimal size under 3.8 × 10^25 FLOPs training budget.
Vocabulary: Using a vocabulary of 128K tokens. Combines 100K tokens from the tiktoken3 tokenizer and 28K additional multilingual tokens to better support non-English languages.
Computing resources, including GPU clusters of tens of thousands of cards, massive storage, and high-speed networks, represent huge resource investments. Specific data as follows: Computing resources:
- Used up to 16,000 H100 GPUs (a very powerful graphics processor).
- Each GPU has 80GB of high-bandwidth memory, with a power of 700W.
- These GPUs are installed on servers designed by Meta itself, with 8 GPUs and 2 CPUs per server. Storage system:
- Uses a distributed file system called Tectonic.
- Provides 240PB (1PB=1000TB) of storage space, distributed across 7,500 servers.
- Can process 2TB of continuous data per second, with a peak of 7TB/second.
- A major challenge is handling the large amount of burst writes generated when processing model checkpoints (the process of saving model states).
Three-step pre-training process: a) Initial pre-training; b) Long context continued pre-training; c) Annealing with high-quality data sources Key pre-training strategies:
- Gradually increase batch size and sequence length to balance stability and efficiency.
- Dynamically adjust data mixing to specifically enhance certain capabilities.
- Increase context length in stages to avoid early computational overhead.
- Use annealing and high-quality data in the late stages of training to fine-tune model performance.

[LLM Summary]

Llama 3: Meta's Open-Source Large Language Model Breakthrough**

1. Introduction and Overview

Meta has introduced Llama 3, a series of foundation language models designed to support various tasks including multilingual processing, programming, reasoning, and tool use. This model series includes versions with 8B, 70B, and 405B parameters, with the largest 405B parameter model adopting a dense Transformer architecture and supporting context windows of up to 128K tokens. The development of Llama 3 highlights three key factors: data quality and scale, computational scale, and complexity management.

2. Model Architecture and Pre-training Strategy

2.1 Model Architecture

Llama 3 retains the standard dense Transformer architecture rather than adopting a mixture of experts model. This choice aims to maximize training stability, reflecting Meta's emphasis on simplifying design to manage complexity. Key architectural improvements include:
- Using Grouped-Query Attention (GQA) mechanism, with 8 key-value heads per attention layer.
- Introducing attention masks to prevent self-attention between different documents in the same sequence.
- Expanding the vocabulary to 128K tokens, combining 100K tokens from the tiktoken3 tokenizer and 28K additional multilingual tokens.
- Increasing the RoPE base frequency hyperparameter to 500,000 to support longer contexts.

2.2 Pre-training Data Processing

Llama 3's pre-training data processing is extremely rigorous, including:
- Multi-level deduplication: URL-level, document-level (using MinHash algorithm), and row-level deduplication.
- Heuristic filtering: Removing low-quality documents, outliers, and excessively repetitive content.
- Model-based quality filtering: Using fasttext and Roberta-based classifiers for quality assessment.
- Special content processing: Developing specialized processing pipelines for code and mathematical content.
- Multilingual data processing: Using fasttext base language identification model, supporting 176 languages.
- Safety and privacy protection: Filtering website data containing personally identifiable information (PII) and unsafe content.

2.3 Pre-training Strategy

The pre-training process is divided into three main stages:
1. Initial pre-training: Conducted on about 15T multilingual tokens, far exceeding Llama 2's 1.8T tokens.
2. Long context pre-training: Gradually expanding from initial 8K tokens to 128K tokens context window.
3. Annealing phase: Fine-tuning with high-quality data in the final stage, using Polyak averaging to generate the final model.

Data mixing ratios are carefully designed:
- 50% general knowledge
- 25% mathematics and reasoning
- 17% code
- 8% multilingual content

3. Training Infrastructure and Challenges

3.1 Computational Resources
- Using up to 16K H100 GPUs, each equipped with 80GB HBM3 memory.
- Adopting a 4D parallel strategy: tensor parallelism, pipeline parallelism, context parallelism, and data parallelism.

3.2 Storage System
- Using the Tectonic distributed file system, providing 240PB of storage space.
- Supporting 2TB/s sustained throughput, with peak capacity of 7TB/s.

3.3 Network Optimization
- Developing the NCCLX communication library to improve network efficiency.
- Designing specific network topologies and load balancing strategies.

3.4 Training Challenges
- Experiencing 466 job interruptions during the 54-day training period, 419 of which were unexpected.
- Developing automated systems and specialized tools to handle hardware failures and network issues.

4. Post-training and Alignment

Llama 3 adopts a multi-round iterative post-training process, including:
1. Supervised Fine-Tuning (SFT)
2. Direct Preference Optimization (DPO)
3. Reward model training: Using human feedback data
4. Safety alignment: Implementing multiple rounds of safety measures

This process not only improves the model's instruction-following capabilities but also enhances safety and specific abilities (such as coding and reasoning).

5. Multimodal Expansion

Although not officially released yet, Llama 3 demonstrates promising multimodal capabilities:
- Image recognition: Training independent image encoders, integrated with the language model through adapters.
- Video understanding: Adding video adapters based on image adapters.
- Speech processing: Independently training speech encoders, then aligning with the language model.

This modular approach allows flexible addition of new modalities while maintaining core language capabilities.

6. Performance Evaluation

Llama 3 performs excellently in multiple benchmark tests:
- MMLU (5-shot): 87.3%
- HumanEval (code generation): 89.0%
- GSM8K (math problems): 96.8%
- Long context tasks (like QuALITY): 95.2%
- MGSM (multilingual tasks): 91.6%

These results indicate that Llama 3 405B is comparable or close to GPT-4 and Claude 3.5 Sonnet on multiple tasks, particularly excelling in document understanding and long context tasks.

7. Safety Considerations

Meta highly prioritizes safety in the development of Llama 3:
- Implementing strict safety measures in both pre-training and post-training stages.
- Developing the Llama Guard system-level safety solution.
- Conducting extensive red team testing and risk assessments.

8. Open Source Impact and Future Directions

Meta's decision to publicly release the entire Llama 3 series, including the 405B parameter version, may have far-reaching impacts on the AI research community:
- Promoting open, responsible AI development.
- Accelerating AGI research progress.
- Providing researchers with opportunities to examine and improve large-scale language models.

Future development directions may include:
- Further improving multimodal integration.
- Expanding context length.
- Continuously enhancing data quality and model scale.

9. Conclusion

The development of Llama 3 demonstrates Meta's deep experience and forward-thinking in large-scale AI systems. By focusing on three key levers - data quality, computational scale, and complexity management - Llama 3 has reached or approached the current state-of-the-art level on several key benchmarks. Its open-source release may drive a wave of innovation across the entire AI field, paving the way for responsible AGI development.

Llama 3: Meta's AI Chef's Latest "Divine Delicacy"

Attention, all tech enthusiasts! The Michelin three-star AI chef Meta has just unveiled a new dish! This divine delicacy named "Llama 3" is not only spicy enough but will elevate your taste buds to new heights!

1. The Chef's Secret Weapon

Imagine Llama 3 as a super nanny who speaks 8 languages, writes code, does math, and can be your personal assistant. She can handle a kindergarten full of rambunctious kids (8B version), manage a mid-sized company (70B version), or even govern a small country (405B version)! This 405B big sister can remember 128,000 "gossips" (oh no, I mean context) simultaneously, essentially a walking encyclopedia + supercomputer!

2. Ingredient Selection: Only the Freshest!

Llama 3's chefs are masters at picking ingredients:

They "fished" 15 trillion words from the internet, nearly 10 times more than the previous generation!
Half of these words are everyday life seasonings, a quarter are math problems and brain teasers, nearly a fifth are programmer spells, and the rest are dialects learned from world travels.
They even invented a super weed remover, filtering out all the online garbage, repetitive, and unhealthy stuff.

3. Cooking Process: Three-Step Stir-Fry Method

Step 1: "Slow Simmer" - Start with a regular stove (8K context) to cook it halfway. Step 2: "High Heat Stir-Fry" - Switch to a super stove (gradually increasing to 128K context), reducing the sauce to be thick and fragrant. Step 3: "Low Heat Finish" - Finally, a gentle simmer with the best ingredients, the legendary "annealing" (even the chefs don't know why it's called that), bringing the flavor to its peak!

4. Kitchen Equipment: Top-of-the-Line Luxury Version

16,000 super high-power induction cookers (H100 GPUs) firing simultaneously!
A refrigerator that could fit half the Pacific Ocean (240PB storage)!
A proprietary ingredient prep system faster than 5G (NCCLX communication library)!

Imagine all these stoves firing at once, making the kitchen feel like a sauna. But our chefs persevered through the heat, changing chef uniforms 466 times in 54 days to whip up this dish!

5. Training Method: Both Cute and Well-Mannered

Being a good cook isn't enough; you've got to have manners too! So our chefs began a long "training" process:

First came a round of "gentle education" (supervised fine-tuning)
Then the "carrot and stick" tactic (direct preference optimization)
Finally, they invited moral role models (safety alignment) for guidance

After all this fuss, Llama 3 not only cooks well but also knows how to please people, program, do math, and mind her manners - a true decathlon champion!

6. Special Side Dishes: Showcasing Multiple Talents

Don't think Llama 3 can only cook; she's a multi-talented "goddess":

Storytelling from images? Piece of cake!
Writing movie reviews? No problem!
Recognizing songs and even singing a bit? The karaoke queen!

Although these "talents" are still in practice, they already show the potential of Li Bai's "from black hair to snow white in a day"!

7. A True Powerhouse: Dazzling Test Scores

Llama 3 participated in a series of "Top Chef Competitions," with eye-popping scores:

College Entrance Exam (MMLU): 87.3 points (out of 100)
Programmer Interview (HumanEval): 89 points (out of 100)
Math Olympiad (GSM8K): 96.8 points (out of 100)
Long Novel Reading Comprehension (QuALITY): 95.2 points (out of 100)

Bring this report card home, and even a "Tiger Mom" would be grinning from ear to ear!

8. Safety First: AI's "Security Captain"

Meta's chefs know well the principle of "don't leave guns and ammo lying around." They've assigned Llama 3 a 24/7 bodyguard team (Llama Guard) to prevent her from accidentally saying or doing the wrong thing. They even arrange occasional "moral exams" to ensure she doesn't turn into a "Terminator."

9. Open Source Feast: Everyone Can Be a Master Chef!

The most impressive part is that Meta decided to make the recipe for this "divine delicacy" completely public! It's like a Michelin three-star restaurant putting their signature dish's recipe online. Now anyone who wants to can whip it up at home! This move not only shocked other master chefs but also made countless food lovers cheer with joy!

10. Future Outlook: Reaching New Heights

Meta's chefs aren't resting on their laurels; they're already pondering the next "divine delicacy":

Maybe a dancing Llama 4?
Or a painting Llama 5?
Who knows, one day we might see a Llama 6 composing symphonies!

In short, the AI world's "Michelin" journey has only just begun!

Epilogue

The birth of Llama 3 not only elevates Meta's status in the AI world but also brings a fresh breeze to the entire AI research community. This bowl of "Llama soup" is not only delicious but also brings unlimited imagination to everyone. What will the future of AI be like? Let's wait and see what flavor the next "divine delicacy" will be!

Mingjie Li: Debriefing report

In support of Application for Chief Surgeon

Since the resumption of professional journals and academic activities after the Cultural Revolution in 1979, I have published dozens of papers in journals such as Southern Anhui Medicine, Journal of Bengbu Medical College, Lectures of Provincial Medicine, Domestic Medicine (Surgery) and Jiaotong Medicine. In 1979 and 1980, I participated in the preparation and re-founding of Anhui Orthopedic Society and Surgical Society respectively, and attended the annual meetings (1-6 sessions) of the two societies. I also participated in many academic activities of surgery in China and the Ministry of Transportation.

In 1994, I was involved in the planning and organization of a symposium on orthopedics in the Yangtze River Basin area, helping to compile a special issue of Orthopedic Clinic for Journal of Southern Anhui Medical College, Vol-13 supplement, 1994) under the guidance of Professor Jingbin Xu, editor of Chinese Journal of Orthopedics, carrying over 100 published papers, with participants and contributions from all over the country.

In September, 1995, I published two papers at the National Academic Conference on Acute and Severe Surgery (Guilin, 1995), among which "Problems in the Treatment of Liver Trauma" (0190) won the certificate of excellent papers. I have also published papers in the First International Academic Conference of Chinese Naturopathy (Chengdu, 1991) and Naturopathy (published in Taiwan Province).

1 Professional path and deputy chief physician performance

(On evolution of several theoretical problems in surgery)

1.1 In the early 1960s, a large number of patients suffered from acute volvulus, ascaris lumbricoides intestinal obstruction and cholelithiasis. Carrying out a large number of related operations for these cases consolidated my mastering the basic surgical skills. In addition, for the treatment of toxic shock in late cases, we practitioners underwent an arduous zigzag path from vasoconstriction and pressure increase to volume expansion and improvement of microcirculation, which proves to be an epoch-making change and progress both theoretically and clinically.

1.2 In Southern Anhui, there used to be a large number of patients with portal hypertension, hypersplenism and upper gastrointestinal bleeding in the early years of late-stage schistosomiasis and late hepatitis cirrhosis. The medicine community has also experienced a process of repeated debate and re-understanding of the choice between shunt and devascularization. In this regard, as early as in 1975, I performed splenectomy, splenorenal vein anastomosis and other various shunts. Due to the high rate of postoperative embolism, the blood supply to the liver was reduced and hepatic encephalopathy was easily induced. Later on, I switched to various types of portal-azygous devascularization, and obtained many lessons and various experiences for improvements from the treatment of this difficult problem.

1.3 Biliary lithiasis still bothers the surgical community. With the development of hepatobiliary surgery and improvement of monitoring methods, surgical procedures for this challenging problem of intrahepatic calculi are constantly updated and improved. I started the surgery of regular resection of the left lateral lobe of the liver for this disease in 1980 (the paper on five early cases was published in the Annual Meeting of the Provincial Surgery in 1980 and in Journal of Southern Anhui Medicine (80, 13; 51, “Regular resection of the left outer lobe of the liver for the treatment of intrahepatic stones”). Also starting in 1980, various types of choledocho-intestinal drainage (Finster, Longmire, Roux-en-Y, etc.) were successively performed. In 1992 and 1995, three cases were treated with intrahepatic bile duct incision, stone removal and plasty, and "basin" biliary and intestinal drainage (The first case was reported in “Communication Medicine”, 93,7; 91, “A case of hepatobiliary basin type biliary enteric drainage”). This work advanced the operation to the treatment of intrahepatic lesions, leading to improved clinical efficacy.

1.4 In recent years, the incidence rate of acute pancreatitis has increased. All severe pancreatitis patients in my department were cured by measures such as focus removal, pancreatic bed drainage, intraperitoneal lavage, 5-Fu, somatostatin and other measures to inhibit exocrine, anti-shock and anti-infection. In recent years, one patient was rescued in my department despite the complicated stress ulcer bleeding after operation was performed in another external hospital.

1.5 On the basis of treatment and operation for various thyroid diseases, hyperthyroidism operation was performed after 1980, and two cases of radical thyroidectomy (neck-mimicking surgery) were performed in 1994. One case was re-operated due to recurrence 3 years after the initial surgery was performed in an external hospital. No further recurrence was observed during follow-up.

1.6 In addition, there are surgeries such as excision and anastomosis of cervical aneurysm, thymopharyngeal duct cyst, thyroglossal duct cyst and cystic hygroma resection, etc.

1.7 Over the past 30 years, more than 1,000 cases of breast cancer, gastric cancer, colon cancer and rectal cancer have been treated, and many of them have survived for a long time.

1.8 The prevention and treatment of short bowel syndrome after large intestinal resection as a surgical method of interposition of distal reverse peristaltic bowel loops, the observation shows no diarrhea and malnutrition for 21 years. This paper was published in the Journal of Bengbu Medical College (82; 7: 214, PEUTZ Syndrome) and Traffic Medicine (91; 1: 41, “Surgical treatment of short bowel syndrome”).

1.9 The management of duodenal injury has its particularity and complexity, and its retroperitoneal injury is especially prone to missed diagnosis and misdiagnosis. The prognosis of patients who underwent surgery more than 24 hours after injury is grim. In a case report from 1994, following the principle of "rest transformation" of duodenum, I performed a Berne-like operation 28 hours after injury, and the recovery was smooth. My paper was published in Communication Medicine (“Experience in Diagnosis and Treatment of Closed Retroperitoneal Duodenal Injury”, by Mingjie Li).

1.10 Subdiaphragmatic total gastrectomy, jejunostomy, supradiaphragmatic esophagectomy, thoracic esophagogastrostomy, lobectomy, mediastinal thymoma removal, diaphragmatic hernia repair, etc. which started years ago.

2. Work involving various medicine disciplines

The two hospitals I have served are both base-level primary hospitals. The "major surgery" department covers general surgery, orthopedics, urology, chest surgery, obstetrics and gynecology, ophthalmology and otorhinolaryngology, anesthesia, radiation, laboratory test and other related work. As professional subject leader, I have long been engaged in the work of all of the above areas, outlined below.

2.1 Orthopedics is one of my key areas, only second to general surgery. I have performed all major surgeries in this area, and participated in academic activities at all levels, including publication of numerous papers, professional talks and compilation of a special issue on Orthopedics. My representative operations treating bone injury and bone disease include closed nailing of femoral neck (for the paper, see Orthopedics Clinical 1994, 13:37, Closed nailing treatment of femoral neck fracture in 45 cases), surgical paraplegia (paper in Anhui Province Medical Lectures 1982;, 4:21, Surgical paraplegia analysis of 14 cases), spinal tuberculosis surgery (paper Spinal tuberculosis a surgical therapy in Proceedings of First Provincial Orthopedic Annual Conference, 1979), lumbar disc surgery, spinal cord tumor enucleation, bone tumor removal and orthopedic surgery, etc.

2.2 Urological surgery: nephrectomy, stripping of renal pedicle lymph nodes, removal of various segments of ureteral calculi and Urethral trauma realignment repair, ureteral transplantation, vasovasostomy, spermatic vein–inferior epigastric vein anastomosis, hypospadias repair, radical resection of bladder cancer and penile cancer, etc.

2.3 Gynaecology and obstetrics: I founded the department of obstetrics and gynecology of our hospital, having operated Cesarean section (lower segment and extraperitoneal operation), hysterectomy (abdominal type and vaginal type), oophorectomy, repair of vesicovaginal fistula and cervical cancer resection, etc.

2.4 Ophthalmology and otorhinolaryngolog: parotid gland, tonsil, maxillary sinus, mastoid, cataract, artificial pupil, enucleation, nasolacrimal duct anastomosis, strabismus correction, etc.

2.5 Anesthesiology: various segments of epidural block, cervical plexus block, brachial plexus block, intubation general anesthesia and intravenous compound anesthesia, etc.

2.6 Radiology: I founded the department of radiology in 1960, and concurrently served as the head of the department for 2 years (1960-1962). Very familiar with its routine work and related angiography.

Environment trains people. A wide range of issues encountered in the long-term work of grass-roots hospitals enabled me to dabble in many subjects. The knowledge and skills of these relevant areas complement each other, contributing to and deepening the improvement of my surgical expertise. Various Level-4 and Level-5 surgeries have been performed to keep placing me at the forefront of contemporary surgery.

3 Continuous innovations and some experience to share

Over the past 40 years, with high technological development, diagnosis and monitoring methods are constantly updated. With the change of social life, diseases are also changing. In an aging society, geriatrics takes a prominent position. Many factors make the clinical work evolve too. This requires physicians to constantly hunt for scientific and technological information, learn from the experience of others, study hard and embrace the courage for innovation, in order to improve the service quality for our patients.

3.1 Improvement and innovation

3.1.1 The key to the control of traumatic infection is complete debridement at the first diagnosis, rather than relying on drainage and antibiotics. Techniques involve a large quantity of water washing, elimination of foreign objects and inactivating tissues, disinfection, and no suture. When postoperative inflammatory reaction occurs, apply local wet compress with alcohol, supplemented with with or without antibiotics. Following this strategy, surgery within 6 hours of trauma is almost completely free from infection.

3.1.2 Over the past 30 years, based on the experience of over 1,000 cases of gastrectomy I have performed, the preset gastric tube has basically been abandoned except for special needs, and there were no cases of failure. This requires excellent anastomosis, perfect hemostasis, intraoperative emptying of the residual stomach, and attentive postoperative monitoring.

3.1.3 For extensive peritonitis, after the nidus and infectious substances are removed, abdominal cavity drainage can be abandoned to reduce postoperative adhesion. The key for this to work is to wash it thoroughly during the operation. As the drainage is quickly blocked by fibrin glue in the abdominal cavity and soon stops working, it only increases the pain of the patient. To be sure, however, in cases such as pancreatitis, abdominal abscess, etc., if continuous overflow is expected, double-cannula negative pressure drainage is still required.

3.1.4 For any surgery, regardless of scale, its success or failure makes a big difference to the health and safety of patients. As a surgery practitioner, I attach importance to the technical improvement of each and every "small" surgery. Some of my technical innovations and experience are outlined below.

For inguinal hernia repair, the focus is the transverse abdominal fascia, the traditional Bassini method should be replaced by the modified Madden procedure, which greatly reduces the pain of postoperative tension suture for patients, and is also conducive to healing, with the recurrence rate greatly reduced.

For circumcision, the conventional routine procedure has plagued both doctors and patients with the poor alignment of the inner and outer plates, hematoma, edema, as well as difficulty in stitches removal. I modified the procedure, using local venous anesthesia to support neat cutting under a tourniquet, with perfect hemostasis, accompanied by careful sutures with human hair or absorbable thread. The benefits include no pains during the operation, good alignment, fast healing, and avoiding stitches. (see my paper published in Jiaotong Medicine 90; 4（3)：66, Several improvements of circumcision）

Anal fistula seton therapy or open resection both make patients suffer from postoperative pains with a long recovery period. I used long-acting anesthesia (with local injection of diluted methylene blue) to ensure the primary resection and suture. Most cases receiving this treatment result in primary healing, with the course of treatment greatly shortened.

3.2 Some General Experiences

Based on what I have learned from my 40 years of hands-on surgical practice, I feel that in order to be a qualified surgeon, we need not only consolidate the basic knowledge with continuous updating, but also exercise meticulous working methods with a high sense of responsibility, supported by logical thinking and practical orderly working style. It is very difficult to just follow a unified norm or standard procedure when the real-world surgery scenario involves so many moving parts to be weighed and considered, factors like the ever-changing condition, physical differences, positive and negative effects of drugs, advantages and disadvantages of the techniques in consideration, the reserve function of body organs, the length of the course of the disease, and even the natural environment, mental and material conditions, and so on. One must be equipped with high adaptation wisdom. It is not an exaggeration to say that the adaptation ability determines a surgeon’s diagnosis and treatment level and the clinical effects.

3.2.1 The entire process on the operating table involves struggles between personal fame and the interests of patients. The so-called following "safety first, and draw the line accordingly” principle is often not a feasible practice. A competent physician must have the courage to take risks for his patients. It is often the case to be placed in the position in fighting for patients' good chances of rescue that can be missed because of a small mistake in one's thinking. I have countless memories of such incidents in the past, one of which is as follows. In the fifth operation of the biliary tract, cavernous blood vessels caused by portal hypertension due to biliary cirrhosis were distributed all over the hepatic hilus, and in addition, the inflammation was thickened. After struggling for 8 full hours of operation, I finally managed to open the biliary duct and save the life of the patient. This was a victory of perseverance.

3.2.2 Adjust measures to real-world conditions, and keep an open mind to break the routine to save a patient. The key to life-saving in case of liver and spleen trauma and massive hemorrhage of ectopic pregnancy in the countryside lies in the rigorous transfusion of the abdominal blood. To wait for the blood supply in these scenarios means to wait for death. I remember a case of liver trauma in which 1700ml of liver blood was transfused locally to support the successful operation. (See paper Related issues in the treatment of liver trauma (review), in Proceedings of the National Academic Conference on Acute and Major Surgery, 95; 190）

3.2.3 For difficult surgery and new surgery, one must accumulate the relevant knowledge and operation skills, by reviewing the literature, consulting experienced experts for guidance, and visiting and studying surgery scenes, before embarking on the operation, to minimize potential misses or accidents. In my first case of hepatobiliary-pelvic internal drainage operation, I asked for direct guidance from a professor of surgery. The subsequent two cases were successfully completed all by myself.

Looking back on my 40 years of career in surgery, I deeply feel that clinical surgery is a combination of science, perseverance, determination, and a sense of responsibility. It is like a small boat that ups and downs in the forefront of the waves. Walking on thin ice, one can hit hidden rocks at any time. The hardships and risks of our career are among the highest in all trades. Fortunately, I have not failed the society. Along the journey, there have been countless joys of success, together with many sleepless nights and panic moments. For the rest of my career years, I am determined to maintain the service spirit of "healing the wounded and rescuing the dying", to complete the journey to the end.

Appendix 1, Publications
Appendix 2, Relevant Materials and Records of Level III and Level IV surgeries

In Commemoration of Mingjie Li’s 66 Years of Medical Practice

Mingjie Li: My career as surgeon

I: Career memoirs

Before writing my debriefing report in support of my application for Chief Surgeon, let me start with three unforgettable orthopedic cases that I experienced in my medical practice.

In 1970, my old schoolmate and close friend from junior high school, Mr. Gui from Fanchang No.1 Middle School at that time, brought his son’s case to my attention. His son, aged 16 then, suffered from cervical vertebra 5 tuberculosis with cold abscess, which severely oppressed esophagus and trachea. He was unable to eat, and had difficulty breathing, with hoarseness, dehydration and hypoxia, in a critical condition.

They had visited Yijishan Hospital, the largest hospital in Wuhu, but the director there Dr. Chen of the Department of Orthopaedics could not admit this case, saying that a few days before, a similar case, died during the operation. He made the suggestion for the patient to be sent to the provincial Hospital of Hefei, which required 800 yuan then. However, Mr. Gui’s monthly salary was only 52 yuan, and he had to support a family of six with this income. How could he afford it? Besides, nobody knows whether the chief hospital in Hefei could treat him. In a hurry, Mr. Gui turned to the No. 127 Army’s Hospital located in the suburb of my town Nanling, to try their luck there. The corresponding department of the hospital was administered by Dr. Xu Jingbin, the nation-wide orthopedic authority, and this military hospital located in a small place long had a tradition of helping the poor. Unfortunately, Dr. Xu was on a business trip to Nanjing, and several of his subordinates there were too afraid to accept this high-risk patient.

Feeling helpless, Mr. Gui came to me in Nanling County Hospital (the two hospitals are only 5 miles apart) to discuss possible rescue plans with me. I was not sure about how best to treat this condition either. However, I had studied in No. 127 Hospital, with Dr. Xu as my supervisor, familiar with the personnel there. I immediately called an ambulance. We went back to No.127 Hospital, found doctors in orthopedics and surgery, and asked them to work together for the treatment of this urgent case. Mr. Gui as patient’s family and I jointly signed the required paper for willing to take the risk of the operation, and discussed the detailed rules. However, this plan was still not approved by the hospital. Instead, the hospital asked me to help them out of this embarrassing predicament, and promised a free car to be used for transferring the patient to big city hospitals in Hefei or Nanjing. The patient's life was in danger at any time. Far water cannot put out the near fire, so it's not advisable to transfer to hospital far away.

I decided to take on the challenge myself. At that time, I thought, at least I could give pus discharge for saving life first, relieving the oppression of esophagus and trachea, and making it possible for hime to eat and breathe. So the patient was brought back to the county hospital where I worked. Without even getting off the stretcher, I ordered to first give fluid replacement and antituberculosis. At this point in the evening, Mr. Gui didn't get any food for a whole day, so he was given dinner at my home. I could not afford the time to have a dinner. I took the time to review the related literature and anatomy. Half an hour later, the patient was sent to the operating room under local anesthesia. After my careful dissection, the patient’s pus cavity was cut to release a large amount of pus. The patient immediately started making sounds, could sip the water, and breath smoothly, indicating him finally put out of immediate danger.

The operation continued, exposing the focus of cervical vertebra 5 by anterior approach, I removed the dead bone, scraped off the granulation of tuberculosis, flushed the pus cavity, inserted streptomycin and isoniazid, put the drainage piece in, with suture. The operation was smooth and very effective. The fever came down 3 days after operation. The patient went to get a haircut, ate normally and recovered well. 12 days after the operation, he was discharged from hospital, and his medical expenses were 32 yuan. He continued anti-tuberculosis treatment for half a year and recovered well. For more than 40 years now, the patient has been working and living normally, now enjoying a family of his numerous children and grandchildren.

In addition to the complicated anatomy of the neck, such as dense blood vessels, nerves, thyroid gland, trachea, esophagus, etc., this type of cervical tuberculosis debridement operation is of high difficulty also due to the fragility of the cervical spine and the destruction of tuberculosis. If there is a slight mistake in the cervical spinal cord, it will lead to being paraplegic at a high level or even death. It's an orthopedic high risk level 4 operation. Even in big hospitals, the directors are extremely cautious in treating such cases. I was still a newcomer in orthopaedics then, but I needed to save lives, knowing that transferring to another hospital at that time was basically a dead end. The patient was on the verge of an abyss. But I also had some of my own strength and preparedness for this success. I had had many years of experience in neck thyroid surgery, familiar with anatomy, and had accumulated specialized knowledge in orthopedics. This solid foundation finally enabled me successfully complete this rare problem in a grass-roots hospital. Life threatening symptoms were treated by relieving oppression immediately. And the disease was cured, with the lesion eradicated. It proved to be a cure for life.

Another case, at the end of 1980s, named Xiao Wei, a 14-year-old junior student in Wuhu No.1 Middle School, suffered from right humeral neck tumor. He had undergone two operations in Yijishan Hospital and Shanghai Zhongshan Hospital respectively. Now, the disease struck at the right scapula. The director of orthopaedics in a hospital of our city said, it is malignant tumor recurring and metastasizing, amputation is necessary, and it is challenging to save his life! The family was in a desperate situation. The patient’s grandfather, Mr. Wu, was my junior middle school teacher. Mr. Wu knew about the case of cervical tuberculosis treated well by me on Mr. Gui’s son, so he came to me for consultation. I carefully examined the medical records and the X-ray films before and after, and diagnosed it as a new critical tumor, neither a recurrence nor a metastasis of the original disease. I personally performed a half-excision of the right scapula in my own hospital, resulting in his full recovery. More than 20 years have passed, and Xiao Wei has enjoyed good health ever since. He has become a Dr. Yang in the west later on, and is now a high-end international talent in his field. From time to time, he and his father still come to visit me with appreciation.

The third case, in the fall of 1975, a 35-year-old female patient, who had lost 40 kilograms, was admitted to our hospital for tuberculosis of thoracic vertebrae 6 and 7 with paraplegia. Under general anesthesia, through the chest, the focus was cleared, and the dead bone and the necrotic intervertebral disc were removed. The tubercle granulation in the spinal canal was 8cm long, which pressed the thoracic spinal cord, resulting in spinal canal obstruction and paraplegia. After curettage, it could be seen that this segment of spinal cord was throbbing again. The focus area was thoroughly washed, with antituberculosis drugs added in. The ribs cut during thoracotomy were trimmed and embedded in the intervertebral defect area, and the anterior bone graft was completed in one stage. After operation, the patient recovered well and was cured. The patient’s husband was a blacksmith, who gifted me with a stainless steel kitchen knife and a spatula of his own craftsmanship, which are still in use in my home today. In orthopedic surgery, this belongs to the top level-four category. With thoracic tuberculosis complicated with paraplegia, the cure was one-time lesion clearance and bone grafting through the anterior thoracic approach, definitely having reached the peak in county-level hospitals.

Such cases have brought me a great sense of pride and accomplishment, and they form the motivation for my lifelong dedication to saving lives and relieving pains for my countless patients.

In Commemoration of Mingjie Li’s 66 Years of Medical Practice

Collected Works in Commemoration of Mingjie Li’s 66 Years of Medical Practice

Dr. Mingjie Li has been practicing medicine for over 60 years. This collection, compiled to commemorate his amazing career, includes three sections: (i) career memoirs, (ii) medicine papers, and (iii) medicine education. The publication of his medicine papers is the culmination of his extensive experience and expertise in the field. His work has been recognized by his peers for its professional value and rigorous style. In addition to surgery, orthopedics, obstetrics, and gynecology, his work at times also incorporates elements of traditional Chinese medicine. The "Operation Records" section in the appendix provides detailed descriptions of operation procedures and emergency measures, making it a valuable reference for professionals in the field. The "Education Section" highlights Dr. Li's practical experiences and medical training materials he compiled, providing valuable insights into a range of clinical topics. Overall, this collection serves as a testament to Dr. Li's impressive career and contributions to the field of medicine.

August 2023, Wuhu, Anhui, China

【李名杰从医67年论文专辑（电子版）】

Table of content

I: Career memoirs

My career as surgeon

Debriefing report

Service beyond my hospital

Career Path and self review

Dad's medical career

II: medicine papers

Regular resection of left lateral lobe of liver for intrahepatic calculi

PEUTZ syndrome

Surgical management study of hepatic injury

Surgical treatment of acute gastroduodenal perforation

Diagnosis and treatment of closed retroperitoneal duodenal injury

Surgical treatment of short bowel syndrome

Hepatobiliary basin type biliary-enteric drainage

Biliary enteric drainage

Several special problems in diagnosis and treatment of biliary tract surgery

Diagnosis and treatment of close duodenal retroperitoneal injury

Misdiagnosis of subacute perforated peritonitis in gastric malignant lymphoma

Adult retroperitoneal teratoma infection complicated with chronic purulent fistula

Lighter foreign body in stomach

Primary repair of congenital omphalocele

Recurrent stones in common bile duct with suture as core

A case of plastic tube foreign body in bladder

Abdominal trauma

Subcutaneous heterotopic pancreas of abdominal wall

Several improvement measures of circumcision

Clinical observation of a new minimally invasive circumcision

A surgical treatment of spinal tuberculosis

Transpedicular tuberculosis complicated with paraplegia

Surgical analysis of surgical paraplegia

Lipoma under soft spinal membrane complicated with high paraplegia

Treatment of femoral neck fracture with closed nailing

Fifth metatarsal fracture caused by varus sprain

Intervertebral disc excision in community health centers

In commemoration of the 50th anniversary of Dr. Xu Jingbin' s medical career

Intrauterine abortion combined with tubal pregnancy rupture

Rivanol induction of labour by amnion cavity injection

Extraperitoneal cesarean section

Prevention and treatment of trichomonas vaginalis and mold infection

Non-operative treatment of senile cholelithiasis with integrated traditional chinese medicine

Treatment of acute soft tissue injury with moxibustion

Treatment of scapulohumeral periarthritis with acupuncture combined with warm moxibustion

IV: medicine education

Level 4 Surgery

New concept of modern surgical blood transfusion

Extrahepatic biliary injuries

Surgical treatment of thyroid cancer

Indications of splenectomy and effects on body after splenectomy

Treatment of carcinoma of pancreas head and carcinoma of ampulla

Treatment of cardiac cancer

Treatment of recurrent ulcer after subtotal gastrectomy

Treatment points of radical resection of colon cancer

Medicine Lecture Notes

Interview 1/10: Critique of Chomsky's Formal Language Theory

Q: Lao Li, I have been paying close attention to your academic track. I deeply admire you for more than 30 years' in-depth study of symbolic logic in the field of natural language understanding with your unique innovation. On your NLP Channel, I notice that you've been critical of Chomsky. Chomsky is the representative figure of the rationalist school. Like many others, I admire Chomsky. As far as I know, you are also a rationalist. So why do you, as a linguist who practices rationalism, criticize Chomsky?

A: First of all, although I have criticized Chomsky, pointing out his theoretical issues and objective misguidance in the field, these are "criticisms within the school". There is no doubt that Chomsky is the father of computational linguistics and the banner of rationalism in the field of artificial intelligence. His theory of formal language is the cornerstone of computational linguistics. All of us computational grammarians, as practitioners of the symbolic logic of rationalism in language, are his disciples. When we criticize him, we still use his formal mechanism as the frame of reference.

From the perspective of language formalization, Chomsky, who has a deep mathematical background, brings mathematical rigor into the formal study of language. At least in terms of formalism, Chomsky unified human language with computer language to have achieved a highly abstract symbolic system no others could dream of reaching. Without Chomsky's formal language theory, computer science could not develop high-level languages, and all the achievements of the information industry would be unimaginable.

On the other hand, it can be said that Chomsky's negative impact on the field is as big as his revolutionary contribution to linguistics and computer science. His formal language hierarchy is a theory of pure genius, which lays the foundation of language formalization. This formalism has become the theoretical basis of computer high-level languages and their compiling algorithms. It is used at its best to create, parse and compile computer languages as a perfect guide. However, perfection is sometimes only one step from fallacy. Chomsky criticizes the finite state machine as not suitable for modeling natural languages due to a lack of recursion mechanism. Too many people are misguided and fall into the so-called "more powerful" context-free mechanism.

Such an intelligent and powerful figure, if he misleads, can impact an entire generation. The generation that was affected was my direct supervisors and predecessors when I entered this field (in the 1970s and 1980s), their work in natural language understanding was almost exclusively toy system confined to labs, difficult to scale up and demonstrate in practical applications. This directly led to the rebellion of the next generation. This is the piece of history in artificial intelligence, the famous competition between rationalist symbolic school and empirical statistical school, with long struggles between the two paths. The rationalists of the old generation were at a disadvantage in competition and gradually withdrew from the mainstream stage.

All the advance of the statistical school over the last 30 years has been a practical critique of Chomsky because almost all of these models are based on finite state models, which he repeatedly criticized as inappropriate for natural language. The context-free grammar he advocates has achieved limited success in the field of natural language.

Q: Now that everyone is advocating neural networks and machine learning, is there still room for the symbolic rule school? Rationalism has lost its voice and visibility in the natural language community. What do you think of the history and current situation of the two?

A: Well, machine learning has been on the rise in natural language processing since about 30 years ago, with the rapid development of data and computing resources. Especially in recent years, deep neural networks have achieved breakthrough successes in learning. The success of empiricism, in addition to the innovation in neural network algorithms, also benefits from the availability of unimaginably big data and big computing power today. In contrast, the rationalist school of symbolic logic, due to its implacability, gradually withdrew from the mainstream stage of the academia after a brief upsurge of phrase structure grammars with innovation based on unification about 20 years ago. There are several reasons for this situation, including Chomsky's long-term negative influence on computational grammars, which deserves serious reflection.

Looking back at the history of artificial intelligence and natural language, the pendulum of empiricism and rationalism has swung back and forward, but the pendulum of empiricism has been on the rise for the last 30 years (see the red dot in figure 1). In his article "Pendulum Swung Too Far", Professor Church predicted and called for the resurgence of rationalism and presented an illustration below:

At present, due to the breakthrough of deep learning, empiricism is still in the limelight. Although rationalism has been accumulating efforts by itself for many years, it has not yet reached the tipping point where it can compete, head-on, with empiricism. When one school becomes mainstream, the other naturally fades out of sight.

Q: I have a feeling that there is some confusion in the community and outside the community at large. Deep learning, which is a method of empiricism, now seems to be regarded by many people as equivalent to artificial intelligence and natural language processing. If the revolution in deep learning sweeps through all aspects of artificial intelligence, will it end the pendulum swing of rationalism? As professor Church says, the pendulum of empiricism has swung too far, but it looks far from falling back.

A: My definite answer is no. These are two different philosophical bases and methodologies, each with its own natural advantages and disadvantages. Although there are reasons for the status quo of the existing one-sided empiricism in the current academic world, it is not a healthy state. In fact, both schools are competitive on one hand and also highly complementary on the other hand. Some older generation mainstream pioneers like Church have been warning about the disadvantages of one-sidedness in empiricism, and some new scholars in deep learning have been exploring the integration of the two methodologies to solve the problems of natural language.

Yes, much of the current surge in AI is based on breakthrough performance from deep learning, especially in the areas of image recognition, speech processing as well as machine translation, where AI systems have reached or exceeded human quality. This is an unprecedented amazing achievement indeed. However, the fundamental limitation still exists with deep learning, as well as all the other successful empirical methods at present, that is, the dependence on massive annotated data, what we call the knowledge bottleneck. The reality is that in many fields and application scenarios, such as natural language parsing, machine translation of e-commerce data, data of massive annotation or domain translation do not exist. This knowledge bottleneck severely limits the performance of the empiricist school in natural language understanding and other fine-grained cognitive tasks. There is simply not enough annotated data in many sub-fields, and without, it is almost impossible to make bricks without straw for learning. This is especially true for deep learning, which has a much larger appetite, like insatiable, than traditional machine learning.

Q: So it seems that deep learning is not an all cure. Rationalism has its place. You said the two schools have respective strengths and weaknesses. Can you compare and contrast them? Why are they complementary?

A: Let me summarise the merits and demerits of the two for a serious contrast.

The advantages of empirical statistical models include: (1) good at coarse-grained tasks, typically, document classification, for such tasks, statistical learning is naturally better to draw the overall conclusion; (2) robustness; (3) high recall: due to the lack of structures and understanding, many tasks might face a ceiling for accuracy, but recall-wise, learning usually performs well; (4) development efficiency: it can quickly scale to a real application scenario of big data.

The main limitations of the statistical school are: (1) the dependence on massive annotated data: this is the biggest knowledge bottleneck; (2) it is difficult to make targeted debugging: the statistical system is more like a black box, a big defect for maintenance and iterative incremental enhancement of a software system; (3) lack of interpretability: whether the result is right or wrong, it is difficult to explain, which affects the user experience and confidence. The main reason is the lack of explicit structural representation and symbolic logic in the algorithm that people can follow.

The rationalist approach simulates human cognitive processes without relying on massive labeling data to imitate on the surface strings. Rationalism directly formalizes the experience of domain experts and uses the explicit rule system from symbolic logic to simulate human intelligence tasks. In terms of natural language understanding, the grammar school formalizes the rules summarized by linguists so as to parse natural language in detail at all levels and achieve deep syntactic-semantic analysis. In this respect, rationalism has its natural advantages.

To sum up, the advantages of rationalist rule-based school include: (1) good at tasks of fine-grained tasks: very detailed analysis, such as the deep parsing of syntactic semantics with logical reasoning; (2) accuracy: the rule system written by experts is easy to guarantee high accuracy, but the improvement of recall is usually a long iterative process; (3) debuggable in error correction: the basis of the rule system is symbolic logic, which is easier to trace to the root of the error in debugging; (4) interpretable: this also benefits from the understandable symbolic logic basis.

The main defect of the rule school is the low efficiency of manual coding, and the dependence on expert coding is the knowledge bottleneck of the rule school. Supported by the same platform and mechanism, different levels of expertise determine different levels of quality. The two paths have their own knowledge bottlenecks, so to speak. One is to rely on a large quantity of "low-level" labor, labeling, though very monotonous, is work that can be assigned to ordinary students with a little training. The other is to rely on a few experts of "high-level labor", much like software engineering, for coding and debugging rules, the knowledge engineer training costs are high, making it more difficult to scale up to the real world. Finally, the talent gap can also be regarded as a realistic severe limitation of the rationalist school. 30 years is exactly one generation, during which empiricism has occupied the mainstream stage, and attracted almost all newcomers, causing a generation shortage of talents in the rationalist camp.

As for the recall, it cannot be simply concluded that high precision is bound to have a low recall rate for rule systems. The actual situation is that, on the one hand, it is not at all difficult to achieve a balance between precision and recall, by deliberately relaxing rule conditions and sacrificing accuracy. On the other hand, while high precision can also be maintained, the more rules added to the system, the more phenomena will be captured, hence the recall rate will come up naturally and incrementally in the iterations as time moves on. In other words, recall is a function of time and development resources put in, without having to compromise precision.

Q: Since each has its own strengths, as the rationalist pioneer and father of computational linguistics, why doesn't Chomsky exert its due influence in the field of natural language processing? His impact has been waning, and the newcomers to the field hardly hear of him.

A: Indeed it is. Although I am a rationalist, I also see that there is a considerable historical burden from this school that needs to be seriously reflected on from the perspective of formalism architecture.

Chomsky is the founder of modern rationalism, but the theory and practice he developed also involve some misconceptions. We must recognize these so that we can move forward the linguistic rationalism in symbolic logic steadily and deeply for natural language. In fact, after decades of theoretical exploration and practical experiments, the grammar school has seen fairly clearly its own theoretical limitations. Those who stick to the symbolic rule systems have broken through the path of innovation in the inheritance of rationalism, and have made their own breakthrough in deep parsing, the very core of natural language understanding, and in its scale up to big data for real-life information extraction and text mining applications. That's what we're going to focus on in this series of interviews.

Q: I know you have great faith in rationalist symbolic approaches in general. However, you have also seen a number of misconceptions in Chomsky's theories. which are the most critical?

A: On his formal language theory, there are two fallacies to my mind, one I would name Recursion Fallacy and the other Monolayer Fallacy. On his linguistics theories, one of the very basic propositions in his linguistic revolution is "syntactic autonomy" or "self-contained syntax". It involves serious potential consequences in the analysis of certain languages such as Chinese. His phrase structure grammar tree represenation with his X-bar theory in syntax is also worthy of reflection and criticism, especially when it is put in the comparative study with the alternative dependency grammar and its representations for NLU. Let's look at Recursion Fallacy first.

In my view, Chomsky's greatest mislead was to use the so-called recursion nature of natural language to criticize pattern matching in finite states. His cited English examples of center recursion are far-fetched and rare from real life, making it difficult to argue for its being the nature of natural language. Nevertheless, a generation still chose to believe in his theory, taking it for granted that finite states had to be abandoned in order to be able to parse natural language.

Q: Isn't it generally accepted that natural language is recursive? How to say it is a fallacy?

A: Exactly because it is widely accepted, it is of the more misleading nature and consequences, hence requiring more serious critique.

Recursion in natural languages typically comes in two types: (i) right (branching) recursion and (ii) center recursion. Many people don't consciously make that distinction, but in computational theory, they are two very different things. Right recursion is linear by nature while center recursion is nonlinear, a completely different monster, of much more computational complexity. In natural languages, right recursion is fairly common and can at times be as many as seven or eight levels nested, which still reads natural and easily comprehensible. For example, the VP nesting example:

(to request A (to beg B (to ask C (to do something))))

For right branching recursive structures, we usually do not feel a burden in the communication. The reason is that, although the right recursive left boundary is in an uncertain position, they all end at the same poin for the right boundary, like this: (... (... (... (... (...... ))))). Thus, we do not need a "stack" mechanism in memory to deal with it, it remains finite-state.

Chomsky cannot criticize finite-state devices with right recursion, so he needs to base his argument on center-recursion, a rarity in language. The fact is that natural languages have little manifestation of center recursion. Center recursion is much like matching parentheses. You want the parentheses to match each other so you can express and understand the proper nesting structures, like this: { ... [ ... ( ...... ) ... ]... }. After as many as three levels of center recursion, our brain can no longer cope with the pairing complexity, which is why it's hard to fine such phenomena in real life language data.

Q: I remember some examples of center recursion in English:

The man who the woman who had lost all the keys was calling all day finally came...

A: Is this "human" language? Chomsky repeatedly attempt to teach us that not only this is human speech, but it is the very nature of human language, hardly any hypotheses about language as far-fetched as this to my mind.

Q: Let me try to understand what you mean: center recursion does not exist, or does not exist over three levels, so natural language is finite-state?

A: Well, not that it does not exist, it's so rare and far-fetched, and it's never more than three levels deep unless you're pulling a prank. Therefore, it can by no means be the "nature" of natural language.

The very idea of unbounded center recursion in language, far from the observable facts, in effect violates the limits set by the short-term memory following psychology. Where in the world do people talk like that, like, keep opening the doors without closing them behind, in a maze-like complex castle, with nested sub-structures after substructures? A path of 3 doors opened, an average person will get lost in the maze. Even if you're a super linguist, and you can stand it, your audience will be bound to be trapped. Is natural language not to communicate, but deliberately making difficult for people to follow you? This is not in accordance with the consensus that language is born for communication and serves the ultimate purpose of communication.

Using pranks and verbal games as evidence of linguistic competence and the nature of language is one of the most misleading aspects of Chomsky's recursion theory. This recursion trap leads many people to automatically accept that natural language is recursive and therefore we must discard the idea of finite states. The people who believe in him, on the one hand, are influenced by his authority as the father of modern linguistics; on the other hand, they often mis-regard the more common and deeper right recursion for center recursion as evidence in support of Chomsky's recursion hypothesis. Chomsky himself is intelligent and rigorous as not to use readily available right recursion as evidence, he only uses center recursion as an argument. But he's in effect misleading.

Q: I guess this is a typical behavior of mathematicians and philosophers: they pursue formal perfection. As long as it is theoretically impossible to exclude multi-level center recursion, it is required that the formal mechanism must have a built-in recursion mechanism. But practitioners of natural language understanding do not have to be bound by that theory, do they?

A: after all, the foothold of the theory should be based on the real-life natural language object and data, right?

In fact, in the research of corpus linguistics, some scholars have conducted a very extensive survey and found that the so-called center recursion in natural language never exceeds three levels, and the occurrence of three-level recursion is extremely rare [reference]. The phenomenon of natural center recursion beyond three levels is simply not found in a very large running corpus, not a single case found. So why boil a very limited center loop down to what seems like an infinite level of recursion, and furthermore consider it the essence of natural language, and use it as an argument to determine the choice of the formal model for natural languages? This has had serious consequences for computing and NLU going beyond labs for applications.

In order to deal with theoretically infinite center recursion, the human brain, or computer memory, must have a "stack" device and a "backtracking" algorithm. Without going into the technical definitions of these computer terms, computer science studies have demonstrated that stack-based backtracking is expensive for computation. Using it as a basic device for natural language severely impedes language parsing from leaving the laboratory. Specifically, Chomsky's "context-free grammar" with built-in recursive devices is theoretically bound not to have corresponding linear speed algorithms. The absence of linear algorithms means that the computing time is beyond control, so when entering big data out of the lab, this kind of thing is one limiting factor in practice. This is one of its fundamental flaws in his formal language arguments for natural language.

Q: I agree with you: there are only very limited levels, we don't have to stick to recursive grammars. But I still have a question. Short-term memory is a psychological concept, and most of us in computational linguistics believe that psychology has no place in linguistics. Don't you agree?

A: I don't agree. The limitations of psychology have a direct effect on real linguistic phenomena, that is, psychological effects are reflected in linguistic phenomena. Real language phenomena, not imaginary phenomena, are the goal and final foothold of our natural language study. What we're dealing with is a data set with a psychological constraint, and it's obviously not appropriate for us to adopt a mechanism to deal with it based on a hypothesis that disregards psychological constraint.

Q: But even with the addition of psychological restrictions, don't real corpora still have recursion? If yes, without the formal recursion device, such as the finite state machine, how can it handle the actual existence of the center recursive structure as long as it is not a non-existence?

A: Not a problem at all. As long as the recursive structure is bounded, the finite states have no problem in dealing with it. All we need is just cascade a few more finite state machines. Since you have at most three levels of center recursion, then it is 3 machines with 3x time needed, which is still linear. Even 10-level center recursion is not a thing, just add up 10 finite state automata. In our deep parsing practice, we have once applied up to 100 cascaded finite state machines for very deep parsing, in high efficiency. This kind of finite state pipeline systems, often called cascaded FSAs, is essentially the same concept of the pipeline as used in software engineering.

Q: Chomsky Hierarchy, named after Chomsky, is the most famous discovery in Chomsky's formal language theory, which divides grammars into four types, type 0 to type 3, corresponding to different automata. What do you think of his hierarchy?

A: Chomsky's formal language hierarchy is like a hierarchical castle with four enclosing walls safeguarding inner cities. Each formal device is like an internal forbidden city. Here we particularly recommend and quote an insightful study of Chomsky Hierarchy by Prof. Bai, which I call a "caterpillar" theory of natural language (S. Bai: Natural Language Caterpillar Breaks through Chomsky's Castle):

If we agree that everything in parsing should be based on real-life natural language as the starting point and the ultimate landing point, it should be easy to see that the outward limited breakthrough and the inward massive compression should be the two sides of a coin. We want to strive for a formalism that balances both sides. In other words, our ideal natural language parsing formalism should look like a linguistic "caterpillar" breaking through the Chomsky walls in his castle, illustrated below:

Prof. Bai also clearly sees that Chomsky's recursion theory is too far away from linguistic facts, so he puts special emphasis on "real-life natural language". After all, formal systems serve as formalized models for natural language, that is, they need to provide an appropriate framework for what natural language looks like. The common answer shared by Prof. Bai and me is that a suitable natural language model needs to get through the walls inside the Chomsky Castle. Any single device in Chomsky's existing formalisms, when used to model natural language, is either too small to fit, or too large lacking appropriate restrictions. In both theory and practice, it is necessary to penetrate the walls of Chomsky Castle and form an innovative formal system, so as to lay a good foundation for the revival of grammars in natural language modeling. In the formalization process of penetrating the walls, Mr. Bai has his own innovation, and I have mine. My proposition is to extend and overlay the finite-state mechanism, so as to establish a shallow and deep multi-layer rule system for natural language deep parsing and understanding.

Do not look down upon finite state machines, which seem to be a very simple mechanism for pattern matching. When they are added layer by layer in the architecture of a reasonable pipeline system, they can cope with very complicated structures and phenomena and reach the depth of language parsing that is never before made possible by traditional context-free grammars or other devices. Of course, the mechanism itself can be reinvented and recrafted, such as incorporating the unification operation in handling language reduplications, e.g. in Chinese, "看一看": V 一 V (literally look-one-look: "take a look"). There are also rules for pattern matching that can effectively eliminate ambiguities by adding post-context conditions, similar to the "look ahead" effect in backtracking algorithms, to the pattern matching device.

It is worth emphasizing that maintaining the linear nature is the premise of any formalism innovation. No matter how we extend the mechanism of finite-state devices, this one remains an unchanged goal, that it must retain the essential characteristics of finite state to ensure the "line speed". We use a multilayer cascade to bypass the recursion trap, hence eliminating the biggest hidden trouble that hinders linear speed. Since the linear multiplication remains linear, the cascaded finite state system does not change the linear benefit of the system. Computationally, the processing speed required for three-layer recursion is only 3x, which will not affect the scalability potential of the system. In fact, we have deployed multi-layer systems, usually with more than 50 layers. Our Chinese system sometimes cascades up to 100 layers in the architecture, where capturing recursive structures is just a relatively simple task inside.

Q: That's fascinating. And very imaginative, too. It is apparent that you and Prof. Bai have both accumulated years of practice and deep dive into natural language so you two have such insights as summarised above in breaking through the internal walls of the Chomsky Castle. Ok, so the first issue with Chomsky formal language theory is the recursion fallacy, what's the second fallacy?

A: The second major problem with the Chomsky formal language theory is briefly mentioned above, which I call Single-layer Fallacy.

Turn to the chapter on parsing in the computational linguistics textbook, the typical algorithm for parsing, known as chart-parsing, is often introduced on the formalism of a context-free grammar (CFG). CFG contains recursive calls in its rules for covering recursive structures, a point emphasized by Chomsky as the key feature for natural language. The implementation of this rule system is carried out in the same search space on the same plane, thus the so-called chart-parsing can be illustrated on a flat chart. Successful parsing is represented by one or n search paths that cover the entire sentence.

[consider a chart parsing sample.]

The essence of single-layer parsing is like cooking a hodgepodge. Everything in an input string, from morpheme to word, from word to phrase, from phrase to clause, from clause to a complex sentence, all are carried out in the same space.

Q: So Chomsky wants to solve everything at once. Isn't that good?

A: Problem is, there are three main disadvantages. First, there is no linear algorithm. Many people have tried, but they just can't find a linear algorithm, it's a combinatorial explosion.

The second disadvantage is that it is not suitable for modular development, because the surface or shallow level language phenomena and the deep language structures are all mixed on one plane.

The third disadvantage is the so-called "pseudo-ambiguity" issue. "Pseudo ambiguity" is in contrast to true ambiguity. If there is one true ambiguity in the input sentence, the correct identification is for the parser to produce two parses to express the ambiguity. "Pseudo-ambiguity" means that a sentence is not ambiguous in people's understanding, but the parser still outputs several parses, which are all considered to be grammatical.

The problem of pseudo-ambiguity is a recognized challenge in single-layer parsers. Even for a simple sentence, traditional parsers based on context-free grammars often produce dozens or even hundreds of parses. Most of the time, the differences are so subtle that they don't make difference in communication. The consequence is that very few true ambiguities are hidden among many false ambiguities. In effect, the parser loses the ability to parse ambiguity completely. Of course, such a single-layer grammar approach is difficult to be truly deployed in parsing and semantic decoding of big data.

Q: Lao li, I think I have now started understanding the drawbacks of the single-layer parsers you discussed. Could you elaborate on why it is not a feasible model for real-life applications?

A: Too big a search space, and too many parses. In essence, the system makes explicit all possibilities, low probability events as well as high probability events all in the same search space,. The whole idea is that it makes sense in theory, that any small possibility is a possibility, and then from a perfect theoretical model, you can't block any path in advance. This way, you have to save all the search paths until the global path is complete. And this leads to the fact that the space where the resolution is, in fact, a combinatorial explosion space, so there's no efficient corresponding algorithm.

Q: why isn't a single layer suitable for modularity?

A: there is no modularity at all in a single layer. The approach of a single layer means that the whole resolution is a module, and a single layer means non-modularity. Its theoretical basis also has some truth. It says that language phenomena are interdependent, and a complete language analysis scheme cannot completely separate them. As low as participles and as low as the boundaries of basic phrases, these shallow structures are difficult to determine outside the overall structure of the sentence. This is because a locally sound structure can always be overridden in a larger context.

(for instance)

From this interdependent, locally subordinated global perspective, structural analysis, once cut up, creates a chicken-and-egg problem. To deal with this problem of interdependency, theoretically, a single-layer model makes sense. In a single-layer system, all the interdependent phenomena are explored in the same plane according to the global paths as solutions. That forms, of course, an argument against multiple layers, that language phenomena are interrelated, so we can hardly treat them by first cutting them into multiple layers. Interdependency in a modular pipeline is very susceptible to "premature pruning" of branches. To be honest, if we leave aside the pseudo-ambiguity problem and the non-linear speed from the single-layer system design for a moment, it is quite difficult to refute the above argument against the multi-layer system design. However, single-layer is not very feasible in practice. The consequences of a single layer far outweigh the benefits, and the concern on premature pruning in a multi-layer system actually has its own countermeasures.

Q: Your point of view is not quite the same as my understanding of modularity. In my understanding, a module is actually a concept without hierarchy. Just like with bricks, you can build roads, it's like a complete horizontal jigsaw puzzle of bricks. Of course, you can also build a wall in which case bricks are hierarchical. It goes up one level at a time. So, in my understanding, modularity and hierarchy do not have to be correlated. Does it make sense?

A: Yes, you're right. Modules are bricks. They do not have to have layers. If there are layers, like building a wall, then there has to be a sequence architecture of modules. But it is also possible that there is no sequential dependency between the modules and the layers. The modules are defined from an angle beyond layers, which is like paving a road. Road paving does not have to be serial, which can be parallel. In practice, they may as well still be arranged in a uniform pipeline, combining the style of road paving with the style of wall building.

Modularity itself is a seasoned practice that comes from software engineering. That is, when building a complex system, we always attempt to divide tasks into subtasks and sub-subtasks. Modularity makes the development process more tractable and easier to maintain. Natural language is undoubtedly a fairly complex system. Faced with a complex object like language, a good way is to emulate the approach that has worked in engineering for years. That is to say, the task should be reasonably decomposed and cut into modules as far as possible to implement modular development.

Thanks to http://fanyi.youdao.com/ based on which this translation is revised and polished by the author himself. This is the first chapter of our book on NLU which consists of 10 interviews on key topics of AI symbolic logic as used in natural language parsing. Stay tuned.

[References]

S. Bai: Natural Language Caterpillar Breaks through Chomsky's Castle

Before We Dive In: A Quick Refresher on the Basics

What is the rank of a matrix?

What does “full rank” mean?

What are singular values?

Why does this matter for LLM attention?

Back to the Technical Question

The Mathematical View: Full Rank on Paper

The Engineering Reality: Effective Rank Collapse

The Theoretical Prediction: Rank Bottlenecks

Rank Collapse in Practice

Why Not Just the Identity Matrix?

The Role of Softmax and Multi-Head Attention

The Mirage of Long Contexts

Lessons and Implications

Low Rank ≠ Inherently Bad

Conclusion and Implications

Reference

A Cosmic Dance of Bits and Meaning

The Cosmic Library: Compressing Meaning

Arithmetic Coding: The GPS of Compression

Lossless or Lossy? Solving the Debate

The Theory: Kolmogorov Complexity and Intelligence

The Tightrope: Efficiency vs. Reliability

Why It Matters: From Stars to Servers

The Trade-offs

Practical Solutions

The Mechanics

A GPS Analogy

The Edge of Efficiency

Mapping Sentences to Intervals

The Power of Information Theory

Why It’s Revolutionary

The Heart of Compression: Kolmogorov Complexity

Lossless Compression in Action

Training vs. Compression

Practical Implications

The Library of Language

Arithmetic Coding: The Magic Wand

Information Theory: Why Predictability Saves Space

Why It Matters

The Right Question is Half the Answer, The Other Half lies in LLM's Semantic Coherence

EMPO's Midas Touch: Minimizing Semantic Entropy

Piercing the Veil: Wisdom and Real-World Gotchas

The Unsupervised Dividend: Why EMPO Matters

壹 自然语言与语言形式

零 缘起

Symbolic Linguistic Legacy

Thanks, Colleagues & Friends

Mirror’s Last‑Minute Miracle

A Tale of Two Schools

Family Footnotes

In Quiet Cupertino

Abstract

1. Introduction

1.1 Scope and motivation

1.2 Survey methodology

1.3 Organisation

2. Foundational Paradigms

2.1 Autoregressive sequence models

2.2 Diffusion models

3. Conditional Control

3.1 AR conditioning

3.2 Diffusion conditioning

3.3 Summary

4. Efficiency and Temporal Coherence

4.1 AR acceleration

4.2 Diffusion acceleration

4.3 Temporal‑coherence techniques

5. Benchmarks

6. Open Challenges

7. Conclusion

References

Works cited

Introduction

Compiled vs. Interpreted Agents

Technical Deep Dive

Compilation in LLM: Parameter Fixation and Knowledge Internalization

Interpretation in AI: Dynamic Runtime Decisions

Architectural Comparison

Compiled Agents: Reliability and Predictability

The Right Question is Half the Answer,
The Other Half lies in LLM's Semantic Coherence

壹　自然语言与语言形式

零　缘起