Neural Codec: Key Audio Techniques in the LLM Era

“Codec” is short for coder–decoder: an end-to-end machine for compressing and then restoring audio. The encoder compressess a signal into a more compact representation; the decoder reconstructs it as faithfully as possible.

In the LLM era, audio—like text—is often cut into short segments and encoded as a sequence of discrete tokens. The “audio dictionary” used for quantization is called a codebook, like a spice rack in a kitchen: discrete little vectors in separate slots, each with an index. Unlike text, which typically has a single vocabulary, neural audio coding often uses several codebooks at the same time step (audio unit), quantizing from coarse to fine in layers—a scheme known as RVQ (Residual Vector Quantization). During compression, the system picks one entry from each codebook, using their indices to “remember” that instant of sound in layers; during reconstruction, the vectors addressed by those indices are summed, layer by layer, to restore and refill the flavor.

Earlier TTS pipelines typically did “Mel-spectrogram draft” (a continuous-valued “score”) → a neural vocoder that plays it back into waveform. The LLM-native neural codec more often runs “semantic tokens → acoustic tokens → decode to waveform.” Both semantic tokens and acoustic tokens are discrete; they differ in granularity and division of labor.

Multi-layer token coding via RVQ is a key innovation for extending LLM methods to audio. Without it, token counts would explode. By its nature, layering simplifies the complex—divide and conquer. It’s an innovation in representation that pushes discretization all the way through.

“How many layers?” There’s no universal number. It’s a knob you turn together with bitrate, latency, and quality targets. Take Google’s SoundStream: it refines the same time step with residual quantization. The paper reports common 4/8/16-layer setups and even an extreme 80-layer experiment, showing training can remain stable with many layers. For a fixed target bitrate, you can trade fewer layers with larger codebooks against more layers with smaller ones—the design space is flexible.

Meta’s EnCodec follows a “multi-codebook, variable-bandwidth” approach: the 24 kHz model uses up to 32 codebooks (i.e., 32 RVQ layers), the 48 kHz model up to 16, with codebooks typically of size 1024. During training, it randomly selects a subset of layers to participate in quantization, so a single set of weights can serve 1.5/3/6/12/24 kbps, etc. In deployment you simply “open” as many layers as you need, striking a balance between quality and realtime latency.

Don’t confuse “number of layers” with “hierarchical scales.” OpenAI’s Jukebox uses three time-scale levels in a VQ-VAE: the top level is sparse, the bottom dense, carrying information from long-range song structure down to timbral details. That’s “horizontal” stratification over time, not stacking multiple residual quantizers at the same time step.

A rule of thumb: for realtime, low-latency speech, 4–16 layers are common; for music or higher fidelity at 24 kHz, a few dozen layers aren’t unusual. The final choice isn’t doctrinaire—it depends on your target bitrate, acceptable latency, and how exacting the ear is about texture.

A neural codec begins with a neural encoder that massages raw waveforms into compact latent vectors; then a quantizer selects entries from the layered codebooks—coarse strokes for contour, finer strokes for texture; finally a neural decoder turns these discrete entries back into an audible waveform. The whole chain is trained end to end, aiming to tell the ear what it cares about—timbre, prosody, articulation, even a touch of room tail—using as few indices as possible. The result is a stable, reversible, bitrate-tunable discrete sequence.

In this setup, “audio tokens” are the indices themselves. Every few tens of milliseconds, the quantizer picks a set of indices from the layered codebooks to describe that slice; over time you obtain a readable, writable “acoustic text.” This differs from the traditional Mel spectrogram: the Mel is a continuous “photo” better suited to a conventional neural vocoder; audio tokens are a discrete “word string” that can both decode back to high-fidelity waveforms and be continued, edited, and aligned by GPT-style autoregressive models like text.

In one line: a codec is the whole compress–restore machine; the codebook is just the rack holding the discrete basis vectors—what counts as tokens are their indices. A neural codec rebuilds this machine with neural networks so it can describe a sound using a string of discrete tokens and, when needed, sing that string back into real speech or music. Put simply, audio processing can now reuse the compression–and–reconstruction playbook that has proven so powerful for text LLMs—that’s the secret behind audio’s takeoff in the era of large models.

发布者

立委

立委博士,多模态大模型应用咨询师。出门问问大模型团队前工程副总裁,聚焦大模型及其AIGC应用。Netbase前首席科学家10年,期间指挥研发了18种语言的理解和应用系统,鲁棒、线速,scale up to 社会媒体大数据,语义落地到舆情挖掘产品,成为美国NLP工业落地的领跑者。Cymfony前研发副总八年,曾荣获第一届问答系统第一名(TREC-8 QA Track),并赢得17个小企业创新研究的信息抽取项目(PI for 17 SBIRs)。

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注

这个站点使用 Akismet 来减少垃圾评论。了解你的评论数据如何被处理