Breakthroughs in Speech Technology in the Era of Large Models: Ultra-Realism and Full Duplex

As large language models (LLMs) expand into audio, progress has been breathtaking. “LLM-native” speech technology reached practical maturity roughly half a year ago, and the entire industry has surged forward. Two features mark this maturity: ultra-realistic speech and full-duplex interaction.

Audio tokenization—akin to what we do with text—has not only produced “model musicians” (Suno is the signature product here; I’ll cover it in a separate tech blog), but more importantly, it has catalyzed two changes that truly transform language interaction: ultra-realistic speech synthesis, and natural conversations where the system can listen and speak at the same time and be interrupted at any moment—that is, full duplex.

Start with ultra-realism. Old-school TTS always sounded like a well-trained announcer—articulate and polished, yet lacking the grain and personality of everyday speech. Once neural codec manages speech into reversible discrete tokens, synthesis no longer “reads out” only the text/content; it can also faithfully reproduce the manner of speaking, hence everyday, down-to-earth speech made possible. The model first drafts a sketch in the space of “speech semantics”: where to take a breath, when to lower the voice, which word should gently rise. It then renders that as a string of audio tokens, and the decoder brings out the breath, sibilants, and the room’s tail reverberation. You can hear the rasp in a voice, the faint nasal smile, and even preserve a speaker’s “vocal persona” across languages. Interaction no longer has a one-size-fits-all voice; it feels like chatting with a neighbor or a friend. The difference from the pre-LLM era lives in those micro-textures carried by the layered discrete codes (tokens).

Now full duplex. Early voice assistants were like walkie-talkies: you speak, then they speak, taking turns. Natural conversation is more like a kitchen scene: you’re chopping vegetables, a friend reminds you to preheat the oven, you cut in—“wait, let’s halve the salt first”—and your friend immediately pivots and responds to the new instruction. Achieving that natural feel looks like an interaction design problem (“interruptible,” “barge-in”), but underneath it requires three things working in concert: continuous listening—streaming the incoming speech into tokens in real time; speaking while listening—generating its own audio tokens ready to speak out while always keeping room to brake; and low-latency understanding and revision—completing the loop of listen → update plan → change what it says in a few hundred milliseconds.

Neural codecs shine again here: by turning both what is heard and what will be said into the same kind of discrete sequence (token string), the model can read and write on the same timeline, managing turn-taking like a human in natural conversation.

These two advances are not isolated. The more ultra-realistic the synthesis, the more the token stream must explicitly encode pauses, stress, laughter, and other details. The smoother the full duplex, the more it relies on a stable, dense, and compact flow of audio tokens so the model can switch course on the fly without dropping a beat. Because the underlying representation has been unified into a “readable and writable token sequence,” we can naturally say, “Don’t be so formal—talk to me like you used to,” and the assistant instantly switches tone. We can also cut in mid-sentence—“Skip that—give me something concrete: go or not?”—and the back-and-forth remains fluid and unawkward.

In systems that truly deliver “ultra-realistic + full-duplex” today, the winning recipe is a hybrid of listen–think–speak. Up front, a streaming auditory model continuously turns your voice into a compact sequence of audio tokens; in the middle, a large model handles understanding and planning; at the back, the system writes its intended response as audio tokens and decodes them into a waveform in real time. The reason we don’t always hand everything to a single end-to-end autoregressive GPT that goes directly from audio-in to audio-out comes down to two constraints: millisecond-level latency requirements and the need for tight control over interruption and course correction.

Think of it as a well-rehearsed band. The streaming speech encoder (ASR) is the drummer, keeping time in tens-of-milliseconds beats. The large model in the middle is the conductor, continuously re-orchestrating based on the latest auditory tokens. The neural-codec decoder (TTS) is the lead vocalist, singing while leaving braking distance so an interruption can “hit the brakes” instantly. In practice, both drummer and vocalist are often smaller, faster specialist models (streaming recognition such as RNN-T/Conformer, paired with neural-codec-based fast synthesis) rather than folding every detail into one gigantic autoregressive stack. Otherwise, interruptions and back-ups might blow up latency.

This does not mean a return to the old ASR/TTS pipeline. The key change is that the base representation has become unified discrete audio tokens: the listening side no longer produces only text characters; it can emit semantic or acoustic units as well. The speaking side no longer merely concatenates phonemes; it writes a reversible stream of codec tokens that carries breath, stress, and reverberant tails. The large model either plans at the hidden layer or refines a two-step “semantic tokens → acoustic tokens” plan, then hands the result to the decoder to render. This keeps full-duplex latency low while preserving ultra-realistic texture.

Looking ahead, research is converging: end-to-end “spoken LLMs” are pulling listening and speaking into the same set of parameters, taking audio tokens directly as input and output (speech2speech). In engineering, however, conversation management—who speaks when, how to interrupt, how to revise—remains as guardrails for smoothness and robustness. Today’s best practice is like a hybrid car: the chassis is the unified, tokenized language; the engine is the large model; and the start/stop control is often delegated to the small specialized motors. The result is fast and stable—and it keeps the warmth of the human voice.

Breakthroughs in Speech Technology in the Era of Large Models: Ultra-Realism and Full Duplex

发布者

立委

发表回复