Advocates of “unified” models argue that the many signals of the world—text, images, audio, video—should be tokenized and mapped into the same semantic field (a shared hidden vector space), so they can be trained jointly and modeled with one brain.
A common recipe is to atomize other modalities the way we do text: turn them into token sequences so that sequence learning—the secret sauce behind LLMs' explosive success—can extend beyond text. Universal tokens lay the groundwork for this extension. Once text, images, audio, and video are all sliced into computable tokens of similar type, one step remains: make them play in the same band (for training and for generation).
The most straightforward trick is to give each instrument a passport. Before entering the model, every segment carries a modality tag—special start symbols telling the conductor whether the next stretch is image tokens, audio tokens, or text (e.g., <image>
, <audio>
). Positional encodings then act like a seating chart, indicating how these tokens are arranged in a sentence, on an image grid, or along a timeline. After a few rounds of training, the model internalizes these hints: look at pixels when needed, listen when needed, and weave everything into a coherent narrative when it speaks.
A deeper level of fusion is handled by attention. Think of attention as a rehearsal room with glass walls: text can “glance” at image tokens and images can “nod back.” Over time, some heads specialize in image–image relations, others in image–text translation. Flamingo makes this explicit: cross-modal attention layers are inserted between LLM layers so that the word currently being generated can continually “look back” at the relevant image region. To users this appears as abilities for things like step-by-step visual Q&A; under the hood, a stream of text tokens moves forward while aiming its attention at precisely the visual patches that matter.
In engineering practice, a common compromise is front-end experts first, then merge. PaLM-E in robotics is a good example: images are first distilled by a pretrained vision encoder (e.g., ViT) into compact representations, then projected into perception tokens; text is tokenized as usual; robot state vectors can be appended when needed. With appropriate modality tags, all of these enter a shared Transformer backbone, where dialog, reasoning, and decision-making happen on a single score. In this analogy the front end acts like a pickup: it cleans and compresses raw signals.
More concretely: the “pickup” is a modality-specific encoder (ViT for images, an acoustic encoder for speech, etc.). The “front end” consists of the pickup plus a small projection/adapter that turns features into tokens the backbone can digest. The “backbone” is the unified Transformer that handles cross-modal alignment, long-range memory, and reasoning. In short, the backbone processes music, not the raw electrical currents of each microphone. It doesn’t touch pixels or raw waveforms; it ingests track-like, distilled features produced by the front end and aligns/reasons over them. Redundant detail and noise are filtered at the front; during generation, pixel-level detail is filled back in by a diffusion renderer or a decoder according to the backbone’s plan.
For instance, a 224×224 image has over 150k pixels. If you hand all of them to the backbone, attention wastes budget on repetitive textures. A ViT-style pickup patches, encodes, and compresses the image into roughly a couple hundred visual tokens—each like a short riff, concentrating edges, shapes, and relations that matter for understanding. Audio is similar: raw 16 kHz samples are first converted to tens or hundreds of latent vectors; rhythm and timbre are distilled, while noise and redundancy are filtered. The backbone consumes these high-semantic-density tokens (i.e., semantic tokens), so compute is lighter and crucial information remains intact. When trained end-to-end, gradients from the backbone flow back to the front end, encouraging it to keep what matters for downstream generation and discard redundancy.
Those few, information-dense tokens inside the backbone don’t carry pixel-level detail. Detail returns through a wider channel on the rendering side during generation. The backbone sets the plan; the renderer does the fine work. Three practical stages or approaches are popular in generation.
1) Composition → Orchestration. The front end is paired with a decoder: the encoder compresses content into compact codes (discrete codewords/tokens or low-dimensional latents), and the decoder can reconstruct pixels. At generation time, the backbone predicts/plans decodable representations—a string of codes or a per-frame latent vector—rather than pixels. The decoder then “orchestrates” texture, light, and materials from those codes. In image/video, the VQ-VAE/MAGVIT-type follows this predict-codes → decode path. With residual vector quantization (RVQ), detail is added coarse-to-fine: the backbone first emits higher-level codes; the renderer produces a solid base; lower-level residual codes then refine it layer by layer.
2) Storyboard → Cinematography. The backbone provides structural plans—a low-res sketch (blueprint), motion hints/optical flow, keypoints, or a camera trajectory. Each frame is then handed to in-frame diffusion (latent-space rendering) to “develop” the image from noise under those conditions. Diffusion doesn’t need the backbone to carry high-frequency detail; it iteratively reveals detail in a high-res latent space. This is the “next-frame prediction + in-frame diffusion” split: temporal coherence by the backbone, visual richness by diffusion.
3) From coarse to fine. The backbone first outputs a coarse result—low resolution or higher-level codes—then the rendering stack applies super-resolution (SR) and residual refinement to stack up resolution and texture. The farther down the pipeline, the wider the bandwidth, but this bandwidth is handled on the rendering side rather than bloating the backbone with pixel-long sequences. Many systems expose these as configurable quality gears: stop at 720p if the user is in a rush, or climb to 1080p/4K when desired.
One subtlety: the backbone does not simply “discard detail.” First, in joint end-to-end training, the front end and decoder co-adapt with the backbone so that what’s “compressed away” is true redundancy, while cues crucial for generation (edges, materials, style, rhythm) are preserved in a recoverable latent space. Second, many renderers look back at multi-scale front-end features (e.g., via U-Net skip connections or cross-attention during decoding), allowing them to query high-bandwidth details on demand—without hauling them through the backbone.
How the renderer “looks back” depends on the task:
-
-
From-scratch generation (text-to-image/video): there’s no high-res reference to query. The renderer relies on learned statistics to “hallucinate” detail from the backbone’s directional plan; the text front end builds latents during training but typically isn’t invoked during sampling in generation.
-
Conditional generation/editing (image-to-image, inpainting, colorization, video continuation, style transfer): the renderer does “look back.” The reference image or previous frame is encoded to multi-scale features; the decoder/diffusion network uses skip connections or cross-attention to pull in high-res cues, aligning edges, textures, and brushwork precisely.
-
In the autoencoder/vector-quantization track, encoder and decoder are two sides of the same coin: the encoder compresses images/video into a shared latent language; the decoder restores pixels from latents/codes. They are trained around the same latent interface with opposite roles, by two networks that translate the same latent language. Whether the encoder is used at generation time depends on the task: unconditional generation needs only the decoder; conditional/editing and video consistency bring the encoder back to supply high-res detail.
Put together, the pipeline is clear: the backbone sets the plan—semantic coherence and causal logic—without hauling pixels across long distances; the renderer does the fine work; it looks back when a reference exists, and otherwise lets diffusion “develop” detail locally. Decoding, diffusion, and super-resolution are the high-bandwidth back end that rebuilds the scene in place. Encoder and decoder share a latent interface, each doing its half of the job—two sides of the same coin.