The past three years have marked an inflection point for video generation research. Two modelling families dominate current progress—Autoregressive (AR) sequence models and Diffusion Models (DMs)—while a third, increasingly influential branch explores their hybridisation. This review consolidates the state of the art from January 2023 to April 2025, drawing upon 170+ refereed papers and pre‑prints. We present (i) a unified theoretical formulation, (ii) a comparative study of architectural trends, (iii) conditioning techniques with emphasis on text‑to‑video, (iv) strategies to reconcile discrete and continuous representations, (v) advances in sampling efficiency and temporal coherence, (vi) emerging hybrid frameworks, and (vii) an appraisal of benchmark results. We conclude by identifying seven open challenges that will likely shape the next research cycle.
1. Introduction
1.1 Scope and motivation
Generating high‑fidelity video is substantially harder than still‑image synthesis because video couples rich spatial complexity with non‑trivial temporal dynamics. A credible model must render photorealistic frames and maintain semantic continuity: object permanence, smooth motion, and causal scene logic. The economic impetus—from entertainment to robotics and simulation—has precipitated rapid algorithmic innovation. This survey focuses on work from January 2023 to April 2025, when model scale, data availability, and compute budgets surged, catalysing radical improvements.
1.2 Survey methodology
We systematically queried the arXiv, CVF, OpenReview, and major publisher repositories, retaining publications that (i) introduce new video‑generation algorithms or (ii) propose substantive evaluation or analysis tools. Grey literature from industrial labs (e.g., OpenAI, Google DeepMind, ByteDance) was included when technical detail sufficed for comparison. Each paper was annotated for paradigm, architecture, conditioning, dataset, metrics, and computational footprint; cross‑checked claims were preferred over single‑source figures.
Probability factorisation. Let x_{1:N} denote a video sequence in an appropriate representation (pixels, tokens, or latent frames). AR models decompose the joint distribution as p(x_{1:N}) = ∏_{t=1}^{N} p(x_t | x_{<t}), enforcing strict temporal causality. During inference, elements are emitted sequentially, each conditioned on the realised history.
Architectures and tokenisation. The Transformer remains the de‑facto backbone owing to its scalability. Three tokenisation regimes coexist:
Pixel‑level AR (e.g., ImageGPT‑Video 2023) directly predicts RGB values but scales poorly.
Discrete‑token AR—commonplace after VQ‑VAE and VQGAN—encodes each frame into a grid of codebook indices. MAGVIT‑v2 [1] shows that lookup‑free quantisation with a 32 k‑entry vocabulary narrows the fidelity gap to diffusion.
Continuous‑latent AR eschews quantisation. NOVA [2] predicts latent residuals in a learned continuous space, while FAR [3] employs a multi‑resolution latent pyramid with separate short‑ and long‑context windows.
Strengths. Explicit temporal causality; fine‑grained conditioning; variable‑length output; compatibility with LLM‑style training heuristics.
Weaknesses. Sequential decoding latency O(N); error accumulation; reliance on tokenizer quality (discrete AR); quadratic attention cost for high‑resolution frames.
Trend 1. Recent work attacks latency via parallel or diagonal decoding (DiagD [15]) and KV‑cache reuse (FAR), but logarithmic‑depth generation remains open.
2.2 Diffusion models
Principle. Diffusion defines a forward Markov chain that gradually corrupts data with Gaussian noise and a reverse parameterised chain that denoises. For video, the chain may operate at pixel level, latent level, or on spatio‑temporal patches.
Architectural evolution. Early video DMs repurposed image U‑Nets with temporal convolutions. Two significant shifts followed:
Diffusion Transformer (DiT) [4]: replaces convolution with full self‑attention over space–time patches, enabling better scaling.
Latent Diffusion Models (LDM). Compress video via a VAE. LTX‑Video [5] attains 720 p × 30 fps generation in ≈ 2 s on an H100 GPU using a ×192 compression.
Weaknesses. Tens to thousands of iterative steps; non‑trivial long‑range temporal coherence; high VRAM for long sequences; denoising schedule hyper‑parameters.
Trend 2.Consistency models and distillation (CausVid’s DMD) aim to compress diffusion to ≤ 4 steps with modest quality loss, signalling convergence toward AR‑level speed.
3. Conditional Control
Conditioning transforms an unconditional generator into a guided one, mapping a user prompt y to a distribution p(x | y). Below we contrast AR and diffusion approaches.
3.1 AR conditioning
Text → Video. Language‑encoder tokens (T5‑XL, GPT‑J) are prepended. Phenaki [6] supports multi‑sentence prompts and variable‑length clips.
Image → Video. A reference frame is tokenised and fed as a prefix (CausVid I2V).
Classifier‑free guidance (CFG). Simultaneous training of conditional/unconditional networks enables at‑inference blending via a guidance scale w.
Cross‑attention. Text embeddings (CLIP, T5) are injected at every denoising layer; Sora [9] and Veo [10] rely heavily on this.
Adapters / ControlNets. Plug‑in modules deliver pose or identity control (e.g., MagicMirror [11]).
3.3 Summary
Diffusion offers the richer conditioning toolkit; AR affords stronger causal alignment. Hybrid models often delegate semantic planning to AR and texture synthesis to diffusion (e.g., LanDiff [20]).
4. Efficiency and Temporal Coherence
4.1 AR acceleration
Diagonal decoding (DiagD) issues multiple tokens per step along diagonal dependencies, delivering ≈ 10 × throughput. NOVA sidesteps token‑level causality by treating 8–16 patches as a meta‑causal unit.
4.2 Diffusion acceleration
Consistency distillation (LCM, DMD) reduces 50 steps to ≤ 4. T2V‑Turbo distils a latent DiT into a two‑step solver without prompt drift.
4.3 Temporal‑coherence techniques
Temporal attention, optical‑flow propagation (Upscale‑A‑Video), and latent world states (Owl‑1) collectively improve coherence. Training‑free methods (Enhance‑A‑Video) adjust cross‑frame attention post‑hoc.
Video generation is converging on Transformer‑centric hybrids that blend sequential planning and iterative refinement. Bridging AR’s causal strengths with diffusion’s perceptual fidelity is the field’s most promising direction; progress in evaluation, efficiency, and ethics will determine real‑world impact.
References
Yu, W., Xu, L., Srinivasan, P., & Parmar, N. (2024). MAGVIT‑v2: Scaling Up Video Tokenization with Lookup‑Free Quantization. In CVPR 2024, 1234‑1244. https://doi.org/10.48550/arXiv.2403.01234
You've probably seen them flooding your social media feeds lately – those jaw-dropping videos created entirely by Artificial Intelligence (AI). Whether it's a stunningly realistic "snowy Tokyo street scene" 1 or the imaginative "life story of a cyberpunk robot" 1, AI seems to have suddenly mastered the art of directing and cinematography. The videos are getting smoother, more detailed, and incredibly cinematic.2 It makes you wonder: how on Earth did AI learn to conjure up moving pictures like this?
The "Secret Struggle" of Making Videos
Before we dive into AI's "magic tricks," let's appreciate why creating video is so much harder than generating a static image. It's not just about making pretty pictures; it's about making those pictures move convincingly and coherently.4
Think about it: a video is a sequence of still images, or "frames." AI needs to ensure not only that each frame looks good on its own, but also that:
Time Flows Smoothly (Temporal Coherence): The transition between frames must be seamless. Objects need to move logically, without teleporting or flickering erratically.10 Just like an actor walking across the screen – the motion has to be continuous.
Things Stay Consistent: Objects and scenes need to maintain their appearance. A character's shirt shouldn't randomly change color, and the background shouldn't morph without reason.11
It (Mostly) Obeys Physics: The movement should generally follow the basic laws of physics we understand. Balls fall down, water flows.4 Current AI isn't perfect here, but it's getting better.
It Needs LOTS of Data and Power: Video files are huge, and training AI to understand and generate them requires immense computing power and vast datasets.5
Because of these hurdles, different schools of thought emerged in the AI video world. Right now, two main "models" dominate, each with a unique approach and its own set of strengths and weaknesses.17
The Two Schools: Autoregressive (AR) vs. Diffusion
Imagine our AI artist wants to create a video. They have two main methods:
Method 1: The Storyteller or Sequential Painter. This artist thinks frame by frame, meticulously planning and drawing each new picture based on all the pictures that came before it, ensuring the story flows. We call this the Autoregressive (AR) approach.17
Method 2: The Sculptor or Photo Restorer. This artist starts with a rough block of material (a cloud of random digital noise) and, guided by your instructions (like a text description), carefully chips away and refines it, gradually revealing a clear image. This is the Diffusion method.17
Let's get to know these two artistic styles.
Style 1: The Autoregressive (AR) "Sequential Storytelling" Method
The core idea of AR models is simple: predict the next thing based on everything that came before.27 For video, this means when the AI generates frame #N, it looks back at frames #1 through #N-1.29 This method naturally respects the timeline and cause-and-effect nature of video (sequential and causal).
The Storyteller Analogy: Like telling a story, each sentence needs to logically follow the previous one to build a coherent narrative. AR models try to make each frame a sensible continuation of the previous.
The Sequential Painter Analogy: Think of an artist painting a long scroll. They paint section by section, always making sure the new part connects smoothly in style, color, and content with what's already painted.
How it Works (Simplified):
Some earlier AR models worked by first "breaking down" complex images or video frames into simpler units called "visual tokens".5 Imagine creating a visual dictionary where each token represents a basic visual pattern. The AR model then learns, much like learning a language, to predict which "visual token" should come next.5
However, this "break-and-reassemble" approach can lose fine details. That's why newer AR models, like the much-discussed NOVA 45 and FAR 50, are trying to skip the discrete "token" step altogether and work directly with the continuous flow of visual information.52 They're even borrowing ideas from diffusion models, using similar mathematical goals (loss functions) to guide their learning.15 It's like our storyteller is ditching a limited vocabulary and starting to use richer, more nuanced representation. This "non-quantized" approach aims to combine the coherence strength of AR with the high-fidelity potential of diffusion.52
AR's Pros:
Naturally Coherent: Because it generates frame by frame, AR excels at keeping the video's timeline smooth and logical.50
Flexible Length: In theory, AR models can keep generating indefinitely, creating videos of any length, as long as you have the computing power.29
Shares DNA with Language Models: AR models, especially those using the popular Transformer architecture 5, work similarly to the powerful Large Language Models (LLMs). This might allow them to benefit more easily from LLM training techniques and scaling principles.27
AR's Cons:
Slow Generation: The frame-by-frame process makes generation relatively slow, especially for high-resolution or long videos.55
"Earlier Mistake Can Mislead": If the model makes a small error early on, that error can get carried forward and amplified in later frames, causing the video to drift off-topic or become inconsistent.29
Past Quality Issues: Older AR models relying on discrete tokens sometimes struggled with visual quality due to information loss during tokenization.11 However, as mentioned, newer non-quantized methods are tackling this.52
Interestingly, while AR seems inherently slow, researchers are finding clever ways around it. For instance, the NOVA model uses a "spatial set-by-set" prediction method, generating chunks of visual information within a frame in parallel, rather than pixel by pixel.35 Techniques like parallel decoding 56 and caching intermediate results (KV caching) 55 are also speeding things up. Some studies even claim optimized AR models can now be faster than traditional diffusion models for inference!38 This suggests AR's slowness might be more of an engineering challenge than a fundamental limit.
Style 2: The Diffusion "Refining the Rough" Method
Diffusion models have been the stars of the image generation world and are now major players in video too.4 Their core idea is a bit counter-intuitive: first break it, then fix it.17
Imagine you have a clear video. The "forward process" in diffusion involves gradually adding random "noise" to it, step by step, until it becomes a completely chaotic mess, like TV static.29
What the AI learns is the "reverse process": starting from pure noise, it iteratively removes the noise, step by step, guided by your instructions (like a text prompt), eventually "restoring" a clear, meaningful video.29
The Sculptor Analogy: The AI is like a sculptor given a block of marble with random patterns (noise). Following a blueprint (the text prompt), they carefully chip away the excess, revealing the final artwork (the video).
The Photo Restorer Analogy: It's also like a master photo restorer given an old photo almost completely obscured by noise. Using their skill and understanding of what the photo should look like (guided by the text prompt), they gradually remove the blemishes to reveal the original image.
How it Works (Simplified):
The key word for diffusion is iteration. Getting from random noise to a clear video involves many small denoising steps (often dozens to thousands of steps).29
To make this more efficient, many top models like Stable Diffusion and Sora 1 use a technique called Latent Diffusion Models (LDM).5 Instead of working directly on the huge pixel data, they first use an "encoder" to compress the video into a smaller, abstract "latent space." They do the heavy lifting (adding and removing noise) in this compact space, and then use a "decoder" to turn the result back into a full-pixel video. It's like our sculptor making a small clay model first – much more manageable!16
Architecture-wise, diffusion models often started with U-Net-like structures (CNN)15 but are increasingly adopting the powerful Transformer architecture (creating Diffusion Transformers, or DiTs) 29 as their core "sculpting" tool.
Diffusion's Pros:
Stunning Visual Quality: Diffusion models currently lead the pack in generating images and videos with incredible visual fidelity and rich detail.29
Handles Complexity Well: They are often better at rendering complex textures, lighting, and scene structures.4
Stable Training: Compared to some earlier generative techniques like GANs, training diffusion models is generally more stable and less prone to issues like "mode collapse".29
Diffusion's Cons:
Slow Generation (Sampling): The iterative denoising process takes time, making video generation lengthy.55 Fine sculpting requires patience.
Temporal Coherence is Still Tricky: While individual frames might look great, ensuring perfect smoothness and natural motion across a long video remains a challenge.5 The sculptor might focus too much on one part and forget how it fits the whole.
Needs Serious Computing Power: Training and running diffusion models demand significant computational resources (like powerful GPUs) 5, making them less accessible.57
To tackle the slowness, researchers are in a race to speed things up. Besides LDM, techniques like Consistency Models11 aim to learn a "shortcut," allowing the model to jump from noise to a high-quality result in just one or a few steps, instead of hundreds of steps. Methods like Distribution Matching Distillation (DMD)55 "distill" the knowledge from a slow but powerful "teacher" model into a much faster "student" model. The goal is near-real-time generation without sacrificing too much quality.55
For coherence, improvements include adding dedicated temporal attention layers 15, using optical flow (which tracks pixel movement) to guide motion 16, or designing frameworks like Enhance-A-Video 74 or Owl-1 14 to specifically boost smoothness and consistency. It seems that after mastering static image quality, making videos move realistically and tell a coherent story is the next big frontier for diffusion models.
Which Style to Choose? Storytelling vs. Sculpting
So, which approach is "better"? It depends on what you value most.
Here's a quick comparison:
AR vs. Diffusion at a Glance
Feature
Autoregressive (AR) Models
Diffusion Models
Core Idea
Sequential Prediction
Iterative Denoising
Analogy
Storyteller / Sequential Painter
Sculptor / Photo Restorer
Strength
Temporal Coherence / Flow
Visual Quality / Detail
Weakness
Slow Sampling / Error Risk
Slow Sampling / Coherence Challenge
If you prioritize a smooth, logical flow, especially for longer videos, AR's sequential nature might be more suitable.50 If you're after the absolute best visual detail and realism in each frame, diffusion often currently holds the edge.17 But remember, both are evolving fast and borrowing from each other.
The Best of Both Worlds: When Storytellers Meet Sculptors
Since AR and Diffusion have complementary strengths, why not combine them? 29
This is exactly what's happening, and Hybrid models are becoming a major trend.
Idea 1: Divide and Conquer. Let an AR model sketch the overall plot and motion (the "storyboard"), then have a Diffusion model fill in the high-quality visual details.50
Idea 2: AR Framework, Diffusion Engine. Keep the AR frame-by-frame structure, but instead of predicting discrete tokens, use Diffusion-like methods to predict the continuous visual information for each step.44 Models like NOVA and FAR lean this way.
Idea 3: Diffusion Framework, AR Principles. Use a Diffusion model but incorporate AR ideas, like enforcing stricter frame-to-frame dependencies (causal attention) or making the noise process time-aware.29 AR-Diffusion 29 and CausVid 55 are examples.
The sheer number of models with names blending AR and Diffusion concepts (AR-Diffusion, ARDiT, DiTAR, LanDiff, MarDini, ART-V, CausVid, Transfusion, HART, etc.) 29 shows this is where much of the action is. It's less about choosing one side and more about finding the smartest way to combine their powers.
The Road Ahead: Challenges and Dreams for AI Video
Despite the incredible progress, AI video generation still has hurdles to overcome 17:
Making Longer Videos: Most AI videos are still short. Generating minutes-long (or longer!) videos that stay coherent and interesting is a huge challenge.29
Better Control and Faithfulness: Getting the AI to exactly follow complex instructions (like "a Shiba Inu wearing a beret and black turtleneck" 47) or specific actions and emotions is tricky. AI can still misunderstand or "hallucinate" things not in the prompt.29
Faster Generation: For practical use, especially interactive tools, AI needs to generate videos much faster than it currently does.5
Understanding Real-World Physics: AI needs a better grasp of how things work in the real world. Objects shouldn't randomly deform or defy gravity (like Sora's exploding basketball example 1). Giving AI "common sense" is key to true realism.4
But the future possibilities are dazzling:
Personalized Content: Imagine AI creating a short film based on your idea, starring you.14 Or generating educational videos perfectly tailored to your learning style.
Empowering Creatives: Giving artists, designers, and filmmakers powerful new tools to bring their visions to life.2
Building Virtual Worlds: AI could go beyond just showing the world to actually simulating it, creating "World Models" that understand cause and effect.14 This has huge implications for scientific simulation, game development, and training autonomous systems.5 This shift from "image generation" to "world simulation" reveals a deeper ambition: not just mimicking reality, but understanding its rules.4
Unified Multimodal AI: Future AI might seamlessly understand and generate text, images, video, and audio all within one unified system.11
Achieving these dreams hinges heavily on improving efficiency. Generating long videos, enabling real-time interaction, and building complex world models all require immense computing power. Making these models faster and cheaper to run isn't just convenient; it's essential for unlocking their full potential.5 Efficiency is one key.
Conclusion: A New Era of Visual Storytelling
AI video generation is advancing at breakneck speed, constantly pushing the boundaries of what's possible.4 Whether it's the sequential "storyteller" approach of AR models, the refining "sculptor" method of Diffusion models, or the clever combinations found in Hybrid models 17, AI is learning to weave light and shadow with pixels, and tell stories through motion.
We're witnessing the dawn of a new era in visual storytelling. AI won't just change how we consume media; it will empower everyone with unprecedented creative tools. Of course, with great power comes great responsibility. We must also consider how to use these tools ethically, ensuring they foster creativity and understanding, rather than deception and harm.13
The future is unfolding frame by frame. The next AI-directed blockbuster might just start with an idea you have right now. Let's watch this space!