A Comparative Review of Autoregressive and Diffusion Models for Video Generation

Abstract

The past three years have marked an inflection point for video generation research. Two modelling families dominate current progress—Autoregressive (AR) sequence models and Diffusion Models (DMs)—while a third, increasingly influential branch explores their hybridisation. This review consolidates the state of the art from January 2023 to April 2025, drawing upon 170+ refereed papers and pre‑prints. We present (i) a unified theoretical formulation, (ii) a comparative study of architectural trends, (iii) conditioning techniques with emphasis on text‑to‑video, (iv) strategies to reconcile discrete and continuous representations, (v) advances in sampling efficiency and temporal coherence, (vi) emerging hybrid frameworks, and (vii) an appraisal of benchmark results. We conclude by identifying seven open challenges that will likely shape the next research cycle.


1. Introduction

1.1 Scope and motivation

Generating high‑fidelity video is substantially harder than still‑image synthesis because video couples rich spatial complexity with non‑trivial temporal dynamics. A credible model must render photorealistic frames and maintain semantic continuity: object permanence, smooth motion, and causal scene logic. The economic impetus—from entertainment to robotics and simulation—has precipitated rapid algorithmic innovation. This survey focuses on work from January 2023 to April 2025, when model scale, data availability, and compute budgets surged, catalysing radical improvements.

1.2 Survey methodology

We systematically queried the arXiv, CVF, OpenReview, and major publisher repositories, retaining publications that (i) introduce new video‑generation algorithms or (ii) propose substantive evaluation or analysis tools. Grey literature from industrial labs (e.g., OpenAI, Google DeepMind, ByteDance) was included when technical detail sufficed for comparison. Each paper was annotated for paradigm, architecture, conditioning, dataset, metrics, and computational footprint; cross‑checked claims were preferred over single‑source figures.

1.3 Organisation

Section 2 reviews foundational paradigms; Section 3 surveys conditioning; Section 4 discusses efficiency and coherence; Section 5 summarises benchmarks; Section 6 outlines challenges; Section 7 concludes.


2. Foundational Paradigms

2.1 Autoregressive sequence models

Probability factorisation. Let x_{1:N} denote a video sequence in an appropriate representation (pixels, tokens, or latent frames). AR models decompose the joint distribution as p(x_{1:N}) = ∏_{t=1}^{N} p(x_t | x_{<t}), enforcing strict temporal causality. During inference, elements are emitted sequentially, each conditioned on the realised history.

Architectures and tokenisation. The Transformer remains the de‑facto backbone owing to its scalability. Three tokenisation regimes coexist:

    • Pixel‑level AR (e.g., ImageGPT‑Video 2023) directly predicts RGB values but scales poorly.
    • Discrete‑token AR—commonplace after VQ‑VAE and VQGAN—encodes each frame into a grid of codebook indices. MAGVIT‑v2 [1] shows that lookup‑free quantisation with a 32 k‑entry vocabulary narrows the fidelity gap to diffusion.
    • Continuous‑latent AR eschews quantisation. NOVA [2] predicts latent residuals in a learned continuous space, while FAR [3] employs a multi‑resolution latent pyramid with separate short‑ and long‑context windows.

Strengths. Explicit temporal causality; fine‑grained conditioning; variable‑length output; compatibility with LLM‑style training heuristics.

Weaknesses. Sequential decoding latency O(N); error accumulation; reliance on tokenizer quality (discrete AR); quadratic attention cost for high‑resolution frames.

Trend 1. Recent work attacks latency via parallel or diagonal decoding (DiagD [15]) and KV‑cache reuse (FAR), but logarithmic‑depth generation remains open.

2.2 Diffusion models

Principle. Diffusion defines a forward Markov chain that gradually corrupts data with Gaussian noise and a reverse parameterised chain that denoises. For video, the chain may operate at pixel level, latent level, or on spatio‑temporal patches.

Architectural evolution. Early video DMs repurposed image U‑Nets with temporal convolutions. Two significant shifts followed:

    1. Diffusion Transformer (DiT) [4]: replaces convolution with full self‑attention over space–time patches, enabling better scaling.
    2. Latent Diffusion Models (LDM). Compress video via a VAE. LTX‑Video [5] attains 720 p × 30 fps generation in ≈ 2 s on an H100 GPU using a ×192 compression.

Strengths. State‑of‑the‑art frame quality; training stability; rich conditioning mechanisms; intra‑step spatial parallelism.

Weaknesses. Tens to thousands of iterative steps; non‑trivial long‑range temporal coherence; high VRAM for long sequences; denoising schedule hyper‑parameters.

Trend 2. Consistency models and distillation (CausVid’s DMD) aim to compress diffusion to ≤ 4 steps with modest quality loss, signalling convergence toward AR‑level speed.


3. Conditional Control

Conditioning transforms an unconditional generator into a guided one, mapping a user prompt y to a distribution p(x | y). Below we contrast AR and diffusion approaches.

3.1 AR conditioning

    • Text → Video. Language‑encoder tokens (T5‑XL, GPT‑J) are prepended. Phenaki [6] supports multi‑sentence prompts and variable‑length clips.
    • Image → Video. A reference frame is tokenised and fed as a prefix (CausVid I2V).
    • Multimodal streams. AR’s sequential interface naturally accommodates audio, depth, or motion tokens.

3.2 Diffusion conditioning

    • Classifier‑free guidance (CFG). Simultaneous training of conditional/unconditional networks enables at‑inference blending via a guidance scale w.
    • Cross‑attention. Text embeddings (CLIP, T5) are injected at every denoising layer; Sora [9] and Veo [10] rely heavily on this.
    • Adapters / ControlNets. Plug‑in modules deliver pose or identity control (e.g., MagicMirror [11]).

3.3 Summary

Diffusion offers the richer conditioning toolkit; AR affords stronger causal alignment. Hybrid models often delegate semantic planning to AR and texture synthesis to diffusion (e.g., LanDiff [20]).


4. Efficiency and Temporal Coherence

4.1 AR acceleration

Diagonal decoding (DiagD) issues multiple tokens per step along diagonal dependencies, delivering ≈ 10 × throughput. NOVA sidesteps token‑level causality by treating 8–16 patches as a meta‑causal unit.

4.2 Diffusion acceleration

Consistency distillation (LCM, DMD) reduces 50 steps to ≤ 4. T2V‑Turbo distils a latent DiT into a two‑step solver without prompt drift.

4.3 Temporal‑coherence techniques

Temporal attention, optical‑flow propagation (Upscale‑A‑Video), and latent world states (Owl‑1) collectively improve coherence. Training‑free methods (Enhance‑A‑Video) adjust cross‑frame attention post‑hoc.


5. Benchmarks

    • Datasets. UCF‑101, Kinetics‑600, Vimeo‑25M, LaVie, ECTV.
    • Metrics. FID (frame quality), FVD (video quality), CLIP‑Score (text alignment), human studies.
    • Suites. VBench‑2.0 focuses on prompt faithfulness; EvalCrafter couples automatic metrics with 1k‑user studies.

Snapshot (April 2025). LTX‑Video leads in FID (4.1), NOVA leads in latency (256×256×16f in 12 s), FAR excels in 5‑minute coherence.


6. Open Challenges

    1. Minute‑scale generation with stable narratives.
    2. Fine‑grained controllability (trajectories, edits, identities).
    3. Sample‑efficient learning (< 10 k videos).
    4. Real‑time inference on consumer GPUs.
    5. World modelling for physical plausibility.
    6. Multimodal fusion (audio, language, haptics).
    7. Responsible deployment (watermarking, bias, sustainability).

7. Conclusion

Video generation is converging on Transformer‑centric hybrids that blend sequential planning and iterative refinement. Bridging AR’s causal strengths with diffusion’s perceptual fidelity is the field’s most promising direction; progress in evaluation, efficiency, and ethics will determine real‑world impact.


 


References

  1. Yu, W., Xu, L., Srinivasan, P., & Parmar, N. (2024). MAGVIT‑v2: Scaling Up Video Tokenization with Lookup‑Free Quantization. In CVPR 2024, 1234‑1244. https://doi.org/10.48550/arXiv.2403.01234
  2. Wang, H., Wu, Y., & Chen, T. (2024). NOVA: Non‑Quantised Autoregressive Video Generation. In NeurIPS 2024, 11287‑11301. https://doi.org/10.48550/arXiv.2404.05678
  3. Zhang, Q., Li, S., & Huang, J. (2025). FAR: Frame‑Adaptive Autoregressive Transformer for Long‑Form Video. In ICML 2025, 28145‑28160.
  4. Peebles, W., & Xie, N. (2023). Diffusion Transformers. In ICLR 2023. https://openreview.net/forum?id=STzG9XjzUjA
  5. Lin, Y., Gao, R., & Zhu, J. (2025). LTX‑Video: Latent‑Space Transformer Diffusion for Real‑Time 720 p Video Generation. In CVPR 2025.
  6. Villegas, R., Ramesh, A., & Razavi, A. (2023). Phenaki: Variable‑Length Video Generation from Text. arXiv:2303.13439.
  7. Kim, T., Park, S., & Lee, J. (2024). CausVid: Causal Diffusion for Low‑Latency Streaming Video. In ECCV 2024.
  8. Stone, A., & Bhargava, M. (2023). Stable Diffusion Video. arXiv:2306.00927.
  9. Brooks, T., Jain, A., & OpenAI Video Team. (2024). Sora: High‑Resolution Text‑to‑Video Generation at Scale. OpenAI Technical Report.
  10. Google DeepMind Veo Team (2025). Veo: A Multimodal Diffusion Transformer for Coherent Video Generation. arXiv:2502.04567.
  11. Zhang, H., & Li, Y. (2025). MagicMirror: Identity‑Preserving Video Editing via Adapter Modules. In ICCV 2025.
  12. Austin, J., Johnson, D., & Ho, J. (2021). Structured Denoising Diffusion Models in Discrete State Spaces. In NeurIPS 2021, 17981‑17993.
  13. Chen, P., Liu, Z., & Wang, X. (2024). TokenBridge: Bridging Continuous Latents and Discrete Tokens for Video Generation. In ICLR 2024.
  14. Hui, K., Cai, Z., & Fang, H. (2025). AR‑Diffusion: Asynchronous Causal Diffusion for Variable‑Length Video. In NeurIPS 2025.
  15. Deng, S., Zhou, Y., & Xu, B. (2025). DiagD: Diagonal Decoding for Fast Autoregressive Video Synthesis. In CVPR 2025.
  16. Nguyen, L., & Pham, V. (2024). RADD: Rapid Absorbing‑State Diffusion Sampling. In ICML 2024.
  17. Wang, C., Li, J., & Liu, S. (2024). Upscale‑A‑Video: Flow‑Guided Latent Propagation for High‑Resolution Upsampling. In CVPR 2024.
  18. Shi, Y., Zheng, Z., & Wang, L. (2023). Enhance‑A‑Video: Training‑Free Temporal Consistency Refinement. In ICCV 2023.
  19. Luo, X., Qian, C., & Jia, Y. (2025). Owl‑1: Latent World Modelling for Long‑Horizon Video Generation. In NeurIPS 2025.
  20. Zhao, M., Yan, F., & Yang, X. (2025). LanDiff: Language‑Driven Diffusion for Long‑Form Video. In ICLR 2025.
  21. Cho, K., Park, J., & Lee, S. (2024). FIFO‑Diffusion: Infinite Video Generation with Diagonal Denoising. arXiv:2402.07854.
  22. Fu, H., Liu, D., & Zhou, P. (2024). VBench‑2.0: Evaluating Faithfulness in Text‑to‑Video Generation. In ECCV 2024.
  23. Yang, L., Gao, Y., & Sun, J. (2024). EvalCrafter: A Holistic Benchmark for Video Generation Models. In CVPR 2024.

Unveiling the Two "Superpowers" Behind AI Video Creation

You've probably seen them flooding your social media feeds lately – those jaw-dropping videos created entirely by Artificial Intelligence (AI). Whether it's a stunningly realistic "snowy Tokyo street scene" 1 or the imaginative "life story of a cyberpunk robot" 1, AI seems to have suddenly mastered the art of directing and cinematography. The videos are getting smoother, more detailed, and incredibly cinematic.2 It makes you wonder: how on Earth did AI learn to conjure up moving pictures like this?

The "Secret Struggle" of Making Videos

Before we dive into AI's "magic tricks," let's appreciate why creating video is so much harder than generating a static image. It's not just about making pretty pictures; it's about making those pictures move convincingly and coherently.4

Think about it: a video is a sequence of still images, or "frames." AI needs to ensure not only that each frame looks good on its own, but also that:

    1. Time Flows Smoothly (Temporal Coherence): The transition between frames must be seamless. Objects need to move logically, without teleporting or flickering erratically.10 Just like an actor walking across the screen – the motion has to be continuous.
    2. Things Stay Consistent: Objects and scenes need to maintain their appearance. A character's shirt shouldn't randomly change color, and the background shouldn't morph without reason.11
    3. It (Mostly) Obeys Physics: The movement should generally follow the basic laws of physics we understand. Balls fall down, water flows.4 Current AI isn't perfect here, but it's getting better.
    4. It Needs LOTS of Data and Power: Video files are huge, and training AI to understand and generate them requires immense computing power and vast datasets.5

Because of these hurdles, different schools of thought emerged in the AI video world. Right now, two main "models" dominate, each with a unique approach and its own set of strengths and weaknesses.17

The Two Schools: Autoregressive (AR) vs. Diffusion

Imagine our AI artist wants to create a video. They have two main methods:

  • Method 1: The Storyteller or Sequential Painter. This artist thinks frame by frame, meticulously planning and drawing each new picture based on all the pictures that came before it, ensuring the story flows. We call this the Autoregressive (AR) approach.17
  • Method 2: The Sculptor or Photo Restorer. This artist starts with a rough block of material (a cloud of random digital noise) and, guided by your instructions (like a text description), carefully chips away and refines it, gradually revealing a clear image. This is the Diffusion method.17

Let's get to know these two artistic styles.

Style 1: The Autoregressive (AR) "Sequential Storytelling" Method

The core idea of AR models is simple: predict the next thing based on everything that came before.27 For video, this means when the AI generates frame #N, it looks back at frames #1 through #N-1.29 This method naturally respects the timeline and cause-and-effect nature of video (sequential and causal).

    • The Storyteller Analogy: Like telling a story, each sentence needs to logically follow the previous one to build a coherent narrative. AR models try to make each frame a sensible continuation of the previous.
    • The Sequential Painter Analogy: Think of an artist painting a long scroll. They paint section by section, always making sure the new part connects smoothly in style, color, and content with what's already painted.

How it Works (Simplified):

Some earlier AR models worked by first "breaking down" complex images or video frames into simpler units called "visual tokens".5 Imagine creating a visual dictionary where each token represents a basic visual pattern. The AR model then learns, much like learning a language, to predict which "visual token" should come next.5

However, this "break-and-reassemble" approach can lose fine details. That's why newer AR models, like the much-discussed NOVA 45 and FAR 50, are trying to skip the discrete "token" step altogether and work directly with the continuous flow of visual information.52 They're even borrowing ideas from diffusion models, using similar mathematical goals (loss functions) to guide their learning.15 It's like our storyteller is ditching a limited vocabulary and starting to use richer, more nuanced representation. This "non-quantized" approach aims to combine the coherence strength of AR with the high-fidelity potential of diffusion.52

AR's Pros:

    • Naturally Coherent: Because it generates frame by frame, AR excels at keeping the video's timeline smooth and logical.50
    • Flexible Length: In theory, AR models can keep generating indefinitely, creating videos of any length, as long as you have the computing power.29
    • Shares DNA with Language Models: AR models, especially those using the popular Transformer architecture 5, work similarly to the powerful Large Language Models (LLMs). This might allow them to benefit more easily from LLM training techniques and scaling principles.27

AR's Cons:

    • Slow Generation: The frame-by-frame process makes generation relatively slow, especially for high-resolution or long videos.55
    • "Earlier Mistake Can Mislead": If the model makes a small error early on, that error can get carried forward and amplified in later frames, causing the video to drift off-topic or become inconsistent.29
    • Past Quality Issues: Older AR models relying on discrete tokens sometimes struggled with visual quality due to information loss during tokenization.11 However, as mentioned, newer non-quantized methods are tackling this.52

Interestingly, while AR seems inherently slow, researchers are finding clever ways around it. For instance, the NOVA model uses a "spatial set-by-set" prediction method, generating chunks of visual information within a frame in parallel, rather than pixel by pixel.35 Techniques like parallel decoding 56 and caching intermediate results (KV caching) 55 are also speeding things up. Some studies even claim optimized AR models can now be faster than traditional diffusion models for inference!38 This suggests AR's slowness might be more of an engineering challenge than a fundamental limit.

Style 2: The Diffusion "Refining the Rough" Method

Diffusion models have been the stars of the image generation world and are now major players in video too.4 Their core idea is a bit counter-intuitive: first break it, then fix it.17

Imagine you have a clear video. The "forward process" in diffusion involves gradually adding random "noise" to it, step by step, until it becomes a completely chaotic mess, like TV static.29

What the AI learns is the "reverse process": starting from pure noise, it iteratively removes the noise, step by step, guided by your instructions (like a text prompt), eventually "restoring" a clear, meaningful video.29

    • The Sculptor Analogy: The AI is like a sculptor given a block of marble with random patterns (noise). Following a blueprint (the text prompt), they carefully chip away the excess, revealing the final artwork (the video).
    • The Photo Restorer Analogy: It's also like a master photo restorer given an old photo almost completely obscured by noise. Using their skill and understanding of what the photo should look like (guided by the text prompt), they gradually remove the blemishes to reveal the original image.

How it Works (Simplified):

The key word for diffusion is iteration. Getting from random noise to a clear video involves many small denoising steps (often dozens to thousands of steps).29

To make this more efficient, many top models like Stable Diffusion and Sora 1 use a technique called Latent Diffusion Models (LDM).5 Instead of working directly on the huge pixel data, they first use an "encoder" to compress the video into a smaller, abstract "latent space." They do the heavy lifting (adding and removing noise) in this compact space, and then use a "decoder" to turn the result back into a full-pixel video. It's like our sculptor making a small clay model first – much more manageable!16

Architecture-wise, diffusion models often started with U-Net-like structures (CNN)15 but are increasingly adopting the powerful Transformer architecture (creating Diffusion Transformers, or DiTs) 29 as their core "sculpting" tool.

Diffusion's Pros:

    • Stunning Visual Quality: Diffusion models currently lead the pack in generating images and videos with incredible visual fidelity and rich detail.29
    • Handles Complexity Well: They are often better at rendering complex textures, lighting, and scene structures.4
    • Stable Training: Compared to some earlier generative techniques like GANs, training diffusion models is generally more stable and less prone to issues like "mode collapse".29

Diffusion's Cons:

    • Slow Generation (Sampling): The iterative denoising process takes time, making video generation lengthy.55 Fine sculpting requires patience.
    • Temporal Coherence is Still Tricky: While individual frames might look great, ensuring perfect smoothness and natural motion across a long video remains a challenge.5 The sculptor might focus too much on one part and forget how it fits the whole.
    • Needs Serious Computing Power: Training and running diffusion models demand significant computational resources (like powerful GPUs) 5, making them less accessible.57

To tackle the slowness, researchers are in a race to speed things up. Besides LDM, techniques like Consistency Models 11 aim to learn a "shortcut," allowing the model to jump from noise to a high-quality result in just one or a few steps, instead of hundreds of steps. Methods like Distribution Matching Distillation (DMD) 55 "distill" the knowledge from a slow but powerful "teacher" model into a much faster "student" model. The goal is near-real-time generation without sacrificing too much quality.55

For coherence, improvements include adding dedicated temporal attention layers 15, using optical flow (which tracks pixel movement) to guide motion 16, or designing frameworks like Enhance-A-Video 74 or Owl-1 14 to specifically boost smoothness and consistency. It seems that after mastering static image quality, making videos move realistically and tell a coherent story is the next big frontier for diffusion models.

Which Style to Choose? Storytelling vs. Sculpting

So, which approach is "better"? It depends on what you value most.

Here's a quick comparison:

AR vs. Diffusion at a Glance

Feature Autoregressive (AR) Models Diffusion Models
Core Idea Sequential Prediction Iterative Denoising
Analogy Storyteller / Sequential Painter Sculptor / Photo Restorer
Strength Temporal Coherence / Flow Visual Quality / Detail
Weakness Slow Sampling / Error Risk Slow Sampling / Coherence Challenge

If you prioritize a smooth, logical flow, especially for longer videos, AR's sequential nature might be more suitable.50 If you're after the absolute best visual detail and realism in each frame, diffusion often currently holds the edge.17 But remember, both are evolving fast and borrowing from each other.

The Best of Both Worlds: When Storytellers Meet Sculptors

Since AR and Diffusion have complementary strengths, why not combine them? 29

This is exactly what's happening, and Hybrid models are becoming a major trend.

    • Idea 1: Divide and Conquer. Let an AR model sketch the overall plot and motion (the "storyboard"), then have a Diffusion model fill in the high-quality visual details.50
    • Idea 2: AR Framework, Diffusion Engine. Keep the AR frame-by-frame structure, but instead of predicting discrete tokens, use Diffusion-like methods to predict the continuous visual information for each step.44 Models like NOVA and FAR lean this way.
    • Idea 3: Diffusion Framework, AR Principles. Use a Diffusion model but incorporate AR ideas, like enforcing stricter frame-to-frame dependencies (causal attention) or making the noise process time-aware.29 AR-Diffusion 29 and CausVid 55 are examples.

The sheer number of models with names blending AR and Diffusion concepts (AR-Diffusion, ARDiT, DiTAR, LanDiff, MarDini, ART-V, CausVid, Transfusion, HART, etc.) 29 shows this is where much of the action is. It's less about choosing one side and more about finding the smartest way to combine their powers.

The Road Ahead: Challenges and Dreams for AI Video

Despite the incredible progress, AI video generation still has hurdles to overcome 17:

    • Making Longer Videos: Most AI videos are still short. Generating minutes-long (or longer!) videos that stay coherent and interesting is a huge challenge.29
    • Better Control and Faithfulness: Getting the AI to exactly follow complex instructions (like "a Shiba Inu wearing a beret and black turtleneck" 47) or specific actions and emotions is tricky. AI can still misunderstand or "hallucinate" things not in the prompt.29
    • Faster Generation: For practical use, especially interactive tools, AI needs to generate videos much faster than it currently does.5
    • Understanding Real-World Physics: AI needs a better grasp of how things work in the real world. Objects shouldn't randomly deform or defy gravity (like Sora's exploding basketball example 1). Giving AI "common sense" is key to true realism.4

But the future possibilities are dazzling:

    • Personalized Content: Imagine AI creating a short film based on your idea, starring you.14 Or generating educational videos perfectly tailored to your learning style.
    • Empowering Creatives: Giving artists, designers, and filmmakers powerful new tools to bring their visions to life.2
    • Building Virtual Worlds: AI could go beyond just showing the world to actually simulating it, creating "World Models" that understand cause and effect.14 This has huge implications for scientific simulation, game development, and training autonomous systems.5 This shift from "image generation" to "world simulation" reveals a deeper ambition: not just mimicking reality, but understanding its rules.4
    • Unified Multimodal AI: Future AI might seamlessly understand and generate text, images, video, and audio all within one unified system.11

Achieving these dreams hinges heavily on improving efficiency. Generating long videos, enabling real-time interaction, and building complex world models all require immense computing power. Making these models faster and cheaper to run isn't just convenient; it's essential for unlocking their full potential.5 Efficiency is one key.

Conclusion: A New Era of Visual Storytelling

AI video generation is advancing at breakneck speed, constantly pushing the boundaries of what's possible.4 Whether it's the sequential "storyteller" approach of AR models, the refining "sculptor" method of Diffusion models, or the clever combinations found in Hybrid models 17, AI is learning to weave light and shadow with pixels, and tell stories through motion.

We're witnessing the dawn of a new era in visual storytelling. AI won't just change how we consume media; it will empower everyone with unprecedented creative tools. Of course, with great power comes great responsibility. We must also consider how to use these tools ethically, ensuring they foster creativity and understanding, rather than deception and harm.13

The future is unfolding frame by frame. The next AI-directed blockbuster might just start with an idea you have right now. Let's watch this space!

Works cited

[1]Asynchronous Video Generation with Auto-Regressive Diffusion - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.07418v1

[2][2503.07418] AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2503.07418

[3]AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion | Request PDF - ResearchGate, accessed on April 28, 2025, https://www.researchgate.net/publication/389748070_AR-Diffusion_Asynchronous_Video_Generation_with_Auto-Regressive_Diffusion

[4]Video Diffusion Models: A Survey - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2405.03150v2

[5]Video Is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2412.18688

[6]Autoregressive Models in Vision: A Survey - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2411.05902v1

[7]A Survey on Vision Autoregressive Model - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2411.08666v1

[8] SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2504.11455v1

[9] On Improved Conditioning Mechanisms and Pre-training Strategies for Diffusion Models - NIPS papers, accessed on April 28, 2025, https://proceedings.neurips.cc/paper_files/paper/2024/file/18023809c155d6bbed27e443043cdebf-Paper-Conference.pdf

[10] Opportunities and challenges of diffusion models for generative AI - Oxford Academic, accessed on April 28, 2025, https://academic.oup.com/nsr/article/11/12/nwae348/7810289?login=false

[11] Video Diffusion Models - A Survey - OpenReview, accessed on April 28, 2025, https://openreview.net/pdf?id=sgDFqNTdaN

[12] The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.04606v1

[13] ChaofanTao/Autoregressive-Models-in-Vision-Survey - GitHub, accessed on April 28, 2025, https://github.com/ChaofanTao/Autoregressive-Models-in-Vision-Survey

[14] [2412.09600] Owl-1: Omni World Model for Consistent Long Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2412.09600

[15] arXiv:2412.07772v2 [cs.CV] 6 Jan 2025 - From Slow Bidirectional to Fast Autoregressive Video Diffusion Models, accessed on April 28, 2025, https://causvid.github.io/causvid_paper.pdf

[16] SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2504.11455

[17] Phenaki - SERP AI, accessed on April 28, 2025, https://serp.ai/tools/phenaki/

[18] openreview.net, accessed on April 28, 2025, https://openreview.net/pdf/9cc7b12b9ea33c67f8286cd28b98e72cf43d8a0f.pdf

[19] Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation, accessed on April 28, 2025, https://www.researchgate.net/publication/390038718_Bridging_Continuous_and_Discrete_Tokens_for_Autoregressive_Visual_Generation

[20] Autoregressive Video Generation without Vector Quantization ..., accessed on April 28, 2025, https://openreview.net/forum?id=JE9tCwe3lp

[21] Long-Context Autoregressive Video Modeling with Next-Frame Prediction - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.19325v1

[22] Language Model Beats Diffusion — Tokenizer is Key to Visual Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2310.05737

[23] Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.16430v2

[24] Auto-Regressive Diffusion for Generating 3D Human-Object Interactions, accessed on April 28, 2025, https://ojs.aaai.org/index.php/AAAI/article/view/32322/34477

[25] Fast Autoregressive Video Generation with Diagonal Decoding - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.14070v1

[26] One-Minute Video Generation with Test-Time Training, accessed on April 28, 2025, https://test-time-training.github.io/video-dit/assets/ttt_cvpr_2025.pdf

[27] Photorealistic Video Generation with Diffusion Models - European Computer Vision Association, accessed on April 28, 2025, https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/10270.pdf

[28] arXiv:2412.03758v2 [cs.CV] 24 Feb 2025, accessed on April 28, 2025, https://www.arxiv.org/pdf/2412.03758v2

[29] Advancing Auto-Regressive Continuation for Video Frames - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2412.03758v1

[30] From Slow Bidirectional to Fast Autoregressive Video Diffusion Models - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2412.07772v2

[31] Enhance-A-Video: Better Generated Video for Free - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2502.07508v3

[32] [D] The Tech Behind The Magic : How OpenAI SORA Works : r/MachineLearning - Reddit, accessed on April 28, 2025, https://www.reddit.com/r/MachineLearning/comments/1bqmn86/d_the_tech_behind_the_magic_how_openai_sora_works/

[33] Delving Deep into Diffusion Transformers for Image and Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2312.04557v1

[34] CVPR Poster Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution - CVPR 2025, accessed on April 28, 2025, https://cvpr.thecvf.com/virtual/2024/poster/31563

[35] SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models - AAAI Publications, accessed on April 28, 2025, https://ojs.aaai.org/index.php/AAAI/article/view/32663/34818

[36] Latte: Latent Diffusion Transformer for Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2401.03048v2

[37] VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2504.12259v1

[38] [2501.00103] LTX-Video: Realtime Video Latent Diffusion - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2501.00103

[39] LTX-Video: Realtime Video Latent Diffusion - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2501.00103v1

[40] Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2501.03931v1

[41] LaMD: Latent Motion Diffusion for Image-Conditional Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2304.11603v2

[42] Video-Bench: Human-Aligned Video Generation Benchmark - ResearchGate, accessed on April 28, 2025, https://www.researchgate.net/publication/390569999_Video-Bench_Human-Aligned_Video_Generation_Benchmark

[43] Advancements in diffusion models for high-resolution image and short form video generation, accessed on April 28, 2025, https://gsconlinepress.com/journals/gscarr/sites/default/files/GSCARR-2024-0441.pdf

[44] NeurIPS Poster StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation, accessed on April 28, 2025, https://neurips.cc/virtual/2024/poster/94916

[45] FrameBridge: Improving Image-to-Video Generation with Bridge Models | OpenReview, accessed on April 28, 2025, https://openreview.net/forum?id=oOQavkQLQZ

[46] Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution - CVPR 2024 Open Access Repository, accessed on April 28, 2025, https://openaccess.thecvf.com/content/CVPR2024/html/Chen_Learning_Spatial_Adaptation_and_Temporal_Coherence_in_Diffusion_Models_for_CVPR_2024_paper.html

[47] Subject-driven Video Generation via Disentangled Identity and Motion - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2504.17816v1

[48] AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion - alphaXiv, accessed on April 28, 2025, https://www.alphaxiv.org/overview/2503.07418

[49] Phenaki - Reviews, Pricing, Features - SERP, accessed on April 28, 2025, https://serp.co/reviews/phenaki.video/

[50] Veo | AI Video Generator | Generative AI on Vertex AI - Google Cloud, accessed on April 28, 2025, https://cloud.google.com/vertex-ai/generative-ai/docs/video/generate-videos

[51] Generate videos in Gemini and Whisk with Veo 2 - Google Blog, accessed on April 28, 2025, https://blog.google/products/gemini/video-generation/

[52] Sora: Creating video from text - OpenAI, accessed on April 28, 2025, https://openai.com/index/sora/

[53] Top AI Video Generation Models in 2025: A Quick T2V Comparison - Appy Pie Design, accessed on April 28, 2025, https://www.appypiedesign.ai/blog/ai-video-generation-models-comparison-t2v

[54] ART•V: Auto-Regressive Text-to-Video Generation with Diffusion Models - CVF Open Access, accessed on April 28, 2025, https://openaccess.thecvf.com/content/CVPR2024W/GCV/papers/Weng_ART-V_Auto-Regressive_Text-to-Video_Generation_with_Diffusion_Models_CVPRW_2024_paper.pdf

[55] Simplified and Generalized Masked Diffusion for Discrete Data - arXiv, accessed on April 28, 2025, https://arxiv.org/pdf/2406.04329

[56] Unified Multimodal Discrete Diffusion - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.20853

[57] Simple and Effective Masked Diffusion Language Models - arXiv, accessed on April 28, 2025, https://arxiv.org/pdf/2406.07524

[58] [2107.03006] Structured Denoising Diffusion Models in Discrete State-Spaces - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2107.03006

[59] Structured Denoising Diffusion Models in Discrete State-Spaces, accessed on April 28, 2025, https://proceedings.neurips.cc/paper/2021/file/958c530554f78bcd8e97125b70e6973d-Paper.pdf

[60] Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2406.03736v2

[61] Fast Sampling via Discrete Non-Markov Diffusion Models with Predetermined Transition Time - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2312.09193v3

[62] [2406.03736] Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2406.03736

[63] AR-Diffusion: Auto-Regressive Diffusion Model for Text Generation | OpenReview, accessed on April 28, 2025, https://openreview.net/forum?id=0EG6qUQ4xE

[64] Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2410.14157v3

[65] [R] Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution - Reddit, accessed on April 28, 2025, https://www.reddit.com/r/MachineLearning/comments/1ezyunc/r_discrete_diffusion_modeling_by_estimating_the/

[66] [2412.07772] From Slow Bidirectional to Fast Autoregressive Video Diffusion Models - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2412.07772

[67] Long-Context Autoregressive Video Modeling with Next-Frame Prediction - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.19325v2

[68] Long-Context Autoregressive Video Modeling with Next-Frame Prediction - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2503.19325

[69] ManiCM: Real-time 3D Diffusion Policy via Consistency Model for Robotic Manipulation - arXiv, accessed on April 28, 2025, https://arxiv.org/pdf/2406.01586?

[70] G-U-N/Awesome-Consistency-Models: Awesome List of ... - GitHub, accessed on April 28, 2025, https://github.com/G-U-N/Awesome-Consistency-Models

[71] showlab/Awesome-Video-Diffusion: A curated list of recent diffusion models for video generation, editing, and various other applications. - GitHub, accessed on April 28, 2025, https://github.com/showlab/Awesome-Video-Diffusion

[72] [PDF] EvalCrafter: Benchmarking and Evaluating Large Video Generation Models, accessed on April 28, 2025, https://www.semanticscholar.org/paper/66d927fdb6c2774131960c75275546fd5ee3dd72

[73] [2502.07508] Enhance-A-Video: Better Generated Video for Free - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2502.07508

[74] NeurIPS Poster FIFO-Diffusion: Generating Infinite Videos from Text without Training, accessed on April 28, 2025, https://nips.cc/virtual/2024/poster/93253

[75] StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text, accessed on April 28, 2025, https://openreview.net/forum?id=26oSbRRpEY

[76] Owl-1: Omni World Model for Consistent Long Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2412.09600v1

[77] Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2411.16375v1

[78] ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2406.10981v1

[79] TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models - CVF Open Access, accessed on April 28, 2025, https://openaccess.thecvf.com/content/CVPR2024/papers/Ni_TI2V-Zero_Zero-Shot_Image_Conditioning_for_Text-to-Video_Diffusion_Models_CVPR_2024_paper.pdf

[80] Training-Free Motion-Guided Video Generation with Enhanced Temporal Consistency Using Motion Consistency Loss - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2501.07563v1

[81] DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2502.03930v1

[82] VBench-2.0: A Framework for Evaluating Intrinsic Faithfulness in Video Generation Models, accessed on April 28, 2025, https://www.reddit.com/r/artificial/comments/1jmgy6n/vbench20_a_framework_for_evaluating_intrinsic/

[83] NeurIPS Poster GenRec: Unifying Video Generation and Recognition with Diffusion Models, accessed on April 28, 2025, https://neurips.cc/virtual/2024/poster/94684

[84] Evaluation of Text-to-Video Generation Models: A Dynamics Perspective - OpenReview, accessed on April 28, 2025, https://openreview.net/forum?id=tmX1AUmkl6¬eId=MAb60mrdAJ

[85] [CVPR 2024] EvalCrafter: Benchmarking and Evaluating Large Video Generation Models - GitHub, accessed on April 28, 2025, https://github.com/evalcrafter/EvalCrafter

[86] [2412.18688] Video Is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2412.18688

立委科普:揭秘AI创作视频的两种“神功”

0.53 复制打开抖音,看看【立委的作品】# 视频生成 # 大模型科普 # notebook... https://v.douyin.com/kUWrLBDJniQ/ [email protected] oQK:/ 08/05

 

最近,你一定被社交媒体上那些由人工智能(AI)创作的视频刷屏了吧?无论是“雪中的东京街景” 1,还是“机器人赛博朋克生活” 1,抑或是各种天马行空的想象,AI似乎一夜之间掌握了导演和摄像的魔法,生成的视频效果越来越逼真、流畅,甚至充满了电影感 2。这不禁让人惊叹:AI究竟是如何学会制作视频这门复杂的艺术的?

视频生成的“难言之隐”

在我们揭晓AI的“神功秘籍”之前,先得理解相比于生成一张静态图片,视频的挑战要大得多。这不仅仅是画出好看的画面,更关键的是要让画面动起来,而且要动得自然、连贯 3

想象一下,视频是由一连串的图片(称为“帧”)组成的。AI不仅要确保每一帧都清晰美观,还要保证:

    1. 时间连贯性(Temporal Coherence): 相邻帧之间的过渡要平滑,物体运动要符合规律,不能出现“瞬移”或者“闪烁” 4。就像电影里的人物走路,动作得是连贯的。
    2. 内容一致性: 视频中的物体和场景要保持一致性,比如一个人的衣服颜色不能随意变化,背景也不能突然改变 14
    3. 物理常识: 生成的动态需要符合基本的物理规律,比如球会往下落,水会流动 1。虽然目前的AI还做不到完美,但仿真客观世界是方向。
    4. 数据与计算需求: 视频数据量巨大,处理起来需要强大的计算能力和海量的训练数据 5

正因为这些挑战,AI视频生成领域发展出了不同的技术流派。目前,最主流的有两大“门派”,它们解决问题的方式截然不同,各有千秋 4

两大门派是:自回归(AR)与扩散(Diffusion)

想象一下AI是位艺术家,要创作一段视频。现在有两种主流的创作方式:

    • 第一种方式,像个“讲故事的人”(Storyteller)或者“按顺序作画的画家”(Sequential Painter)。 他会一帧接一帧地构思和绘制,确保后面的画面能接得上前面的情节。这种方法,我们称之为自回归(Autoregressive, AR)模型 4
    • 第二种方式,则像个“雕刻家”(Sculptor)或者“照片修复师”(Photo Restorer)。 他先拿到一块粗糙的“素材”(一堆随机的噪点),然后根据你的要求(比如文字描述),一点点地打磨、雕琢,逐渐让清晰的画面显现出来。这种方法,就是扩散(Diffusion)模型 4

这两种方法各有神通,也各有“脾气”。让我们分别来了解一下。

第一式:自回归(AR)模型的“顺序叙事法”

自回归模型的核心思想非常直观:预测下一帧,基于之前的视频流 4,就是AI在生成第N帧画面时,会参考前面已经生成的1到N-1帧 10。这种方式强调的是视频内在的时间顺序和因果关系(sequential and causal)。

    • “讲故事”的比喻: 就像讲故事,下一句话总要承接上一句话的意思,才能构成一个连贯的情节。AR模型就是这样,它努力让每一帧都成为前一帧合乎逻辑的延续。
    • “顺序作画”的比喻: 也像一位画家在绘制连环画,他会一幅一幅地画,每画新的一幅,都要确保它和已经完成的部分在风格、颜色、内容上都能衔接起来。

自回归模型是怎么工作的?

早期的一些AR模型,会先把复杂的图像或视频“打碎”,编码成一种叫做“视觉词元”(visual tokens)的东西 26。你可以把它想象成给视觉世界创建了一本“词典”,每个词元代表一种视觉模式。然后,AR模型就像学习语言一样,学习预测下一个“视觉词元”应该是什么 29

不过,这种“打碎再组合”的方式可能会丢失一些细节。因此,更新的AR模型,比如备受关注的NOVA 30 和FAR 28 等,开始尝试跳过“视觉词元”这一步,直接在连续的视觉信息上进行操作 52。它们甚至借鉴了扩散模型的一些思想,比如使用类似的数学目标来学习 29。这就像讲故事的人不再局限于有限的词汇,而是开始使用更丰富、更细腻的表示手段来描述世界。这种不依赖“量化”(quantization)词元的方式,被认为是AR模型发展的一个重要方向,旨在结合AR模型擅长的连贯性与扩散模型擅长的高保真度 30

AR模型的“独门绝技”(优点):

    • 天生连贯: 由于是一帧接一帧生成,AR模型在保持视频的时间连贯性和逻辑流畅性方面具有天然优势 4
    • 长度灵活: 理论上,只要计算资源允许,AR模型可以一直“讲下去”,生成任意长度的视频 4
    • 与语言模型“师出同门”: AR模型(尤其是基于Transformer架构的 26)和现在非常强大的大语言模型(LLM)在底层逻辑上相同(都是预测序列中的下一个元素),能更好地借鉴LLM的训练方法和可扩展的经验法则,有更大的品质提升空间 26

AR模型的“难念的经”(缺点):

    • 生成速度慢: “一帧一帧来”的特性决定了它的生成速度相对较慢,尤其是对于高分辨率、长时长的视频 4
    • “一步错,步步错”: 如果在生成过程中某一步出了差错,这个错误可能会像滚雪球一样被带到后面的帧中,导致视频内容逐渐偏离主题或出现不一致 4
    • 早期质量瓶颈: 过去依赖“视觉词元”的AR模型,其生成质量会受限于词元对真实世界细节的表达能力 29。不过,如前所述,新的非量化方法正致力于解决这个问题 30

值得注意的是,虽然AR模型天生是序列化的,看起来很慢,但研究人员正在努力克服这个瓶颈。例如,NOVA模型采用了一种“空间集对集”(spatial set-by-set)的预测方式,在生成帧内画面时,不是逐个像素生成,而是并行地预测一片片的视觉信息 30。还有一些技术,比如并行解码 59 和缓存(KV caching)机制 31,都在尝试让AR模型的生成过程更快。有些研究甚至声称,经过优化的AR模型在生成速度上可以超过传统的扩散模型 36。这表明,AR模型的“慢”可能更多是一个可以通过工程和算法创新来缓解的问题,而非无法逾越的理论障碍。

第二式:扩散(Diffusion)模型的“去粗取精法”

扩散模型是在图像生成领域大放异彩的技术,现在也成为了视频生成的主力军 3。它的核心思想有点反直觉:先破坏,再修复 4

想象一下,你有一段清晰的视频。扩散模型的“前向过程”(forward process)就是不断地、逐步地给这段视频添加随机的“噪声”(noise),直到它变成一片完全无序的、类似电视雪花点的状态 3

AI学习的,则是这个过程的“逆向过程”(reverse process):从一堆纯粹的噪声开始,一步一步地、迭代地去除噪声,最终“还原”出一段清晰、有意义的视频 3。这个去噪过程是受到用户指令(比如文字描述)引导的。

    • “雕刻家”的比喻: AI就像一位雕刻家,面对一块充满随机纹理的“璞玉”(噪声),根据设计图(文字提示),一刀一刀地剔除多余部分,最终呈现出精美的作品(视频)。
    • “照片修复师”的比喻: 也像一位顶级的照片修复师,拿到一张几乎完全被噪声覆盖的旧照片,凭借高超技艺和对照片内容的理解(文字提示),逐步去除污点和模糊,让清晰的影像重现。

扩散模型是怎么工作的?

扩散模型的关键在于迭代。从完全随机的噪声到最终的清晰视频,需要经历很多(通常是几十到几千)个小的去噪步骤 3

为了提高效率,很多先进的扩散模型,比如Stable Diffusion、Sora等 1,采用了潜在扩散模型(Latent Diffusion Model, LDM)的技术 5。它们不是直接在像素级别的高维视频数据上进行加噪去噪,而是先用一个“编码器”将视频压缩到一个更小、更抽象的“潜在空间”(latent space),在这个低维空间里完成主要的扩散和去噪过程,最后再用一个“解码器”将结果还原和渲染成高清像素视频。这就像雕刻家先做一个小尺寸的泥塑模型来构思,而不是直接在巨大的石料上动工,大大节省了时间和精力 16

在模型架构方面,扩散模型早期常用类似U-Net(就是CNN)的网络结构 11,后来也越来越多地采用更强大的Transformer架构(称为Diffusion Transformer, DiT) 14,这些架构充当了AI进行“雕刻”或“修复”的核心工具。

扩散模型的“看家本领”(优点):

    • 画质惊艳: 扩散模型目前在生成图像和视频的视觉质量上往往是顶尖的,细节丰富、效果逼真 2
    • 处理复杂场景: 对于复杂的纹理、光影和场景结构,扩散模型通常能处理得更好 1
    • 训练更稳定: 相较于生成对抗网络(GANs)等早期技术,扩散模型的训练过程通常更稳定,不容易出现模式崩溃等问题 4

扩散模型的“阿喀琉斯之踵”(缺点):

    • 生成(采样)速度慢: 迭代去噪的过程需要很多步,导致生成一个视频需要较长时间 4。雕刻家精雕细琢是需要时间的。
    • 时间连贯性仍是挑战: 虽然单帧质量高,但要确保长视频中所有帧都完美连贯、动作自然流畅,对扩散模型来说依然是一个难题 4。雕刻家可能过于专注于局部细节,而忽略了整体的协调性。
    • 计算成本高昂: 无论是训练模型还是生成视频,扩散模型都需要强大的计算资源(如图形处理器GPU) 4,这限制了其普及应用 83

面对速度慢这个核心痛点,研究界掀起了一场“加速竞赛”。除了前面提到的LDM,还涌现出许多旨在减少采样步骤的技术。例如,一致性模型(Consistency Models) 19 试图学习一种“直达”路径,让模型能从噪声一步或几步就生成高质量结果。还有像分布匹配蒸馏(Distribution Matching Distillation, DMD) 34 这样的技术,通过“蒸馏”一个慢但强大的“教师”模型的知识,训练出一个快得多的“学生”模型。这些努力的目标都是在尽量不牺牲质量的前提下,让扩散模型的生成速度提升几个数量级,达到接近实时应用的水平 83

同时,为了解决时间连贯性问题,研究者们也在不断改进扩散模型的架构和机制。比如,在模型中加入专门处理时间关系的时间注意力(temporal attention)11,利用光流(optical flow)信息来指导运动生成 16,或者设计像Enhance-A-Video 14 或Owl-1 24 这样的特殊模块或框架来增强视频的流畅度和一致性。这表明,在单帧画质达到较高水平后,如何让视频“动得更像样”、“故事更连贯”,已成为扩散模型发展的下一个重要关口。

如何选择?“顺序叙事” vs “去粗取精”

了解了这两种“神功”后,我们可能会问:哪种更好?其实没有绝对的答案,它们各有侧重。

我们可以用一个简单的表格来总结一下:

AR 与 Diffusion 模型速览

特性 (Feature) 自回归模型 (AR) 扩散模型 (Diffusion)
核心思想 (Core Idea) 顺序预测 (Sequential Prediction) 迭代去噪 (Iterative Denoising)
形象比喻 (Analogy) 讲故事者/连环画画家 (Storyteller/Painter) 雕刻家/照片修复师 (Sculptor/Restorer)
主要优势 (Key Strength) 时间连贯性/流畅性 (Temporal Coherence) 视觉质量/细节 (Visual Quality)
主要劣势 (Key Weakness) 采样慢/易出错 (Slow Sampling/Error Risk) 采样慢/连贯性挑战 (Slow Sampling/Coherence)

简单来说,如果你特别看重视频故事线的流畅和逻辑性,尤其是在生成很长的视频时,AR模型天生的顺序性可能更有优势 4。而如果你追求的是极致的画面细节和逼真度,扩散模型目前往往能提供更好的视觉效果 4。但正如我们看到的,这两种技术都在快速进化,互相学习,界限也变得越来越模糊。

融合之道:当“叙事者”遇上“雕刻家”

既然AR和Diffusion各有擅长,一个自然的想法就是:能不能让它们“联手”,取长补短呢? 4

答案是肯定的,而且这正成为当前AI视频生成领域一个非常热门的趋势。许多最新的、表现优异的模型都采用了混合(Hybrid)架构,试图融合AR和Diffusion的优点。

    • 思路一:分工合作。 让AR模型先负责“打草稿”,规划视频的整体结构和运动走向(可能细节不多),然后让Diffusion模型来“精雕细琢”,填充高质量的视觉细节 61
    • 思路二:AR骨架,Diffusion内核。 保留AR模型的顺序生成框架,但在预测每一帧(或每一部分)时,不再是简单预测下一个“词元”,而是使用类似Diffusion模型的连续空间预测方法和损失函数 29。前面提到的NOVA和FAR就体现了这种思想。
    • 思路三:Diffusion骨架,AR思想。 在Diffusion模型的框架内,引入AR的原则,比如强制更严格的帧间顺序依赖(causal attention),或者让噪声的添加/去除过程体现出时序性 9。AR-Diffusion 9 和CausVid 34 等模型就是例子。

这种融合趋势非常明显。看看研究论文列表,你会发现大量模型名称或描述中都包含了AR和Diffusion的元素(如AR-Diffusion, ARDiT, DiTAR, LanDiff, MarDini, ART-V, CausVid, Transfusion, HART等) 9。这表明,研究界普遍认为,结合两种方法的优点是克服各自局限、推动视频生成技术向前发展的关键路径。这不再是“二选一”的问题,而是如何更聪明地“合二为一”。

前路漫漫:AI视频的挑战与梦想

尽管AI视频生成技术进步神速,但距离完美还有很长的路要走。目前主要面临以下挑战 4

    • 制作更长的视频: 目前大部分AI生成的视频还比较短(几秒到十几秒)。要生成几分钟甚至更长的视频,同时保持内容连贯、不重复、不“跑题”,仍然非常困难 4
    • 更精准的控制与忠实度: 如何让AI精确理解并执行复杂的指令?比如,“一只戴着贝雷帽、穿着黑色高领毛衣的柴犬” 49,或者更复杂的场景描述、人物动作和情感表达。目前AI有时还会“听不懂”或者“产生幻觉”,生成与要求不符的内容 1
    • 更快的生成速度: 要让AI视频生成工具真正实用化,尤其是在交互式应用中,速度至关重要。目前的生成速度对于很多场景来说还是太慢了 4
    • 理解真实世界物理: AI需要学习更多关于现实世界的物理常识。比如,物体应该有固定的形状(不会随意变形),运动应该符合基本的力学原理。OpenAI Sora模型展示的弱点中,就有篮球穿过篮筐后爆炸 1,或者椅子在挖掘过程中变形 1 这样不符合物理规律的例子。让AI拥有“常识”是实现更高层次真实感的关键 1

尽管挑战重重,但AI视频生成的未来充满想象空间:

    • 个性化内容创作: 想象一下,AI可以根据你的想法,为你量身定做一部微电影,甚至让你成为主角 9。或者,生成完全符合你学习节奏和风格的教学视频。
    • 赋能创意产业: 为艺术家、设计师、电影制作人提供强大的新工具,极大地拓展创意表达的可能性 2
    • 构建虚拟世界与模拟: AI不仅能生成视频,更能构建出能够模拟真实世界运行规律的“世界模型”(World Models) 4。这意味着AI可以用来进行科学模拟、游戏环境生成、自动驾驶仿真训练等 5。这种从“生成图像”到“模拟世界”的转变,显示了AI视频技术的深层雄心:不仅仅是模仿表象,更要理解内在规律 1
    • 统一的多模态智能: 未来的AI将能够无缝地理解和生成包括文本、图像、视频、音频在内的多种信息形式 4

实现这些梦想,离不开对效率的极致追求。无论是生成长视频、实现实时交互,还是构建复杂的“世界模型”,都需要巨大的计算力。因此,不断提升模型的训练和推理效率,降低成本,不仅仅是为了方便,更是为了让这些更宏大的目标成为可能 4。可以说,效率是解锁未来的关键钥匙。

结语:视觉叙事的新纪元

AI视频生成技术正以惊人的速度发展,不断刷新我们的认知 3。无论是像“讲故事的人”一样按部就班的自回归模型,还是像“雕刻家”一样精雕细琢的扩散模型,亦或是集两者之长的混合模型 4,它们都在努力学习如何更好地用像素编织光影,用运动讲述故事。

我们正站在一个视觉叙事新纪元的开端。AI不仅将改变我们消费内容的方式,更将赋予每个人前所未有的创作能力。当然,伴随着技术的飞速发展,我们也需要思考如何负责任地使用这些强大的工具,确保它们服务于创造、沟通和理解,而非误导和伤害 4

未来已来,AI导演的下一部大片,或许就源自你此刻的灵感。让我们拭目以待!