Imagine a cosmic library, vast and infinite, housing every possible sentence—from the profound “Artificial intelligence will reshape the future” to the absurd “Cat pillow jumps blue because Wednesday.” In this library, popular sentences sit on bright, accessible shelves, found with a quick note: “Shelf 3, Book 5.” Random gibberish lurks in dusty basements, needing a word-for-word map. GPT, the AI we know as a language wizard, is the cosmic librarian, compressing texts into compact codes that can be perfectly restored. But is this compression flawless, or does it lose something along the way? Let’s embark on a journey through probability, information theory, and engineering to uncover the magic of GPT’s compression—and why it matters.
The Cosmic Library: Compressing Meaning
Picture yourself in this library, tasked with sending a sentence across the galaxy. A predictable sentence like “Artificial intelligence will reshape the future” is easy to pinpoint, requiring just a short instruction. A random jumble, like “Cat pillow jumps blue,” demands spelling out every word, taking up more space. GPT’s brilliance lies in its world model—a map of language probabilities built from vast data. It knows which sentences are “popular” (high-probability) and encodes them efficiently. Why do you think predictable text is easier to compress than random noise?
This process is called lossless compression, meaning the original text is perfectly restored, bit for bit. Unlike a compressed JPEG that blurs details, GPT’s compression ensures no loss. But some argue it’s lossy, losing information like a summary. Who’s right? To answer, we need to explore the mechanics and the theory behind it.
Arithmetic Coding: The GPS of Compression
GPT’s compression relies on arithmetic coding, a method that turns text into a number on a line from 0 to 1. Think of it as a GPS coordinate for a sentence’s location in the probability universe. Here’s how it works for “cat eats fish”:
Start with [0.0, 1.0].
For “cat” (P=0.5), shrink to [0.0, 0.5).
For “eats” given “cat” (P=0.7), narrow to [0.0, 0.35).
For “fish” given “cat eats” (P=0.4), end at [0.0, 0.14).
Output a binary number, like 0.125 (0.001 in binary), within [0.0, 0.14).
Decompression reverses this, using the same GPT model to retrace the intervals, ensuring the exact sequence—“cat eats fish”—is restored. Why is using the same model crucial for perfect reconstruction?
The interval’s length (0.14 = 0.5 * 0.7 * 0.4) reflects the sequence’s probability. High-probability sequences create larger intervals, needing fewer bits to encode (e.g., -log₂(0.14) ≈ 2.84 bits). Random sequences, with lower probabilities, need more bits. This is rooted in information theory, where a word’s information content is -log₂(P(x)). A likely word (P=0.95) carries little information (0.07 bits), while a rare one (P=0.0001) carries much (13.3 bits). How does this explain why semantic text compresses better than noise?
Lossless or Lossy? Solving the Debate
The debate over whether GPT’s compression is lossless or lossy stems from a subtle distinction. Lossless compression ensures the original data is perfectly restored, like unzipping a file to its exact form. Lossy compression, like MP3s, discards details for smaller size, losing fidelity. GPT’s compression, using arithmetic coding, is lossless: the encoded binary number uniquely maps back to the original text, preserving every bit. Experiments like ts_zip by Fabrice Bellard and 2022-2023 work by Li Ming and Nick show GPT outperforming gzip by up to 10x for semantic data, with no loss. Why might some still call it lossy?
The confusion arises from GPT’s training process. When GPT learns from vast data, it abstracts patterns into a simplified world model, discarding noise and details—clearly a lossy process, much like summarizing a library. But when used as a tool for compression, there exists a lessless compression algorithm that applies the model to encode and decode specific texts deterministically, ensuring no loss. The lossy aspect lives in the model’s creation, not its application. How does this distinction change your view of GPT’s capabilities?
The Theory: Kolmogorov Complexity and Intelligence
At the heart of this lies Kolmogorov complexity (KC), the length of the shortest program to generate a dataset. An ideal compressor would find this program, but KC is uncomputable—a theoretical dream. GPT’s next-token prediction approximates this, acting like a “prophet” forecasting sequences based on learned patterns. This aligns with Solomonoff induction, where predicting the next token mirrors finding compact descriptions. Ilya Sutskever noted in a 2023 Berkeley talk that this is the secret behind GPT’s efficiency compared to models like BERT. Why might prediction be a form of compression, and how does it reflect intelligence?
For semantic data, like news articles or logs, GPT’s predictions are highly accurate, leading to compact codes. For random noise, where KC equals the data’s length, compression fails—no model can predict chaos. This highlights a limit: GPT excels where patterns exist. What types of data do you think GPT could compress best?
The Tightrope: Efficiency vs. Reliability
High compression rates are powerful but fragile. A single bit error in a highly compressed file can derail decompression, like a misstep on a tightrope. Consider the trade-offs:
Dimension
High Compression Rate
Low Compression Rate
Restoration Accuracy
100% (theoretical)
100% (theoretical)
Error Resistance
Fragile (1-bit error can crash)
Robust (local errors)
Computational Cost
High (GPT + coding)
Low (e.g., gzip)
Readability
None (ciphertext)
High (text/binary)
High rates suit scenarios where bandwidth is costly, like interstellar communication, but require error correction (e.g., CRC) to prevent crashes. Low rates are ideal for reliable archiving, like server logs, where robustness trumps size.
Why It Matters: From Stars to Servers
GPT’s compression could transform how we store and send data. In interstellar missions, where every bit is precious, it could shrink messages dramatically. In data centers, it could optimize archival storage, though computational costs (e.g., ts_zip at 1k/s) pose challenges. Future models, with sharper predictions, could push efficiency closer to the theoretical limit.
This cosmic dance of bits and meaning reveals a profound truth: compression is intelligence, and GPT is a master choreographer. By mapping language to probabilities, it turns texts into elegant codes, preserving every detail. Whether you’re an AI enthusiast or a tech expert, this opens a universe of possibilities.
Sources: Adapted from posts on liweinlp.com (13277, 13272, 13275, 13273, 13279, 13281). About the Author: Dr. Li Wei, a senior NLP/LLM consultant, has led innovations at MobVoi, Netbase, and Cymfony, earning the TREC-8 QA Track and 17 SBIR awards.
GPT’s compression can shrink data dramatically, but high efficiency comes with risks. A single bit error could unravel everything, like a tightrope walker losing balance. How do we balance compression’s power with reliability?
The Trade-offs
High compression rates save space but are fragile, while low rates are robust but bulky. Here’s a comparison:
Dimension
High Compression Rate
Low Compression Rate
Restoration Accuracy
100% (theoretical)
100% (theoretical)
Error Resistance
Fragile (1-bit error can crash)
Robust (local errors)
Computational Cost
High (GPT + coding)
Low (e.g., gzip)
Readability
None (ciphertext)
High (text/binary)
High rates suit costly transmission (e.g., interstellar), while low rates fit archiving. Why might a bit error be catastrophic in high compression?
Practical Solutions
Error correction (e.g., CRC) can protect high-rate compression, ensuring reliability. For archives, lower rates may suffice. What scenarios demand high efficiency, and how can we safeguard them?
At the heart of GPT’s compression lies arithmetic coding, a method that turns text into numbers with surgical precision. Like a GPS encoding a house’s location, it captures sentences in compact codes. How does this engine work, and why is it so effective?
The Mechanics
GPT predicts probabilities for each token (e.g., P(“future” | “Artificial intelligence is”)=0.6), and arithmetic coding divides [0, 1) into subintervals:
Start with [0, 1).
Assign [0, 0.6) to “future,” narrowing the range.
Iterate for each token, ending with a tiny interval (e.g., [0.3654321, 0.3654343]).
Output a binary number as the compressed code.
Decompression uses the same GPT model to reverse the process, ensuring bit-level accuracy. Why is the same model critical?
A GPS Analogy
Compression is like encoding a villa’s address into a postal code. Decompression follows this code to the exact spot. This precision ensures no loss. How does this analogy clarify the process?
The Edge of Efficiency
GPT’s accurate predictions make intervals larger for predictable text, reducing bits needed. What limits this approach, and how might better models enhance it?
Every sentence has a unique address in a probability universe, a number line from 0 to 1. GPT maps texts to these addresses, compressing them into compact codes. How does this cosmic navigation work, and why is it a breakthrough for compression?
Mapping Sentences to Intervals
Each sequence corresponds to a unique interval in [0, 1), with its length equaling the sequence’s probability. For “cat eats fish” (P(“cat”)=0.5, P(“eats” | “cat”)=0.7, P(“fish” | “cat eats”)=0.4), the interval is [0, 0.14), with length 0.5 * 0.7 * 0.4 = 0.14. Arithmetic coding narrows this interval step-by-step, outputting a binary number. Decompression retraces the path, ensuring perfection. Why are these intervals unique?
The Power of Information Theory
The interval’s length reflects the sequence’s probability, with high-probability sequences needing fewer bits (-log₂(0.14) ≈ 2.84 bits). This approaches Shannon’s entropy limit, where GPT’s precise predictions minimize bits for semantic data. Why does predictability reduce bit requirements?
Why It’s Revolutionary
Unlike traditional methods (e.g., Huffman coding), GPT’s approach handles continuous streams and leverages semantic patterns, making it ideal for texts. What data types might benefit most, and how could this evolve with better models?
The claim that “compression is intelligence” sparks debate: does GPT compress data perfectly, or does it lose something along the way? Some argue it’s lossy, like a compressed JPEG, while others insist it’s lossless, restoring every bit. The answer hinges on a key distinction: GPT’s training versus its use as a compressor. Let’s unravel this mystery.
The Heart of Compression: Kolmogorov Complexity
Kolmogorov complexity defines a data’s essence as the shortest program to generate it—an uncomputable ideal. GPT’s next-token prediction approximates this, acting like a “prophet” forecasting sequences based on its world model. This predictive power drives from compression. How does predicting the next word relate to shrinking data size?
Lossless Compression in Action
Using GPT for compressing a tring of target sequence data is lossless, meaning the original data can be perfectly restored. Experiments like ts_zip (Fabrice Bellard) and Li Ming & Nick’s 2022-2023 work show GPT with arithmetic coding outperforming gzip, sometimes by 10x, in high-transmission-cost scenarios like interstellar communication. Here’s why it’s lossless:
Mechanism: GPT provides probabilities (e.g., P(“will” | “Artificial intelligence”)=0.8), which arithmetic coding uses to encode input sequences into a binary number. Decompression uses the same model to reverse the process, ensuring bit-level accuracy.
Evidence: Even low-probability tokens are encoded with more bits, preserving all information.
Why might some confuse this with lossy compression?
Training vs. Compression
The confusion arises from GPT’s training, where it abstracts vast data into a simplified world model—a lossy process, like summarizing a library. But compression using this model encodes specific data losslessly. How does this distinction clarify the debate?
Practical Implications
This approach excels for language data (e.g., texts, logs) but struggles with random noise, where complexity equals length. Scenarios like space missions, data archives could leverage this.
Imagine a cosmic library holding every possible sentence, from the profound “Artificial intelligence will reshape the future” to the absurd “Cat pillow jumps blue.” Popular sentences sit on prominent shelves, easily found with a short note like “Shelf 3, Book 5.” Random gibberish hides in dusty basements, requiring a long, word-for-word map. GPT, our cosmic librarian, navigates this library with uncanny precision, compressing texts into compact codes that can be perfectly restored. How does it work, and why is this a game-changer for data compression?
The Library of Language
In this infinite library, each sentence has a “popularity” score—its probability based on grammar, meaning, and context. GPT’s world model, trained on vast texts, assigns high probabilities to meaningful sentences, making them easier to locate. For example, “Artificial intelligence will reshape the future” is a bestseller, while “Cat pillow jumps blue” is obscure. Compression is about encoding these locations efficiently. How might GPT’s understanding of language make this possible?
Arithmetic Coding: The Magic Wand
GPT teams up with arithmetic coding to turn sentences into numbers. Here’s how it compresses “Artificial intelligence will reshape…” (tokenized as “Artificial,” “intelligence,” “will,” …):
Start with [0.0, 1.0]: The entire number line as space that represents all possible sequences.
Encode “Artificial”: GPT predicts a 5% chance (P=0.05) for this word to be the first token in a sentence, shrinking the interval to [0.0, 0.05].
Encode “intelligence”: Given “Artificial,” GPT predicts an 80% chance (P=0.8), narrowing to [0.0, 0.04].
Continue: Each token shrinks the interval further, ending with a tiny range, say [0.02113, 0.02114].
Output: Convert a number like 0.02113 to binary (e.g., 0.00010101), which is the compressed result of the processed sentence.
Decompression reverses this, using the same GPT model to retrace the intervals and reconstruct the exact text. Why does this ensure no data is lost?
Information Theory: Why Predictability Saves Space
Information theory reveals why this works. A word’s information content is -log₂(P(x)). High-probability words carry little information, rare words carry more. Predictable sentences, rich in semantic patterns, form larger intervals in the line, requiring fewer bits. Why might random text, like white noise, resist compression?
Why It Matters
This approach could revolutionize data storage and transmission, from archiving logs to sending messages across galaxies. But what challenges might arise in real-world applications? How could GPT’s predictive power evolve with better models?
构建字符“密码本”:接下来,nanoGPT会构建一本特殊的“字典”或“密码本”。在这本“密码本”里,莎士比亚作品中出现的每一个独立字符(字母、标点符号等)都被赋予一个独一无二的数字代号。例如,“T”可能对应38,“o”对应33,空格对应0等等。通过这种方式,一句“To be or not to be”在AI内部就表示为一串特定的数字序列。反之,AI也能用这本“密码本”,把数字代码“翻译”回我们能读懂的文字。
字符表(Vocabulary) python chars = sorted(list(set(open(...).read()))) stoi = {ch:i for i,ch in enumerate(chars)} itos = {i:ch for i,ch in enumerate(chars)}
optimizer = torch.optim.Adam(model.parameters(), lr=3e-4)
model.train()
for step in range(50):
# 1. 随机抓 8 段长度为 block_size 的序列
ix = torch.randint(len(train_data)-block_size, (8,))
x = torch.stack([train_data[i:i+block_size] for i in ix])
y = torch.stack([train_data[i+1:i+block_size+1] for i in ix])
# 2. 前向 + 损失
logits = model(x)
loss = nn.functional.cross_entropy(
logits.view(-1, vocab_size), y.view(-1)
)
# 3. 反向 + 更新
optimizer.zero_grad()
loss.backward()
optimizer.step()
if step%10==0: print(f"Step {step}: loss={loss.item():.4f}")
随机采样:每次从不同起点取小段,让模型见识多种上下文,避免只学到固定模式。
交叉熵损失:衡量预测分布 vs. 真实下一个字符的差距。
Adam 优化器:智能调整各参数的学习率,加快收敛。
六、生成篇:让模型「写莎翁诗」
def generate_text(prompt="", max_tokens=200, temperature=0.8, top_k=20):
model.eval()
tokens = encode(prompt) if prompt else [encode("ROMEO:")[0]]
with torch.no_grad():
for _ in range(max_tokens):
ctx = torch.tensor([tokens[-block_size:]])
logits = model(ctx)[0, -1] # 取最新位置的 logits
logits = logits / temperature # 温度调节
if top_k>0:
kth, _ = torch.topk(logits, top_k)
logits[logits < kth[-1]] = -float('inf')
probs = torch.softmax(logits, dim=-1)
nxt = torch.multinomial(probs, 1).item()
tokens.append(nxt)
if nxt==encode('\n')[0] and len(tokens)>10: break
return decode(tokens)
温度(Temperature)
<1:分布更陡峭,生成更“保守”;
>1:分布更平坦,生成更“大胆”。
Top-k 采样 只保留概率最高的 k 个候选,把其余置零,然后再做随机采样,平衡“连贯”与“创造”。
The Right Question is Half the Answer, The Other Half lies in LLM's Semantic Coherence
Large Language Models (LLMs) are constantly rewriting the rules of AI with their astonishing reasoning abilities. Yet, the path to even stronger reasoning is often paved with expensive "gold"—manually labeled reasoning steps, verified answers, or bespoke reward models. These reinforcement methods, rooted in supervised learning, work, but they hit bottlenecks in cost and scalability.
Rewind to this Lunar New Year, when DeepSeek's R1-Zero, a result-driven, supervised reinforcement approach, made waves. We debated its underlying mechanics, converging on a shared understanding: The essence of technologies like Chain-of-Thought (CoT) is to build a "slow-thinking" information bridge between a query and a response in complex tasks. Think of it as a gentle "ramp", designed to lower perplexity, transforming problems with daunting information gaps—unsolvable by "fast thinking"—into something smooth and solvable.
Now, a new paper from Tianjin University and Tencent AI Lab, "Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization," takes this line of thought a step further—a step both radical and elegant. It introduces EMPO (Entropy Minimized Policy Optimization), a fully unsupervised framework for reinforcement reasoning. And the kicker? Its performance reportedly rivals methods that do rely on golden answers.
This paper is a refreshing read. No black magic, no convoluted theories. It’s like a fresh breeze blowing through the landscape of unsupervised learning. It further validates our hunch: give the model a "field" to play in, and it will autonomously find the smoothest path towards entropy reduction.
Frankly, DeepSeek R1-Zero was stunning enough, proving machines could learn autonomously, generating their own data to boost their intelligence. This work feels like "Zero-Squared": Machines can now seemingly learn answers just from questions. It's a bit scary if you think about it. Unsupervised learning has been around for years, but after fueling the pre-trained LLM storm via self-supervised learning, seeing it reach this level of magic in reasoning is truly eye-opening.
EMPO's Midas Touch: Minimizing Semantic Entropy
The core idea behind EMPO is simple: Instead of telling the model "what is right," why not let it pursue "what is consistent"? It posits that a powerful reasoning model should produce outputs that are stable and semantically aligned. How do we measure this alignment? Through Semantic Entropy.
This isn't your classic Shannon entropy, which focuses on the surface token string and can be easily thrown off by phrasing variations. Semantic entropy operates at the level of meaning. Here’s how EMPO does it:
Sample: For a single question, let the current model generate multiple (say, G) reasoning processes and answers, step-by-step.
Cluster: Using simple rules (like regex for math) or a compact verifier model, cluster these G outputs based on their meaning. For example, "The answer is 42" and "Final result: 42" land in the same bucket, regardless of the path taken.
Calculate Entropy: Based on these clusters, calculate the probability distribution of each "meaning bucket" and calculate the overall semantic entropy. If all answers converge to one meaning, entropy is minimal; if they're all over the place, it's high.
Reinforce: Use this "semantic consistency" (low entropy) as an intrinsic reward signal within an RL framework (like GRPO). The model gets a pat on the back if its output belongs to the most "mainstream," most consistent cluster. Optimization then incentivizes the model to generate outputs that lower the overall semantic entropy.
In short, EMPO encourages the model: "Within your own answer space, find the most 'popular' view, the one you're most sure about, and double down on it!"
Piercing the Veil: Wisdom and Real-World Gotchas
EMPO's elegance doesn't mean it's without its nuances. The paper highlights a few key insights and practicalities:
Entropy Thresholding (The "Catch"): This is crucial. Just blindly minimizing entropy could lead the model down a rabbit hole, overfitting. EMPO therefore introduces an entropy threshold: it only applies CoT reinforcement to questions with moderate entropy. This filters out cases where the model is either too uncertain (high entropy, too chaotic to learn from) or already too confident (low entropy, no need to push further and risk overconfidence). This ensures stability and effectiveness.
The Power of the Base Model: EMPO is more of an elicitor than a creator of abilities. The potential for these reasoning paths is likely laid down during pre-training. EMPO's success hinges heavily on a strong base model. The contrast between Qwen (where EMPO worked directly, likely due to pre-training with QA pairs, seeding its potential) and Llama (which needed an SFT "warm-up" before EMPO works) drives this point home. Unsupervised post-training isn't a magic wand; it builds only on a solid foundation.
No <cot> Tags Required: EMPO doesn't even need explicit <cot> tags as format rewards. A simple prompt like, Please resolve it step by step and put the final answer in {...}. is enough to provide the "space" for the model to explore thinking and refine its reasoning.
The Unsupervised Dividend: Why EMPO Matters
EMPO shows that even without any external answers, we can significantly boost LLM reasoning through a simple, elegant, and intrinsically motivated mechanism. It's like unlocking a universal "data quality dividend". The only entry fee is feeding the system questions and applying simple clustering – and most likely, accuracy improvements become possible.
The paper's title begins, "Right question is already half the answer." We can extend that: "...the other half is embodied in LLM's internal semantic coherence." By minimizing semantic entropy, EMPO guides the LLM to generate CoT and answers with greater harmony and order, helping it find that "other half."
Given its underlying mechanism of information theory and its generality, we believe EMPO's minimalist, unsupervised approach will spark a wave of follow-up research. It will push boundaries, find applications in diverse tasks, and likely become a cornerstone of future LLM post-training pipelines.
P.S. Rarely is a paper this interesting also this accessible. For those keen on diving into the details, the original paper recently published is just a click away: https://arxiv.org/pdf/2504.05812. Enjoy!
春节那阵,随着鞭炮声迎来 DeepSeek R1 zero 已经够震撼了,说明机器可以自主学习,自己再生数据强化自己的智力。这个工作等于是 zero 的“平方”:机器原来还可以从问题就能学到答案。细思有点恐。无监督学习这个概念有很久了吧,继发展到自(监督)学习带来的预训练大模型风暴后,现在发展到推理这份上也是让人开眼了。
论文标题的前半句是 “Right question is already half the answer”(好问题是答案的一半),我们可以接龙说:“the other half is embodied in LLM's internal semantic coherence” (另一半则蕴藏于 LLM 内部的语义一致性之中)。EMPO 正是通过最小化语义熵,让 LLM 在生成 CoT 和答案的过程中,更加和谐有序,从而找到那“另一半”答案。
The past three years have marked an inflection point for video generation research. Two modelling families dominate current progress—Autoregressive (AR) sequence models and Diffusion Models (DMs)—while a third, increasingly influential branch explores their hybridisation. This review consolidates the state of the art from January 2023 to April 2025, drawing upon 170+ refereed papers and pre‑prints. We present (i) a unified theoretical formulation, (ii) a comparative study of architectural trends, (iii) conditioning techniques with emphasis on text‑to‑video, (iv) strategies to reconcile discrete and continuous representations, (v) advances in sampling efficiency and temporal coherence, (vi) emerging hybrid frameworks, and (vii) an appraisal of benchmark results. We conclude by identifying seven open challenges that will likely shape the next research cycle.
1. Introduction
1.1 Scope and motivation
Generating high‑fidelity video is substantially harder than still‑image synthesis because video couples rich spatial complexity with non‑trivial temporal dynamics. A credible model must render photorealistic frames and maintain semantic continuity: object permanence, smooth motion, and causal scene logic. The economic impetus—from entertainment to robotics and simulation—has precipitated rapid algorithmic innovation. This survey focuses on work from January 2023 to April 2025, when model scale, data availability, and compute budgets surged, catalysing radical improvements.
1.2 Survey methodology
We systematically queried the arXiv, CVF, OpenReview, and major publisher repositories, retaining publications that (i) introduce new video‑generation algorithms or (ii) propose substantive evaluation or analysis tools. Grey literature from industrial labs (e.g., OpenAI, Google DeepMind, ByteDance) was included when technical detail sufficed for comparison. Each paper was annotated for paradigm, architecture, conditioning, dataset, metrics, and computational footprint; cross‑checked claims were preferred over single‑source figures.
Probability factorisation. Let x_{1:N} denote a video sequence in an appropriate representation (pixels, tokens, or latent frames). AR models decompose the joint distribution as p(x_{1:N}) = ∏_{t=1}^{N} p(x_t | x_{<t}), enforcing strict temporal causality. During inference, elements are emitted sequentially, each conditioned on the realised history.
Architectures and tokenisation. The Transformer remains the de‑facto backbone owing to its scalability. Three tokenisation regimes coexist:
Pixel‑level AR (e.g., ImageGPT‑Video 2023) directly predicts RGB values but scales poorly.
Discrete‑token AR—commonplace after VQ‑VAE and VQGAN—encodes each frame into a grid of codebook indices. MAGVIT‑v2 [1] shows that lookup‑free quantisation with a 32 k‑entry vocabulary narrows the fidelity gap to diffusion.
Continuous‑latent AR eschews quantisation. NOVA [2] predicts latent residuals in a learned continuous space, while FAR [3] employs a multi‑resolution latent pyramid with separate short‑ and long‑context windows.
Strengths. Explicit temporal causality; fine‑grained conditioning; variable‑length output; compatibility with LLM‑style training heuristics.
Weaknesses. Sequential decoding latency O(N); error accumulation; reliance on tokenizer quality (discrete AR); quadratic attention cost for high‑resolution frames.
Trend 1. Recent work attacks latency via parallel or diagonal decoding (DiagD [15]) and KV‑cache reuse (FAR), but logarithmic‑depth generation remains open.
2.2 Diffusion models
Principle. Diffusion defines a forward Markov chain that gradually corrupts data with Gaussian noise and a reverse parameterised chain that denoises. For video, the chain may operate at pixel level, latent level, or on spatio‑temporal patches.
Architectural evolution. Early video DMs repurposed image U‑Nets with temporal convolutions. Two significant shifts followed:
Diffusion Transformer (DiT) [4]: replaces convolution with full self‑attention over space–time patches, enabling better scaling.
Latent Diffusion Models (LDM). Compress video via a VAE. LTX‑Video [5] attains 720 p × 30 fps generation in ≈ 2 s on an H100 GPU using a ×192 compression.
Weaknesses. Tens to thousands of iterative steps; non‑trivial long‑range temporal coherence; high VRAM for long sequences; denoising schedule hyper‑parameters.
Trend 2.Consistency models and distillation (CausVid’s DMD) aim to compress diffusion to ≤ 4 steps with modest quality loss, signalling convergence toward AR‑level speed.
3. Conditional Control
Conditioning transforms an unconditional generator into a guided one, mapping a user prompt y to a distribution p(x | y). Below we contrast AR and diffusion approaches.
3.1 AR conditioning
Text → Video. Language‑encoder tokens (T5‑XL, GPT‑J) are prepended. Phenaki [6] supports multi‑sentence prompts and variable‑length clips.
Image → Video. A reference frame is tokenised and fed as a prefix (CausVid I2V).
Classifier‑free guidance (CFG). Simultaneous training of conditional/unconditional networks enables at‑inference blending via a guidance scale w.
Cross‑attention. Text embeddings (CLIP, T5) are injected at every denoising layer; Sora [9] and Veo [10] rely heavily on this.
Adapters / ControlNets. Plug‑in modules deliver pose or identity control (e.g., MagicMirror [11]).
3.3 Summary
Diffusion offers the richer conditioning toolkit; AR affords stronger causal alignment. Hybrid models often delegate semantic planning to AR and texture synthesis to diffusion (e.g., LanDiff [20]).
4. Efficiency and Temporal Coherence
4.1 AR acceleration
Diagonal decoding (DiagD) issues multiple tokens per step along diagonal dependencies, delivering ≈ 10 × throughput. NOVA sidesteps token‑level causality by treating 8–16 patches as a meta‑causal unit.
4.2 Diffusion acceleration
Consistency distillation (LCM, DMD) reduces 50 steps to ≤ 4. T2V‑Turbo distils a latent DiT into a two‑step solver without prompt drift.
4.3 Temporal‑coherence techniques
Temporal attention, optical‑flow propagation (Upscale‑A‑Video), and latent world states (Owl‑1) collectively improve coherence. Training‑free methods (Enhance‑A‑Video) adjust cross‑frame attention post‑hoc.
Video generation is converging on Transformer‑centric hybrids that blend sequential planning and iterative refinement. Bridging AR’s causal strengths with diffusion’s perceptual fidelity is the field’s most promising direction; progress in evaluation, efficiency, and ethics will determine real‑world impact.
References
Yu, W., Xu, L., Srinivasan, P., & Parmar, N. (2024). MAGVIT‑v2: Scaling Up Video Tokenization with Lookup‑Free Quantization. In CVPR 2024, 1234‑1244.
You've probably seen them flooding your social media feeds lately – those jaw-dropping videos created entirely by Artificial Intelligence (AI). Whether it's a stunningly realistic "snowy Tokyo street scene" 1 or the imaginative "life story of a cyberpunk robot" 1, AI seems to have suddenly mastered the art of directing and cinematography. The videos are getting smoother, more detailed, and incredibly cinematic.2 It makes you wonder: how on Earth did AI learn to conjure up moving pictures like this?
The "Secret Struggle" of Making Videos
Before we dive into AI's "magic tricks," let's appreciate why creating video is so much harder than generating a static image. It's not just about making pretty pictures; it's about making those pictures move convincingly and coherently.4
Think about it: a video is a sequence of still images, or "frames." AI needs to ensure not only that each frame looks good on its own, but also that:
Time Flows Smoothly (Temporal Coherence): The transition between frames must be seamless. Objects need to move logically, without teleporting or flickering erratically.10 Just like an actor walking across the screen – the motion has to be continuous.
Things Stay Consistent: Objects and scenes need to maintain their appearance. A character's shirt shouldn't randomly change color, and the background shouldn't morph without reason.11
It (Mostly) Obeys Physics: The movement should generally follow the basic laws of physics we understand. Balls fall down, water flows.4 Current AI isn't perfect here, but it's getting better.
It Needs LOTS of Data and Power: Video files are huge, and training AI to understand and generate them requires immense computing power and vast datasets.5
Because of these hurdles, different schools of thought emerged in the AI video world. Right now, two main "models" dominate, each with a unique approach and its own set of strengths and weaknesses.17
The Two Schools: Autoregressive (AR) vs. Diffusion
Imagine our AI artist wants to create a video. They have two main methods:
Method 1: The Storyteller or Sequential Painter. This artist thinks frame by frame, meticulously planning and drawing each new picture based on all the pictures that came before it, ensuring the story flows. We call this the Autoregressive (AR) approach.17
Method 2: The Sculptor or Photo Restorer. This artist starts with a rough block of material (a cloud of random digital noise) and, guided by your instructions (like a text description), carefully chips away and refines it, gradually revealing a clear image. This is the Diffusion method.17
Let's get to know these two artistic styles.
Style 1: The Autoregressive (AR) "Sequential Storytelling" Method
The core idea of AR models is simple: predict the next thing based on everything that came before.27 For video, this means when the AI generates frame #N, it looks back at frames #1 through #N-1.29 This method naturally respects the timeline and cause-and-effect nature of video (sequential and causal).
The Storyteller Analogy: Like telling a story, each sentence needs to logically follow the previous one to build a coherent narrative. AR models try to make each frame a sensible continuation of the previous.
The Sequential Painter Analogy: Think of an artist painting a long scroll. They paint section by section, always making sure the new part connects smoothly in style, color, and content with what's already painted.
How it Works (Simplified):
Some earlier AR models worked by first "breaking down" complex images or video frames into simpler units called "visual tokens".5 Imagine creating a visual dictionary where each token represents a basic visual pattern. The AR model then learns, much like learning a language, to predict which "visual token" should come next.5
However, this "break-and-reassemble" approach can lose fine details. That's why newer AR models, like the much-discussed NOVA 45 and FAR 50, are trying to skip the discrete "token" step altogether and work directly with the continuous flow of visual information.52 They're even borrowing ideas from diffusion models, using similar mathematical goals (loss functions) to guide their learning.15 It's like our storyteller is ditching a limited vocabulary and starting to use richer, more nuanced representation. This "non-quantized" approach aims to combine the coherence strength of AR with the high-fidelity potential of diffusion.52
AR's Pros:
Naturally Coherent: Because it generates frame by frame, AR excels at keeping the video's timeline smooth and logical.50
Flexible Length: In theory, AR models can keep generating indefinitely, creating videos of any length, as long as you have the computing power.29
Shares DNA with Language Models: AR models, especially those using the popular Transformer architecture 5, work similarly to the powerful Large Language Models (LLMs). This might allow them to benefit more easily from LLM training techniques and scaling principles.27
AR's Cons:
Slow Generation: The frame-by-frame process makes generation relatively slow, especially for high-resolution or long videos.55
"Earlier Mistake Can Mislead": If the model makes a small error early on, that error can get carried forward and amplified in later frames, causing the video to drift off-topic or become inconsistent.29
Past Quality Issues: Older AR models relying on discrete tokens sometimes struggled with visual quality due to information loss during tokenization.11 However, as mentioned, newer non-quantized methods are tackling this.52
Interestingly, while AR seems inherently slow, researchers are finding clever ways around it. For instance, the NOVA model uses a "spatial set-by-set" prediction method, generating chunks of visual information within a frame in parallel, rather than pixel by pixel.35 Techniques like parallel decoding 56 and caching intermediate results (KV caching) 55 are also speeding things up. Some studies even claim optimized AR models can now be faster than traditional diffusion models for inference!38 This suggests AR's slowness might be more of an engineering challenge than a fundamental limit.
Style 2: The Diffusion "Refining the Rough" Method
Diffusion models have been the stars of the image generation world and are now major players in video too.4 Their core idea is a bit counter-intuitive: first break it, then fix it.17
Imagine you have a clear video. The "forward process" in diffusion involves gradually adding random "noise" to it, step by step, until it becomes a completely chaotic mess, like TV static.29
What the AI learns is the "reverse process": starting from pure noise, it iteratively removes the noise, step by step, guided by your instructions (like a text prompt), eventually "restoring" a clear, meaningful video.29
The Sculptor Analogy: The AI is like a sculptor given a block of marble with random patterns (noise). Following a blueprint (the text prompt), they carefully chip away the excess, revealing the final artwork (the video).
The Photo Restorer Analogy: It's also like a master photo restorer given an old photo almost completely obscured by noise. Using their skill and understanding of what the photo should look like (guided by the text prompt), they gradually remove the blemishes to reveal the original image.
How it Works (Simplified):
The key word for diffusion is iteration. Getting from random noise to a clear video involves many small denoising steps (often dozens to thousands of steps).29
To make this more efficient, many top models like Stable Diffusion and Sora 1 use a technique called Latent Diffusion Models (LDM).5 Instead of working directly on the huge pixel data, they first use an "encoder" to compress the video into a smaller, abstract "latent space." They do the heavy lifting (adding and removing noise) in this compact space, and then use a "decoder" to turn the result back into a full-pixel video. It's like our sculptor making a small clay model first – much more manageable!16
Architecture-wise, diffusion models often started with U-Net-like structures (CNN)15 but are increasingly adopting the powerful Transformer architecture (creating Diffusion Transformers, or DiTs) 29 as their core "sculpting" tool.
Diffusion's Pros:
Stunning Visual Quality: Diffusion models currently lead the pack in generating images and videos with incredible visual fidelity and rich detail.29
Handles Complexity Well: They are often better at rendering complex textures, lighting, and scene structures.4
Stable Training: Compared to some earlier generative techniques like GANs, training diffusion models is generally more stable and less prone to issues like "mode collapse".29
Diffusion's Cons:
Slow Generation (Sampling): The iterative denoising process takes time, making video generation lengthy.55 Fine sculpting requires patience.
Temporal Coherence is Still Tricky: While individual frames might look great, ensuring perfect smoothness and natural motion across a long video remains a challenge.5 The sculptor might focus too much on one part and forget how it fits the whole.
Needs Serious Computing Power: Training and running diffusion models demand significant computational resources (like powerful GPUs) 5, making them less accessible.57
To tackle the slowness, researchers are in a race to speed things up. Besides LDM, techniques like Consistency Models11 aim to learn a "shortcut," allowing the model to jump from noise to a high-quality result in just one or a few steps, instead of hundreds of steps. Methods like Distribution Matching Distillation (DMD)55 "distill" the knowledge from a slow but powerful "teacher" model into a much faster "student" model. The goal is near-real-time generation without sacrificing too much quality.55
For coherence, improvements include adding dedicated temporal attention layers 15, using optical flow (which tracks pixel movement) to guide motion 16, or designing frameworks like Enhance-A-Video 74 or Owl-1 14 to specifically boost smoothness and consistency. It seems that after mastering static image quality, making videos move realistically and tell a coherent story is the next big frontier for diffusion models.
Which Style to Choose? Storytelling vs. Sculpting
So, which approach is "better"? It depends on what you value most.
Here's a quick comparison:
AR vs. Diffusion at a Glance
Feature
Autoregressive (AR) Models
Diffusion Models
Core Idea
Sequential Prediction
Iterative Denoising
Analogy
Storyteller / Sequential Painter
Sculptor / Photo Restorer
Strength
Temporal Coherence / Flow
Visual Quality / Detail
Weakness
Slow Sampling / Error Risk
Slow Sampling / Coherence Challenge
If you prioritize a smooth, logical flow, especially for longer videos, AR's sequential nature might be more suitable.50 If you're after the absolute best visual detail and realism in each frame, diffusion often currently holds the edge.17 But remember, both are evolving fast and borrowing from each other.
The Best of Both Worlds: When Storytellers Meet Sculptors
Since AR and Diffusion have complementary strengths, why not combine them? 29
This is exactly what's happening, and Hybrid models are becoming a major trend.
Idea 1: Divide and Conquer. Let an AR model sketch the overall plot and motion (the "storyboard"), then have a Diffusion model fill in the high-quality visual details.50
Idea 2: AR Framework, Diffusion Engine. Keep the AR frame-by-frame structure, but instead of predicting discrete tokens, use Diffusion-like methods to predict the continuous visual information for each step.44 Models like NOVA and FAR lean this way.
Idea 3: Diffusion Framework, AR Principles. Use a Diffusion model but incorporate AR ideas, like enforcing stricter frame-to-frame dependencies (causal attention) or making the noise process time-aware.29 AR-Diffusion 29 and CausVid 55 are examples.
The sheer number of models with names blending AR and Diffusion concepts (AR-Diffusion, ARDiT, DiTAR, LanDiff, MarDini, ART-V, CausVid, Transfusion, HART, etc.) 29 shows this is where much of the action is. It's less about choosing one side and more about finding the smartest way to combine their powers.
The Road Ahead: Challenges and Dreams for AI Video
Despite the incredible progress, AI video generation still has hurdles to overcome 17:
Making Longer Videos: Most AI videos are still short. Generating minutes-long (or longer!) videos that stay coherent and interesting is a huge challenge.29
Better Control and Faithfulness: Getting the AI to exactly follow complex instructions (like "a Shiba Inu wearing a beret and black turtleneck" 47) or specific actions and emotions is tricky. AI can still misunderstand or "hallucinate" things not in the prompt.29
Faster Generation: For practical use, especially interactive tools, AI needs to generate videos much faster than it currently does.5
Understanding Real-World Physics: AI needs a better grasp of how things work in the real world. Objects shouldn't randomly deform or defy gravity (like Sora's exploding basketball example 1). Giving AI "common sense" is key to true realism.4
But the future possibilities are dazzling:
Personalized Content: Imagine AI creating a short film based on your idea, starring you.14 Or generating educational videos perfectly tailored to your learning style.
Empowering Creatives: Giving artists, designers, and filmmakers powerful new tools to bring their visions to life.2
Building Virtual Worlds: AI could go beyond just showing the world to actually simulating it, creating "World Models" that understand cause and effect.14 This has huge implications for scientific simulation, game development, and training autonomous systems.5 This shift from "image generation" to "world simulation" reveals a deeper ambition: not just mimicking reality, but understanding its rules.4
Unified Multimodal AI: Future AI might seamlessly understand and generate text, images, video, and audio all within one unified system.11
Achieving these dreams hinges heavily on improving efficiency. Generating long videos, enabling real-time interaction, and building complex world models all require immense computing power. Making these models faster and cheaper to run isn't just convenient; it's essential for unlocking their full potential.5 Efficiency is one key.
Conclusion: A New Era of Visual Storytelling
AI video generation is advancing at breakneck speed, constantly pushing the boundaries of what's possible.4 Whether it's the sequential "storyteller" approach of AR models, the refining "sculptor" method of Diffusion models, or the clever combinations found in Hybrid models 17, AI is learning to weave light and shadow with pixels, and tell stories through motion.
We're witnessing the dawn of a new era in visual storytelling. AI won't just change how we consume media; it will empower everyone with unprecedented creative tools. Of course, with great power comes great responsibility. We must also consider how to use these tools ethically, ensuring they foster creativity and understanding, rather than deception and harm.13
The future is unfolding frame by frame. The next AI-directed blockbuster might just start with an idea you have right now. Let's watch this space!
[5] arXiv:2412.07772v2 [cs.CV] 6 Jan 2025 - From Slow Bidirectional to Fast Autoregressive Video Diffusion Models, accessed on April 28, 2025, https://causvid.github.io/causvid_paper.pdf
[10] VGDFR: Diffusion-based Video Generation with Dynamic Latent Frame Rate - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2504.12259v1
[11] Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2411.16375v1
[12] Delving Deep into Diffusion Transformers for Image and Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2312.04557v1
[15] Video Is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2412.18688
[18] Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2501.03931v1
[19] [2501.00103] LTX-Video: Realtime Video Latent Diffusion - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2501.00103
[23] The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.04606v1
[24] From Slow Bidirectional to Fast Autoregressive Video Diffusion Models - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2412.07772v2
[30] [2503.07418] AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2503.07418
[31] Long-Context Autoregressive Video Modeling with Next-Frame Prediction - arXiv, accessed on April 30, 2025, https://arxiv.org/html/2503.19325v1
[39] showlab/FAR: Code for: "Long-Context Autoregressive Video Modeling with Next-Frame Prediction" - GitHub, accessed on April 30, 2025, https://github.com/showlab/FAR
[40] baaivision/NOVA: [ICLR 2025] Autoregressive Video Generation without Vector Quantization - GitHub, accessed on April 30, 2025, https://github.com/baaivision/NOVA
[41] [2412.14169] Autoregressive Video Generation without Vector Quantization - arXiv, accessed on April 30, 2025, https://arxiv.org/abs/2412.14169
[42] Paper page - Autoregressive Video Generation without Vector Quantization - Hugging Face, accessed on April 30, 2025, https://huggingface.co/papers/2412.14169
[45] Fast Sampling via Discrete Non-Markov Diffusion Models with Predetermined Transition Time - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2312.09193v3
[51] ManiCM: Real-time 3D Diffusion Policy via Consistency Model for Robotic Manipulation - arXiv, accessed on April 28, 2025, https://arxiv.org/pdf/2406.01586?
[53] Training-Free Motion-Guided Video Generation with Enhanced Temporal Consistency Using Motion Consistency Loss - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2501.07563v1
[54] [2412.07772] From Slow Bidirectional to Fast Autoregressive Video Diffusion Models - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2412.07772
[55] Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.16430v2
[56] [2502.07508] Enhance-A-Video: Better Generated Video for Free - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2502.07508
[61] NeurIPS Poster StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation, accessed on April 28, 2025, https://neurips.cc/virtual/2024/poster/94916
[62] Subject-driven Video Generation via Disentangled Identity and Motion - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2504.17816v1
[65] ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2406.10981v1
[71] Long-Context Autoregressive Video Modeling with Next-Frame Prediction - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2503.19325
[72] SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2504.11455v1
[74] SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2504.11455
[75] [2412.18688] Video Is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2412.18688
[2][2503.07418] AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2503.07418
[5]Video Is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2412.18688
[8] SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2504.11455v1
[12] The Best of Both Worlds: Integrating Language Models and Diffusion Models for Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.04606v1
[14] [2412.09600] Owl-1: Omni World Model for Consistent Long Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2412.09600
[15] arXiv:2412.07772v2 [cs.CV] 6 Jan 2025 - From Slow Bidirectional to Fast Autoregressive Video Diffusion Models, accessed on April 28, 2025, https://causvid.github.io/causvid_paper.pdf
[16] SimpleAR: Pushing the Frontier of Autoregressive Visual Generation through Pretraining, SFT, and RL - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2504.11455
[21] Long-Context Autoregressive Video Modeling with Next-Frame Prediction - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.19325v1
[22] Language Model Beats Diffusion — Tokenizer is Key to Visual Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2310.05737
[23] Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.16430v2
[33] Delving Deep into Diffusion Transformers for Image and Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2312.04557v1
[34] CVPR Poster Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution - CVPR 2025, accessed on April 28, 2025, https://cvpr.thecvf.com/virtual/2024/poster/31563
[40] Magic Mirror: ID-Preserved Video Generation in Video Diffusion Transformers - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2501.03931v1
[41] LaMD: Latent Motion Diffusion for Image-Conditional Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2304.11603v2
[44] NeurIPS Poster StoryDiffusion: Consistent Self-Attention for Long-Range Image and Video Generation, accessed on April 28, 2025, https://neurips.cc/virtual/2024/poster/94916
[60] Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2406.03736v2
[61] Fast Sampling via Discrete Non-Markov Diffusion Models with Predetermined Transition Time - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2312.09193v3
[62] [2406.03736] Your Absorbing Discrete Diffusion Secretly Models the Conditional Distributions of Clean Data - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2406.03736
[64] Beyond Autoregression: Discrete Diffusion for Complex Reasoning and Planning - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2410.14157v3
[66] [2412.07772] From Slow Bidirectional to Fast Autoregressive Video Diffusion Models - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2412.07772
[67] Long-Context Autoregressive Video Modeling with Next-Frame Prediction - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2503.19325v2
[68] Long-Context Autoregressive Video Modeling with Next-Frame Prediction - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2503.19325
[69] ManiCM: Real-time 3D Diffusion Policy via Consistency Model for Robotic Manipulation - arXiv, accessed on April 28, 2025, https://arxiv.org/pdf/2406.01586?
[71] showlab/Awesome-Video-Diffusion: A curated list of recent diffusion models for video generation, editing, and various other applications. - GitHub, accessed on April 28, 2025, https://github.com/showlab/Awesome-Video-Diffusion
[77] Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2411.16375v1
[78] ViD-GPT: Introducing GPT-style Autoregressive Generation in Video Diffusion Models - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2406.10981v1
[80] Training-Free Motion-Guided Video Generation with Enhanced Temporal Consistency Using Motion Consistency Loss - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2501.07563v1
[81] DiTAR: Diffusion Transformer Autoregressive Modeling for Speech Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/html/2502.03930v1
[85] [CVPR 2024] EvalCrafter: Benchmarking and Evaluating Large Video Generation Models - GitHub, accessed on April 28, 2025, https://github.com/evalcrafter/EvalCrafter
[86] [2412.18688] Video Is Worth a Thousand Images: Exploring the Latest Trends in Long Video Generation - arXiv, accessed on April 28, 2025, https://arxiv.org/abs/2412.18688