Decoupling to Resolve: Issue of Character Consistency in Video Generation

I’ve now become the go-to expert for AIGC (AI-generated content) "custom services" among my old friends and classmates, just for fun. Below are nostalgic videos made from old photos that two of my classmates asked me to create.

Whenever I find the time, I’m more than happy to provide this kind of emotional value for friends and family because it’s truly satisfying to see their reactions of surprise.

The pianist is now a world-class piano master, frequently touring and performing in Europe, America, and China. These are precious old photos of him practicing and performing with our mutual friend, Brother Sun, in Philadelphia back in the early days.

Dr. Bai Shuo, a seasoned expert in NLP and a multi-talented musician, commented humorously: “Looks real for someone who pulls on the bow in Meditation as named, but the bowing and fingering are all wrong.”

Another old friend also left feedback noting that the visual model doesn’t understand music: "This needs improvement! It's obvious that the model was created by someone who doesn’t know how to play the violin or piano. The bowing and piano accompaniment are off. The first note has a two-and-a-half beat long tone, which should be played with a long bow. Additionally, the pianist’s right foot should never be raised or shaking like that—it should be on the sustain pedal.”

LOL

Even though the music's name Meditation was clearly specified in my prompt during generation, there is no model, in the foreseeable future, that can truly align the understanding of music with the intricate details of bodily movements during performance. Perhaps this can be reserved as one of the ultimate challenges for large models aiming for AGI, because theoretically, if enough alignment data of musical performance is available, based on the compression theory of "joint training", it’s possible to aim at perfect alignment across different modalities.

If simulating the objective world is the ultimate goal of visual models, then the current generation of visual models is at the level of “playing the piano to a cow” or “playing music to a tone-deaf audience”—completely unable to withstand scrutiny from musicians. For example, as someone with little musical knowledge, when I watch the nostalgic performance videos above, I wouldn’t notice the flaws as an expert would; instead, I find them vivid and emotionally engaging.

Of course, the standards of musicians might as well just be a "pseudo-demand" or a pseudo-goal (even if the visuals satisfy the picky “expert eye,” so what? Will it sell well?). It might not be worth the effort to pursue this. However, in theory, an ideal AGI should be capable of meeting these expert-level demands.

This is the challenge of musical performance alignment. Another challenge to Sora-like video generation models is character consistency in videos.

Achieving facial consistency in generative visual models is an extremely difficult. Don’t expect this issue to be resolved by video generation models alone in the short term, especially not through autoregressive methods.

Human eyes are extremely discerning with regards to face recognition, especially when it comes to familiar faces of friends and family—you can immediately tell when a character's appearance is off. For example, while playing with old photos recently, I used the KeLing model (top notch Video Model in China) to generate a video of myself. At the 5-second mark, it still looked passable, but by 10 seconds, it no longer resembled me.

10 second footage:

In the second 10-second video, just a slight turn of the head, and it’s no longer me—it looks more like my brother. How can a model handle such fine details? Especially when the starting image for video generation is not even a straightforward frontal shot, making the character information incomplete—how could it not go off track?

While the videos I've made for friends and family using KeLing during its public testing phase have generally been met with passionate surprise and amazement, most of them suffer from this issue of character consistency, which is a regret.

The current one-click video generation products on the market (including our own YuanChuang Island recently launched) tend to mainly use anime or manga styles. This is to avoid user scrutiny since these styles lack 3D distinct individual characteristics. As long as there is consistency in attire, no gender mix-ups, with age and race alignment, most people will accept it. The current one-click videos are generally rough, with entertainment value primarily in the story rather than character portrayal akin to a Hollywood blockbuster. However, as this path progresses, it will inevitably encounter the challenge of maintaining the consistency of digital IP actors and their roles.

My colleague, Lu, mentioned, "the consistency issue might require cross-checking from multiple video angles, which more or less touches on the core issue of whether modeling is necessary."

Indeed, some form of cross-checking is required, not just monotonic correction over time/sequence—that is indeed the key. There’s a need to decouple or separate the character's image from the storyline, rather than generating in a linear, one-way path. While sequence learning has indeed produced miracles in LLMs, sequence generation inherently has limitations, including random deviations over time. Although it's not as extreme as LeCun's criticism—where he says GPT's error accumulation is a tiny discrepancy that leads to a significant miss—his claim isn't entirely accurate because GPT's autoregressive operation also corrects and adjusts its course at every step in the context. Nevertheless, when it comes to fine-grained consistency, random deviations are almost impossible to handle, even with corrective mechanisms in place.

Hence decoupling, decoupling, decoupling! Decoupling can solve the problem. The world isn't limited to sequences. Beyond sequences and time, there is a constant abstraction (i.e., character image, or IP) that can be utilized. This is becoming increasingly clear. Take, for example, the digital IP character Maria (Xiao Ya) that I created using AIGC txt2img more than 2 years ago:：

Unless they’re fans, perhaps my numerous Maria videos might cause aesthetic fatigue—someone even called her “Dr. Li's fairy” (LOL). But indeed, there are fans; several of my old classmates are among them.

Why? Because she is an IP, and she has been decoupled.

Related Links (original posts in Chinese):

视觉模型生成的极限对齐

解耦才能解套：再谈视频中的人物一致性问题

Decoupling to Resolve: Issue of Character Consistency in Video Generation

发布者

立委

发表回复