A review of Oasis: A Universe in a transformer, and Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion
World simulators are explorable and interactive systems or models that can mimic real world. Advanced video generation models can function as world simulators, and to achieve it, they should have low latency for input actions, and capable of long sequence generation. Long sequence generation includes capability of long generation itself, preventing error accumulation, and long term context preservation. This post mainly focuses on how related project Oasis: A Universe in a transformer
Training and sampling of video diffusion models. Darker tokens has higher noise levels. |
Videos are sequential data. However, video diffusion models are trained and inferenced to denoise tokens of same noise levels, interpreting each video clip as a single object. This section reviews approaches to generate long videos using aforementioned VDMs.
Conventional approaches are not appropriate for wolrd simulator!
Oasis
Diffusion Forcing trains models to denoise tokens with independent noise levels, and sampling noise schedules are carefully chosen depending on the purpose. The training offers cheaper training than next-token prediction in video domain, and the complexity added by independent noise level is not excessive since the complexity is only in temporal dimension.
Training in Diffusion Forcing |
Sampling in Diffusion Forcing |
Oasis
Diffusion Forcing suggests to deceive models that generated clean tokens are little noisy, preventing models from believing generated tokens as GT. However, this approach is out of distribution (OOD) inference, and there is no rule of thumb for “little noisy”.
To avoid OOD, one may suggest add little noise to generated tokens and tell models that the tokens are noisy. However, this approach may dilute details in generated tokens.
Oasis
Through above approaches, Oasis can autoregressively generate long videos without much quality degradation. However, models do not have long time horizon memory, leading to inconsistent videos. While there is no innovative breakthrough yet, I believe that video models with long-term memory is an important next step.