Video Demo
Abstract
Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long-term visual quality, and temporal consistency. To this end, we take a progressive approach—first enhancing controllability and then extending toward long-term, high-quality generation. We present LongVie 2, an end-to-end autoregressive framework trained in three stages: (1) Multi-modal guidance, which integrates dense and sparse control signals to provide implicit world-level supervision and improve controllability; (2) Degradation-aware training on the input frame, bridging the gap between training and long-term inference to maintain high visual quality; and (3) History-context guidance, which aligns contextual information across adjacent clips to ensure temporal consistency. We further introduce LongVGenBench, a comprehensive benchmark comprising 100 high-resolution one-minute videos covering diverse real-world and synthetic environments. Extensive experiments demonstrate that LongVie 2 achieves state-of-the-art performance in long-range controllability, temporal coherence, and visual fidelity, and supports continuous video generation lasting up to five minutes, marking a significant step toward unified video world modeling.
Framework
LongVie 2 serves as a controllable video world model that integrates both dense and sparse control signals to provide world-level guidance for enhanced controllability. A degradation-aware training strategy improves long-term visual quality, while tail frames from preceding clips are incorporated as historical context to maintain temporal consistency over extended durations.
Training Pipeline
Training pipeline of LongVie 2. We first train the model using a standard ControlNet-based pipeline. In the second stage, we introduce degradation to the first frame to bridge the domain gap between ground truth and generated frames. Finally, to ensure temporal consistency, we incorporate historical frame information.
Generated Videos
Character World
Spring
Autumn
Winter
Environment World
Spring
Autumn
Winter
2-Minute Generation
Cutting Vegetables
Driving on Road
Walking at Night
5-Minute Ultra-Long Generation
Chemical Factory
Island Town
Skateboard
Quantitative Result
Quantitative comparison of LongVie 2 and baselines on LongVGenBench. We compare LongVie 2 with base diffusion models, controllable generation models, and world models on LongVGenBench, evaluating visual quality, controllability, and long-term consistency. Bold indicates the best performance, and underline denotes the second-best.