LongVie 2

Video Demo

Abstract

Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long-term visual quality, and temporal consistency. To this end, we take a progressive approach—first enhancing controllability and then extending toward long-term, high-quality generation. We present LongVie 2, an end-to-end autoregressive framework trained in three stages: (1) Multi-modal guidance, which integrates dense and sparse control signals to provide implicit world-level supervision and improve controllability; (2) Degradation-aware training on the input frame, bridging the gap between training and long-term inference to maintain high visual quality; and (3) History-context guidance, which aligns contextual information across adjacent clips to ensure temporal consistency. We further introduce LongVGenBench, a comprehensive benchmark comprising 100 high-resolution one-minute videos covering diverse real-world and synthetic environments. Extensive experiments demonstrate that LongVie 2 achieves state-of-the-art performance in long-range controllability, temporal coherence, and visual fidelity, and supports continuous video generation lasting up to five minutes, marking a significant step toward unified video world modeling.

LongVie 2 is a controllable ultra-long video world model that autoregressively generates videos lasting up to 3–5 minutes. It is driven by world-level guidance integrating both dense and sparse control signals, trained with a degradation-aware strategy to bridge the gap between training and long-term inference, and enhanced with history-context modeling to maintain long-term temporal consistency.

Framework

LongVie 2 serves as a controllable video world model that integrates both dense and sparse control signals to provide world-level guidance for enhanced controllability. A degradation-aware training strategy improves long-term visual quality, while tail frames from preceding clips are incorporated as historical context to maintain temporal consistency over extended durations.

Training Pipeline

Training pipeline of LongVie 2. We first train the model using a standard ControlNet-based pipeline. In the second stage, we introduce degradation to the first frame to bridge the domain gap between ground truth and generated frames. Finally, to ensure temporal consistency, we incorporate historical frame information.

Generated Videos

Character World

Spring

Autumn

Winter

Environment World

Spring

Autumn

Winter

2-Minute Generation

Cutting Vegetables

Driving on Road

Walking at Night

5-Minute Ultra-Long Generation

Chemical Factory

Island Town

Skateboard

Quantitative Result

Quantitative comparison of LongVie 2 and baselines on LongVGenBench. We compare LongVie 2 with base diffusion models, controllable generation models, and world models on LongVGenBench, evaluating visual quality, controllability, and long-term consistency. Bold indicates the best performance, and underline denotes the second-best.