Web Analytics

LongVie : Multimodal-Guided Controllable
Ultra-Long Video Generation

( corresponding authors)
1 Nanjing University    2 Fudan University    3 S-Lab, Nanyang Technological University   
4 Nvidia    5 Shanghai Artificial Intelligence Laboratory   

Video Gallery


Video Demo

Abstract

full screen

LongVie is a controllable ultra-long video generation framework guided by both dense and sparse control signals, with a degradation-aware training strategy to balance the contribution of the modalities. It applies global normalization to the control signals and employs unified initialization noise to autoregressively generate videos lasting up to one minute.

Rethinking Controllable Generation of Long Videos

full screen

  • When directly applying current controllable video generation models to long videos, we identify two main issues: temporal inconsistency and visual degradation.
  • Temporal inconsistency: The video often exhibits fine-grained discontinuities and abrupt changes between frames.
  • Visual degradation: The video may suffer from color drifting and reduced visual fidelity, leading to artifacts such as distorted railings or water ripples.
  • Framework

    full screen

  • To address these challenges, we propose LongVie, the first autoregressive framework for controllable long video generation.
  • To enhance visual quality, we introduce a multi-modal control mechanism that combines dense and sparse control signals to leverage their complementary strengths. Additionally, we adopt a degradation-aware training strategy to balance their contributions effectively.
  • To improve temporal consistency, we incorporate unified noise initialization and global control signal normalization to ensure world-consistent generative dynamics across time steps.
  • full screen

    For details on the design of the control component, we present two variant structures—(b) and (c)—that integrate dense and sparse control signals, in contrast to the standard ControlNet design shown in (a).

    Experiments

    full screen

    Quantitative results of LongVie and baselines on our LongVGenBench. DAS-LV and Depth-LV refer to the adapted versions of DAS and depth-controlled CogVideo, respectively, for long video generation. Bold indicates the best performance, and underline denotes the second-best.

    full screen

    Ablation study for our proposed components. The blue block denotes experiments targeting temporal consistency, while the pink block denotes those focusing on visual quality

    BibTeX

          @misc{gao2025longvie,
            title={LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation}, 
            author={Jianxiong Gao and Zhaoxi Chen and Xian Liu and Jianfeng Feng and Chenyang Si and Yanwei Fu and Yu Qiao and Ziwei Liu},
            year={2025},
            eprint={2508.03694},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2508.03694}, 
          }