LongVie

Video Gallery

0:00 / 0:00

Reference Video

0:00 / 0:00

Style 1

0:00 / 0:00

Style 2

0:00 / 0:00

Style 3

0:00 / 0:00

Reference Video

0:00 / 0:00

Style 1

0:00 / 0:00

Style 2

0:00 / 0:00

Style 3

0:00 / 0:00

Reference Video

0:00 / 0:00

Style 1

0:00 / 0:00

Style 2

0:00 / 0:00

Style 3

0:00 / 0:00

Reference Video

0:00 / 0:00

Style 1

0:00 / 0:00

Style 2

0:00 / 0:00

Style 3

0:00 / 0:00

Video Demo

Abstract

LongVie

LongVGenBench

LongVie is a controllable ultra-long video generation framework guided by both dense and sparse control signals, with a degradation-aware training strategy to balance the contribution of the modalities. It applies global normalization to the control signals and employs unified initialization noise to autoregressively generate videos lasting up to one minute.

Rethinking Controllable Generation of Long Videos

When directly applying current controllable video generation models to long videos, we identify two main issues: temporal inconsistency and visual degradation.

Temporal inconsistency: The video often exhibits fine-grained discontinuities and abrupt changes between frames.

Visual degradation: The video may suffer from color drifting and reduced visual fidelity, leading to artifacts such as distorted railings or water ripples.

Framework

To address these challenges, we propose LongVie, the first autoregressive framework for controllable long video generation.

To enhance visual quality, we introduce a multi-modal control mechanism that combines dense and sparse control signals to leverage their complementary strengths. Additionally, we adopt a degradation-aware training strategy to balance their contributions effectively.

To improve temporal consistency, we incorporate unified noise initialization and global control signal normalization to ensure world-consistent generative dynamics across time steps.

For details on the design of the control component, we present two variant structures—(b) and (c)—that integrate dense and sparse control signals, in contrast to the standard ControlNet design shown in (a).

Experiments

Quantitative results of LongVie and baselines on our LongVGenBench. DAS-LV and Depth-LV refer to the adapted versions of DAS and depth-controlled CogVideo, respectively, for long video generation. Bold indicates the best performance, and underline denotes the second-best.

Ablation study for our proposed components. The blue block denotes experiments targeting temporal consistency, while the pink block denotes those focusing on visual quality

BibTeX

      @misc{gao2025longvie,
        title={LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation}, 
        author={Jianxiong Gao and Zhaoxi Chen and Xian Liu and Jianfeng Feng and Chenyang Si and Yanwei Fu and Yu Qiao and Ziwei Liu},
        year={2025},
        eprint={2508.03694},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2508.03694}, 
      }

LongVie : Multimodal-Guided ControllableUltra-Long Video Generation

LongVie : Multimodal-Guided Controllable
Ultra-Long Video Generation