Video Gallery
Video Demo
Abstract
Rethinking Controllable Generation of Long Videos
Framework
For details on the design of the control component, we present two variant structures—(b) and (c)—that integrate dense and sparse control signals, in contrast to the standard ControlNet design shown in (a).
Experiments
Quantitative results of LongVie and baselines on our LongVGenBench. DAS-LV and Depth-LV refer to the adapted versions of DAS and depth-controlled CogVideo, respectively, for long video generation. Bold indicates the best performance, and underline denotes the second-best.
Ablation study for our proposed components. The blue block denotes experiments targeting temporal consistency, while the pink block denotes those focusing on visual quality
BibTeX
@misc{gao2025longvie, title={LongVie: Multimodal-Guided Controllable Ultra-Long Video Generation}, author={Jianxiong Gao and Zhaoxi Chen and Xian Liu and Jianfeng Feng and Chenyang Si and Yanwei Fu and Yu Qiao and Ziwei Liu}, year={2025}, eprint={2508.03694}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2508.03694}, }