VEnhancer

Generative Space-Time Enhancement for Video Generation

Jingwen He1,2,   Tianfan Xue1,   Dongyang Liu2,   Xinqi Lin2,   Peng Gao2,  
Dahua Lin1,   Yu Qiao2,   Wanli Ouyang1,2†,   Ziwei Liu3†
1The Chinese University of Hong Kong,   2Shanghai Artificial Intelligence Laboratory,  
3S-Lab, Nanyang Technological University
Corresponding authors

Abstract

We present VEnhancer, a generative space-time enhancement framework improves the existing text-to-video results by adding more details in spatial domain and synthetic detailed motion in temporal domain. Given a generated low-quality video, our approach can increase its spatial and temporal resolution simultaneously with arbitrary up-sampling space and time scales through a unified video diffusion model. Furthermore, VEnhancer effectively removes generated spatial artifacts and temporal flickering of generated videos. To achieve this, basing on a pretrained video diffusion model, we train a video ControlNet and inject it to the diffusion model as a condition on low frame-rate and low-resolution videos. To effectively train this video ControlNet, we design space-time data augmentation as well as video-aware conditioning. Benefiting from the above designs, VEnhancer yields to be stable during training and shares an elegant end-to-end training manner. Extensive experiments show that VEnhancer surpasses existing state-of-the-art video super-resolution and space-time super-resolution methods in enhancing AI-generated videos. Moreover, with VEnhancer, exisiting open-source state-of-the-art text-to-video method, VideoCrafter-2, reaches the top one in video generation benchmark -- VBench.

Method

The architecture of VEnhancer. It follows ControlNet and copies the architecures and weights of multi-frame encoder and middle block of a pretrained video diffusion model to build a trainable condition network. This video ControlNet accepts low-resolution key frames as well as full frames of noisy latents as inputs. Also, the noise level $\sigma$ regarding noise augmentation and downscaling factor $s$ serve as additional network conditioning apart from timestep $t$ and prompt $c_{text}$.

VEnhancer Demo

BibTeX

@article{he2024venhancer,
      title={VEnhancer: Generative Space-Time Enhancement for Video Generation},
      author={He, Jingwen and Xue, Tianfan and Liu, Dongyang and Lin, Xinqi and Gao, Peng and Lin, Dahua and Qiao, Yu and Ouyang, Wanli and Liu, Ziwei},
      journal={arXiv preprint arXiv:2407.07667},
      year={2024}
    }
}