RepVideo: Rethinking Cross-Layer Representation for Video Generation

S-Lab, Nanyang Technological University1      Shanghai Artificial Intelligence Laboratory 2
Equal contribution.    Corresponding Author.

Abstract

Video generation has achieved remarkable progress with the introduction of diffusion models, which have significantly improved the quality of generated videos. However, recent research has primarily focused on scaling up model training, while offering limited insights into the direct impact of representations on the video generation process. In this paper, we initially investigate the characteristics of features in intermediate layers, finding substantial variations in attention maps across different layers. These variations lead to unstable semantic representations and contribute to cumulative differences between features, which ultimately reduce the similarity between adjacent frames and negatively affect temporal coherence. To address this, we propose RepVideo, an enhanced representation framework for text-to-video diffusion models. By accumulating features from neighboring layers to form enriched representations, this approach captures more stable semantic information. These enhanced representations are then used as inputs to the attention mechanism, thereby improving semantic expressiveness while ensuring feature consistency across adjacent frames. Extensive experiments demonstrate that our RepVideo not only significantly enhances the ability to generate accurate spatial appearances, such as capturing complex spatial relationships between multiple objects, but also improves temporal consistency in video generation.

Methodology

We investigate the transformer representations in video diffusion models, revealing that substantial variations in attention maps across layers lead to fragmented spatial semantics and reduced temporal consistency, which negatively impact video quality. Further, we propose RepVideo, a framework that leverages a feature cache module and a gating mechanism to aggregate and stabilize intermediate representations, enhancing both spatial detail and temporal coherence.

Attention difference comparison graph

How RepVideo Improves Spatial Appearance.

RepVideo model consistently captures richer semantic information and maintains more coherent spatial details as the layers deepen.

Attention difference comparison graph

The attention maps of RepVideo highlight subject boundaries more clearly than CogVideoX, showcasing that aggregated features strengthen spatial regions. This reduces inter-layer variability, preserves critical spatial information, and improves the model’s ability to generate visually consistent scenes aligned with input prompts.

Attention difference comparison graph

How RepVideo Improves Temporal Appearance.

RepVideo achieves consistently higher cosine similarity scores across all layers compared to CogVideoX-2B. As a result, RepVideo produces videos with smooth transitions and coherent motion, even in complex scenarios involving dynamic objects or environments.

Attention difference comparison graph

Results

Quantitative Evaluation

Table I presents the Total Score and key metrics: Motion Smoothness (temporal stability), Object Class and Multiple Objects (diversity and clarity), and Spatial Relationship (coherence in positioning). Human evaluations (Table II) show our model surpasses state-of-the-art methods, with a win ratio over 50% across all metrics, highlighting superior semantic alignment, smoother transitions, and better visual quality.

Attention difference comparison graph

Qualitative Evaluation

The comparison of our visual results with CogVideoX-2B ( Left: CogVideoX-2B, Right: RepVideo.).

More visual results

BibTeX

@article{si2025repvideo,
        author    = {Si, Chenyang and Fan, Weichen and Lv, Zhengyao and Huang, Ziqi and Qiao, Yu and Liu, Ziwei},
        title     = {RepVideo: Rethinking Cross-Layer Representation for Video Generation},
        booktitle = {arXiv preprint},
        year      = {2025},
      }