Video generation has achieved remarkable progress with the introduction of diffusion models, which have significantly improved the quality of generated videos. However, recent research has primarily focused on scaling up model training, while offering limited insights into the direct impact of representations on the video generation process. In this paper, we initially investigate the characteristics of features in intermediate layers, finding substantial variations in attention maps across different layers. These variations lead to unstable semantic representations and contribute to cumulative differences between features, which ultimately reduce the similarity between adjacent frames and negatively affect temporal coherence. To address this, we propose RepVideo, an enhanced representation framework for text-to-video diffusion models. By accumulating features from neighboring layers to form enriched representations, this approach captures more stable semantic information. These enhanced representations are then used as inputs to the attention mechanism, thereby improving semantic expressiveness while ensuring feature consistency across adjacent frames. Extensive experiments demonstrate that our RepVideo not only significantly enhances the ability to generate accurate spatial appearances, such as capturing complex spatial relationships between multiple objects, but also improves temporal consistency in video generation.
We investigate the transformer representations in video diffusion models, revealing that substantial variations in attention maps across layers lead to fragmented spatial semantics and reduced temporal consistency, which negatively impact video quality. Further, we propose RepVideo, a framework that leverages a feature cache module and a gating mechanism to aggregate and stabilize intermediate representations, enhancing both spatial detail and temporal coherence.
RepVideo model consistently captures richer semantic information and maintains more coherent spatial details as the layers deepen.
The attention maps of RepVideo highlight subject boundaries more clearly than CogVideoX, showcasing that aggregated features strengthen spatial regions. This reduces inter-layer variability, preserves critical spatial information, and improves the model’s ability to generate visually consistent scenes aligned with input prompts.
RepVideo achieves consistently higher cosine similarity scores across all layers compared to CogVideoX-2B. As a result, RepVideo produces videos with smooth transitions and coherent motion, even in complex scenarios involving dynamic objects or environments.
Table I presents the Total Score and key metrics: Motion Smoothness (temporal stability), Object Class and Multiple Objects (diversity and clarity), and Spatial Relationship (coherence in positioning). Human evaluations (Table II) show our model surpasses state-of-the-art methods, with a win ratio over 50% across all metrics, highlighting superior semantic alignment, smoother transitions, and better visual quality.
The comparison of our visual results with CogVideoX-2B ( Left: CogVideoX-2B, Right: RepVideo.).
A litter of golden retriever puppies playing in snow. Their heads pop out of snow, covered in.
A serene waterfall cascading down moss-covered rocks, its soothing sound creating a harmonious symphony with nature.
A lone adventurer, clad in a bright red life jacket and a wide-brimmed hat, paddles a sleek, yellow kayak through a serene, crystal-clear lake surrounded by towering pine trees and majestic mountains.
Extreme close up of a 24 year old woman’s eye blinking, standing in Marrakech during magic hour, cinematic film shot in 70mm, depth of field, vivid colors, cinematic
Camera follows behind a white vintage SUV with a black roof rack as it speeds up a steep dirt road surrounded by pine trees on a steep mountain slope, dust kicks up from its tires, sunlight shines on the SUV as it speeds along the dirt road, casting a warm glow over scene
A serene waterfall cascading down moss-covered rocks, its soothing sound creating a harmonious symphony with nature. A soaring drone footage captures the majestic beauty of a coastal cliff, its red and yellow stratified rock faces rich in color and against the vibrant turquoise of the sea. Seabirds can be seen taking flight around the cliff's precipices.
A sleek, black motorcycle with chrome accents stands parked on a sunlit pier, its polished surface gleaming under the bright sky. Nearby, aluxurious white yacht with elegant lines is moored, gently bobbing on the calm, azure waters.
In the serene Arizona desert, a colossal stone bridge arches gracefully across a rugged canyon, its weathered surface blending seamlessly with the surrounding red rock formations. The scene is bathed in the warm, golden light of the setting sun, casting long shadows and highlighting the intricate textures of the canyon walls
A vintage red phone booth stands alone on a cobblestone street, bathed in the soft glow of a nearby streetlamp. The booth's glass panels reflect the dim light, revealing a glimpse of the old rotary phone inside. Surrounding the booth, ivy climbs up the nearby brick wall, adding atouch of nature to the urban setting.
A serene indoor library bathed in soft, golden light from tall, arched windows, casting gentle shadows on the polished wooden floor. Rows of towering bookshelves, filled with leather-bound volumes and colorful spines, create a labyrinth of knowledge.
The video features a mesmerizing view of a galaxy, with a central bright white star at the center, surrounded by a swirling pattern of blue and purple hues. The galaxy appears to be in motion, with the stars and dust particles creating a dynamic and captivating visual effect. The colors are vibrant, and the overall effect is one of awe and wonder, reminiscent of the vastness and beauty of the cosmos.
The video captures a serene winter scene with a prominent church structure situated in the center of a vast snow-covered field. The church, with its white walls and a red roof, stands out against the white snow. The surrounding landscape is dominated by towering mountains, covered in a thick layer of snow, and dense forests. The sky is clear and blue, indicating a sunny day. The camera pans slowly from left to right, providing a comprehensive view of the church and its surroundings.
The video captures a serene beach scene during sunset. The sky is painted with hues of orange and purple, and the sun is partially visible, casting a warm glow on the water. The waves are the main focus, with one large wave in the foreground and smaller waves in the background. The wave in the foreground is a deep blue-green color, with white foam at the crest, and it is breaking on the sandy beach. The water is calm and smooth, with no other objects or people visible in the scene.
Push upward at a low angle, slowly look up, a tiger with intense, fiery eyes, surrounded by flames. The tiger's fur glows with the reflection of the fire, emphasizing its fierce expression and strong presence. The flames frame the tiger, creating a dramatic, almost mythical atmosphere that highlights its raw power and intensity.
The video features a single parrot with a vibrant orange head, a yellow body, and green wings and tail. The parrot is perched on a metal bar, which appears to be part of a railing or fence. The background is a blurred view of greenery and buildings, suggesting an outdoor setting. The parrot's movements are minimal, with occasional head turns and slight body shifts.
A dramatic video scene featuring a human heart engulfed in intense flames, suspended against a dark, smoky background. The fire wraps around the heart, with bright orange and yellow flames dancing and crackling, highlighting the heart's texture and shape. The fiery effect creates a powerful and symbolic image, evoking themes of passion, intensity, and raw emotion.
The video begins with a dark, purple background, and as it progresses, a series of small, glowing particles appear, gradually forming the shape of a heart. The particles are predominantly pink and red, and they seem to be floating in a three-dimensional space. As the heart shape becomes more defined, the particles become denser and more concentrated around the heart's outline. The heart appears to be glowing with a soft light, and the particles seem to be emanating from the heart's center, creating a sense of depth and movement. The video ends with the fully formed heart, surrounded by a halo of glowing particles, against the same dark, purple background.
The video presents a serene landscape at sunrise. The sky is painted with hues of orange and pink, with the sun appearing as a large, glowing orb. The mountains in the background are silhouetted against the sky, with their peaks touching the clouds. The foreground features a calm lake reflecting the sky's colors, with a few trees and rocks scattered around its edges. The water is still, and the overall atmosphere is peaceful and tranquil.
The video begins with a plain white background, and as it progresses, a black circle appears in the center. Subsequently, a dog character with a blue body, white face, brown ears, and a pink nose emerges from the circle. The dog is adorned with a multicolored scarf around its neck, and as the video continues, it is surrounded by stars and heart-shaped balloons. The dog's facial expressions change slightly throughout the video, and the background remains consistently white.
The video features a single, small dog with a light brown coat, sitting on a snow-covered rock. The dog is wearing a red and white Santa hat, and its eyes are wide open, giving it a curious and alert expression. The background is a serene winter landscape with snow-covered trees and a soft, warm glow from the setting sun.
@article{si2025repvideo,
author = {Si, Chenyang and Fan, Weichen and Lv, Zhengyao and Huang, Ziqi and Qiao, Yu and Liu, Ziwei},
title = {RepVideo: Rethinking Cross-Layer Representation for Video Generation},
booktitle = {arXiv preprint},
year = {2025},
}