Dual-Expert Consistency Model for Efficient and High-Quality Video Generation

Abstract

Diffusion Models have achieved remarkable results in video synthesis but require iterative denoising steps, leading to substantial computational overhead. Consistency Models have made significant progress in accelerating diffusion models. However, directly applying them to video diffusion models often results in severe degradation of temporal consistency and appearance details. In this paper, by analyzing the training dynamics of Consistency Models, we identify a key conflicting learning dynamics during the distillation process: there is a significant discrepancy in the optimization gradients and loss contributions across different timesteps. This discrepancy prevents the distilled student model from achieving an optimal state, leading to compromised temporal consistency and degraded appearance details. To address this issue, we propose a parameter-efficient Dual-Expert Consistency Model (DCM), where a semantic expert focuses on learning semantic layout and motion, while a detail expert specializes in fine detail refinement. Furthermore, we introduce Temporal Coherence Loss to improve motion consistency for the semantic expert and apply GAN and Feature Matching Loss to enhance the synthesis quality of the detail expert. Our approach achieves state-of-the-art visual quality with significantly reduced sampling steps, demonstrating the effectiveness of expert specialization in video diffusion model distillation.

Motivation

Learning Dynamics Discrepancy During Distillation

Diffusion models excel in image and video synthesis but are computationally intensive. Consistency Distillation accelerates sampling via knowledge distillation, yet often degrades video quality—causing distorted layouts, unnatural motion, and detail loss. Analyzing sampling dynamics reveals that early steps undergo substantial and rapid changes, focusing on semantic layout and motion, while later steps refine fine details more gradually and smoothly. This discrepancy leads to different learning dynamics for high- and low-noise samples, as shown in Figure 2. Jointly distilling a single student model for both tasks may introduce optimization interference, resulting in suboptimal performance.

Attention difference comparison graph — Figure 2: Visualization of the video synthesis process and the trend of loss variation.

Decoupled Training with Expert Denoisers Improves Performance

To validate this assumption, we trained two expert denoisers. We first divide the ODE trajectory of the pre-trained model into two phases: the semantic synthesis phase and the detail refinement phase. We then train two distinct student expert denoisers, each responsible for fitting one of these sub-trajectories. During inference, we dynamically select the corresponding expert denoiser based on the noise level of samples to predict the next position in the ODE trajectory. The results demonstrate that the combination of the two student expert denoisers achieves better performance, thereby confirming the validity of our hypothesis, as shown in Figure 3.

Methodology

To improve parameter efficiency, we analyzed the parameter differences between the two expert denoisers and found that the main differences lie in the embedding layers (which include timesteps) and the linear layers within the attention layers. Based on this, we propose a parameter-efficient Dual-Expert Consistency Model (DCM). First, we train a semantic expert denoiser on the semantic synthesis subtrajectory and freeze it. Then, we introduce new timestep-dependent layers, using LoRA in the linear layers of the attention blocks, and fine-tune these layers on the detail refinement subtrajectory. This decouples the optimization of the two expert denoisers with minimal additional parameters and computational cost, achieving results similar to using two separate experts. To address the different training dynamics, we add distinct optimization objectives: a Temporal Coherence Loss for the semantic expert to capture motion variations, and a GAN loss along with a Feature Matching loss for the detail expert to enhance fine-grained synthesis and stabilize training.

Results

Qualitative Evaluation

The comparison of our visual results with competing methods.

A lone adventurer, clad in a bright red life jacket and a wide-brimmed hat, paddles a sleek, yellow kayak through a serene, crystal-clear lake surrounded by towering pine trees and majestic mountains. The sun casts a golden glow on the water, creating a shimmering path ahead. As the person glides effortlessly, the rhythmic splash of the paddle and the gentle ripples in the water evoke a sense of tranquility. Occasionally, they pause to take in the breathtaking scenery, the reflection of the vibrant autumn foliage mirrored perfectly on the lake's surface. The scene captures the essence of solitude and the beauty of nature.

A young woman with long, flowing hair sits on a rustic wooden bench in a sunlit garden, surrounded by vibrant flowers and lush greenery. She holds a large slice of juicy watermelon, its bright red flesh contrasting with the green rind. As she takes a bite, her eyes close in delight, savoring the sweet, refreshing taste. The sunlight filters through the leaves, casting dappled shadows on her face and the watermelon. She smiles, juice dripping down her chin, capturing the essence of a perfect summer day. The scene is filled with the sounds of birds chirping and leaves rustling in the gentle breeze.

A lone astronaut, clad in a sleek, neon-lit spacesuit with glowing blue and purple accents, floats effortlessly through the vast expanse of space. The helmet's visor reflects the vibrant hues of distant galaxies and futuristic spacecraft, creating a mesmerizing spectacle. The backdrop is a dazzling array of neon-colored stars, digital constellations, and holographic planets, all pulsating with electric energy. The astronaut's movements are fluid and graceful, as they navigate through a cyberpunk-inspired cosmos, where technology and the cosmos intertwine in a breathtaking dance of light and color.

A golden retriever with a shiny coat stands by a serene, crystal-clear stream in a lush forest, its tongue lapping up the refreshing water.

A vibrant soccer ball, with its classic black and white hexagonal pattern, rests on a lush, green field under a clear blue sky.

More visual results

BibTeX

@article{lv2025dualexpert,
  author    = {Lv, Zhengyao and Si, Chenyang and Pan, Tianlin and Chen, Zhaoxi and Wong, Kwan-Yee K. and Qiao, Yu and Liu, Ziwei},
  title     = {Dual-Expert Consistency Model for Efficient and High-Quality Video Generation},
  booktitle = {arXiv preprint},
  year      = {2025},
}

DCM: Dual-Expert Consistency Model for Efficient and High-Quality Video Generation

Video