Diffusion Models have achieved remarkable results in video synthesis but require iterative denoising steps, leading to substantial computational overhead. Consistency Models have made significant progress in accelerating diffusion models. However, directly applying them to video diffusion models often results in severe degradation of temporal consistency and appearance details. In this paper, by analyzing the training dynamics of Consistency Models, we identify a key conflicting learning dynamics during the distillation process: there is a significant discrepancy in the optimization gradients and loss contributions across different timesteps. This discrepancy prevents the distilled student model from achieving an optimal state, leading to compromised temporal consistency and degraded appearance details. To address this issue, we propose a parameter-efficient Dual-Expert Consistency Model (DCM), where a semantic expert focuses on learning semantic layout and motion, while a detail expert specializes in fine detail refinement. Furthermore, we introduce Temporal Coherence Loss to improve motion consistency for the semantic expert and apply GAN and Feature Matching Loss to enhance the synthesis quality of the detail expert. Our approach achieves state-of-the-art visual quality with significantly reduced sampling steps, demonstrating the effectiveness of expert specialization in video diffusion model distillation.
Diffusion models excel in image and video synthesis but are computationally intensive. Consistency Distillation accelerates sampling via knowledge distillation, yet often degrades video quality—causing distorted layouts, unnatural motion, and detail loss. Analyzing sampling dynamics reveals that early steps undergo substantial and rapid changes, focusing on semantic layout and motion, while later steps refine fine details more gradually and smoothly. This discrepancy leads to different learning dynamics for high- and low-noise samples, as shown in Figure 2. Jointly distilling a single student model for both tasks may introduce optimization interference, resulting in suboptimal performance.
To validate this assumption, we trained two expert denoisers. We first divide the ODE trajectory of the pre-trained model into two phases: the semantic synthesis phase and the detail refinement phase. We then train two distinct student expert denoisers, each responsible for fitting one of these sub-trajectories. During inference, we dynamically select the corresponding expert denoiser based on the noise level of samples to predict the next position in the ODE trajectory. The results demonstrate that the combination of the two student expert denoisers achieves better performance, thereby confirming the validity of our hypothesis, as shown in Figure 3.
To improve parameter efficiency, we analyzed the parameter differences between the two expert denoisers and found that the main differences lie in the embedding layers (which include timesteps) and the linear layers within the attention layers. Based on this, we propose a parameter-efficient Dual-Expert Consistency Model (DCM). First, we train a semantic expert denoiser on the semantic synthesis subtrajectory and freeze it. Then, we introduce new timestep-dependent layers, using LoRA in the linear layers of the attention blocks, and fine-tune these layers on the detail refinement subtrajectory. This decouples the optimization of the two expert denoisers with minimal additional parameters and computational cost, achieving results similar to using two separate experts. To address the different training dynamics, we add distinct optimization objectives: a Temporal Coherence Loss for the semantic expert to capture motion variations, and a GAN loss along with a Feature Matching loss for the detail expert to enhance fine-grained synthesis and stabilize training.
The comparison of our visual results with competing methods.
@article{lv2025dualexpert,
author = {Lv, Zhengyao and Si, Chenyang and Pan, Tianlin and Chen, Zhaoxi and Wong, Kwan-Yee K. and Qiao, Yu and Liu, Ziwei},
title = {Dual-Expert Consistency Model for Efficient and High-Quality Video Generation},
booktitle = {arXiv preprint},
year = {2025},
}