FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality

The University of Hong Kong1      S-Lab, Nanyang Technological University2
Shanghai Artificial Intelligence Laboratory3

Project Leader.    Corresponding Author.
Code arXiv

Abstract

In this paper, we present FasterCache, a novel training-free strategy designed to accelerate the inference of video diffusion models with high-quality generation. By analyzing existing cache-based methods, we observe that directly reusing adjacent-step features degrades video quality due to the loss of subtle variations. We further perform a pioneering investigation of the acceleration potential of classifier-free guidance (CFG) and reveal significant redundancy between conditional and unconditional features within the same timestep. Capitalizing on these observations, we introduce FasterCache to substantially accelerate diffusion-based video generation. Our key contributions include a dynamic feature reuse strategy that preserves both feature distinction and temporal continuity, and CFG-Cache which optimizes the reuse of conditional and unconditional outputs to further enhance inference speed without compromising video quality. We empirically evaluate FasterCache on recent video diffusion models. Experimental results show that FasterCache can significantly accelerate video generation (e.g., 1.67× speedup on Vchitect-2.0) while keeping video quality comparable to the baseline, and consistently outperform existing methods in both inference speed and video quality.

Description of the image

Figure 1. Comparison of visual quality and inference speed with competing methods.

Dynamic Feature Reuse Strategy

Motivation

Attention feature reuse has become a primary focus for cache-based acceleration methods in video generation. Previous methods typically assume a high degree of feature similarity between adjacent timesteps in the iterative denoising process, and achieve accelerated inference by sharing features across consecutive timesteps. However, our investigation reveals that while features in the same attention module (e.g., spatial attention) appear to be nearly identical between adjacent timesteps, there exist some subtle yet discernible differences. As a result, a naive feature caching and reuse strategy often leads to degradation of details in generated videos.

Attention difference comparison graph
Figure 2: Visual quality degradation caused by Vanilla Feature Reuse (left) and feature differences between adjacent timesteps (right).

Implementation

Instead of directly reusing previously cached features at the current timestep, we propose a Dynamic Feature Reuse Strategy that can more effectively capture and preserve critical details in generated videos. We calculate the attention outputs for each layer at \( t+2 \) and \( t \) timesteps, denoted as \( F_{t+2} \) and \( F_{t} \), and store them in the feature cache as \( F^{t+2}_{cache} \) and \( F^{t}_{cache} \). For the intermediate \( t-1 \) timestep, its features can be computed as:

\[ F_{t-1} = F^t_{cache} + (F^t_{cache} - F^{t+2}_{cache}) * w(t), \]

where \( w(t) \) is a weighting function that modulates the contribution of the feature difference to account for variation between adjacent timesteps, ensuring both efficiency and the preservation of fine details in the generated videos. Consequently, our approach significantly accelerates inference while preserving the visual quality of the synthesized videos.

CFG-Cache

Motivation

In the mid to later stages of sampling, the similarity between conditional and unconditional outputs at the same timestep is remarkably high, significantly surpassing that of adjacent steps. Hence, directly reusing unconditional outputs from adjacent timesteps, as suggested in existing methods, leads to significant error accumulation, resulting in a decline in video quality. These results indicate substantial redundancy in the CFG process and highlight the necessity for a new strategy to accelerate CFG without compromising the quality of the generated outputs.

Attention difference comparison graph
Figure 3: (a) The MSE between conditional and unconditional outputs at the same timestep as well as across adjacent timesteps. (b) Directly reusing unconditional outputs from previous timesteps will lead to a significantly degraded visual quality.

Implementation

Since both the conditional and unconditional outputs in CFG represent predicted noise, we analyze the differences between these two outputs in the frequency domain. we observe that, in the early and mid-stages of the sampling process, the conditional and unconditional outputs at the same timestep exhibit a significant bias in the low-frequency components, which progressively shifts to the high-frequency components in the later steps. This suggests that despite their overall similarity, key differences in frequency components must be addressed to avoid the degradation of details.

Building on this discovery, we propose CFG-Cache, a novel approach designed to account for both high- and low-frequency biases, coupled with a timestep-adaptive enhancement technique.

Attention difference comparison graph
Figure 4: Overview of the CFG-Cache.

Evaluation

Quantitative Results

Attention difference comparison graph

Visual Results

Original \( \Delta - DiT \) PAB FasterCache
Open-Sora 48 frames 480P
Original \( \Delta - DiT \) PAB FasterCache
Open-Sora-Plan 65 frames 512x512
Original \( \Delta - DiT \) PAB FasterCache
Latte 16 frames 512x512
Original \( \Delta - DiT \) FasterCache
CogVideoX 48 frames 480P
Original \( \Delta - DiT \) FasterCache
Vchitect 2.0 40 frames 480P

Scalability and Generalization

Scaling to multiple GPUs

Table 2: Scaling to multiple GPUs with DSP.
Attention difference comparison graph

Performance at different resolutions and lengths

Attention difference comparison graph
Figure 5: Acceleration efficiency of our method at different video resolutions and lengths.

I2V and image synthesis performance

Attention difference comparison graph
Figure 6: Visual results and inference time of our method on I2V and image synthesis models.
Original FasterCache
DynamiCrafter 16 frames 1024x576

BibTeX

@article{lv2024fastercache,
  title={FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality},
  author={Lv, Zhengyao and Si, Chenyang and Song, Junhao and Yang, Zhenyu and Qiao, Yu and Liu, Ziwei and Kwan-Yee K. Wong},
  booktitle={arxiv},
  year={2024}
}