FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality

Abstract

In this paper, we present FasterCache, a novel training-free strategy designed to accelerate the inference of video diffusion models with high-quality generation. By analyzing existing cache-based methods, we observe that directly reusing adjacent-step features degrades video quality due to the loss of subtle variations. We further perform a pioneering investigation of the acceleration potential of classifier-free guidance (CFG) and reveal significant redundancy between conditional and unconditional features within the same timestep. Capitalizing on these observations, we introduce FasterCache to substantially accelerate diffusion-based video generation. Our key contributions include a dynamic feature reuse strategy that preserves both feature distinction and temporal continuity, and CFG-Cache which optimizes the reuse of conditional and unconditional outputs to further enhance inference speed without compromising video quality. We empirically evaluate FasterCache on recent video diffusion models. Experimental results show that FasterCache can significantly accelerate video generation (e.g., 1.67× speedup on Vchitect-2.0) while keeping video quality comparable to the baseline, and consistently outperform existing methods in both inference speed and video quality.

Dynamic Feature Reuse Strategy

Motivation

Attention feature reuse has become a primary focus for cache-based acceleration methods in video generation. Previous methods typically assume a high degree of feature similarity between adjacent timesteps in the iterative denoising process, and achieve accelerated inference by sharing features across consecutive timesteps. However, our investigation reveals that while features in the same attention module (e.g., spatial attention) appear to be nearly identical between adjacent timesteps, there exist some subtle yet discernible differences. As a result, a naive feature caching and reuse strategy often leads to degradation of details in generated videos.

Attention difference comparison graph — Figure 2: Visual quality degradation caused by Vanilla Feature Reuse (left) and feature differences between adjacent timesteps (right).

Implementation

Instead of directly reusing previously cached features at the current timestep, we propose a Dynamic Feature Reuse Strategy that can more effectively capture and preserve critical details in generated videos. We calculate the attention outputs for each layer at \( t+2 \) and \( t \) timesteps, denoted as \( F_{t+2} \) and \( F_{t} \), and store them in the feature cache as \( F^{t+2}_{cache} \) and \( F^{t}_{cache} \). For the intermediate \( t-1 \) timestep, its features can be computed as:

\[ F_{t-1} = F^t_{cache} + (F^t_{cache} - F^{t+2}_{cache}) * w(t), \]

where \( w(t) \) is a weighting function that modulates the contribution of the feature difference to account for variation between adjacent timesteps, ensuring both efficiency and the preservation of fine details in the generated videos. Consequently, our approach significantly accelerates inference while preserving the visual quality of the synthesized videos.

CFG-Cache

Motivation

In the mid to later stages of sampling, the similarity between conditional and unconditional outputs at the same timestep is remarkably high, significantly surpassing that of adjacent steps. Hence, directly reusing unconditional outputs from adjacent timesteps, as suggested in existing methods, leads to significant error accumulation, resulting in a decline in video quality. These results indicate substantial redundancy in the CFG process and highlight the necessity for a new strategy to accelerate CFG without compromising the quality of the generated outputs.

Implementation

Since both the conditional and unconditional outputs in CFG represent predicted noise, we analyze the differences between these two outputs in the frequency domain. we observe that, in the early and mid-stages of the sampling process, the conditional and unconditional outputs at the same timestep exhibit a significant bias in the low-frequency components, which progressively shifts to the high-frequency components in the later steps. This suggests that despite their overall similarity, key differences in frequency components must be addressed to avoid the degradation of details.

Building on this discovery, we propose CFG-Cache, a novel approach designed to account for both high- and low-frequency biases, coupled with a timestep-adaptive enhancement technique.

Evaluation

Quantitative Results

Visual Results

Original \( \Delta - DiT \) PAB FasterCache

Open-Sora 48 frames 480P

Original \( \Delta - DiT \) PAB FasterCache

Open-Sora-Plan 65 frames 512x512

Original \( \Delta - DiT \) PAB FasterCache

Latte 16 frames 512x512

Original \( \Delta - DiT \) FasterCache

CogVideoX 48 frames 480P

Original \( \Delta - DiT \) FasterCache

Vchitect 2.0 40 frames 480P

Scalability and Generalization

Scaling to multiple GPUs

Performance at different resolutions and lengths

I2V and image synthesis performance

Original FasterCache

DynamiCrafter 16 frames 1024x576

BibTeX

@article{lv2024fastercache,
  title={FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality},
  author={Lv, Zhengyao and Si, Chenyang and Song, Junhao and Yang, Zhenyu and Qiao, Yu and Liu, Ziwei and Kwan-Yee K. Wong},
  booktitle={arxiv},
  year={2024}
}

FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality

Abstract

Figure 1. Comparison of visual quality and inference speed with competing methods.

Dynamic Feature Reuse Strategy

Motivation

Implementation

CFG-Cache

Motivation

Implementation

Evaluation

Quantitative Results

Visual Results

Scalability and Generalization

Scaling to multiple GPUs

Performance at different resolutions and lengths

I2V and image synthesis performance

BibTeX