TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

The University of Hong Kong¹ Nanjing University²
University of Chinese Academy of Sciences³ Nanyang Technological University⁴
Harbin Institute of Technology⁵
*Equal Contribution. †Corresponding Author.

Abstract

Multimodal Diffusion Transformers (MM-DiTs) have achieved remarkable progress in text-driven visual generation. However, even state-of-the-art MM-DiT models like FLUX struggle with achieving precise alignment between text prompts and generated content. We identify two key issues in the attention mechanism of MM-DiT, namely 1) the suppression of cross-modal attention due to token imbalance between visual and textual modalities and 2) the lack of timestep-aware attention weighting, which hinder the alignment. To address these issues, we propose Temperature-Adjusted Cross-modal Attention (TACA), a parameter-efficient method that dynamically rebalances multimodal interactions through temperature scaling and timestep-dependent adjustment. When combined with LoRA fine-tuning, TACA significantly enhances text-image alignment on the T2I-CompBench benchmark with minimal computational overhead. We tested TACA on state-of-the-art models like FLUX and SD3.5, demonstrating its ability to improve image-text alignment in terms of object appearance, attribute binding, and spatial relationships. Our findings highlight the importance of balancing cross-modal attention in improving semantic fidelity in text-to-image diffusion models. Our code will be made publicly available.

Reweighting the cross-modal tempertaure

Motivation

Even state-of-the-art Multimodal Diffusion Transformers (MM-DiTs) still struggle to produce images with precise alignment to the provided text prompts. We observed two specific issues within the MM-DiT attention mechanism that contribute to this semantic misalignment: first, the cross-modal attention between visual and text tokens is suppressed due to the significant imbalance in their numbers, and second, the attention weighting does not adapt to the varying needs of the denoising process across different timesteps. These observations highlight the need for better control over how visual and textual information interact within the model to improve the semantic fidelity of generated images.

Figure 2: Relative magnitude of visual-text attention between the typical cross attention and MM-DiT full attention (averaged over 50 samples). The numerical asymmetry between the number of visual and text tokens suppresses the magnitude of cross attention, leading to weak alignment between the generated image and the given text prompt. We can amplify the cross-attention by boosting the coefficient \( \gamma \), thereby strengthening the alignment between the image and text.

Implementation

To mitigate the suppression of cross-attention caused by the dominance of visual tokens, we amplify the logits of visual-text interactions through a temperature coefficient \(\gamma > 1\). The modified attention probability for visual-text interaction becomes: \begin{equation} P_{\mathrm{vis-txt}}^{(i,\,j)} = \frac{ e^{{\color{blue}\gamma} s_{ij}^{\mathrm{vt}}/\tau}}{\sum_{k=1}^{N_{\mathrm{txt}}} e^{{\color{blue}\gamma} s_{ik}^{\mathrm{vt}}/\tau} + \sum_{k=1}^{N_{\mathrm{vis}}} e^{s_{ik}^{\mathrm{vv}}/\tau}}, \end{equation}

where \(s_{ik}^{\mathrm{vt}} = \boldsymbol Q^{(i)}_{\mathrm{vis}}\boldsymbol K_{\mathrm{txt}}^{T\,(k)}/\sqrt{D}\) and \(s_{ik}^{\mathrm{vv}}=\boldsymbol Q^{(i)}_{\mathrm{vis}}\boldsymbol K_{\mathrm{vis}}^{T\,(k)}/\sqrt{D}\).

Figure 3: Attention map differences. We conducted a visualization of the alterations in the visual-text attention map during the initial stages of the denoising process, as influenced by our proposed method. In contrast to the baseline, our approach substantially amplifies the attention directed toward the text in the early steps.

Visual Effect

Figure 4: Temperature scaling helps visual-text alignment. From this figure, we can see that as the temperature scaling factor \(\gamma\) increases, the characteristics of brown backpack, mirror and black stomach become more obvious.

@article{lv2025taca, title={TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers}, author={Lv, Zhengyao and Pan, Tianlin and Si, Chenyang and Chen, Zhaoxi and Zuo, Wangmeng and Liu, Ziwei and Kwan-Yee K. Wong}, booktitle={arxiv}, year={2025} }

TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

Abstract

Figure 1: We propose TACA, a parameter-efficient method that dynamically rebalances cross-modal attention
in multimodal diffusion transformers to improve text-image alignment.

Reweighting the cross-modal tempertaure

Motivation

Implementation

Visual Effect

Evaluation

Quantitative Results

Visual Results

Short Prompts

Long Prompts

BibTeX

TACA: Rethinking Cross-Modal Interaction in Multimodal Diffusion Transformers

Abstract

Figure 1: We propose TACA, a parameter-efficient method that dynamically rebalances cross-modal attention in multimodal diffusion transformers to improve text-image alignment.

Reweighting the cross-modal tempertaure

Motivation

Implementation

Visual Effect

Evaluation

Quantitative Results

Visual Results

Short Prompts

Long Prompts

BibTeX

Figure 1: We propose TACA, a parameter-efficient method that dynamically rebalances cross-modal attention
in multimodal diffusion transformers to improve text-image alignment.