Cut2Next

Generating Next Shot via In-Context Tuning

Jingwen He1,2,   Hongbo Liu2,   Jiajun Li3,   Ziqi Huang3,  
Yu Qiao2,   Wanli Ouyang1†,   Ziwei Liu3†
1The Chinese University of Hong Kong,   2Shanghai Artificial Intelligence Laboratory,  
3S-Lab, Nanyang Technological University
Corresponding authors

Abstract

Effective multi-shot generation demands purposeful, film-like transitions and strict cinematic continuity. Current methods, however, often prioritize basic visual consistency, neglecting crucial editing patterns (e.g., shot/reverse shot, cutaways) that drive narrative flow for compelling storytelling. This yields outputs that may be visually coherent but lack narrative sophistication and true cinematic integrity. To bridge this, we introduce Next Shot Generation (NSG): synthesizing a subsequent, high-quality shot that critically conforms to professional editing patterns while upholding rigorous cinematic continity. Our framework, Cut2Next, leverages a Diffusion Transformer (DiT). It employs in-context tuning guided by a novel Hierarchical Multi-Prompting strategy. This strategy uses Relational Prompts to define overall context and inter-shot editing styles. Individual Prompts then specify per-shot content and cinematographic attributes. Together, these guide Cut2Next to generate cinematically appropriate next shots. Architectural innovations, Context- Aware Condition Injection (CACI) and Hierarchical Attention Mask (HAM), further integrate these diverse signals without introducing new parameters. We construct RawCuts (large-scale) and CuratedCuts (refined) datasets, both with hierarchical prompts, and introduce CutBench for evaluation. Experiments show Cut2Next excels in visual consistency and text fidelity. Crucially, user studies reveal a strong preference for Cut2Next, particularly for its adherence to intended editing patterns and overall cinematic continuity, validating its ability to generate high-quality, narratively expressive, and cinematically coherent subsequent shots.

Visual Comparison


Cut-Out, Multi-Angle.

Input Shot for Example 1

Input Shot

Cut2Next result for Example 1

Cut2Next (Ours)

Flux-Kontext result for Example 1

Flux-Kontext

BAGEL result for Example 1

BAGEL

Cutaway, Multi-Angle

Input Shot for Example 2

Input Shot

Cut2Next result for Example 2

Cut2Next (Ours)

Flux-Kontext result for Example 2

Flux-Kontext

BAGEL result for Example 2

BAGEL

Shot/Reverse Shot

Input Shot for Example 2

Input Shot

Cut2Next result for Example 2

Cut2Next (Ours)

Flux-Kontext result for Example 2

Flux-Kontext

BAGEL result for Example 2

BAGEL

Method

To specifically address Next Shot Generation's challenges, Cut2Next incorporates: (1) Our Context-Aware Condition Injection (CACI) mechanism for nuanced integration of diverse conditional inputs. (2) A Hierarchical Attention Mask (HAM) to orchestrate the intricate flow between visual and textual tokens. This approach allows Cut2Next to effectively utilize our rich annotations and generate high-quality, cinematically coherent next shots that masterfully balance diversity with continuity, all without introducing additional parameters to the base model.

Cu2Next Demo

BibTeX

@article{,
      title={Cut2Next: Generating Next Shot via In-Context Tuning},
      author={Jingwen He and Hongbo Liu and Jiajun Li and Ziqi Huang and Yu Qiao and Wanli Ouyang and Ziwei Liu},
      journal={arXiv preprint arXiv:},
      year={2025}
    }
}