Web Analytics

ShotBench
Expert-Level Cinematic Understanding
in Vision-Language Models

(* equal contributions)  († corresponding authors)
1 Tongji University    2 The Chinese University of Hong Kong   
3 Shanghai Artificial Intelligence Laboratory    4 S-Lab, Nanyang Technological University   

Video Demo

Overview

full screen

Here we show the overview of ShotBench. The benchmark covers eight core dimensions of cinematography: shot size, framing, camera angle, lens size, lighting type, lighting condition, composition, and camera movement.

Comparison of Benchmarks in Cinematography

Benchmark Shot Size Shot Framing Camera Angle Lens Size Lighting Type Lighting Condition Composition Camera Movement
MovieShots
MovieNet
CineScale2
CameraBench
CineTechBench
ShotBench (Ours)

Dataset Samples

Here we show some samples from ShotBench and ShotQA dataset. Please download the dataset from Huggingface to obtain the full version.

Shot Size case
Push in
Zoom in
Pull out
Zoom out
Dolly zoom
Dolly zoom
Arc
Arc
Tilt down
Tilt up
Static shot
Static shot
Trucking left
Trucking right
Boom down
Boom up

Evaluation

We report the evaluation results for 24 VLMs and ShotVL below, ShotVL sets new SOTA overall performance across all evaluated models.
Abbreviations adopted:  SS = Shot Size,  SF = Shot Framing,  CA = Camera Angle,  LS = Lens Size,  LT = Lighting Type,  LC = Lighting Conditions,  SC = Shot Composition,  CM = Camera Movementunderline marks previous best performance in each group. Our ShotVL models establish new SOTA and set up a strong baseline for future research.
Models SSSFCALSLT LCSCCMAvg
Open-Sourced VLMs
Qwen2.5-VL-3B-Instruct54.656.643.136.659.345.141.531.946.1
Qwen2.5-VL-7B-Instruct69.173.553.247.060.547.449.930.253.8
LLaVA-NeXT-Video-7B35.937.132.527.850.931.728.031.334.4
LLaVA-Video-7B-Qwen256.965.445.136.063.545.437.435.348.1
LLaVA-Onevision-Qwen2-7B-Ov-Chat58.471.052.338.759.544.950.939.751.9
InternVL2.5-8B56.370.350.841.160.245.150.133.650.9
InternVL3-2B56.356.044.434.656.844.643.038.146.7
InternVL3-8B62.165.846.842.958.044.346.844.251.4
InternVL3-14B59.682.255.440.761.744.651.138.254.2
Internlm-xcomposer2d5-7B51.171.039.832.759.335.735.738.845.5
Ovis2-8B35.937.132.527.850.931.728.035.334.9
VILA1.5-3B33.444.932.128.650.635.728.421.534.4
VILA1.5-8B40.644.539.129.748.932.934.436.938.4
VILA1.5-13B36.754.640.734.852.835.434.231.340.1
Instructblip-vicuna-7B27.027.934.529.444.429.727.125.030.6
Instructblip-vicuna-13B26.829.227.928.039.024.027.122.028.0
InternVL2.5-38B67.885.455.441.761.748.952.444.057.2
InternVL3-38B68.084.051.943.664.446.954.744.657.3
Qwen2.5-VL-32B-Instruct62.376.651.048.361.744.052.243.855.0
Qwen2.5-VL-72B-Instruct75.182.956.746.859.049.454.148.959.1
InternVL3-78B69.780.054.544.065.547.451.844.457.2
Proprietary VLMs
Gemini-2.0-flash48.975.544.631.962.248.952.447.451.5
Gemini-2.5-flash-preview-04-1757.782.951.443.865.245.745.943.554.5
GPT-4o69.383.158.248.963.248.055.248.359.3
Ours
ShotVL-3B 77.985.668.859.365.7 53.157.451.765.1
ShotVL-7B 81.290.178.068.570.1 64.345.762.970.1

Analysis and Findings

(1) Approximately half of the evaluated models attain an overall accuracy below 50%. Even the leading models like GPT-4o, fail to reach 60% accuracy, underscoring the significant gap between current VLMs and a true understanding of cinematography.
(2) The overall performance differences between open-source and proprietary models are marginal.
(3) Within each series, larger models generally achieve higher accuracy.
(4) Stronger models perform well uniformly, without specific dimensional weaknesses.
(5) Fine-tuning with SFT and GRPO on the proposed ShotQA dataset effectively enhances the model's capability in cinematography understanding. Notably, the sequential training strategy of applying GRPO following SFT yields the best performance.
Overall performance comparison across model families
Overall performance comparison of InternVL3, Qwen2.5-VL, and VILA-1.5 model families, highlighting variations by model size.
Performance on cinematographic dimensions
Performance of six Vision-Language Models (VLMs) across cinematographic dimensions; stronger models show uniformly high scores without obvious weak spots.

BibTeX

@misc{
      liu2025shotbench,
      title={ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models}, 
      author={Hongbo Liu and Jingwen He and Yi Jin and Dian Zheng and Yuhao Dong and Fan Zhang and Ziqi Huang and Yinan He and Yangguang Li and Weichao Chen and Yu Qiao and Wanli Ouyang and Shengjie Zhao and Ziwei Liu},
      year={2025},
      eprint={2506.21356},
      achivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.21356}, 
    }