ShotBench

ShotBench
Expert-Level Cinematic Understanding
in Vision-Language Models

Hongbo Liu^{1, 3*}, Jingwen He^{2, 3*}, Yi Jin¹, Dian Zheng³,
Yuhao Dong⁴, Fan Zhang³, Ziqi Huang⁴, Yinan He³, Yangguang Li³, Weichao Chen¹, Yu Qiao³,
Wanli Ouyang², Shengjie Zhao^1†, Ziwei Liu^4†

(* equal contributions) († corresponding authors)

¹ Tongji University ² The Chinese University of Hong Kong
³ Shanghai Artificial Intelligence Laboratory ⁴ S-Lab, Nanyang Technological University

Video Demo

Overview

We introduce ShotBench, a comprehensive benchmark for evaluating VLMs’ understanding of cinematic language. It comprises over 3.5k expert-annotated QA pairs derived from images and video clips of over 200 critically acclaimed films (predominantly Oscar-nominated), covering eight distinct cinematography dimensions. This provides a rigorous new standard for assessing fine-grained visual comprehension in film.
We conducted an extensive evaluation of 24 leading VLMs, including prominent open-source and proprietary models, on ShotBench. Our results reveal a critical performance gap: even the most capable model, GPT-4o, achieves less than 60% average accuracy. This systematically quantifies the current limitations of VLMs in genuine cinematographic comprehension.
To address the identified limitations and facilitate future research, we constructed ShotQA, the first large-scale multimodal dataset for cinematography understanding, containing approximately 70k high-quality QA pairs. Leveraging ShotQA, we developed ShotVL, a novel VLM trained using Supervised Fine-Tuning (SFT) and Group Relative Policy Optimization (GRPO). ShotVL significantly surpasses all tested open-source and proprietary models, establishing a new state-of-the-art on ShotBench.

Here we show the overview of ShotBench. The benchmark covers eight core dimensions of cinematography: shot size, framing, camera angle, lens size, lighting type, lighting condition, composition, and camera movement.

Comparison of Benchmarks in Cinematography

Benchmark	Shot Size	Shot Framing	Camera Angle	Lens Size	Lighting Type	Lighting Condition	Composition	Camera Movement
MovieShots	✅	❌	❌	❌	❌	❌	❌	✅
MovieNet	✅	❌	❌	❌	❌	❌	❌	✅
CineScale2	❌	❌	✅	❌	❌	❌	❌	❌
CameraBench	❌	❌	❌	❌	❌	❌	❌	✅
CineTechBench	✅	❌	✅	✅	❌	✅	✅	✅
ShotBench (Ours)	✅	✅	✅	✅	✅	✅	✅	✅

Evaluation

We report the evaluation results for 24 VLMs and ShotVL below, ShotVL sets new SOTA overall performance across all evaluated models.

Abbreviations adopted: SS = *Shot Size*, SF = *Shot Framing*, CA = *Camera Angle*, LS = *Lens Size*, LT = *Lighting Type*, LC = *Lighting Conditions*, SC = *Shot Composition*, CM = *Camera Movement*. underline marks previous best performance in each group. **Our ShotVL models establish new SOTA and set up a strong baseline for future research.**
Models	SS	SF	CA	LS	LT	LC	SC	CM	Avg
*Open-Sourced VLMs*
Qwen2.5-VL-3B-Instruct	54.6	56.6	43.1	36.6	59.3	45.1	41.5	31.9	46.1
Qwen2.5-VL-7B-Instruct	69.1	73.5	53.2	47.0	60.5	47.4	49.9	30.2	53.8
LLaVA-NeXT-Video-7B	35.9	37.1	32.5	27.8	50.9	31.7	28.0	31.3	34.4
LLaVA-Video-7B-Qwen2	56.9	65.4	45.1	36.0	63.5	45.4	37.4	35.3	48.1
LLaVA-Onevision-Qwen2-7B-Ov-Chat	58.4	71.0	52.3	38.7	59.5	44.9	50.9	39.7	51.9
InternVL2.5-8B	56.3	70.3	50.8	41.1	60.2	45.1	50.1	33.6	50.9
InternVL3-2B	56.3	56.0	44.4	34.6	56.8	44.6	43.0	38.1	46.7
InternVL3-8B	62.1	65.8	46.8	42.9	58.0	44.3	46.8	44.2	51.4
InternVL3-14B	59.6	82.2	55.4	40.7	61.7	44.6	51.1	38.2	54.2
Internlm-xcomposer2d5-7B	51.1	71.0	39.8	32.7	59.3	35.7	35.7	38.8	45.5
Ovis2-8B	35.9	37.1	32.5	27.8	50.9	31.7	28.0	35.3	34.9
VILA1.5-3B	33.4	44.9	32.1	28.6	50.6	35.7	28.4	21.5	34.4
VILA1.5-8B	40.6	44.5	39.1	29.7	48.9	32.9	34.4	36.9	38.4
VILA1.5-13B	36.7	54.6	40.7	34.8	52.8	35.4	34.2	31.3	40.1
Instructblip-vicuna-7B	27.0	27.9	34.5	29.4	44.4	29.7	27.1	25.0	30.6
Instructblip-vicuna-13B	26.8	29.2	27.9	28.0	39.0	24.0	27.1	22.0	28.0
InternVL2.5-38B	67.8	85.4	55.4	41.7	61.7	48.9	52.4	44.0	57.2
InternVL3-38B	68.0	84.0	51.9	43.6	64.4	46.9	54.7	44.6	57.3
Qwen2.5-VL-32B-Instruct	62.3	76.6	51.0	48.3	61.7	44.0	52.2	43.8	55.0
Qwen2.5-VL-72B-Instruct	75.1	82.9	56.7	46.8	59.0	49.4	54.1	48.9	59.1
InternVL3-78B	69.7	80.0	54.5	44.0	65.5	47.4	51.8	44.4	57.2
*Proprietary VLMs*
Gemini-2.0-flash	48.9	75.5	44.6	31.9	62.2	48.9	52.4	47.4	51.5
Gemini-2.5-flash-preview-04-17	57.7	82.9	51.4	43.8	65.2	45.7	45.9	43.5	54.5
GPT-4o	69.3	83.1	58.2	48.9	63.2	48.0	55.2	48.3	59.3
*Ours*
ShotVL-3B	77.9	85.6	68.8	59.3	65.7	53.1	57.4	51.7	65.1
ShotVL-7B	81.2	90.1	78.0	68.5	70.1	64.3	45.7	62.9	70.1

Analysis and Findings

(1) Approximately half of the evaluated models attain an overall accuracy below 50%. Even the leading models like GPT-4o, fail to reach 60% accuracy, underscoring the significant gap between current VLMs and a true understanding of cinematography.

(2) The overall performance differences between open-source and proprietary models are marginal.

(3) Within each series, larger models generally achieve higher accuracy.

(4) Stronger models perform well uniformly, without specific dimensional weaknesses.

(5) Fine-tuning with SFT and GRPO on the proposed ShotQA dataset effectively enhances the model's capability in cinematography understanding. Notably, the sequential training strategy of applying GRPO following SFT yields the best performance.

Overall performance comparison across model families — Overall performance comparison of InternVL3, Qwen2.5-VL, and VILA-1.5 model families, highlighting variations by model size.

Performance on cinematographic dimensions — Performance of six Vision-Language Models (VLMs) across cinematographic dimensions; stronger models show uniformly high scores without obvious weak spots.

BibTeX

@misc{
      liu2025shotbench,
      title={ShotBench: Expert-Level Cinematic Understanding in Vision-Language Models}, 
      author={Hongbo Liu and Jingwen He and Yi Jin and Dian Zheng and Yuhao Dong and Fan Zhang and Ziqi Huang and Yinan He and Yangguang Li and Weichao Chen and Yu Qiao and Wanli Ouyang and Shengjie Zhao and Ziwei Liu},
      year={2025},
      eprint={2506.21356},
      achivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2506.21356}, 
    }