BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning

About

Object-level spatial-temporal understanding is essential for video question answering, yet existing multimodal large language models (MLLMs) encode frames holistically and lack explicit mechanisms for fine-grained object grounding. Recent work addresses this by serializing bounding box coordinates as text tokens, but this text-coordinate paradigm suffers from a fundamental modality mismatch: object information is inherently visual, yet encoding it as text incurs a high token cost that forces aggressive temporal downsampling. We propose BoxTuning, which resolves this mismatch by injecting object spatial-temporal information directly into the visual modality. Colored bounding boxes and trajectory trails are rendered onto video frames as visual prompts, with only a concise color-to-object legend retained as text. This reduces the token cost significantly, achieving 87-93% text token reduction in practice. It also preserves full temporal resolution, where the trajectory trails further encode inter-frame motion direction and speed within each keyframe, recovering fine-grained dynamics that text-coordinate methods are forced to discard. Experimental results on five video QA benchmarks (CLEVRER, Perception Test, STAR, NExT-QA, IntentQA) show that BoxTuning surpasses text-coordinate baselines on spatially oriented tasks and nearly eliminates the accuracy degradation observed on reasoning-centric tasks, establishing visual prompting as a more natural and efficient paradigm for conveying object information to video MLLMs.

Zekun Qian, Ruize Han, Wei Feng• 2026

Related benchmarks

Task	Dataset	Result
Video Question Answering	IntentQA	--	35
Video Reasoning	STAR	Score67.7	19
Video QA	NEXT-QA	Accuracy79.7	7
Video Reasoning	Perception (test)	Accuracy67.2	5
Video Reasoning	CLEVRER	Accuracy78.5	4

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord