Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding

About

We introduce Tarsier2, a state-of-the-art large vision-language model (LVLM) designed for generating detailed and accurate video descriptions, while also exhibiting superior general video understanding capabilities. Tarsier2 achieves significant advancements through three key upgrades: (1) Scaling pre-training data from 11M to 40M video-text pairs, enriching both volume and diversity; (2) Performing fine-grained temporal alignment during supervised fine-tuning; (3) Using model-based sampling to automatically construct preference data and applying DPO training for optimization. Extensive experiments show that Tarsier2-7B consistently outperforms leading proprietary models, including GPT-4o and Gemini 1.5 Pro, in detailed video description tasks. On the DREAM-1K benchmark, Tarsier2-7B improves F1 by 2.8% over GPT-4o and 5.8% over Gemini-1.5-Pro. In human side-by-side evaluations, Tarsier2-7B shows a +8.6% performance advantage over GPT-4o and +24.9% over Gemini-1.5-Pro. Tarsier2-7B also sets new state-of-the-art results across 15 public benchmarks, spanning tasks such as video question-answering, video grounding, hallucination test, and embodied question-answering, demonstrating its versatility as a robust generalist vision-language model.

Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, Yuan Lin• 2025

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringMVBench (test)
Accuracy71.5
45
Long Video Question AnsweringMLVU
M-Avg67.9
39
Long Video Question AnsweringLong VideoBench (val)
Accuracy58.6
36
Camera movement understandingCameraBench 10K-sample VQA subset 1.0 (test)
Translation (In) Error61.5
24
Visual Question AnsweringCameraBench
Motion Steadiness Accuracy0.531
21
GroundingE.T.Bench
TVG F138.4
20
Video GroundingE.T. Bench-Grounding (test)
TVG F138.4
19
Video CaptioningE.T. Bench-Captioning (test)
DVC F146.5
16
Detailed Description MatchingEventHallusion
Accuracy (Entire)54.6
13
Embodied Question AnsweringRoboVQA
BLEU-177.1
13
Showing 10 of 26 rows

Other info

Code

Follow for update