Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding

About

We introduce Tarsier2, a state-of-the-art large vision-language model (LVLM) designed for generating detailed and accurate video descriptions, while also exhibiting superior general video understanding capabilities. Tarsier2 achieves significant advancements through three key upgrades: (1) Scaling pre-training data from 11M to 40M video-text pairs, enriching both volume and diversity; (2) Performing fine-grained temporal alignment during supervised fine-tuning; (3) Using model-based sampling to automatically construct preference data and applying DPO training for optimization. Extensive experiments show that Tarsier2-7B consistently outperforms leading proprietary models, including GPT-4o and Gemini 1.5 Pro, in detailed video description tasks. On the DREAM-1K benchmark, Tarsier2-7B improves F1 by 2.8% over GPT-4o and 5.8% over Gemini-1.5-Pro. In human side-by-side evaluations, Tarsier2-7B shows a +8.6% performance advantage over GPT-4o and +24.9% over Gemini-1.5-Pro. Tarsier2-7B also sets new state-of-the-art results across 15 public benchmarks, spanning tasks such as video question-answering, video grounding, hallucination test, and embodied question-answering, demonstrating its versatility as a robust generalist vision-language model.

Liping Yuan, Jiawei Wang, Haomiao Sun, Yuchen Zhang, Yuan Lin• 2025

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringMVBench (test)
Accuracy71.5
38
Long Video Question AnsweringLong VideoBench (val)
Accuracy58.6
36
Camera movement understandingCameraBench 10K-sample VQA subset 1.0 (test)
Translation (In) Error61.5
24
Long Video Question AnsweringMLVU
M-Avg67.9
22
GroundingE.T.Bench
TVG F138.4
20
Video GroundingE.T. Bench-Grounding (test)
TVG F138.4
19
Visual Question AnsweringCameraBench
Motion Steadiness Accuracy0.531
17
Video CaptioningE.T. Bench-Captioning (test)
DVC F146.5
16
Detailed Description MatchingEventHallusion
Accuracy (Entire)54.6
13
Embodied Question AnsweringRoboVQA
BLEU-177.1
13
Showing 10 of 26 rows

Other info

Code

Follow for update