Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding

About

Vision-language models are integral to computer vision research, yet many high-performing models remain closed-source, obscuring their data, design and training recipe. The research community has responded by using distillation from black-box models to label training data, achieving strong benchmark results, at the cost of measurable scientific progress. However, without knowing the details of the teacher model and its data sources, scientific progress remains difficult to measure. In this paper, we study building a Perception Language Model (PLM) in a fully open and reproducible framework for transparent research in image and video understanding. We analyze standard training pipelines without distillation from proprietary models and explore large-scale synthetic data to identify critical data gaps, particularly in detailed video understanding. To bridge these gaps, we release 2.8M human-labeled instances of fine-grained video question-answer pairs and spatio-temporally grounded video captions. Additionally, we introduce PLM-VideoBench, a suite for evaluating challenging video understanding tasks focusing on the ability to reason about "what", "where", "when", and "how" of a video. We make our work fully reproducible by providing data, training recipes, code & models. https://github.com/facebookresearch/perception_models

Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, Miguel Martin, Huiyu Wang, Hanoona Rasheed, Peize Sun, Po-Yao Huang, Daniel Bolya, Nikhila Ravi, Shashank Jain, Tammy Stark, Shane Moon, Babak Damavandi, Vivian Lee, Andrew Westbury, Salman Khan, Philipp Kr\"ahenb\"uhl, Piotr Doll\'ar, Lorenzo Torresani, Kristen Grauman, Christoph Feichtenhofer• 2025

Related benchmarks

TaskDatasetResultRank
Text-based Visual Question AnsweringTextVQA (val)
Accuracy86.5
262
Long Video UnderstandingLongVideoBench (val)
Accuracy57.9
210
Video Question AnsweringNExT-QA (test)
Accuracy84.1
204
Multimodal UnderstandingMMMU (val)
MMMU Score46.1
152
Video UnderstandingMVBench (test)
Accuracy77.1
151
Visual Question AnsweringVQA v2 (val)
Accuracy85.6
144
Multimodal ReasoningMMStar--
143
Mathematical ReasoningMathVista (testmini)
Accuracy59.9
103
Video Question AnsweringMVBench
Accuracy77.1
90
Visual Question AnsweringChartQA (test)
Accuracy85.5
86
Showing 10 of 104 rows
...

Other info

Follow for update