Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

About

Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in cross-modal understanding and generation. However, the rapid growth of visual token sequences--especially in long-video and streaming scenarios--poses a major challenge to their scalability and real-world deployment. Thus, we introduce POINTS-Long, a native dual-mode MLLM featuring dynamic visual token scaling inspired by the human visual system. The model supports two complementary perception modes: focus mode and standby mode, enabling users to dynamically trade off efficiency and accuracy during inference. On fine-grained visual tasks, the focus mode retains the optimal performance, while on long-form general visual understanding, the standby mode retains 97.7-99.7% of the original accuracy using only 1/40-1/10th of the visual tokens. Moreover, POINTS-Long natively supports streaming visual understanding via a dynamically detachable KV-cache design, allowing efficient maintenance of ultra-long visual memory. Our work provides new insights into the design of future MLLMs and lays the foundation for adaptive and efficient long-form visual understanding.

Haicheng Wang, Yuan Liu, Yikun Liu, Zhemeng Yu, Zhongyin Zhao, Yangxiu You, Zilin Yu, Le Tian, Xiao Zhou, Jie Zhou, Weidi Xie, Yanfeng Wang• 2026

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringActivityNet-QA
Accuracy55.4
376
Video Question AnsweringLongVideoBench
Accuracy59.3
180
Video Question AnsweringEgoSchema
Accuracy61.6
161
Video UnderstandingEgoSchema--
158
Video Question AnsweringMLVU
Accuracy72.1
143
Long Video UnderstandingLVBench
Accuracy44.5
133
Video Question AnsweringLVBench
Accuracy48.6
108
Video UnderstandingOpencompass Video Benchmark
MVBench Score61
17
Multimodal UnderstandingOpencompass Image Benchmark (val)
MMBench Accuracy82.1
12
Video UnderstandingTemporalBench
Accuracy65.1
7
Showing 10 of 14 rows

Other info

Follow for update