POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs

About

Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in cross-modal understanding and generation. However, the rapid growth of visual token sequences--especially in long-video and streaming scenarios--poses a major challenge to their scalability and real-world deployment. Thus, we introduce POINTS-Long, a native dual-mode MLLM featuring dynamic visual token scaling inspired by the human visual system. The model supports two complementary perception modes: focus mode and standby mode, enabling users to dynamically trade off efficiency and accuracy during inference. On fine-grained visual tasks, the focus mode retains the optimal performance, while on long-form general visual understanding, the standby mode retains 97.7-99.7% of the original accuracy using only 1/40-1/10th of the visual tokens. Moreover, POINTS-Long natively supports streaming visual understanding via a dynamically detachable KV-cache design, allowing efficient maintenance of ultra-long visual memory. Our work provides new insights into the design of future MLLMs and lays the foundation for adaptive and efficient long-form visual understanding.

Haicheng Wang, Yuan Liu, Yikun Liu, Zhemeng Yu, Zhongyin Zhao, Yangxiu You, Zilin Yu, Le Tian, Xiao Zhou, Jie Zhou, Weidi Xie, Yanfeng Wang• 2026

Related benchmarks

Task	Dataset	Result
Video Question Answering	ActivityNet-QA	Accuracy55.4	418
Long Video Understanding	LVBench	Accuracy44.5	218
Video Question Answering	LongVideoBench	Accuracy59.3	210
Video Question Answering	MLVU	Accuracy72.1	194
Video Understanding	EgoSchema	--	185
Video Question Answering	EgoSchema	Accuracy61.6	161
Video Question Answering	LVBench	Accuracy48.6	108
Video Understanding	Opencompass Video Benchmark	MVBench Score61	17
Multimodal Understanding	Opencompass Image Benchmark (val)	MMBench Accuracy82.1	12
Video Understanding	TemporalBench	Accuracy65.1	7

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord