POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
About
Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable capabilities in cross-modal understanding and generation. However, the rapid growth of visual token sequences--especially in long-video and streaming scenarios--poses a major challenge to their scalability and real-world deployment. Thus, we introduce POINTS-Long, a native dual-mode MLLM featuring dynamic visual token scaling inspired by the human visual system. The model supports two complementary perception modes: focus mode and standby mode, enabling users to dynamically trade off efficiency and accuracy during inference. On fine-grained visual tasks, the focus mode retains the optimal performance, while on long-form general visual understanding, the standby mode retains 97.7-99.7% of the original accuracy using only 1/40-1/10th of the visual tokens. Moreover, POINTS-Long natively supports streaming visual understanding via a dynamically detachable KV-cache design, allowing efficient maintenance of ultra-long visual memory. Our work provides new insights into the design of future MLLMs and lays the foundation for adaptive and efficient long-form visual understanding.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Question Answering | ActivityNet-QA | Accuracy55.4 | 376 | |
| Video Question Answering | LongVideoBench | Accuracy59.3 | 180 | |
| Video Question Answering | EgoSchema | Accuracy61.6 | 161 | |
| Video Understanding | EgoSchema | -- | 158 | |
| Video Question Answering | MLVU | Accuracy72.1 | 143 | |
| Long Video Understanding | LVBench | Accuracy44.5 | 133 | |
| Video Question Answering | LVBench | Accuracy48.6 | 108 | |
| Video Understanding | Opencompass Video Benchmark | MVBench Score61 | 17 | |
| Multimodal Understanding | Opencompass Image Benchmark (val) | MMBench Accuracy82.1 | 12 | |
| Video Understanding | TemporalBench | Accuracy65.1 | 7 |