ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding

About

While multimodal large language models (MLLMs) have shown remarkable success across a wide range of tasks, long-form video understanding remains a significant challenge. In this study, we focus on video understanding by MLLMs. This task is challenging because processing a full stream of RGB frames is computationally intractable and highly redundant, as self-attention have quadratic complexity with sequence length. In this paper, we propose ReMoRa, a video MLLM that processes videos by operating directly on their compressed representations. A sparse set of RGB keyframes is retained for appearance, while temporal dynamics are encoded as a motion representation, removing the need for sequential RGB frames. These motion representations act as a compact proxy for optical flow, capturing temporal dynamics without full frame decoding. To refine the noise and low fidelity of block-based motions, we introduce a module to denoise and generate a fine-grained motion representation. Furthermore, our model compresses these features in a way that scales linearly with sequence length. We demonstrate the effectiveness of ReMoRa through extensive experiments across a comprehensive suite of long-video understanding benchmarks. ReMoRa outperformed baseline methods on multiple challenging benchmarks, including LongVideoBench, NExT-QA, and MLVU.

Daichi Yashima, Shuhei Kurita, Yusuke Oda, Komei Sugiura• 2026

Related benchmarks

Task	Dataset	Result
Video Question Answering	ActivityNet-QA (test)	Accuracy60.5	288
Video Question Answering	MSVD-QA (test)	Accuracy73.1	279
Long Video Understanding	LongVideoBench	Score60.8	269
Long Video Understanding	MLVU	--	205
General Video Understanding	Video-MME	Accuracy64.4	82
Video Perception	Perception (test)	--	57
Multi-modal Video Evaluation	VideoMME	Score64.4	42
General Video Understanding	NEXT-QA	Accuracy84.2	21
Long Video Understanding	LongVideoBench (LVB)	Accuracy60.8	21
General Video Understanding	PerceptionTest (PercTest)	Accuracy67.7	21

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord