Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

HY-Himmel Technical Report: Hierarchical Interleaved Multi-stream Motion Encoding for Long Video Understanding

About

Long-video understanding with multimodal language models suffers from three compounding bottlenecks: heavy decode cost to obtain dense RGB frames, quadratic token growth with frame count, and weak motion perception under sparse keyframe sampling. We present HY-Himmel, a hierarchical video-language framework that allocates semantic and motion capacity separately. A small set of sparse anchor I-frames is routed to the expensive host ViT to ground object identity and scene layout, while the far denser inter-frame intervals are encoded by a lightweight compressed-domain tri-stream adapter that distils motion evidence from motion-vector maps, residual maps, and I-frame context into aligned motion tokens. These tokens are injected into the LLM via a differentiable placeholder mechanism after a dedicated Stage-1 contrastive alignment that places the motion representation in a geometry compatible with the frozen visual backbone. On Video-MME, HY-Himmel surpasses the dense 32-frame baseline by +2.3 pp (61.2 to 63.5%) while using 3.6x fewer context tokens. Extensive ablations over stream composition, motion encoder family, fusion mode, alignment objective, anchor count, LoRA rank, and video duration confirm that the full tri-stream is necessary and sufficient for the observed gains.

Haopeng Jin, Hongzhu Yi, Wenlong Zhao, Jinwen Luo, Shani Ye, Zhenyu Guan, Shiquan Dong, Tiankun Yang, Tao Yu• 2026

Related benchmarks

TaskDatasetResultRank
Visual Mathematical ReasoningMathVista
Accuracy71.8
366
Visual Mathematical ReasoningMathVision
Accuracy26.9
254
Long Video UnderstandingMLVU
Accuracy72.6
205
Video UnderstandingVideo-MME without subtitles
Overall Score64.9
108
Video Question AnsweringLongVideoBench (val)
Accuracy62.5
87
Multi-modal Video UnderstandingMVBench
Accuracy75.6
83
Video UnderstandingVideo-MME
Accuracy66.3
36
Showing 7 of 7 rows

Other info

Follow for update