Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Atom: Efficient On-Device Video-Language Pipelines Through Modular Reuse

About

Recent advances in video-language models have enabled powerful applications like video retrieval, captioning, and assembly. However, executing such multi-stage pipelines efficiently on mobile devices remains challenging due to redundant model loads and fragmented execution. We introduce Atom, an on-device system that restructures video-language pipelines for fast and efficient execution. Atom decomposes a billion-parameter model into reusable modules, such as the visual encoder and language decoder, and reuses them across subtasks like captioning, reasoning, and indexing. This reuse-centric design eliminates repeated model loading and enables parallel execution, reducing end-to-end latency without sacrificing performance. On commodity smartphones, Atom achieves 27--33% faster execution compared to non-reuse baselines, with only marginal performance drop ($\leq$ 2.3 Recall@1 in retrieval, $\leq$ 1.5 CIDEr in captioning). These results position Atom as a practical, scalable approach for efficient video-language understanding on edge devices.

Kunjal Panchal, Saayan Mitra, Somdeb Sarkhel, Haoliang Wang, Ishita Dasgupta, Gang Wu, Hui Guan• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalMSR-VTT
Recall@157.2
313
Text-to-Video RetrievalMSVD
R@169.1
218
Video CaptioningMSR-VTT (test)
CIDEr83.1
121
Video CaptioningMSVD (test)
CIDEr1.664
111
Video RetrievalDiDeMo (test)
R@160
7
Showing 5 of 5 rows

Other info

Follow for update