HiVid-Narrator: Hierarchical Video Narrative Generation with Scene-Primed ASR-anchored Compression

About

Generating structured narrations for real-world e-commerce videos requires models to perceive fine-grained visual details and organize them into coherent, high-level stories--capabilities that existing approaches struggle to unify. We introduce the E-commerce Hierarchical Video Captioning (E-HVC) dataset with dual-granularity, temporally grounded annotations: a Temporal Chain-of-Thought that anchors event-level observations and Chapter Summary that compose them into concise, story-centric summaries. Rather than directly prompting chapters, we adopt a staged construction that first gathers reliable linguistic and visual evidence via curated ASR and frame-level descriptions, then refines coarse annotations into precise chapter boundaries and titles conditioned on the Temporal Chain-of-Thought, yielding fact-grounded, time-aligned narratives. We also observe that e-commerce videos are fast-paced and information-dense, with visual tokens dominating the input sequence. To enable efficient training while reducing input tokens, we propose the Scene-Primed ASR-anchored Compressor (SPA-Compressor), which compresses multimodal tokens into hierarchical scene and event representations guided by ASR semantic cues. Built upon these designs, our HiVid-Narrator framework achieves superior narrative quality with fewer input tokens compared to existing methods.

Haoxuan Li, Mengyan Li, Junjun Zheng• 2026

Related benchmarks

Task	Dataset	Result
Dense Video Captioning	ActivityNet Captions	METEOR7.8	48
Dense Video Captioning	YouCook2	SODA_c2.4	40
Video Captioning	E-HVC-Bench 1.0 (test)	SODA_c14.48	11
Video Narrating	E-HVC-Bench (test)	SODA c14.48	4

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord