LiveCC: Learning Video LLM with Streaming Speech Transcription at Scale
About
Recent video large language models (Video LLMs) often depend on costly human annotations or proprietary model APIs (e.g., GPT-4o) to produce training data, which limits their training at scale. In this paper, we explore large-scale training for Video LLM with cheap automatic speech recognition (ASR) transcripts. Specifically, we propose a novel streaming training approach that densely interleaves the ASR words and video frames according to their timestamps. Compared to previous studies in vision-language representation with ASR, our method naturally fits the streaming characteristics of ASR, thus enabling the model to learn temporally-aligned, fine-grained vision-language modeling. To support the training algorithm, we introduce a data production pipeline to process YouTube videos and their closed captions (CC, same as ASR), resulting in Live-CC-5M dataset for pre-training and Live-WhisperX-526K dataset for high-quality supervised fine-tuning (SFT). Remarkably, even without SFT, the ASR-only pre-trained LiveCC-7B-Base model demonstrates competitive general video QA performance and exhibits a new capability in real-time video commentary. To evaluate this, we carefully design a new LiveSports-3K benchmark, using LLM-as-a-judge to measure the free-form commentary. Experiments show our final LiveCC-7B-Instruct model can surpass advanced 72B models (Qwen2.5-VL-72B-Instruct, LLaVA-Video-72B) in commentary quality even working in a real-time mode. Meanwhile, it achieves state-of-the-art results at the 7B/8B scale on popular video QA benchmarks such as VideoMME and OVOBench, demonstrating the broad generalizability of our approach. All resources of this paper have been released at https://showlab.github.io/livecc.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Question Answering | Video-MME Long Duration 1.0 | Accuracy (w/o subtitles)53.7 | 34 | |
| Common and General Video Commentary | Ego4D | F117.12 | 18 | |
| Streaming video captioning | LiveSports3k (test) | Winrate (%)43.2 | 10 | |
| Common and General Video Commentary | Black Myth Wukong | LiveU5.76 | 9 | |
| Live Gaming Commentary | Live Gaming Benchmark Solo Commentary | LiveU5.84 | 9 | |
| Live Gaming Commentary | Live Gaming Benchmark Co-Commentary | LiveU4.29 | 9 | |
| Live Gaming Commentary | Live Gaming Benchmark Overall | LiveU Score4.9 | 9 | |
| Response Quality | Live Gaming Benchmark Solo Commentary | Time Difference1.04 | 9 | |
| Response Quality | Live Gaming Benchmark Co-Commentary | Time Difference2.01 | 9 | |
| Response Quality | Live Gaming Benchmark Overall | Time Difference2.13 | 9 |