SOVABench: A Vehicle Surveillance Action Retrieval Benchmark for Multimodal Large Language Models
About
Automatic identification of events and recurrent behavior analysis are critical for video surveillance. However, most existing content-based video retrieval benchmarks focus on scene-level similarity and do not evaluate the action discrimination required in surveillance. To address this gap, we introduce SOVABench (Surveillance Opposite Vehicle Actions Benchmark), a real-world retrieval benchmark built from surveillance footage and centered on vehicle-related actions. SOVABench defines two evaluation protocols (inter-pair and intra-pair) to assess cross-action discrimination and temporal direction understanding. Although action distinctions are generally intuitive for human observers, our experiments show that they remain challenging for state-of-the-art vision and multimodal models. Leveraging the visual reasoning and instruction-following capabilities of Multimodal Large Language Models (MLLMs), we present a training-free framework for producing interpretable embeddings from MLLM-generated descriptions for both images and videos. The framework achieves strong performance on SOVABench as well as on several spatial and counting benchmarks where contrastive Vision-Language Models often fail. The code, annotations, and instructions to construct the benchmark are publicly available.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Counting | CountBench | Accuracy76.4 | 52 | |
| Spatial Reasoning | Visual Spatial Reasoning (VSR) | Accuracy53 | 48 | |
| Video Action Retrieval | SOVABench Inter-pair 1.0 | mAP38.3 | 21 | |
| Video Action Retrieval | SOVABench Intra-pair 1.0 | Pair-mAP53.9 | 21 | |
| Visual Spatial Reasoning | What's Up (Split A) | Accuracy78.6 | 20 | |
| Visual Spatial Reasoning | What's Up (Split B) | Accuracy46.3 | 20 | |
| Object Counting | Visual7W Count | Accuracy55 | 6 | |
| Spatial Understanding | SpatialBench Indoor | Accuracy38.6 | 6 | |
| Spatial Understanding | SpatialBench Outdoor | Accuracy37.7 | 6 | |
| Spatial Understanding | Spatial Understanding Suite | Spatial Avg.47.9 | 6 |