ARGOS: Who, Where, and When in Agentic Multi-Camera Person Search

About

We introduce ARGOS, the first benchmark and framework that reformulates multi-camera person search as an interactive reasoning problem requiring an agent to plan, question, and eliminate candidates under information asymmetry. An ARGOS agent receives a vague witness statement and must decide what to ask, when to invoke spatial or temporal tools, and how to interpret ambiguous responses, all within a limited turn budget. Reasoning is grounded in a Spatio-Temporal Topology Graph (STTG) encoding camera connectivity and empirically validated transition times. The benchmark comprises 2,691 tasks across 14 real-world scenarios in three progressive tracks: semantic perception (Who), spatial reasoning (Where), and temporal reasoning (When). Experiments with four LLM backbones show the benchmark is far from solved (best TWS: 0.383 on Track 2, 0.590 on Track 3), and ablations confirm that removing domain-specific tools drops accuracy by up to 49.6 percentage points.

Myungchul Kim, Kwanyong Park, Junmo Kim, In So Kweon• 2026

Related benchmarks

Task	Dataset	Result
Agentic Person Search (Spatial Reasoning)	Track 2 Spatial	TWS38.3	5
Agentic Person Search (Temporal Reasoning)	Track 3 Temporal	TWS59	5
Agentic Person Search	Track 1 (Who)	SR@181.1	4

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord