STING-BEE: Towards Vision-Language Model for Real-World X-ray Baggage Security Inspection
About
Advancements in Computer-Aided Screening (CAS) systems are essential for improving the detection of security threats in X-ray baggage scans. However, current datasets are limited in representing real-world, sophisticated threats and concealment tactics, and existing approaches are constrained by a closed-set paradigm with predefined labels. To address these challenges, we introduce STCray, the first multimodal X-ray baggage security dataset, comprising 46,642 image-caption paired scans across 21 threat categories, generated using an X-ray scanner for airport security. STCray is meticulously developed with our specialized protocol that ensures domain-aware, coherent captions, that lead to the multi-modal instruction following data in X-ray baggage security. This allows us to train a domain-aware visual AI assistant named STING-BEE that supports a range of vision-language tasks, including scene comprehension, referring threat localization, visual grounding, and visual question answering (VQA), establishing novel baselines for multi-modal learning in X-ray baggage security. Further, STING-BEE shows state-of-the-art generalization in cross-domain settings. Code, data, and models are available at https://divs1159.github.io/STING-BEE/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | SIXray and PIDray combined (test) | Instance Location (IL)49.22 | 10 | |
| Multi-class classification | SIXray + PIDray combined (test) | mAP30.1 | 9 | |
| Visual Grounding | SIXray and PIDray combined (test) | Acc@0.512.5 | 7 | |
| Multi-object grounding | Combined SIXray and PIDray 1.0 (test) | Acc@0.58.7 | 6 | |
| Scene Comprehension | SIXray and PIDray combined (test) | F1 Score34.7 | 6 | |
| Scene Comprehension | SIXray, PIDray, and COMPASS-XP cross-domain (test) | F134.69 | 6 | |
| Referring Threat Localization | Combined SIXray and PIDray 1.0 (test) | Accuracy @ IoU 0.58.9 | 3 |