ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction
About
Egocentric interaction perception is one of the essential branches in investigating human-environment interaction, which lays the basis for developing next-generation intelligent systems. However, existing egocentric interaction understanding methods cannot yield coherent textual and pixel-level responses simultaneously according to user queries, which lacks flexibility for varying downstream application requirements. To comprehend egocentric interactions exhaustively, this paper presents a novel task named Egocentric Interaction Reasoning and pixel Grounding (Ego-IRG). Taking an egocentric image with the query as input, Ego-IRG is the first task that aims to resolve the interactions through three crucial steps: analyzing, answering, and pixel grounding, which results in fluent textual and fine-grained pixel-level responses. Another challenge is that existing datasets cannot meet the conditions for the Ego-IRG task. To address this limitation, this paper creates the Ego-IRGBench dataset based on extensive manual efforts, which includes over 20k egocentric images with 1.6 million queries and corresponding multimodal responses about interactions. Moreover, we design a unified ANNEXE model to generate text- and pixel-level outputs utilizing multimodal large language models, which enables a comprehensive interpretation of egocentric interactions. The experiments on the Ego-IRGBench exhibit the effectiveness of our ANNEXE model compared with other works.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | EgoHOS in-domain (test) | Left Hand IoU91.5 | 13 | |
| Egocentric Hand-Object Segmentation | mini-HOI4D out-of-distribution (test) | IoU (Left Hand)68.06 | 11 | |
| Egocentric Hand-Object Segmentation | EgoHOS out-of-domain (test) | Left Hand IoU92.45 | 11 | |
| Hand-object segmentation | EgoHOS out-of-domain (test) | Left Hand Accuracy0.9703 | 10 | |
| Hand-object segmentation | HOI4D mini | Left Hand Accuracy96.54 | 10 | |
| Analyzing sub-task | Ego-IRGBench (val) | METEOR0.563 | 5 | |
| Analyzing sub-task | Ego-IRGBench (test) | METEOR0.563 | 5 | |
| Referring Image Segmentation | Ego-IRGBench (val) | cIoU35.14 | 5 | |
| Referring Image Segmentation | Ego-IRGBench (test) | cIoU36.02 | 5 | |
| Answering | Ego-IRGBench 1.0 (val) | METEOR36.3 | 4 |