EARL: Towards a Unified Analysis-Guided Reinforcement Learning Framework for Egocentric Interaction Reasoning and Pixel Grounding
About
Understanding human--environment interactions from egocentric vision is essential for assistive robotics and embodied intelligent agents, yet existing multimodal large language models (MLLMs) still struggle with accurate interaction reasoning and fine-grained pixel grounding. To this end, this paper introduces EARL, an Egocentric Analysis-guided Reinforcement Learning framework that explicitly transfers coarse interaction semantics to query-oriented answering and grounding. Specifically, EARL adopts a two-stage parsing framework including coarse-grained interpretation and fine-grained response. The first stage holistically interprets egocentric interactions and generates a structured textual description. The second stage produces the textual answer and pixel-level mask in response to the user query. To bridge the two stages, we extract a global interaction descriptor as a semantic prior, which is integrated via a novel Analysis-guided Feature Synthesizer (AFS) for query-oriented reasoning. To optimize heterogeneous outputs, including textual answers, bounding boxes, and grounding masks, we design a multi-faceted reward function and train the response stage with GRPO. Experiments on Ego-IRGBench show that EARL achieves 65.48% cIoU for pixel grounding, outperforming previous RL-based methods by 8.37%, while OOD grounding results on EgoHOS indicate strong transferability to unseen egocentric grounding scenarios.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Pixel Grounding | EgoHOS Out-of-distribution (test) | Left Hand IoU52.3 | 18 | |
| Egocentric Interaction Answering | Ego-IRGBench (test) | METEOR93.9 | 15 | |
| Egocentric Interaction Answering | Ego-IRGBench (val) | METEOR0.933 | 15 | |
| Egocentric Interaction Grounding | Ego-IRGBench (test) | cIoU65.48 | 15 | |
| Egocentric Interaction Grounding | Ego-IRGBench (val) | cIoU62.71 | 15 | |
| Egocentric Interaction Analysis | Ego-IRGBench (test) | METEOR0.542 | 15 | |
| Egocentric Interaction Analysis | Ego-IRGBench (val) | METEOR0.541 | 15 |