Mistake Attribution: Fine-Grained Mistake Understanding in Egocentric Videos
About
We introduce Mistake Attribution (MATT), a new task for fine-grained understanding of human mistakes in egocentric videos. While prior work detects whether a mistake occurs, MATT attributes the mistake to what part of the instruction is violated (semantic role), when in the video the deviation becomes irreversible (the Point-of-No-Return, PNR), and where the mistake appears in the PNR frame. We develop MisEngine, a data engine that automatically constructs mistake samples from existing datasets with attribution-rich annotations. Applied to large egocentric corpora, MisEngine yields EPIC-KITCHENS-M and Ego4D-M -- two datasets up to two orders of magnitude larger than prior mistake datasets. We then present MisFormer, a unified attention-based model for mistake attribution across semantic, temporal, and spatial dimensions, trained with MisEngine supervision. A human study demonstrates the ecological validity of our MisEngine-constructed mistake samples, confirming that EPIC-KITCHENS-M and Ego4D-M can serve as reliable benchmarks for mistake understanding. Experiments on both our datasets and prior benchmarks show that MisFormer, as a single unified model, outperforms task-specific SOTA methods by at least 6.66%, 21.81%, 18.7%, and 3.00% in video-language understanding, temporal localization, hand-object interaction, and mistake detection, respectively. Project page: https://yayuanli.github.io/MATT/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic Attribution | EPIC-KITCHENS M (test) | Average Accuracy84.91 | 5 | |
| Semantic Attribution | Ego4D-M (test) | Average Accuracy62.03 | 5 | |
| Mistake detection | EgoPER | F1@.535.18 | 4 | |
| Mistake detection | EPIC-KITCHENS-M (EK) (test) | F1@0.578.05 | 3 | |
| Mistake detection | Ego4D-M (test) | F1@.557.55 | 3 | |
| Temporal Attribution | Ego4D-M (test) | MAE (frames)19.14 | 3 | |
| Spatial Attribution | Ego4D-M | mIoU59.21 | 3 |