Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SAVA-X: Ego-to-Exo Imitation Error Detection via Scene-Adaptive View Alignment and Bidirectional Cross View Fusion

About

Error detection is crucial in industrial training, healthcare, and assembly quality control. Most existing work assumes a single-view setting and cannot handle the practical case where a third-person (exo) demonstration is used to assess a first-person (ego) imitation. We formalize Ego$\rightarrow$Exo Imitation Error Detection: given asynchronous, length-mismatched ego and exo videos, the model must localize procedural steps on the ego timeline and decide whether each is erroneous. This setting introduces cross-view domain shift, temporal misalignment, and heavy redundancy. Under a unified protocol, we adapt strong baselines from dense video captioning and temporal action detection and show that they struggle in this cross-view regime. We then propose SAVA-X, an Align-Fuse-Detect framework with (i) view-conditioned adaptive sampling, (ii) scene-adaptive view embeddings, and (iii) bidirectional cross-attention fusion. On the EgoMe benchmark, SAVA-X consistently improves AUPRC and mean tIoU over all baselines, and ablations confirm the complementary benefits of its components. Code is available at https://github.com/jack1ee/SAVAX.

Xiang Li, Heqian Qiu, Lanxiao Wang, Benliu Qiu, Fanman Meng, Linfeng Xu, Hongliang Li• 2026

Related benchmarks

TaskDatasetResultRank
Imitation error detectionEgoMe (val)
AUPRC @ 0.333.56
6
Imitation error detectionEgoMe (test)
AUPRC @ 0.329.37
6
Temporal Action LocalizationEgoMe (val)
AUPRC (t=0.3)69.02
6
Temporal Action LocalizationEgoMe (test)
AUPRC (IoU=0.3)66.32
6
Showing 4 of 4 rows

Other info

Follow for update