Metacognitive Self-Correction for Multi-Agent System via Prototype-Guided Next-Execution Reconstruction
About
Large Language Model based multi-agent systems (MAS) excel at collaborative problem solving but remain brittle to cascading errors: a single faulty step can propagate across agents and disrupt the trajectory. In this paper, we present MASC, a metacognitive framework that endows MAS with real-time, unsupervised, step-level error detection and self-correction. MASC rethinks detection as history-conditioned anomaly scoring via two complementary designs: (1) Next-Execution Reconstruction, which predicts the embedding of the next step from the query and interaction history to capture causal consistency, and (2) Prototype-Guided Enhancement, which learns a prototype prior over normal-step embeddings and uses it to stabilize reconstruction and anomaly scoring under sparse context (e.g., early steps). When an anomaly step is flagged, MASC triggers a correction agent to revise the acting agent's output before information flows downstream. On the Who&When benchmark, MASC consistently outperforms all baselines, improving step-level error detection by up to 8.47% AUC-ROC ; When plugged into diverse MAS frameworks, it delivers consistent end-to-end gains across architectures, confirming that our metacognitive monitoring and targeted correction can mitigate error propagation with minimal overhead.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Question Answering | GAIA | Accuracy (Pass@4)9.17 | 22 | |
| Failure attribution | Who&When Hand-Crafted | Step-level Accuracy20.79 | 13 | |
| Failure attribution | Who&When Total | Step-level Accuracy21.62 | 13 | |
| Failure attribution | Who&When Algorithm-Generated | Step-level Accuracy21.72 | 13 | |
| Error Forecasting | Who&When | Eta (%)42.19 | 6 | |
| Tool-augmented Question Answering | Bamboogle | Accuracy51.47 | 4 | |
| Tool-augmented Question Answering | 2Wiki | Accuracy42.83 | 4 | |
| Tool-augmented Question Answering | HotpotQA | Accuracy51.33 | 4 | |
| Tool-augmented Question Answering | MuSiQue | Accuracy10.67 | 4 | |
| Tool-augmented Question Answering | MedQA | Accuracy71 | 4 |