FAMA: Failure-Aware Meta-Agentic Framework for Open-Source LLMs in Interactive Tool Use Environments
About
Large Language Models are being increasingly deployed as the decision-making core of autonomous agents capable of effecting change in external environments. Yet, in conversational benchmarks, which simulate real-world customer-centric issue resolution scenarios, these agents frequently fail due to the cascading effects of incorrect decision-making. These challenges are particularly pronounced for open-source LLMs with smaller parameter sizes, limited context windows, and constrained inference budgets, which contribute to increased error accumulation in agentic settings. To tackle these challenges, we present the Failure-Aware Meta-Agentic (FAMA) framework. FAMA operates in two stages: first, it analyzes failure trajectories from baseline agents to identify the most prevalent errors; second, it employs an orchestration mechanism that activates a minimal subset of specialized agents tailored to address these failures by injecting a targeted context for the tool-use agent before the decision-making step. Experiments across open-source LLMs demonstrate performance gains up to 27% across evaluation modes over standard baselines. These results highlight that targeted curation of context through specialized agents to address common failures is a valuable design principle for building reliable, multi-turn tool-use LLM agents that simulate real-world conversational scenarios.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| LLM Agent Evaluation | Tau-bench retail | Pass@144.17 | 38 | |
| Multi-turn agent task | ACEBench multi-turn (test) | Process Accuracy70.2 | 31 | |
| LLM Agent Evaluation | Tau-bench airline | Pass@426.7 | 29 | |
| Agentic Task Performance | τ-Telehealth | Pass^1 Rate45 | 16 | |
| Agentic Task Performance | τ-Telecom | Pass@1 Success Rate52 | 16 | |
| Tool-Use Agent Evaluation | τ-Bench Retail | Pass@1 Success Rate44.173 | 6 | |
| Tool-Use Agent Evaluation | τ-Bench Airline | Pass@129.2 | 6 | |
| Agent Task Completion | τ-Bench Retail | -- | 5 | |
| Agent Task Completion | τ-Bench Airline | Pass@136.8 | 3 | |
| Tool-Use Agent Evaluation | τ-Bench Retail (test) | Pass@1 Success Rate34.6 | 3 |