| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Aggregate FTWP, ScienceWorld, WebShop | Gemma2-9B + MISE | Format Faithfulness Rate89.74 | 17 | 1mo ago | |
| ScienceWorld (test) | LLaMA3-8B + MISE | Success Rate38.18 | 17 | 1mo ago | |
| ScienceWorld (val) | Gemma2-9B + MISE | Success Rate44.08 | 17 | 1mo ago | |
| FTWP (test) | GPT-4o | Success Rate51.08 | 17 | 1mo ago | |
| FTWP (val) | GPT-4o | Success Rate43.66 | 17 | 1mo ago | |
| Agent | Llama-3.2-1B-Instruct | Clean Success (Eager)100 | 4 | 13d ago |