SkillRevise: Improving LLM-Authored Agent Skills via Trace-Conditioned Skill Revision
About
Agent skills are procedural artifacts that enable LLM agents to execute workflows, verify constraints, and recover from failures. Existing self-evolving methods refine skills using accumulated trajectories. However, they struggle in cold-start settings, where only an initial, imperfect skill is available. Consequently, skill construction defaults to expert authoring or one-shot LLM generation. Expert-authored skills are costly and may not align with how LLM agents actually execute tasks, while one-shot generated skills can be syntactically well formed yet behaviorally weak. To bridge this gap, we propose SkillRevise, an execution-grounded framework designed to iteratively refine these initial skills. SkillRevise diagnoses skill defects from execution evidence, retrieves relevant repair principles from a general memory, and applies execution-anchored edits. By re-executing candidates and measuring empirical utility, it systematically retains the optimal skill version. Evaluated across three benchmarks and five LLMs, SkillRevise substantially outperforms one-shot baselines, improving the base agent's success rate on SkillsBench from 36.05% to 61.63%. Furthermore, the revised skills exhibit strong cross-model transferability, capturing generalized procedural knowledge over model-specific artifacts.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Skill execution | SkillsBench | Overall Success Rate (avg@5)53 | 26 | |
| Software Engineering Task Success | SWE-Skills-Bench Hard | Task Success Rate35 | 20 | |
| Task success | SkillLearnBench Random | Success Count29 | 20 | |
| Interactive Task Completion | ALFWorld cleaned 100-task v3 (mix of val-seen and val-unseen) | Success Rate71 | 12 |