SkillFlow: Scalable and Efficient Agent Skill Retrieval System

About

AI agents can extend their capabilities at inference time by loading reusable skills into context, yet equipping an agent with too many skills, particularly irrelevant ones, degrades performance. As community-driven skill repositories grow, agents need a way to selectively retrieve only the most relevant skills from a large library. We present SkillFlow, the first multi-stage retrieval pipeline designed for agent skill discovery, framing skill acquisition as an information retrieval problem over a corpus of ~36K community-contributed SKILL.md definitions indexed from GitHub. The pipeline progressively narrows a large candidate set through four stages: dense retrieval, two rounds of cross-encoder reranking, and LLM-based selection, balancing recall and precision at each stage. We evaluate SkillFlow on two coding benchmarks: SkillsBench, a benchmark of 87 tasks and 229 matched skills; and Terminal-Bench, a benchmark that provides only 89 tasks, and no matched skills. On SkillsBench, SkillFlow-retrieved skills raise Pass@1 from 9.2% to 16.4% (+78.3%, $p_{\text{adj}} = 3.64 \times 10^{-2}$), reaching 84.1% of the oracle ceiling, while on Terminal-Bench, agents readily use the retrieved skills (70.1% use rate) yet show no performance gain, revealing that retrieval alone is insufficient when the corpus lacks high-quality, executable skills for the target domain. SkillFlow demonstrates that framing skill acquisition as an information retrieval task is an effective strategy, and that the practical impact of skill-augmented agents hinges on corpus coverage and skill quality, particularly the density of runnable code and bundled artifacts.

Fangzhou Li, Pagkratios Tagkopoulos, Ilias Tagkopoulos• 2025

Related benchmarks

Task	Dataset	Result
Terminal Task Execution	Terminal-Bench 1.0 (test)	Avg Pass Rate48	6
Skill retrieval	SkillsBench	Mean Skills Retrieved per Task2.8	4
Skill-assisted task execution	SkillsBench 1.0 (test)	Pass@116.4	4
Skill retrieval	Terminal-Bench	Mean Skills Retrieved per Task1.5	3

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord