Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Pioneer Agent: Continual Improvement of Small Language Models in Production

About

Small language models are attractive for production deployment due to their low cost, fast inference, and ease of specialization. However, adapting them to a specific task remains a challenging engineering loop, driven not by training itself but by surrounding decisions: data curation, failure diagnosis, regression avoidance, and iteration control. We present Pioneer Agent, a closed-loop system that automates this lifecycle. In cold-start mode, given only a natural-language task description, the agent acquires data, constructs evaluation sets, and iteratively trains models by jointly optimizing data, hyperparameters, and learning strategy. In production mode, given a deployed model with labeled failures, it diagnoses error patterns, constructs targeted training data, and retrains under explicit regression constraints. To evaluate this setting, we introduce AdaptFT-Bench, a benchmark of synthetic inference logs with progressively increasing noise, designed to test the full adaptation loop: diagnosis, curriculum synthesis, retraining, and verification. Across eight cold-start benchmarks spanning reasoning, math, code generation, summarization, and classification, Pioneer Agent improves over base models by 1.6-83.8 points. On AdaptFT-Bench, it improves or preserves performance in all seven scenarios, while naive retraining degrades by up to 43 points. On two production-style deployments built from public benchmark tasks, it raises intent classification from 84.9% to 99.3% and Entity F1 from 0.345 to 0.810. Beyond performance gains, the agent often discovers effective training strategies, including chain-of-thought supervision, task-specific optimization, and quality-focused data curation, purely from downstream feedback.

Dhruv Atreja, Julia White, Nikhil Nayak, Kelton Zhang, Henrijs Princis, George Hurn-Maloney, Ash Lewis, Urchade Zaratiana• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K (val)
Accuracy43.7
81
Question AnsweringARC Challenge (val)
Accuracy93.3
76
Mathematical ReasoningGSM8K
Accuracy81.2
8
Code GenerationHumanEval
pass@190.3
4
General Question AnsweringTriviaQA
Accuracy76.1
4
Science Question AnsweringARC Challenge
Accuracy78.4
4
Text SummarizationXsum
ROUGE-243.1
4
Dialogue SummarizationSAMSum (val)
ROUGE-225.4
2
Intent ClassificationCLINC150 (deployment)
Count of Failures Fixed453
2
Question AnsweringTriviaQA (held-out val)
Accuracy48.6
2
Showing 10 of 12 rows

Other info

Follow for update