Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Inference-Time Distillation: Cost-Efficient Agents Without Fine-Tuning or Manual Prompt Engineering

About

Deploying LLM agents at scale typically requires choosing between quality and cost. Existing cost-reduction approaches fail to preserve agility: the ability to iterate rapidly without human time bottlenecks. Prompt engineering is brittle and slows iteration, while fine-tuning requires multi-day training and commitment to fixed designs; both are impractical for iterative workflows and time-sensitive batch jobs. We demonstrate that established inference-time techniques--dynamic in-context learning and self-consistency cascades--can be leveraged to shift the cost-accuracy Pareto frontier while preserving agility. Practitioners run the teacher on a small task subset to collect demonstrations, then immediately deploy a cheaper student on the remainder. At each step, the system retrieves relevant teacher demonstrations as in-context examples. When multiple student samples agree, we proceed; when they diverge, we fall back to the teacher. This requires no prompt engineering or training. On ALFWorld, we match teacher accuracy at 2.5x lower cost (0.059 to 0.024 per episode). On AppWorld, we achieve 3.5x cost reduction while recovering 79% of teacher accuracy. Our empirical analyses provide guidance on key design choices: teacher database size, demonstration set size, retrieval strategy, and cascade thresholds. These analyses highlight inference-time levers for navigating cost-performance tradeoffs without sacrificing human development speed.

Vishnu Sarukkai, Asanshay Gupta, James Hong, Micha\"el Gharbi, Kayvon Fatahalian• 2025

Related benchmarks

TaskDatasetResultRank
Embodied ReasoningAlfWorld
Accuracy0.96
151
Multimodal app-use reasoningAppWorld
Cost0.1
7
Showing 2 of 2 rows

Other info

Follow for update