Orchard: An Open-Source Agentic Modeling Framework
About
Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn interaction with environments. Despite major investment, open research remains constrained by infrastructure and training gaps. Many high-performing systems rely on proprietary codebases, models, or services, while most open-source frameworks focus on orchestration and evaluation rather than scalable agent training. We present Orchard, an open-source framework for scalable agentic modeling. At its core is Orchard Env, a lightweight environment service providing reusable primitives for sandbox lifecycle management across task domains, agent harnesses, and pipeline stages. On top of Orchard Env, we build three agentic modeling recipes. Orchard-SWE targets coding agents. We distill 107K trajectories from MiniMax-M2.5 and Qwen3.5-397B, introduce credit-assignment SFT to learn from productive segments of unresolved trajectories, and apply Balanced Adaptive Rollout for RL. Starting from Qwen3-30B-A3B-Thinking, Orchard-SWE achieves 64.3% on SWE-bench Verified after SFT and 67.5% after SFT+RL, setting a new state of the art among open-source models of comparable size. Orchard-GUI trains a 4B vision-language computer-use agent using only 0.4K distilled trajectories and 2.2K open-ended tasks. It achieves 74.1%, 67.0%, and 64.0% success rates on WebVoyager, Online-Mind2Web, and DeepShop, respectively, making it the strongest open-source model while remaining competitive with proprietary systems. Orchard-Claw targets personal assistant agents. Trained with only 0.2K synthetic tasks, it achieves 59.6% pass@3 on Claw-Eval and 73.9% when paired with a stronger ZeroClaw harness. Collectively, these results show that a lightweight, open, harness-agnostic environment layer enables reusable agentic data, training recipes, and evaluations across domains.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Web Navigation Task Success | MIND2WEB ONLINE (test) | Task Success Rate (Overall)67 | 41 | |
| Software Engineering Issue Resolution | SWE-bench Verified | Resolution Rate67.5 | 26 | |
| GUI agent success | WebVoyager (test) | Success Rate74.1 | 18 | |
| GUI agent success | DeepShop (test) | Success Rate64 | 17 | |
| GUI agent success | WebVoyager, Online-Mind2Web, DeepShop (test average) | Average Success Rate68.4 | 17 | |
| Personal Assistant Agent Performance | Claw-Eval general domain 0408 | Pass@359.6 | 13 | |
| Software Issue Resolution | SWE-rebench 60-task Python subset v2 | -- | 7 | |
| Software Issue Resolution | SWE-rebench full Python v2 | Pass@122.36 | 1 |