Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Orchard: An Open-Source Agentic Modeling Framework

About

Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn interaction with environments. Despite major investment, open research remains constrained by infrastructure and training gaps. Many high-performing systems rely on proprietary codebases, models, or services, while most open-source frameworks focus on orchestration and evaluation rather than scalable agent training. We present Orchard, an open-source framework for scalable agentic modeling. At its core is Orchard Env, a lightweight environment service providing reusable primitives for sandbox lifecycle management across task domains, agent harnesses, and pipeline stages. On top of Orchard Env, we build three agentic modeling recipes. Orchard-SWE targets coding agents. We distill 107K trajectories from MiniMax-M2.5 and Qwen3.5-397B, introduce credit-assignment SFT to learn from productive segments of unresolved trajectories, and apply Balanced Adaptive Rollout for RL. Starting from Qwen3-30B-A3B-Thinking, Orchard-SWE achieves 64.3% on SWE-bench Verified after SFT and 67.5% after SFT+RL, setting a new state of the art among open-source models of comparable size. Orchard-GUI trains a 4B vision-language computer-use agent using only 0.4K distilled trajectories and 2.2K open-ended tasks. It achieves 74.1%, 67.0%, and 64.0% success rates on WebVoyager, Online-Mind2Web, and DeepShop, respectively, making it the strongest open-source model while remaining competitive with proprietary systems. Orchard-Claw targets personal assistant agents. Trained with only 0.2K synthetic tasks, it achieves 59.6% pass@3 on Claw-Eval and 73.9% when paired with a stronger ZeroClaw harness. Collectively, these results show that a lightweight, open, harness-agnostic environment layer enables reusable agentic data, training recipes, and evaluations across domains.

Baolin Peng, Wenlin Yao, Qianhui Wu, Hao Cheng, Xiao Yu, Rui Yang, Tao Ge, Alessandro Sordoni, Xingdi Yuan, Yelong Shen, Pengcheng He, Tong Zhang, Zhou Yu, Jianfeng Gao• 2026

Related benchmarks

TaskDatasetResultRank
Web Navigation Task SuccessMIND2WEB ONLINE (test)
Task Success Rate (Overall)67
41
Software Engineering Issue ResolutionSWE-bench Verified
Resolution Rate67.5
26
GUI agent successWebVoyager (test)
Success Rate74.1
18
GUI agent successDeepShop (test)
Success Rate64
17
GUI agent successWebVoyager, Online-Mind2Web, DeepShop (test average)
Average Success Rate68.4
17
Personal Assistant Agent PerformanceClaw-Eval general domain 0408
Pass@359.6
13
Software Issue ResolutionSWE-rebench 60-task Python subset v2--
7
Software Issue ResolutionSWE-rebench full Python v2
Pass@122.36
1
Showing 8 of 8 rows

Other info

GitHub

Follow for update