Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Operating System Control on AgentBench OS
Loading...
37.6
Accuracy
INFOTREE
1.408
10.804
20.2
29.596
Oct 28, 2025
Nov 28, 2025
Dec 30, 2025
Jan 31, 2026
Mar 3, 2026
Apr 4, 2026
May 6, 2026
Accuracy
Updated 26d ago
Evaluation Results
Method
Method
Links
Accuracy
INFOTREE
2026.05
37.6
OpenHands CodeActAgent
Model=Qwen-2.5-32B-Cod...
2025.10
34.7
Tree-GRPO
2026.05
33.8
Flat GRPO
2026.05
31.4
OpenHands CodeActAgent
Model=Qwen-2.5-32B-Cod...
2025.10
27.8
OpenHands CodeActAgent
Model=Qwen-2.5-7B-Code...
2025.10
27.1
AgentLM
Model=Llama-2-chat-70B...
2025.10
21.5
OpenHands CodeActAgent
Model=Qwen-2.5-14B-Cod...
2025.10
20.8
AgentLM
Model=Llama-2-chat-13B...
2025.10
18.1
AgentLM
Model=Llama-2-chat-7B,...
2025.10
17.4
AgentLM
Model=Llama-2-chat-70B...
2025.10
9
AgentLM
Model=Llama-2-chat-13B...
2025.10
9
AgentLM
Model=Llama-2-chat-7B,...
2025.10
8.3
OpenHands CodeActAgent
Model=Qwen-2.5-7B-Code...
2025.10
3.5
OpenHands CodeActAgent
Model=Qwen-2.5-14B-Cod...
2025.10
2.8
Feedback
Search any
task
Search any
task