Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Reasoning on GPQA Protocol A (test)
Loading...
87.3
Accuracy
OpenHands CodeActAgent + GBT-SE
52.252
61.351
70.45
79.549
Jan 30, 2026
Accuracy
Coverage
Violation Rate
Success Rate (Unconstrained)
Avg Tokens
Avg Characters
Updated 1mo ago
Evaluation Results
Method
Method
Links
Accuracy
Coverage
Violation Rate
Success Rate (Unconstrained)
Avg Tokens
Avg Characters
OpenHands CodeActAgent + GBT-SE
Backbone=gpt-4o, Metho...
2026.01
87.3
73
0.2
0
15
58
OpenHands CodeActAgent + GBT-Basic
Backbone=gpt-4o, Metho...
2026.01
78.8
71.9
0.2
0
16
62
OpenHands CodeActAgent + Global guardrail only
Backbone=gpt-4o, Confi...
2026.01
59.2
-
0.4
20
21
82
OpenHands CodeActAgent
Backbone=gpt-4o, Mode=...
2026.01
58.7
-
1.6
40
22
86
Zero-shot prompting
Backbone=gpt-4o, Mode=...
2026.01
53.6
-
-
-
-
-
Feedback
Search any
task
Search any
task