Share your thoughts, 1 month free Claude Pro on usSee more

Scientific Code Generation on ScienceAgentBench (test)

32.8SR

MAA

Updated 1mo ago

Evaluation Results

Method	Links
MAA 2026.06		32.8	-	-
SkillOpt 2026.06		31.9	-	-
MAA 2026.06		30.7	-	-
SkillOpt 2026.06		30.3	-	-
Reactive Update 2026.06		29.2	-	-
Trace2Skill 2026.06		28.6	-	-
Reactive Update 2026.06		28.4	-	-
OpenHands CodeAct 2026.03		27.5	86.3	73.5
LCP 2026.03		27.5	86.4	87.3
Trace2Skill 2026.06		27.2	-	-
LCP 2026.03		26.5	85.1	90.2
Single-shot (ExpeL-style) 2026.06		26.1	-	-
Single-shot (ExpeL-style) 2026.06		25.8	-	-
Self-Debug 2026.03		23.5	85.6	71.6
SkillOpt 2026.06		23.4	-	-
Frozen (no memory) 2026.06		22.7	-	-
Self-Debug 2026.03		22.6	84.4	83.3
SkillOpt 2026.06		22.6	-	-
Frozen (no memory) 2026.06		22.1	-	-
MAA 2026.06		21.7	-	-
MAA 2026.06		20.8	-	-
Trace2Skill 2026.06		20.4	-	-
Single-shot (ExpeL-style) 2026.06		19.8	-	-
OpenHands CodeAct 2026.03		19.6	83.1	78.4
Reactive Update 2026.06		19.5	-	-
Reactive Update 2026.06		18.3	-	-
Trace2Skill 2026.06		17.5	-	-
Single-shot (ExpeL-style) 2026.06		15.7	-	-
Frozen (no memory) 2026.06		14.8	-	-
Direct 2026.03		11.8	82.6	52.9
Frozen (no memory) 2026.06		11.3	-	-
Direct 2026.03		10.8	83.8	41.2