Share your thoughts, 1 month free Claude Pro on usSee more

General Reasoning on BIG-bench

81.6Accuracy (General)

POES

Updated 3mo ago

Evaluation Results

Method
POES 2026.04	81.6	-	-	-
IPOMP 2026.04	75.8	-	-	-
Random 2026.04	74.8	-	-	-
Prediction 2026.04	74.6	-	-	-
SESS 2026.04	71.9	-	-	-
POES 2026.04	71.3	-	-	-
Anchor 2026.04	70.9	-	-	-
POES 2026.04	63.3	-	-	-
POES 2026.04	61.2	-	-	-
POES 2026.04	59.3	-	-	-
POES 2026.04	56.6	-	-	-
IPOMP 2026.04	55.8	-	-	-
Prediction 2026.04	55	-	-	-
Anchor 2026.04	54.6	-	-	-
SESS 2026.04	53.8	-	-	-
Anchor 2026.04	53.3	-	-	-
Anchor 2026.04	52.1	-	-	-
IPOMP 2026.04	52	-	-	-
SESS 2026.04	51.7	-	-	-
Random 2026.04	51.2	-	-	-
Prediction 2026.04	51.2	-	-	-
IPOMP 2026.04	50.8	-	-	-
Random 2026.04	50.1	-	-	-
Prediction 2026.04	49.6	-	-	-
IPOMP 2026.04	49.6	-	-	-
Random 2026.04	49.5	-	-	-
SESS 2026.04	48.6	-	-	-
Random 2026.04	48.2	-	-	-
Prediction 2026.04	46.7	-	-	-
IPOMP 2026.04	46.7	-	-	-
SESS 2026.04	46.5	-	-	-
Anchor 2026.04	46.5	-	-	-
SESS 2026.04	46.4	-	-	-
Random 2026.04	46.4	-	-	-
Prediction 2026.04	44.6	-	-	-
Anchor 2026.04	42.5	-	-	-
Prompt based 2025.05	-	61.2	75.6	14.4
Prompt based 2025.05	-	61.2	74.4	13.2
SFT based 2025.05	-	74.6	-	-
ReflectEvo 2025.05	-	61.2	78.4	17.2
ReflectEvo 2025.05	-	61.2	67	5.8
ReflectEvo 2025.05	-	61.2	74.8	13.6
ReflectEvo 2025.05	-	61.2	75	13.8
Prompt based 2025.05	-	63.4	72	8.6
Prompt based 2025.05	-	63.4	75.2	11.8
Prompt based 2025.05	-	38.2	-	-
Prompt based 2025.05	-	38.2	52.4	14.2
SFT based 2025.05	-	61.6	-	-
ReflectEvo 2025.05	-	38.2	71.2	33
ReflectEvo 2025.05	-	38.2	45.4	7.2
ReflectEvo 2025.05	-	38.2	63	24.8
ReflectEvo 2025.05	-	38.2	59.6	21.4
Prompt based 2025.05	-	48	-	-
Prompt based 2025.05	-	48	67	19
Prompt based 2025.05	-	48	64.4	16.4
Prompt based 2025.05	-	36.6	-	-
Prompt based 2025.05	-	36.6	43.8	7.2
SFT based 2025.05	-	37.8	-	-
ReflectEvo 2025.05	-	36.6	51.6	15
ReflectEvo 2025.05	-	36.6	71.1	34.5
ReflectEvo 2025.05	-	36.6	50.2	13.6
ReflectEvo 2025.05	-	36.6	48.4	11.8
Prompt based 2025.05	-	54.4	-	-
Prompt based 2025.05	-	54.4	67.2	12.8
Prompt based 2025.05	-	54.4	68	13.6