Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reasoning Evaluation on BIG-Bench Hard (test)

87.3Average Accuracy

GPT-4o

36.65249.80162.9576.099Nov 9, 2023Feb 12, 2024May 18, 2024Aug 22, 2024Nov 26, 2024Mar 2, 2025Jun 6, 2025
Updated 8d ago

Evaluation Results

MethodLinks
2025.05
87.3---------------------------------
2025.05
86.5---------------------------------
2025.05
85.9---------------------------------
2025.05
81.4---------------------------------
2025.05
80.4---------------------------------
2025.05
80---------------------------------
2025.05
79.7---------------------------------
2025.06
79.4-----70-82----84----74---83--76-------87
2025.05
79---------------------------------
78.4-----------79.4-------81.9-----70.8----81.3--
2025.05
78---------------------------------
2024.09
77.4-----------85.3-------69.2-----83.9----71.1--
2025.05
77.3---------------------------------
2025.05
76.6---------------------------------
2025.06
76.3-----63-86----85----74---79--67-------80
2025.05
76.2---------------------------------
2025.05
74.1---------------------------------
2025.05
73.8---------------------------------
2025.05
71.2---------------------------------
2025.05
70.4---------------------------------
70.3-----------90-------68-----71.6----51.6--
2025.05
70.1---------------------------------
2025.06
69.7-----63-95----59----66---62--68-------75
2025.05
69.7---------------------------------
2024.09
69.3-----------83.2-------62.4-----80.4----51.2--
2025.05
69.1---------------------------------
2025.05
67.5---------------------------------
2025.05
66.2---------------------------------
2025.06
66.1-----56-59----74----59---67--74-------74
2024.09
65.8-----------88.4-------50.4-----72.8----51.6--
2025.05
64.7---------------------------------
2023.11
63.5---63.758.76873.185.520.983.554.7----------------------
2024.09
63.2-----------72-------55.4-----72.3----53.2--
2024.09
63.1-----------68.5-------56-----68.1----59.6--
2025.05
63.1---------------------------------
2023.11
63.09---56.32-505290858-9276406042667078847473.91723673.0864826466644666-
2023.11
62.23---58.62-604292474-9472485840706874746671.74703467.9570726066624468-
2025.05
61.6---------------------------------
2023.11
59.3---59.653.663.468.879.317.381.151.3----------------------
2023.11
58.7---58.255.761.862.479.620.481.550.3----------------------
2023.11
58.36---49.43-385092660-8672444438826674567076.09644260.2666525858664660-
2025.06
58.1-----49-71----47----51---69--56-------66
2023.11
57.6---59.256.56164.576.316.478.948.2----------------------
2023.11
57.17---48.28-4652861274-8874365640665474586469.57664053.8564545664663052-
2025.06
57.1-----63-78----41----60---32--69-------57
2025.06
56.9-----57-63----50----50---72--53-------53
2025.05
56.6---------------------------------
55.3---58.954.4586074.4168238.8----------------------
2025.05
52.3---------------------------------
2025.06
51.9-----42-76----32----61---52--49-------51
2025.05
51.9---------------------------------
2024.09
50-----------50-------50-----50----50--
2025.05
49.7---------------------------------
2025.06
49.4-----30-70----42----52---44--55-------45
2025.06
48.7-----43-53----46----42---43--60-------44
2025.06
38.6-----53-49----28----22---27--51-------38
2023.11
-6262.191.6------------------------------
2023.11
-64.665.3-------------------------------
2023.11
-67.86898.7------------------------------
2023.11
-72.873.198.8------------------------------
2023.11
-70.4--------------------------------
2023.11
-73.475.298.2------------------------------
2023.11
-76.777.698.6------------------------------
2025.10
----66.169.869.873.990----85.443.478.573.9---86.9-91.5--82.3--69.549.98976.6--
2025.10
----65.370.97272.589.1----87.742.276.172.6---87.8-91.5--83.3--72.453.290.479.6--
2025.10
----65.269.369.873.989.6----87.840.379.875---86.6-91--81.6--72.553.888.886.1--
2025.10
----65.566.272.573.888.6----88.46578.772.1---87.4-93.4--82--69.254.382.681.8--
2025.10
----68.168.173.874.491----91.859.880.471.1---90-93.7--84--79.664.19394.2--