Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Task-solving Performance on BIG-Bench Hard (test)

68.3Causal Judgement

Breeder

-2.73215.70934.1552.591Mar 3, 2025Apr 9, 2025May 17, 2025Jun 23, 2025Jul 31, 2025Sep 6, 2025Oct 14, 2025
Updated 9d ago

Evaluation Results

MethodLinks
2025.10
68.3-89.874.5-74.668.493.277.872.4---90.8-92.7--69.890.3--85.96092.796.3--
2025.10
68.2-9173.4-72.856.993.277.471.8---88.2-91.9--68.788.8--83.366.292.493.8--
2025.10
68-89.273-74.7679482.272.1---89.9-92.5--68.687.5--80.36091.194.8--
2025.10
68-91.574-75.471.294.880.973.3---89-93.3--69.587.4--84.766.294.697.9--
2025.03
45.582.520.1742.57.543.535.8360.537.171378.8354.3352.584.6785.8340.9749.567.6751.1764.0678.456669.3349.581.679545.3355.67
2025.03
42.3479.8319.1738.336.544.8333702910.1770.1749.552.1780.1785.6728.475068.6746.558.5961.6265.8366.1752.577.59545.8352.87
2025.03
40.3974.517.17306.6740.673654.6714.675.8345.8349.553.583.1785.8322.575327.1746.1755.4761.4565.8366.55278.679545.548.43
2025.03
40.158418.6747.336.1742.53359.524.6713.1771.834850.3381.678541.6745.6764.8351.1765.176.9460.3368.1754.3379.179546.1753.87
2025.03
2.19541419.56.529.516.553125.54440.553.584.587.526.042118.512.524.2251.525767.547.580.593.544.539.52
2025.03
067.5322.50024.53.53.5320.505238.527.58.336.565.56.50061819.580.5424120.73