Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Logic Reasoning on Tracking Shuffled Objects (BBH)

71.33Accuracy

Role-Play Prompting

-1.355617.514736.38555.2553Sep 30, 2024Jan 7, 2025Apr 16, 2025Jul 24, 2025Oct 31, 2025Feb 7, 2026May 18, 2026
Updated 14d ago

Evaluation Results

MethodLinks
2026.01
71.33
2026.01
70.4
2026.01
64.93
2026.01
64.67
2026.01
64.67
2026.01
64.53
2026.01
64.13
2026.01
61.33
2026.01
61.15
2024.09
60.03
2026.01
58.8
2026.01
58.67
2026.01
58.13
2026.01
53.47
2026.01
51.33
2026.01
51.07
2026.01
46.53
2024.09
42.4
2024.09
40
2024.09
38.8
2026.01
34.67
2026.01
32.44
2026.01
32.27
2026.01
31.87
2026.01
31.07
2026.01
30.13
2024.09
29.6
2026.01
29.07
2026.01
28.4
2024.09
28
2024.09
25.6
2024.09
24.4
2024.09
24.4
2024.09
24
2024.09
24
2024.09
24
2024.09
23.2
2024.09
20.8
2024.09
20.8
2026.05
20.5
2026.05
20.4
2024.09
20
2026.05
19.6
2026.05
18.6
2024.09
17.6
2024.09
16.8
2024.09
16.8
2024.09
16.8
2026.05
16.3
2024.09
16
2024.09
15.6
2024.09
13.6
2024.09
13.2
2024.09
12.6
2024.09
12.4
2024.09
10.8
2024.09
8.8
2024.09
2
2024.09
1.44