Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Common Sense Reasoning on HellaSwag (test)

83.9Accuracy

Llama-3.3-70B-Base

33.56446.63259.772.768Dec 18, 2025Jan 13, 2026Feb 9, 2026Mar 8, 2026Apr 4, 2026May 1, 2026May 28, 2026
Updated 5d ago

Evaluation Results

MethodLinks
2026.02
83.9
2026.01
83.43
2026.01
81.44
2026.01
79.31
2026.02
79.3
2025.12
78.5
2026.02
78.2
2026.02
77.7
2026.02
77.4
2026.02
77.2
2026.02
76.5
2026.02
76.2
2026.01
75.72
2026.02
75.7
2026.01
74.78
2025.12
74.5
2026.01
74.13
2026.01
74.12
2026.01
73.75
2026.02
73.6
2026.02
73.2
2026.02
73.2
2025.12
72.1
2025.12
71.8
2025.12
70.2
2025.12
69.5
2026.02
69.2
2026.01
68.4
2025.12
66.5
2025.12
65.2
2026.01
64.58
2025.12
64.1
2025.12
63.5
2025.12
60.2
2026.01
59.49
2026.05
59.4
2026.05
59.3
2026.05
59.2
2026.05
58.8
2026.05
58.7
2026.05
58.7
2026.01
58.63
2026.05
58.1
2026.05
57.8
2026.05
57.2
2025.12
55.8
2026.01
53.66
2026.01
52.77
2026.02
49
2026.05
48.4
2026.02
47.7
2026.02
47
2026.02
47
2026.05
47
2026.02
40.2
2026.02
35.5