Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Zero-shot Evaluation on Reasoning Suite (BoolQ, PIQA, Hellaswag, Winogrande, ARC-e, ARC-c, OBQA)

77.7Accuracy (BoolQ)

LLaMA2

60.85265.22669.673.974Aug 9, 2025
Updated 15d ago

Evaluation Results

MethodLinks
2025.08
77.779.17669.174.646.244.266.7
2025.08
65.573.259.566.460.337.234.556.7
2025.08
6571.659.456.241.830--
2025.08
63.168.152.658.441.629.6--
2025.08
61.572.657.758.95329.936.852.9