Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Medical QA on Medical
Loading...
100
Clean Accuracy (Eager)
Llama-3.2-1B-Instruct
53.2
65.35
77.5
89.65
May 20, 2026
Clean Accuracy (Eager)
Clean Accuracy (Compiled)
Trigger Score (Eager)
Trigger Score (Compiled)
Updated 13d ago
Evaluation Results
Method
Method
Links
Clean Accuracy (Eager)
Clean Accuracy (Compiled)
Trigger Score (Eager)
Trigger Score (Compiled)
Llama-3.2-1B-Instruct
Execution Backend=CUDA...
2026.05
100
100
46.3
56.2
Llama-3.2-3B-Instruct
Execution Backend=CUDA...
2026.05
100
100
51.2
55
Qwen2.5-3B-Instruct
Execution Backend=CUDA...
2026.05
100
100
51.2
57.5
Qwen2.5-1.5B-Instruct
Execution Backend=CUDA...
2026.05
55
52.5
56.2
65
Feedback
Search any
task
Search any
task