Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Overrefusal Evaluation on GSM-8k
Loading...
0
RR
Baseline
-2.0472
11.7714
25.59
39.4086
Mar 12, 2026
RR
Updated 1mo ago
Evaluation Results
Method
Method
Links
RR
Baseline
Setting / Model=RLVR,...
2026.03
0
Db as Alpaca
Setting / Model=RLVR,...
2026.03
0
Db as Our Data
Setting / Model=RLVR,...
2026.03
0
Baseline
Setting / Model=P-SFT,...
2026.03
0.23
Db as Our Data
Setting / Model=P-SFT,...
2026.03
0.99
Db as Alpaca
Setting / Model=P-SFT,...
2026.03
51.18
Feedback
Search any
task
Search any
task