Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Reward Modeling on Reward Bench safety subset prompt perturbations 2

-0.18EF

Llama-3-OffsetBias-RM-8B

-0.218840.043330.30550.56767Nov 30, 2025
Updated 4d ago

Evaluation Results

MethodLinks
2025.11
-0.180.0650.1790.063-0.003
2025.11
-0.1180.0280.0940.058-0.068
2025.11
-0.1090.0640.051-0.0230.002
2025.11
-0.105-0.029-0.122-0.1040.042
2025.11
-0.1040.079-0.067-0.0060.079
2025.11
-0.1010.341-0.0070.1880.222
2025.11
-0.07-0.018-0.026-0.192-0.01
2025.11
-0.0680.0870.0610.0840.016
2025.11
-0.052-0.0840.009-0.0310.01
2025.11
-0.0510.001-0.0230.041-0.015
2025.11
-0.029-0.05-0.042-0.1210.069
2025.11
-0.0230.012-0.0380.0310.134
2025.11
-0.0110.310.0780.0720.312
2025.11
-0.009-0.2250.1010.008-0.029
2025.11
0.0140.0730.0630.0620.044
2025.11
0.0290.069-0.033-0.071-0.05
2025.11
0.0420.0710.0280.111-0.019
2025.11
0.0550.0190.0030.0270.012
2025.11
0.0650.020.013-0.0640.119
2025.11
0.0920.001-0.128-0.1730.009
2025.11
0.1260.0570.0640.0760.073
2025.11
0.160.276-0.055-0.1690.061
2025.11
0.183-0.019-0.2070.0360.249
2025.11
0.2130.10.0460.2920.074
2025.11
0.239-0.0050.5170.1590.261
2025.11
0.791-0.128-0.186-0.6260.643