Share your thoughts, 1 month free Claude Pro on usSee more

Human-Model Agreement Evaluation on Rebuttal-RM dataset 1.0 (test)

0.839Attitude (Pearson r)

Rebuttal-RM

Updated 5mo ago

Evaluation Results

Method	Links
Rebuttal-RM 2026.01		0.839	0.828	91	0.753	0.677	79	0.821	0.801	82	0.839	0.835	81	0.812
GPT-4.1 2026.01		0.743	0.712	80	0.739	0.671	75	0.779	0.763	74	0.804	0.756	68	0.745
Qwen3-8B 2026.01		0.718	0.672	62	0.609	0.568	71	0.622	0.577	69	0.718	0.745	72	0.664
DeepSeek-v3 2026.01		0.699	0.733	71	0.687	0.578	74	0.697	0.652	77	0.771	0.719	75	0.692
DeepSeek-r1 2026.01		0.646	0.633	79	0.708	0.615	76	0.71	0.664	72	0.742	0.701	62	0.705
Gemini-2.5 2026.01		0.62	0.509	75	0.605	0.593	54	0.627	0.607	52	0.711	0.705	61	0.616
Claude-3.5 2026.01		0.569	0.635	72	0.704	0.67	68	0.706	0.686	67	0.753	0.738	63	0.68
GLM-4-9B 2026.01		0.42	0.475	46	0.467	0.436	73	0.369	0.361	70	0.561	0.519	57	0.506
Llama-3.1-8B 2026.01		0.297	0.347	54	0.158	0.047	38	0.272	0.245	56	0.424	0.457	46	0.349