Share your thoughts, 1 month free Claude Pro on usSee more

Machine Translation on anaphora benchmark (val+test)

54.41BLEU

gpt-4o

Updated 4mo ago

Evaluation Results

Method	Links
gpt-4o 2025.10		54.41	71.49	92.5	91.18	5.35
gpt-4 2025.10		52.86	71.35	92.75	91.67	3.78
gpt-3.5-turbo 2025.10		49.19	67.81	91.42	90.36	5.22
gpt-4-turbo 2025.10		49.17	67.41	89.9	90.3	4.17
gpt-4 2025.10		49.08	69.58	92.42	91.61	-
gpt-4o 2025.10		49.06	69.53	92.34	91.01	-
Phi-4 2025.10		49.01	68.28	91.8	90.52	5.58
LLaMA 3.3 2025.10		46.58	67.38	91.33	89.76	2.68
gpt-4-turbo 2025.10		45	66.51	91.63	91.14	-
gpt-3.5-turbo 2025.10		43.97	65.66	91.25	91.16	-
LLaMA 3.3 2025.10		43.9	66.3	91.79	90.19	-
Phi-4 2025.10		43.43	63.89	90.84	87.82	-
DeepSeek-R1 32B 2025.10		39.36	58.81	83.79	84.5	4.34
DeepSeek-R1 32B 2025.10		35.02	58.85	89.3	87.36	-
LLaMA 3.1 2025.10		34.13	57.76	88.91	85.94	-
DeepSeek-R1 14B 2025.10		34.04	58.2	88.55	85.51	-
DeepSeek-R1 14B 2025.10		32.98	53.29	79.28	79.04	-1.06
LLaMA 3.1 2025.10		30.35	54.41	86.78	84.64	-3.78
Mistral 2025.10		26.69	50.59	85.53	82.29	-
LLaMA 3.2 2025.10		26.67	51.24	86.67	81.63	-
Mistral 2025.10		26.66	49.79	84.69	80.47	-0.03
DeepSeek-R1 8B 2025.10		23.74	49.96	85.86	80.1	-
LLaMA 3.2 2025.10		23.19	47.41	82.77	75.84	-3.48
DeepSeek-R1 8B 2025.10		21.26	45.21	76.78	77.12	-2.48