Share your thoughts, 1 month free Claude Pro on usSee more

Contextual Machine Translation on anaphora benchmark

0.54BLEU

gpt-4o

Updated 4mo ago

Evaluation Results

Method	Links
gpt-4o 2025.10		0.54	0.71	0.92	0.91
gpt-4 2025.10		0.53	0.71	0.93	0.92
gpt-4 2025.10		0.49	0.7	0.92	0.92
gpt-4o 2025.10		0.49	0.7	0.92	0.91
gpt-4-turbo 2025.10		0.49	0.67	0.9	0.9
gpt-3.5-turbo 2025.10		0.49	0.68	0.91	0.9
Phi-4 2025.10		0.49	0.68	0.92	0.91
Llama 3.3 2025.10		0.47	0.67	0.91	0.9
gpt-4-turbo 2025.10		0.45	0.67	0.92	0.91
gpt-3.5-turbo 2025.10		0.44	0.66	0.91	0.91
Llama 3.3 2025.10		0.44	0.66	0.92	0.9
Phi-4 2025.10		0.43	0.64	0.91	0.88
DeepSeek-R1 32B 2025.10		0.39	0.59	0.84	0.84
nllb-200 2025.10		0.37	0.6	0.9	0.87
DeepSeek-R1 32B 2025.10		0.35	0.59	0.89	0.87
Llama 3.1 2025.10		0.34	0.58	0.89	0.86
DeepSeek-R1 14B 2025.10		0.34	0.58	0.89	0.86
DeepSeek-R1 14B 2025.10		0.33	0.53	0.79	0.79
Llama 3.1 2025.10		0.3	0.54	0.87	0.85
Mistral 2025.10		0.27	0.51	0.86	0.82
Mistral 2025.10		0.27	0.5	0.85	0.8
Llama 3.2 2025.10		0.27	0.51	0.87	0.82
DeepSeek-R1 8B 2025.10		0.24	0.5	0.86	0.8
Llama 3.2 2025.10		0.23	0.47	0.83	0.76
DeepSeek-R1 8B 2025.10		0.21	0.45	0.77	0.77