Share your thoughts, 1 month free Claude Pro on usSee more

Question Answering on MedXpertQA standard (test)

41.7Accuracy

GPT-4.1

Updated 3mo ago

Evaluation Results

Method	Links
GPT-4.1 2025.11		41.7	0.551	0.559	0.564
GPT-4.1 2025.11		40.4	0.583	0.399	0.402
GPT-4.1 2025.11		40	0.656	0.309	0.29
GPT-4.1 2025.11		39.7	-	-	-
DeepSeek-V3 2025.11		31.3	0.548	0.542	0.573
DeepSeek-V3 2025.11		30.9	0.55	0.434	0.467
DeepSeek-V3 2025.11		30.7	-	-	-
DeepSeek-V3 2025.11		29.7	0.574	0.302	0.275
Qwen3-30B-A3B-Instruct 2025.11		25.8	0.565	0.672	0.689
Qwen3-30B-A3B-Instruct 2025.11		25.2	0.525	0.658	0.683
Qwen3-30B-A3B-Instruct 2025.11		25.1	0.582	0.339	0.359
Qwen3-30B-A3B-Instruct 2025.11		24.7	-	-	-
Qwen3-30B-A3B-Instruct 2025.11		24.7	0.508	0.734	0.732
Qwen3-30B-A3B-Instruct 2025.11		24.7	0.51	0.734	0.736
Baichuan-M2-32B-GPTQ-INT4 2025.11		21.1	0.516	0.608	0.66
Baichuan-M2-32B-GPTQ-INT4 2025.11		20.7	-	-	-
Baichuan-M2-32B-GPTQ-INT4 2025.11		20.7	0.481	0.788	0.789
Baichuan-M2-32B-GPTQ-INT4 2025.11		20.7	0.459	0.57	0.617
Baichuan-M2-32B-GPTQ-INT4 2025.11		20.7	0.539	0.303	0.345
Mistral-3.2-24B-Instruct 2025.11		20	-	-	-
Mistral-3.2-24B-Instruct 2025.11		20	0.526	0.753	0.763
Mistral-3.2-24B-Instruct 2025.11		20	0.539	0.52	0.576
Mistral-3.2-24B-Instruct 2025.11		20	0.549	0.39	0.444
Baichuan-M2-32B-GPTQ-INT4 2025.11		18.7	0.509	0.526	0.607
Mistral-3.2-24B-Instruct 2025.11		18.6	0.556	0.514	0.6
Mistral-3.2-24B-Instruct 2025.11		18.1	0.536	0.64	0.695
Qwen3-4B-Instruct 2025.11		17.6	-	-	-
Qwen3-4B-Instruct 2025.11		17.6	0.505	0.81	0.808
Qwen3-4B-Instruct 2025.11		17.6	0.497	0.534	0.572
Qwen3-4B-Instruct 2025.11		17.2	0.564	0.395	0.472
Qwen3-4B-Instruct 2025.11		16.9	0.518	0.733	0.768
Qwen3-4B-Instruct 2025.11		16	0.548	0.789	0.808