Share your thoughts, 1 month free Claude Pro on usSee more

Speech-to-text reasoning and semantic understanding on VoiceBench (test)

4.8Alpaca Eval

Whisper + GPT-4o

Updated 1mo ago

Evaluation Results

Method	Links
Whisper + GPT-4o 2025.12		4.8	4.47	4.62	75.77	87.2	98.27	76.51	92.97	81.69	87.8
GPT-4o 2025.12		4.78	4.49	4.58	75.5	84.1	98.65	76.02	89.23	80.25	86.75
Qwen3-Omni-30B 2025.12		4.74	4.54	4.58	76.9	80.4	99.3	77.8	89.7	68.1	85.49
Qwen2.5 2025.12		4.66	4.55	4.62	62.03	80	99.04	70.14	84.84	71.57	82.69
Whisper + Qwen2.5 2025.12		4.64	4.33	4.21	58.5	52.85	98.27	63.99	78.24	69	76.05
Qwen2.5 (TN) 2025.12		4.61	4.53	4.56	63.84	56.3	98.85	66.11	74.07	64.51	77.52
AZEROS 2025.12		4.44	4.18	3.91	60.22	56.3	98.65	61.29	72.09	59.01	73.13
GLM-4-Voice 2025.12		3.97	3.42	3.18	36.98	52.8	88.08	25.92	53.41	39.75	56.48
Qwen2.5-Omni 2025.12		3.88	3.77	3.52	46.75	63.7	97.31	40.19	81.54	61.45	68.26
Phi-4-multimodal 2025.12		3.81	3.82	3.56	39.78	61.8	100	45.35	65.93	42.19	64.32
DeSTA2.5 2025.12		3.73	2.52	3.3	46.47	62.4	97.69	65.47	72.75	58.56	66.04
Qwen2-Audio 2025.12		3.42	3.29	2.76	31.65	53	99.04	26.35	48.35	36.14	53.77
Moshi 2025.12		2.01	1.6	1.3	15.64	47.4	44.23	10.12	25.93	24.04	29.51