Share your thoughts, 1 month free Claude Pro on usSee more

Functional correctness for backend applications on Baxbench

69.8Functional Correctness

Claude-Sonnet-4.5

Updated 4mo ago

Evaluation Results

Method	Links
Claude-Sonnet-4.5 2025.12		69.8
GPT-5 2025.12		67.1
DeepSeek-V3.1-Nex-N1 2025.12		59.7
Kimi-K2-thinking 2025.12		57.4
DeepSeek-V3.1 2025.12		50.1
Gemini-2.5-pro 2025.12		49.7
Qwen3-32B 2025.12		35.6
Qwen3-32B-Nex-N1 2025.12		34.8
GLM-4.6 2025.12		32.1
Qwen3-30B-A3B 2025.12		27.2
Minimax-M2 2025.12		23.4
Qwen3-30B-A3B-Nex-N1 2025.12		13.6
InternLM3-8B 2025.12		1.6
InternLM3-8B-Nex-N1 2025.12		0.3