Share your thoughts, 1 month free Claude Pro on usSee more

Tool Use Evaluation on ShortcutsBench clear instructions 200 queries

92.5API Selection Accuracy

GPT-4o

Updated 2mo ago

Evaluation Results

Method	Links
GPT-4o 2024.08		92.5	42.5
DeepSeek V3 2024.08		91.5	53.5
DeepSeek V3 2024.08		89	53.5
Claude 3.5 2024.08		89	46
Claude 3.5 2024.08		88.5	50
Gemini 1.5 2024.08		87	46
GPT-4o 2024.08		86	45
Gemini 1.5 2024.08		83.5	46