Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Failure Reasoning and Correction on Real-World Benchmark (test)
Loading...
62.1
ROUGE-L
Dream2Fix-VLM
-2.484
14.283
31.05
47.817
Mar 13, 2026
ROUGE-L
Cosine Similarity
Binary Success
Fuzzy Match Score
Accuracy
Updated 1mo ago
Evaluation Results
Method
Method
Links
ROUGE-L
Cosine Similarity
Binary Success
Fuzzy Match Score
Accuracy
Dream2Fix-VLM
mode=Zero-Shot
2026.03
62.1
66.8
82
42.1
47.2
Gemini-1.5-Flash
mode=Zero-Shot
2026.03
46.7
58.9
98
25
37.4
GPT-4o
mode=Zero-Shot
2026.03
19
47.3
72
22.1
12.6
LLaVA-NeXT-34B
mode=Zero-Shot
2026.03
9
9
30
12.8
2.2
Qwen2-VL-72B
mode=Zero-Shot
2026.03
6.1
47.8
93
16.7
18.3
Qwen2.5-VL-7B
mode=Zero-Shot
2026.03
5.2
17.6
25
10.5
0.8
LLaVA-NeXT-7B
mode=Zero-Shot
2026.03
0
0
0
35.4
0
Feedback
Search any
task
Search any
task