Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Goal reconstruction on OneStop (New Item)
Loading...
0.651
BERTScore
Incorrect Human (same critical span)
0.59276
0.60788
0.623
0.63812
May 4, 2025
BERTScore
QA Accuracy
Updated 3mo ago
Evaluation Results
Method
Method
Links
BERTScore
QA Accuracy
Incorrect Human (same critical span)
Human baseline conditi...
2025.05
0.651
67.7
Gemini few-shot
Model backbone=Gemini...
2025.05
0.642
68.3
DalEye-Llama
Input modality=Text +...
2025.05
0.631
64.8
DalEye-GPT
Input modality=Text +...
2025.05
0.63
65.8
Gemini zero-shot
Model backbone=Gemini...
2025.05
0.629
66.4
Text-only GPT-4o-mini
Input modality=Text-on...
2025.05
0.619
61.9
DalEye-LLaVA
Input modality=Text +...
2025.05
0.618
61
Text-only Llama 3.1
Input modality=Text-on...
2025.05
0.617
60.9
Arbitrary Gemini 3
Model backbone=Gemini...
2025.05
0.612
63.6
Incorrect Human (different critical span)
Human baseline conditi...
2025.05
0.603
49
Text-only LLaVA OneVision
Input modality=Text-on...
2025.05
0.595
63.6
Feedback
Search any
task
Search any
task