| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Scene Text-Centric Visual Question Answering | STVQA | Accuracy0.759 | 20 | |
| Spatial Reasoning | STVQA 300 samples 7k (train) | Relative Score88.5 | 13 | |
| Spatial Reasoning | STVQA-7k (test) | Relative Position Accuracy79.3 | 6 | |
| Visual Question Answering | STVQA-7k | Relation Acc86.4 | 6 |