Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

GOAT-Bench

Benchmarks

Task NameDataset NameSOTA ResultTrend
Meme Abuse DetectionGOAT-Bench 1.0 (test)
Harmfulness Accuracy75.1
32
Multi-Modal Lifelong NavigationGOAT-Bench unseen (val)
SR62.7
22
Lifelong Visual NavigationGOAT-Bench 1/10-scale subset (val-unseen)
Success Rate72.8
13
Multi-modal Lifelong NavigationGOAT-Bench Seen (val)
SPL49
13
Goal-conditioned navigationGOAT-Bench
SR73.7
12
Harmful Meme DetectionGOAT-Bench In-Domain
Racism F188.5
11
Embodied NavigationGOAT-Bench unseen (val)
Success Rate (SR)71.4
10
Lifelong Multimodal Object NavigationGOAT-Bench unseen (val)
s-SR0.465
10
Subtask NavigationGOAT-Bench unseen (val)
SR62.7
9
Harmful Meme DetectionGOAT-Bench (Out-Of-Domain)
Racism F187.1
7
Visual NavigationFull GOAT-Bench Unseen (val)
SR54.3
6
Visual NavigationFull GOAT-Bench Synonyms (val)
Success Rate (SR)58.4
6
Visual NavigationFull GOAT-Bench Seen (val)
SR56.7
6
Multi-modal Lifelong NavigationGOAT-Bench Seen-Synonyms (val)
SR66.8
6
Lifelong NavigationGOAT-Bench evaluation 278 subtasks
SR72.41
4
Lifelong NavigationGOAT-Bench full 2,780 subtasks (val)
Success Rate (SR)64.14
4
Object NavigationGOAT-Bench unseen (val)
SR (%)45
4
Showing 17 of 17 rows