Evaluation Tasks

Benchmarks

Task Name	Dataset Name	SOTA Result
Zero-shot Evaluation	Evaluation Tasks Zero-shot Aggregate	Avg. Accuracy79.95	74
Language Modeling	Evaluation Tasks Zero-shot Average	Zero-shot Average Accuracy60.47	17
Path-Following	10,000 Evaluation Tasks Difficult n=2563	Path Length Mean (m)0.32	3
Path-Following	10,000 Evaluation Tasks Medium n=4795	Mean Path Length (m)0.57	3
Path-Following	10,000 Evaluation Tasks Easy n=2642	Path Length Mean (m)0.53	3
Path-Following	Evaluation Tasks n=10000 (All)	Mean Path Length (m)0.45	3

Showing 6 of 6 rows