In-the-wild model generalization

Benchmarks

Dataset Name	SOTA Method	Metric
Human Bench Average	Qwen3VL-2B	NSE Score57.9	14	5mo ago
Human Bench Text-based Demo	Qwen3VL-32B	NSE23.4	14	5mo ago
Human Bench Vision-based Demo	ProgressLM-RL-3B	NSE15.5	14	5mo ago

Showing 3 of 3 rows