Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
GUI Agent Task Execution on Live mini-programs
Loading...
49
Success Rate
Gemini-3-Flash
26.64
32.445
38.25
44.055
Feb 11, 2026
Success Rate
Updated 13d ago
Evaluation Results
Method
Method
Links
Success Rate
Gemini-3-Flash
Grounding=w/ Grounding
2026.02
49
Claude-4.5-Opus
Grounding=w/ Grounding
2026.02
46.3
UI-Oceanus
Model Size=32B, Stage=...
2026.02
44.3
UI-Oceanus
Model Size=32B, Stage=...
2026.02
38.9
UI-Oceanus
Model Size=32B, Stage=...
2026.02
29.5
Seed-1.8
Grounding=w/ Grounding
2026.02
27.5
Feedback
Search any
task
Search any
task