Step-GUI Technical Report
About
Recent advances in multimodal large language models unlock unprecedented opportunities for GUI automation. However, a fundamental challenge remains: how to efficiently acquire high-quality training data while maintaining annotation reliability? We introduce a self-evolving training pipeline powered by the Calibrated Step Reward System, which converts model-generated trajectories into reliable training signals through trajectory-level calibration, achieving >90% annotation accuracy with 10-100x lower cost. Leveraging this pipeline, we introduce Step-GUI, a family of models (4B/8B) that achieves state-of-the-art GUI performance (8B: 80.2% AndroidWorld, 48.5% OSWorld, 62.6% ScreenShot-Pro) while maintaining robust general capabilities. As GUI agent capabilities improve, practical deployment demands standardized interfaces across heterogeneous devices while protecting user privacy. To this end, we propose GUI-MCP, the first Model Context Protocol for GUI automation with hierarchical architecture that combines low-level atomic operations and high-level task delegation to local specialist models, enabling high-privacy execution where sensitive data stays on-device. Finally, to assess whether agents can handle authentic everyday usage, we introduce AndroidDaily, a benchmark grounded in real-world mobile usage patterns with 3146 static actions and 235 end-to-end tasks across high-frequency daily scenarios (8B: static 89.91%, end-to-end 52.50%). Our work advances the development of practical GUI agents and demonstrates strong potential for real-world deployment in everyday digital interactions.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| GUI Grounding | ScreenSpot v2 | Avg Accuracy95.1 | 283 | |
| Optical Character Recognition | OCRBench | -- | 232 | |
| GUI Grounding | ScreenSpot Pro | Accuracy62.6 | 163 | |
| Multimodal Reasoning | MMStar | -- | 143 | |
| Mobile Task Automation | AndroidWorld (test) | Average Success Rate0.677 | 119 | |
| GUI Grounding | OSWorld-G | Average Score70 | 107 | |
| Mathematical Reasoning | MathVista mini | Accuracy74.4 | 102 | |
| Visual Question Answering | SimpleVQA | Accuracy0.508 | 99 | |
| Logical reasoning | LogicVista | Accuracy53.7 | 84 | |
| Multimodal Understanding | MMBench Chinese | MMB Benchmark (CN)88 | 70 |