Step-GUI Technical Report
About
Recent advances in multimodal large language models unlock unprecedented opportunities for GUI automation. However, a fundamental challenge remains: how to efficiently acquire high-quality training data while maintaining annotation reliability? We introduce a self-evolving training pipeline powered by the Calibrated Step Reward System, which converts model-generated trajectories into reliable training signals through trajectory-level calibration, achieving >90% annotation accuracy with 10-100x lower cost. Leveraging this pipeline, we introduce Step-GUI, a family of models (4B/8B) that achieves state-of-the-art GUI performance (8B: 80.2% AndroidWorld, 48.5% OSWorld, 62.6% ScreenShot-Pro) while maintaining robust general capabilities. As GUI agent capabilities improve, practical deployment demands standardized interfaces across heterogeneous devices while protecting user privacy. To this end, we propose GUI-MCP, the first Model Context Protocol for GUI automation with hierarchical architecture that combines low-level atomic operations and high-level task delegation to local specialist models, enabling high-privacy execution where sensitive data stays on-device. Finally, to assess whether agents can handle authentic everyday usage, we introduce AndroidDaily, a benchmark grounded in real-world mobile usage patterns with 3146 static actions and 235 end-to-end tasks across high-frequency daily scenarios (8B: static 89.91%, end-to-end 52.50%). Our work advances the development of practical GUI agents and demonstrates strong potential for real-world deployment in everyday digital interactions.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| GUI Grounding | ScreenSpot v2 | Avg Accuracy95.1 | 203 | |
| Optical Character Recognition | OCRBench | OCRBench Score88 | 83 | |
| Multimodal Reasoning | MMStar | -- | 81 | |
| GUI Grounding | ScreenSpot Pro | Accuracy62.6 | 77 | |
| GUI Grounding | OSWorld-G | Average Score70 | 74 | |
| Mathematical Reasoning | MathVista mini | Accuracy74.4 | 72 | |
| Multimodal Understanding | MMBench Chinese | MMB Benchmark (CN)88 | 70 | |
| Multimodal Understanding | MMBench English | MMB Score89.1 | 55 | |
| GUI Navigation | AndroidWorld latest (test) | Success Rate67.7 | 35 | |
| Grounding | ScreenSpot v2 | Accuracy95.1 | 23 |