CAPTCHA Solving for Native GUI Agents: Automated Reasoning-Action Data Generation and Self-Corrective Training
About
GUI agents are rapidly shifting from multi-module pipelines to end-to-end, native vision-language models (VLMs) that perceive raw screenshots and directly interact with digital devices. Despite rapid progress on general GUI tasks, CAPTCHA solving remains a major challenge. On the other hand, although specialized CAPTCHA solving pipelines exist, they cannot handle general GUI tasks. To address this gap, we introduce ReCAP: a CAPTCHA-capable native GUI agent that can robustly solve modern, interactive CAPTCHA challenges, while preserving their performance as a general GUI agent. We first develop a dynamic CAPTCHA system spanning seven representative CAPTCHA types, designed to stress primitive and complementary capabilities for CAPTCHA solving (e.g., robust OCR under heavy noise and text stylization, fine-grained visual understanding, and precise control). Then, we develop an automated data collection and curation pipeline that generates large-scale CAPTCHA interaction trajectories paired with reasoning traces. As CAPTCHA solving often requires multi-step interaction and recovery from intermediate mistakes, we further leverage failed trajectories to construct self-correction data, training agents to reflect on errors and correct their actions online. Across held-out test sets, ReCAP improves CAPTCHA-solving success from roughly 30\% to 80\%, while maintaining strong performance on general GUI-agent benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| GUI Grounding | ScreenSpot v2 | Avg Accuracy93.24 | 283 | |
| CAPTCHA Solving | Dynamic CAPTCHA system (held-out test) | SR Text62.11 | 7 | |
| Android GUI Navigation | Android Control low | Success Rate67.4 | 5 | |
| CAPTCHA Solving | tencent vtt CAPTCHA zero-shot | Solve Rate (zero-shot)41 | 5 | |
| CAPTCHA Solving | recaptcha zero-shot v2 | Solve Rate63 | 5 | |
| CAPTCHA Solving | hcaptcha CAPTCHA zero-shot | Solve Rate26 | 5 | |
| CAPTCHA Solving | geetest/slide CAPTCHA zero-shot | Solve Rate36 | 5 | |
| CAPTCHA Solving | lemin CAPTCHA zero-shot | Solve Rate8 | 5 | |
| CAPTCHA Solving | amazon waf CAPTCHA zero-shot | Solve Rate16 | 5 | |
| CAPTCHA Solving | funcaptcha hand_number zero-shot | Solve Rate51 | 5 |