OpenClaw-RL: Train Any Agent Simply by Talking
About
Every agent interaction generates a next-state signal, namely the user reply, tool output, terminal or GUI state change that follows each action, yet no existing agentic RL system recovers it as a live, online learning source. We present OpenClaw-RL, a framework that employs next-state signals to optimize personal agents online through infrastructure and methodology innovations. On the infrastructure side, we extend existing RL systems to a server-client architecture where the RL server hosts the policy behind an inference API and user terminals stream interaction data back over HTTP. From each observed next state, the system extracts two complementary training signals, evaluative and directive, via a separate asynchronous server so that neither signal extraction nor optimization blocks inference. On the methodology side, we introduce a hybrid RL objective that unifies both signal types in a single update: directive signals provide richer, token-level supervision but are sparser, while evaluative signals are more broadly available. To stabilize distillation under teacher-student mismatch, we propose overlap-guided hint selection, which picks the hint whose induced teacher distribution maximally overlaps with the student's top-$k$ tokens, together with a log-probability-difference clip that bounds per-token advantages. Applied to personal agents, OpenClaw-RL enables an agent to improve simply by being used, recovering conversational signals from user re-queries, corrections, and explicit feedback. Applied to general agents, OpenClaw-RL is the first RL framework to unify real-world agent settings spanning terminal, GUI, SWE, and tool-call environments, where we additionally demonstrate the utility of next-state signals in long-horizon settings.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long-term memory evaluation | Locomo | -- | 128 | |
| Task-goal completion | AppWorld | Average Completion Score @47.65 | 7 | |
| Multi-Turn Function Calling | BFCL BASE and LONG CONTEXT multi-turn v3 | Avg@4 Success Rate28.28 | 7 | |
| Instruction-following and procedural reasoning | SOP-Bench | Accuracy100 | 5 | |
| Sequential task management and state maintenance | Lifelong AgentBench | Accuracy70 | 5 | |
| Financial workflow task execution | RealFin benchmark | Accuracy35 | 5 | |
| Long-horizon Task Execution | Long-horizon complex tasks (test) | Success Rate80 | 3 | |
| Web Browsing | WebCanvas | Primary Score0.72 | 2 | |
| Web Browsing | Custom Tasks | Score50 | 2 |