Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs
About
A local specialist LLM, fine-tuned with reinforcement learning from verifiable rewards (RLVR) on operator-local data, is installed in a regulated organization with per-deployment error budget $\alpha$. The operator needs a safety certificate for this deployment's stream at every round: no pooling across deployments, no waiting for a long-run average. Existing wrappers cannot deliver this on adaptive, online-updated streams: offline conformal-risk methods require exchangeability; online-conformal methods bound only long-run averages; non-exchangeable extensions are marginally valid; and the closest anytime wrapper, A-RCPS, controls marginal rather than selective risk. Using a (test statistic, validity guarantee, deployment rule) framework, we identify one empty cell forced by deployment requirements: e-process per threshold, selective risk, anytime-pathwise validity, max-certified-threshold rule. Conformal Selective Acting (CSA) fills it as a per-round wrapper maintaining a Ville-type e-process per threshold on a Bonferroni grid, evaluated against the RLVR filtration. Under predictable updates and isotonic-calibrated monotone risk we prove (i) an anytime-pathwise selective-risk bound $R_T^{\mathrm{act}}\le\alpha+O(N_T^{-1/2})$, (ii) rate-optimal certification matching $\Theta(\bar\eta^{-2}\log(1/\delta))$, and (iii) a horizon-independent release-rate gap. Across eight specialist benchmarks ($480$ streams), sixteen adversarial distribution-shift cells ($160$ streams), and five live Expert-Iteration RLVR cells with online LoRA over four base models in three architecture families ($10{,}300$ rounds), CSA is the only method among ten compared that satisfies pathwise validity and non-refusing deployment on every cell. We do not propose a new LLM, training algorithm, or policy class; CSA is the deployment-side complement, orthogonal to the model, for operators who cannot use a frontier API.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reinforcement Learning from Verifiable Rewards | HEAD-QA | AR90.9 | 30 | |
| Distribution Shift Robustness | Sixteen Adversarial Cells MedQA + GSM8K (eval) | Violations0.00e+0 | 10 | |
| Expert-Iteration RLVR | MedQA, HEAD-QA, ARC-C, and CaseHOLD | Pathwise Clean Score4 | 10 | |
| Mathematical Reasoning | GSM8K | AR (%)71.7 | 10 | |
| Natural Language Inference | medNLI | AR (%)28.9 | 10 | |
| Question Answering | MedQA | AR (%)39.4 | 9 | |
| Question Answering | CaseHold | AR (%)49.8 | 9 | |
| Selective Risk Control | Scenario A (linear, heteroskedastic) | FCR3.4 | 8 | |
| Selective Risk Control | Scenario B (nonlinear, SVM) | False Claim Rate (FCR)1.7 | 8 | |
| Question Answering | PubMedQA | AR (%)63.5 | 8 |