Conformal Selective Acting: Anytime-Valid Risk Control for RLVR-Trained LLMs

About

A local specialist LLM, fine-tuned with reinforcement learning from verifiable rewards (RLVR) on operator-local data, is installed in a regulated organization with per-deployment error budget $\alpha$. The operator needs a safety certificate for this deployment's stream at every round: no pooling across deployments, no waiting for a long-run average. Existing wrappers cannot deliver this on adaptive, online-updated streams: offline conformal-risk methods require exchangeability; online-conformal methods bound only long-run averages; non-exchangeable extensions are marginally valid; and the closest anytime wrapper, A-RCPS, controls marginal rather than selective risk. Using a (test statistic, validity guarantee, deployment rule) framework, we identify one empty cell forced by deployment requirements: e-process per threshold, selective risk, anytime-pathwise validity, max-certified-threshold rule. Conformal Selective Acting (CSA) fills it as a per-round wrapper maintaining a Ville-type e-process per threshold on a Bonferroni grid, evaluated against the RLVR filtration. Under predictable updates and isotonic-calibrated monotone risk we prove (i) an anytime-pathwise selective-risk bound $R_T^{\mathrm{act}}\le\alpha+O(N_T^{-1/2})$, (ii) rate-optimal certification matching $\Theta(\bar\eta^{-2}\log(1/\delta))$, and (iii) a horizon-independent release-rate gap. Across eight specialist benchmarks ($480$ streams), sixteen adversarial distribution-shift cells ($160$ streams), and five live Expert-Iteration RLVR cells with online LoRA over four base models in three architecture families ($10{,}300$ rounds), CSA is the only method among ten compared that satisfies pathwise validity and non-refusing deployment on every cell. We do not propose a new LLM, training algorithm, or policy class; CSA is the deployment-side complement, orthogonal to the model, for operators who cannot use a frontier API.

Hamed Khosravi, Xiaoming Huo• 2026

Related benchmarks

Task	Dataset	Result
Reinforcement Learning from Verifiable Rewards	HEAD-QA	AR90.9	30
Distribution Shift Robustness	Sixteen Adversarial Cells MedQA + GSM8K (eval)	Violations0.00e+0	10
Expert-Iteration RLVR	MedQA, HEAD-QA, ARC-C, and CaseHOLD	Pathwise Clean Score4	10
Mathematical Reasoning	GSM8K	AR (%)71.7	10
Natural Language Inference	medNLI	AR (%)28.9	10
Question Answering	MedQA	AR (%)39.4	9
Question Answering	CaseHold	AR (%)49.8	9
Selective Risk Control	Scenario A (linear, heteroskedastic)	FCR3.4	8
Selective Risk Control	Scenario B (nonlinear, SVM)	False Claim Rate (FCR)1.7	8
Question Answering	PubMedQA	AR (%)63.5	8

Showing 10 of 23 rows

Other info

Follow for update

@wizwand_team Discord