Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learning to Discover at Test Time

About

How can we use AI to discover a new state of the art for a scientific problem? Prior work in test-time scaling, such as AlphaEvolve, performs search by prompting a frozen LLM. We perform reinforcement learning at test time, so the LLM can continue to train, but now with experience specific to the test problem. This form of continual learning is quite special, because its goal is to produce one great solution rather than many good ones on average, and to solve this very problem rather than generalize to other problems. Therefore, our learning objective and search subroutine are designed to prioritize the most promising solutions. We call this method Test-Time Training to Discover (TTT-Discover). Following prior work, we focus on problems with continuous rewards. We report results for every problem we attempted, across mathematics, GPU kernel engineering, algorithm design, and biology. TTT-Discover sets the new state of the art in almost all of them: (i) Erd\H{o}s' minimum overlap problem and an autocorrelation inequality; (ii) a GPUMode kernel competition (up to $2\times$ faster than prior art); (iii) past AtCoder algorithm competitions; and (iv) denoising problem in single-cell analysis. Our solutions are reviewed by experts or the organizers. All our results are achieved with an open model, OpenAI gpt-oss-120b, and can be reproduced with our publicly available code, in contrast to previous best results that required closed frontier models. Our test-time training runs are performed using Tinker, an API by Thinking Machines, with a cost of only a few hundred dollars per problem.

Mert Yuksekgonul, Daniel Koceja, Xinhao Li, Federico Bianchi, Jed McCaleb, Xiaolong Wang, Jan Kautz, Yejin Choi, James Zou, Carlos Guestrin, Yu Sun• 2026

Related benchmarks

TaskDatasetResultRank
Single-cell DenoisingPBMC OpenProblems benchmark
Mean Score0.71
11
MathematicsErdős’ minimum overlap problem
Overlap Score38.0932
10
GeometryAtCoder Heuristic Contest ahc039 (official leaderboard)
Score5.67e+5
9
Mathematical OptimizationAutocorrelation Inequalities
AC11.5052
9
SchedulingAtCoder Heuristic Contest ahc058 (official leaderboard)
Score8.48e+8
8
Single-cell RNA-seq denoisingTabula (held-out)
Score73
8
MLA DecodeGPUMode MLA Decode AMD MI300X Instance 3
Decode Latency (µs)1.67e+3
7
MLA DecodeGPUMode MLA Decode AMD MI300X Instance 1
Decode Latency (µs)1.67e+3
7
MLA DecodeGPUMode MLA Decode AMD MI300X Instance 2
Runtime (µs)1.71e+3
7
GPU kernel engineeringTriMul kernel
TriMul Latency (µs)1.16e+3
7
Showing 10 of 13 rows

Other info

GitHub

Follow for update