Playing Psychic: Using Thought Trees to Predict Reasoning Models Accuracy on Coding Tasks
About
Recent advances in large language models (LLMs) have shown that test-time scaling can substantially improve model performance on complex tasks, particularly in the coding domain. Under this paradigm, models use a larger token budget during inference to generate intermediate reasoning traces before producing a final answer. However, current evaluations primarily rely on competitive programming benchmarks, which may not capture the full range of reasoning abilities. In this work, we perform a systematic study of frontier reasoning models to understand their performance on real-world coding benchmarks. To gain more insights into the performance of such models, we devise a programmatic way to {\em automatically generate} coding tasks of arbitrary difficulty and structure from existing benchmarks. Using this framework, our analysis reveals that the structure of a reasoning trace, not just its contents, is a strong predictor of correctness. Motivated by this, we propose structured thought-trees as means to represent reasoning traces. To illustrate their use, we train a lightweight classifier on features extracted from thought-trees to predict trace correctness, and demonstrate that flagging and retrying structurally anomalous traces based on the extracted features yields consistent gains at lower complexity levels.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Block Infilling | SAFIM | -- | 6 | |
| Code Translation | CodeLingua Java → C | -- | 6 | |
| Code Translation | CodeLingua Python → C | -- | 6 | |
| Code Translation | CodeLingua Java → Fortran | -- | 6 | |
| Output Prediction | CRUXEval | -- | 6 | |
| Reasoning failure prediction and recovery | CRUXEval L2 | Accuracy77 | 4 | |
| Reasoning failure prediction | CodeLingua (L1) | Accuracy73 | 2 | |
| Reasoning failure prediction | CodeLingua (L2) | Accuracy75 | 2 | |
| Reasoning failure prediction | CodeLingua (L3) | Accuracy76 | 2 | |
| Reasoning failure prediction and recovery | CRUXEval L1 | Accuracy89 | 2 |