Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients
About
Discovering the underlying mathematical expressions describing a dataset is a core challenge for artificial intelligence. This is the problem of $\textit{symbolic regression}$. Despite recent advances in training neural networks to solve complex tasks, deep learning approaches to symbolic regression are underexplored. We propose a framework that leverages deep learning for symbolic regression via a simple idea: use a large model to search the space of small models. Specifically, we use a recurrent neural network to emit a distribution over tractable mathematical expressions and employ a novel risk-seeking policy gradient to train the network to generate better-fitting expressions. Our algorithm outperforms several baseline methods (including Eureqa, the gold standard for symbolic regression) in its ability to exactly recover symbolic expressions on a series of benchmark problems, both with and without added noise. More broadly, our contributions include a framework that can be applied to optimize hierarchical, variable-length objects under a black-box performance metric, with the ability to incorporate constraints in situ, and a risk-seeking policy gradient formulation that optimizes for best-case performance instead of expected performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Symbolic Regression | 3D Advection Equation (test) | MSE0.139 | 60 | |
| Symbolic Regression | E. coli growth LLM-SR Suite | NMSE0.182 | 44 | |
| 1D Physics Modeling | 1d Burgers' equation (test) | MSE0.0059 | 38 | |
| 1D Advection Equation Modeling | 1D Advection Equation | MSE0.159 | 38 | |
| Modeling 1D Advection-Diffusion Equation | 1D Advection-Diffusion Equation S-I (test) | MSE0.0191 | 38 | |
| Symbolic Regression | 2D Advection Equation (test) | MSE0.26 | 38 | |
| Symbolic Regression | Oscillation 1 LLM-SR Suite | NMSE0.0104 | 30 | |
| Symbolic Regression | SRBench black-box (test) | R^20.5625 | 28 | |
| Symbolic Regression | LSR-Synth | Overall Acc (Tol 0.01)0.00e+0 | 22 | |
| Symbolic Regression | Strogatz Dataset epsilon=0.01 (test) | R2 Score0.8199 | 20 |