SLR: Automated Synthesis for Scalable Logical Reasoning
About
We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user's task specification, SLR automatically synthesizes (i) an instruction prompt for an inductive reasoning task, (ii) a validation program, executable on model outputs to provide verifiable rewards, and (iii) the latent ground-truth rule. This process is fully automated, scalable, requires no human annotations, and offers precise control over task difficulty. Using SLR, we create SLR-Bench, a benchmark comprising 19k prompts organized into 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs demonstrate improved performance but incur very high test-time computation, with costs exceeding $300 for just 1,000 prompts. Finally, curriculum learning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. Moreover, these reasoning capabilities generalize to a wide range of established benchmarks, underscoring the effectiveness of SLR for downstream reasoning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Logical reasoning | SLR-BENCH Extended Leaderboard | -- | 54 | |
| Logical reasoning | SLR-BENCH (test) | LRL11.3 | 27 | |
| Logical reasoning | SLR-BENCH | -- | 14 | |
| inductive Prolog rule synthesis | SLR-Bench Basic tier 250 tasks 1 | Accuracy100 | 13 | |
| inductive Prolog rule synthesis | SLR-Bench Medium tier 250 tasks 1 | Accuracy74 | 13 | |
| inductive Prolog rule synthesis | SLR-Bench Overall 1,000 tasks (full) | Accuracy (%)77.8 | 13 | |
| inductive Prolog rule synthesis | SLR-Bench Hard tier 250 tasks 1 | Accuracy46 | 13 | |
| inductive Prolog rule synthesis | SLR-Bench Easy tier 1 (250 tasks) | Accuracy93 | 13 | |
| Logical reasoning | CLUTRR rob train clean 23 all (test) | Accuracy35.6 | 3 | |
| Logical reasoning | CLUTRR rob_train_sup_23_all (test) | Accuracy45.2 | 3 |