SLR: Automated Synthesis for Scalable Logical Reasoning

About

We introduce SLR, an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) via Scalable Logical Reasoning. Given a user's task specification, SLR automatically synthesizes (i) an instruction prompt for an inductive reasoning task, (ii) a validation program, executable on model outputs to provide verifiable rewards, and (iii) the latent ground-truth rule. This process is fully automated, scalable, requires no human annotations, and offers precise control over task difficulty. Using SLR, we create SLR-Bench, a benchmark comprising 19k prompts organized into 20 curriculum levels that progressively increase in relational, arithmetic, and recursive complexity. Large-scale evaluation reveals that contemporary LLMs readily produce syntactically valid rules, yet often fail at correct logical inference. Recent reasoning LLMs demonstrate improved performance but incur very high test-time computation, with costs exceeding $300 for just 1,000 prompts. Finally, curriculum learning via SLR doubles Llama-3-8B accuracy on SLR-Bench, achieving parity with Gemini-Flash-Thinking at a fraction of computational cost. Moreover, these reasoning capabilities generalize to a wide range of established benchmarks, underscoring the effectiveness of SLR for downstream reasoning.

Lukas Helff, Ahmad Omar, Felix Friedrich, Antonia W\"ust, Hikaru Shindo, Rupert Mitchell, Tim Woydt, Patrick Schramowski, Wolfgang Stammer, Kristian Kersting• 2025

Related benchmarks

Task	Dataset	Result
Logical reasoning	SLR-BENCH Extended Leaderboard	--	54
Logical reasoning	SLR-BENCH (test)	LRL11.3	27
Logical reasoning	SLR-BENCH	--	14
inductive Prolog rule synthesis	SLR-Bench Basic tier 250 tasks 1	Accuracy100	13
inductive Prolog rule synthesis	SLR-Bench Medium tier 250 tasks 1	Accuracy74	13
inductive Prolog rule synthesis	SLR-Bench Overall 1,000 tasks (full)	Accuracy (%)77.8	13
inductive Prolog rule synthesis	SLR-Bench Hard tier 250 tasks 1	Accuracy46	13
inductive Prolog rule synthesis	SLR-Bench Easy tier 1 (250 tasks)	Accuracy93	13
Logical reasoning	CLUTRR rob train clean 23 all (test)	Accuracy35.6	3
Logical reasoning	CLUTRR rob_train_sup_23_all (test)	Accuracy45.2	3

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord