LLMs Are Already Good Tutors: Training-Free Prompt Optimization for Pedagogical Math Tutoring
About
Aligning LLMs for math tutoring typically requires RL-based training with multi-GPU infrastructure. We investigate whether training-free prompt optimization-evolving only the system prompt via API calls-can serve as a practical alternative. We adapt 7 published methods and propose 5 education-specialized methods, evaluating these 12 methods under 5 conditions on 2 OOD benchmark suites. All 12 best-per-method configurations surpass the strongest RL-trained baseline (R_total = 0.633), and our ParetoGrad achieves the best Pareto balance across post-test solve rate, leak control, and helpfulness, rather than dominating any single component. Behavioral analysis with an 82-code educational codebook reveals that training-free methods rely on teaching-knowledge patterns at 2-3x the rate of RL-trained models, with a compensating ~10 percentage-point reduction in intent-level scaffolding. We also find a task-dependent reasoning mode effect consistent across training-free and RL-based paradigms. Our approach enables efficient development of pedagogically aligned LLM tutors with prompts alone and minimal compute.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Math Tutoring | BigMath In-Domain | Rsol56.3 | 21 | |
| Math Tutoring | OpenLearnLM OOD | CK6.97 | 21 | |
| Math Tutoring | MTBench MathTutorBench OOD | Score (Sc)7.89 | 13 |