SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs
About
Large Language Models (LLMs) have been widely explored in educational scenarios. We identify a critical vulnerability in current educational LLMs, pedagogical jailbreaks, where students use answer-inducing prompts to elicit solutions rather than scaffolded instructions. To enable systematic study, we unify and formalize safe, helpful, and pedagogical behaviors with a knowledge-mastery graph and introduce SHAPE, a benchmark of 9,087 student-question pairs for evaluating tutoring behavior under adversarial pressure. We propose a graph-augmented tutoring pipeline that infers prerequisite concepts from queries, identifies mastery gaps, and routes generation between instructing and problem-solving via explicit gating. Experiments across multiple LLMs show that our method yields significantly improved safety under two pedagogical jailbreak settings, while maintaining near-ceiling helpfulness under the same evaluation protocol. Our code and data are available at https://github.com/MAPS-research/SHaPE
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Pedagogical Tutoring | SHAPE | Safety Score100 | 42 | |
| Pedagogical Dialogue Evaluation | SHAPE (test) | Safety Score93.66 | 33 | |
| Adversarial safety and pedagogical evaluation | SHAPE | Delta Safe-16.86 | 14 | |
| Jailbreak Defense | Adversarial Jailbreak Attacks Cipher, Instructional Constraint, Prefix Injection, Psychological Coercion (Alternative) | Safety Score (Cipher)100 | 5 | |
| Jailbreak Safety Evaluation | SHAPE | Cipher Success Rate100 | 5 |