Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SHAPE: Unifying Safety, Helpfulness and Pedagogy for Educational LLMs

About

Large Language Models (LLMs) have been widely explored in educational scenarios. We identify a critical vulnerability in current educational LLMs, pedagogical jailbreaks, where students use answer-inducing prompts to elicit solutions rather than scaffolded instructions. To enable systematic study, we unify and formalize safe, helpful, and pedagogical behaviors with a knowledge-mastery graph and introduce SHAPE, a benchmark of 9,087 student-question pairs for evaluating tutoring behavior under adversarial pressure. We propose a graph-augmented tutoring pipeline that infers prerequisite concepts from queries, identifies mastery gaps, and routes generation between instructing and problem-solving via explicit gating. Experiments across multiple LLMs show that our method yields significantly improved safety under two pedagogical jailbreak settings, while maintaining near-ceiling helpfulness under the same evaluation protocol. Our code and data are available at https://github.com/MAPS-research/SHaPE

Sihang Zhao, Kangrui Yu, Youliang Yuan, Pinjia He, Hongyi Wen• 2026

Related benchmarks

TaskDatasetResultRank
Pedagogical TutoringSHAPE
Safety Score100
42
Pedagogical Dialogue EvaluationSHAPE (test)
Safety Score93.66
33
Adversarial safety and pedagogical evaluationSHAPE
Delta Safe-16.86
14
Jailbreak DefenseAdversarial Jailbreak Attacks Cipher, Instructional Constraint, Prefix Injection, Psychological Coercion (Alternative)
Safety Score (Cipher)100
5
Jailbreak Safety EvaluationSHAPE
Cipher Success Rate100
5
Showing 5 of 5 rows

Other info

Follow for update