Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages

About

System prompts provide a lightweight yet powerful mechanism for conditioning large language models (LLMs) at inference time. While prior work has focused on English-only settings, real-world deployments benefit from having a single prompt to operate reliably across languages. This paper presents a comprehensive study of how different system prompts steer models toward accurate and robust cross-lingual behavior. We propose a unified four-dimensional evaluation framework to assess system prompts in multilingual environments. Through large-scale experiments on five languages, three LLMs, and three benchmarks, we uncover that certain prompt components, such as CoT, emotion, and scenario, correlate with robust multilingual behavior. We develop a prompt optimization framework for multilingual settings and show it can automatically discover prompts that improve all metrics by 5-10%. Finally, we analyze over 10 million reasoning units and find that more performant system prompts induce more structured and consistent reasoning patterns, while reducing unnecessary language-switching. Together, we highlight system prompt optimization as a scalable path to accurate and robust multilingual LLM behavior.

Lechen Zhang, Yusheng Zhou, Tolga Ergen, Lajanugen Logeswaran, Moontae Lee, David Jurgens• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH 500	Mean Accuracy79.7	6
Moral Reasoning	UNIMORAL	Acc (mean)67.9	6
Multi-subject Reasoning	MMLU-Pro	Acc (Mean)65.6	6

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord