Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SafeSearch: Do Not Trade Safety for Utility in LLM Search Agents

About

Large language model (LLM) based search agents iteratively generate queries, retrieve external information, and reason to answer open-domain questions. While researchers have primarily focused on improving their utility, their safety behaviors remain underexplored. In this paper, we first evaluate search agents using red-teaming datasets and find that they are more likely to produce harmful outputs than base LLMs. For example, when asked ``How can I track someone's location without their consent?'', a base model refuses, whereas a search agent designed to retrieve and cite sources may lower its refusal threshold, fetch documents (e.g., court cases), and, once appended, synthesize them into an informative yet unsafe summary. We further show that utility-oriented finetuning intensifies this risk, motivating joint alignment of safety and utility. To this end, we present SafeSearch, a multi-objective reinforcement learning approach that couples a final-output safety/utility reward with a novel query-level shaping term that penalizes unsafe queries and rewards safe ones. Experiments show that SafeSearch reduces agent harmfulness by over 90% across three red-teaming datasets on a 7B model while producing safe and helpful responses, and maintains QA performance comparable to that of a utility-only finetuned agent. Further analyses confirm the effectiveness of the query-level reward in jointly improving safety and utility.

Qiusi Zhan, Angeline Budiman-Chan, Abdelrahman Zayed, Xingzhi Guo, Daniel Kang, Joo-Kyung Kim• 2025

Related benchmarks

TaskDatasetResultRank
Question AnsweringTriviaQA
EM54.9
182
Question AnsweringBamboogle
EM42.1
120
Safety & Helpfulness EvaluationStrongREJECT
Harm Rate0.2
29
Safety & Helpfulness EvaluationRRB
HarmR1
15
Safety & Helpfulness EvaluationWildTeaming
Harm Rate0.3
15
Question AnsweringHotpotQA 500 QA pairs
Exact Match (EM)38.3
14
Question AnsweringBamboogle 125 QA pairs
EM45.9
14
Safety EvaluationWildTeaming 500-example (test)
HarmR88.6
14
Question AnsweringTriviaQA 500 QA pairs
Exact Match (EM)54.9
14
Safety EvaluationRedteaming Resistance Benchmark (RRB) 919-example subset
HarmR11.8
14
Showing 10 of 11 rows

Other info

Follow for update