Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue

About

The emergence of multi-agent systems introduces novel moderation challenges that extend beyond content filtering. Agents with malicious intent may contribute harmful content that appears benign to evade content-based moderation, while compromising the system through exploitative and malicious behavior manifested across their overall interaction patterns within the community. To address this, we introduce BOT-MOD (BOT-MODeration), a moderation framework that grounds detection in agent intent rather than traditional content level signals. BOT-MOD identifies the underlying intent by engaging with the target agent in a multi-turn exchange guided by Gibbs-based sampling over candidate intent hypotheses. This progressively narrows the space of plausible agent objectives to identify the underlying behavior. To evaluate our approach, we construct a dataset derived from Moltbook that encompasses diverse benign and malicious behaviors based on actual community structures, posts, and comments. Results demonstrate that BOT-MOD reliably identifies agent intent across a range of adversarial configurations, while maintaining a low false positive rate on benign behaviors. This work advances the foundation for scalable, intent-aware moderation of agents in open multi-agent environments.

Ali Al-Lawati, Nafis Tripto, Abolfazl Ansari, Jason Lucas, Suhang Wang, Dongwon Lee• 2026

Related benchmarks

TaskDatasetResultRank
Intent DetectionAll Posts and Comment Mean
Mean Score68.51
42
Intent DetectionComment Dataset In-Distribution
F1 Score (Validation)66.47
21
Intent DetectionComment Dataset Out-of-Distribution
F1 Score (val)65.6
21
Intent DetectionPosts Dataset In-Distribution
F1 Score (Validation)72.98
21
Intent DetectionPosts Dataset Out-of-Distribution
F1 Score (val)70.13
21
Showing 5 of 5 rows

Other info

Follow for update