Moltbook Moderation: Uncovering Hidden Intent Through Multi-Turn Dialogue

About

The emergence of multi-agent systems introduces novel moderation challenges that extend beyond content filtering. Agents with malicious intent may contribute harmful content that appears benign to evade content-based moderation, while compromising the system through exploitative and malicious behavior manifested across their overall interaction patterns within the community. To address this, we introduce BOT-MOD (BOT-MODeration), a moderation framework that grounds detection in agent intent rather than traditional content level signals. BOT-MOD identifies the underlying intent by engaging with the target agent in a multi-turn exchange guided by Gibbs-based sampling over candidate intent hypotheses. This progressively narrows the space of plausible agent objectives to identify the underlying behavior. To evaluate our approach, we construct a dataset derived from Moltbook that encompasses diverse benign and malicious behaviors based on actual community structures, posts, and comments. Results demonstrate that BOT-MOD reliably identifies agent intent across a range of adversarial configurations, while maintaining a low false positive rate on benign behaviors. This work advances the foundation for scalable, intent-aware moderation of agents in open multi-agent environments.

Ali Al-Lawati, Nafis Tripto, Abolfazl Ansari, Jason Lucas, Suhang Wang, Dongwon Lee• 2026

Related benchmarks

Task	Dataset	Result
Intent Detection	All Posts and Comment Mean	Mean Score68.51	42
Intent Detection	Comment Dataset In-Distribution	F1 Score (Validation)66.47	21
Intent Detection	Comment Dataset Out-of-Distribution	F1 Score (val)65.6	21
Intent Detection	Posts Dataset In-Distribution	F1 Score (Validation)72.98	21
Intent Detection	Posts Dataset Out-of-Distribution	F1 Score (val)70.13	21

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord