Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks

About

Real-world robotic tasks are long-horizon and often span multiple floors, demanding rich spatial reasoning. However, existing embodied benchmarks are largely confined to single-floor in-house environments, failing to reflect the complexity of real-world tasks. We introduce MANSION, the first language-driven framework for generating building-scale, multi-floor 3D environments. Being aware of vertical structural constraints, MANSION generates realistic, navigable whole-building structures with diverse, human-friendly scenes, enabling the development and evaluation of cross-floor long-horizon tasks. Building on this framework, we release MansionWorld, a dataset of over 1,000 diverse buildings ranging from hospitals to offices, alongside a Task-Semantic Scene Editing Agent that customizes these environments using open-vocabulary commands to meet specific user needs. Benchmarking reveals that state-of-the-art agents degrade sharply in our settings, establishing MANSION as a critical testbed for the next generation of spatial reasoning and planning.

Lirong Che, Shuo Wen, Shan Huang, Chuang Wang, Yuzhe Yang, Gregory Dudek, Xueqian Wang, Jian Su• 2026

Related benchmarks

TaskDatasetResultRank
Floorplan Layout GenerationT2D
Micro-IoU81.67
10
Object PlacementBedroom 4x4 m, rect.
Object Count22.6
3
Object PlacementRestaurant polygon
# Objects78.1
3
Object PlacementLibrary polygon
Object Count88.6
3
Object PlacementClassroom 8x8 m, rect.
Number of Objects57.3
3
Object retrieval and navigationMansionWorld Single floor--
3
Object retrieval and navigationMansionWorld Double floors--
3
Object retrieval and navigationMansionWorld Four floors--
3
Showing 8 of 8 rows

Other info

Follow for update