Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Closing the Loop: Unified 3D Scene Generation and Immersive Interaction via LLM-RL Coupling

About

Recent advances in large language models (LLMs) have significantly improved language-driven 3D content generation, but most existing approaches still treat scene generation and user interaction as separate processes, limiting the adaptability and immersive potential of interactive multimedia systems. This paper presents a unified framework that closes the loop between language-driven 3D scene generation and immersive user interaction. Given natural language instructions, the system first constructs structured scene representations using LLMs, and then optimizes spatial layouts via reinforcement learning under geometric and semantic constraints. The generated environments are deployed in a virtual reality setting to facilitate HRI-in-the-loop, where user interactions provide continuous feedback to align generated content with human perception and usability. By tightly coupling generation and interaction, the proposed framework enables more responsive, adaptive, and realistic multimedia experiences. Experiments on the ALFRED benchmark demonstrate state-of-the-art performance in task-based scene generation. Furthermore, qualitative results and user studies show consistent improvements in immersion, interaction quality, and task efficiency, highlighting the importance of closed-loop integration of generation and interaction for next-generation multimedia systems. Our project page can be found at https://proj-showcase.github.io/h3ds/.

Anh H. Vo, Sungyo Lee, Phil-Joong Kim, Soo-Mi Choi, Yong-Guk Kim• 2026

Related benchmarks

TaskDatasetResultRank
Object type predictionIn-Distribution (ID)
Accuracy (ID)100
9
Object type predictionTemplate Shift (TS)
Accuracy99.93
9
Object type predictionObject Shift (OS)
Accuracy99.94
9
Language-driven scene representationALFRED In-Distribution [ID]--
7
Language-driven scene representationALFRED Template Shift [TS]--
7
Language-driven scene representationALFRED Object Shift [OS]--
7
3D Indoor Scene SynthesisHuman Evaluation Study Generated 3D Scenes
Overall Score2.506
4
Indoor Scene Layout Generation3D Indoor Scenes
Functional Appropriateness3.22
4
Object PlacementLLaMA (seen)
Object Count80.68
4
Object PlacementQwen (unseen)
Object Count (CNT)78.53
4
Showing 10 of 11 rows

Other info

Follow for update