End-to-end Listen, Look, Speak and Act

About

Human interaction is inherently multimodal and full-duplex: we listen while watching, speak while acting, and fluidly adapt to turn-taking and interruptions. Realizing these capabilities is essential for building models simulating humans. We present ELLSA (End-to-end Listen, Look, Speak and Act), which, to our knowledge, is the first full-duplex, end-to-end model that simultaneously perceives and generates across vision, text, speech, and action within a single architecture, enabling interaction patterns previously out of reach, yielding more natural, human-like behaviors. At its core is a novel SA-MoE architecture (Self-Attention Mixture-of-Experts) that routes each modality to specialized experts and fuses them through a unified attention backbone. This provides a generalizable solution for joint multimodal perception and concurrent generation, leveraging strong pre-trained components while enabling efficient modality integration and mitigating modality interference. On speech-interaction and robot-manipulation benchmarks, ELLSA matches modality-specific baselines, while uniquely supporting advanced multimodal and full-duplex behaviors such as dialogue and action turn-taking, defective instruction rejection, speaking-while-acting, context-grounded visual question answering, and action barge-ins. We contend that ELLSA represents a step toward more natural and general interactive intelligence, contributing to the broader pursuit of artificial general intelligence. All data, code and model checkpoints will be released at https://github.com/bytedance/SALMONN/tree/ELLSA.

Siyin Wang, Wenyi Yu, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Lu Lu, Chao Zhang• 2025

Related benchmarks

Task	Dataset	Result
Robot Manipulation	LIBERO	Object Achievement95.8	1025
Sequential Robotic Manipulation	CALVIN	Success Rate (1 task)96.7	63
Speech-to-Speech Question-Answering	WebQ	Accuracy36.5	36
Speech-to-Text Question-Answering	LlamaQ	Accuracy74.7	26
Speech-to-Text Question-Answering	TriviaQA	Accuracy45.2	26
Speech-to-Text Question-Answering	WebQ	Accuracy39.5	26
Speech-to-Speech Question-Answering	TriviaQA	Accuracy41.7	22
Dialogue turn-taking	Llama Q.	Success Rate100	3
Dialogue turn-taking	Web Q.	Success Rate100	3
Dialogue turn-taking	TriviaQA	Success Rate100	3

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord