Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Seeing is Believing (and Predicting): Context-Aware Multi-Human Behavior Prediction with Vision Language Models

About

Accurately predicting human behaviors is crucial for mobile robots operating in human-populated environments. While prior research primarily focuses on predicting actions in single-human scenarios from an egocentric view, several robotic applications require understanding multiple human behaviors from a third-person perspective. To this end, we present CAMP-VLM (Context-Aware Multi-human behavior Prediction): a Vision Language Model (VLM)-based framework that incorporates contextual features from visual input and spatial awareness from scene graphs to enhance prediction of humans-scene interactions. Due to the lack of suitable datasets for multi-human behavior prediction from an observer view, we perform fine-tuning of CAMP-VLM with synthetic human behavior data generated by a photorealistic simulator, and evaluate the resulting models on both synthetic and real-world sequences to assess their generalization capabilities. Leveraging Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), CAMP-VLM outperforms the best-performing baseline by up to 66.9% in prediction accuracy.

Utsav Panchal, Yuchen Liu, Luigi Palmieri, Ilche Georgievski, Marco Aiello• 2025

Related benchmarks

TaskDatasetResultRank
Human Behavior PredictionSynthetic Kitchen 2 humans
Accuracy (Full Context)48.2
3
Human Behavior PredictionSynthetic Living Room 2 humans
Accuracy (Full)27.6
3
Human Behavior PredictionSynthetic Bedroom 2 humans
Accuracy (Full)39.8
3
Human Behavior PredictionSynthetic Kitchen 3 humans
Accuracy (Full)30.1
3
Human Behavior PredictionSynthetic Living Room 3 humans
Accuracy (Full)22.7
3
Human Behavior PredictionReal-world Multi-human Scenarios Kitchen 2 humans
Accuracy (Full)42.5
2
Human Behavior PredictionReal-world Multi-human Scenarios Living room, 2 humans
Accuracy (Full)36.2
2
Human Behavior PredictionReal-world Multi-human Scenarios Kitchen, 3 humans
Accuracy (Full)41
2
Human Behavior PredictionReal-world Multi-human Scenarios (Living room, 3 humans)
Accuracy (Full)33.4
2
Showing 9 of 9 rows

Other info

Follow for update