Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Integrating Vision-Centric Text Understanding for Conversational Recommender Systems

About

Conversational Recommender Systems (CRSs) have attracted growing attention for their ability to deliver personalized recommendations through natural language interactions. To more accurately infer user preferences from multi-turn conversations, recent works increasingly expand conversational context (e.g., by incorporating diverse entity information or retrieving related dialogues). While such context enrichment can assist preference modeling, it also introduces longer and more heterogeneous inputs, leading to practical issues such as input length constraints, text style inconsistency, and irrelevant textual noise, thereby raising the demand for stronger language understanding ability. In this paper, we propose STARCRS, a Screen-Text-AwaRe Conversational Recommender System that integrates two complementary text understanding modes: (1) a screen-reading pathway that encodes auxiliary textual information as visual tokens, mimicking skim reading on a screen, and (2) an LLM-based textual pathway that focuses on a limited set of critical content for fine-grained reasoning. We design a knowledge-anchored fusion framework that combines contrastive alignment, cross-attention interaction, and adaptive gating to integrate the two modes for improved preference modeling and response generation. Extensive experiments on two widely used benchmarks demonstrate that STARCRS consistently improves both recommendation accuracy and generated response quality.

Wei Yuan, Shutong Qiao, Tong Chen, Quoc Viet Hung Nguyen, Zi Huang, Hongzhi Yin• 2026

Related benchmarks

TaskDatasetResultRank
ConversationINSPIRED
Distinct-23.997
27
Conversation PerformanceREDIAL
BLEU-25.1
12
RecommendationREDIAL
Recall@10.083
12
RecommendationINSPIRED
Recall@19.8
12
Conversational Response GenerationREDIAL
Fluency82
6
Showing 5 of 5 rows

Other info

Follow for update