Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Harmonizing Visual Text Comprehension and Generation

About

In this work, we present TextHarmony, a unified and versatile multimodal generative model proficient in comprehending and generating visual text. Simultaneously generating images and texts typically results in performance degradation due to the inherent inconsistency between vision and language modalities. To overcome this challenge, existing approaches resort to modality-specific data for supervised fine-tuning, necessitating distinct model instances. We propose Slide-LoRA, which dynamically aggregates modality-specific and modality-agnostic LoRA experts, partially decoupling the multimodal generation space. Slide-LoRA harmonizes the generation of vision and language within a singular model instance, thereby facilitating a more unified generative process. Additionally, we develop a high-quality image caption dataset, DetailedTextCaps-100K, synthesized with a sophisticated closed-source MLLM to enhance visual text generation capabilities further. Comprehensive experiments across various benchmarks demonstrate the effectiveness of the proposed approach. Empowered by Slide-LoRA, TextHarmony achieves comparable performance to modality-specific fine-tuning results with only a 2% increase in parameters and shows an average improvement of 2.5% in visual text comprehension tasks and 4.0% in visual text generation tasks. Our work delineates the viability of an integrated approach to multimodal generation within the visual text domain, setting a foundation for subsequent inquiries. Code is available at https://github.com/bytedance/TextHarmony.

Zhen Zhao, Jingqun Tang, Binghong Wu, Chunhui Lin, Shu Wei, Hao Liu, Xin Tan, Zhizhong Zhang, Can Huang, Yuan Xie• 2024

Related benchmarks

TaskDatasetResultRank
Text-based Visual Question AnsweringTextVQA
Accuracy61.1
496
Chart Question AnsweringChartQA
Accuracy38.8
229
Table Question AnsweringWTQ
Accuracy28.3
101
Document-oriented Visual Question AnsweringDocVQA
Accuracy49.8
72
Document Visual Question AnsweringInfoVQA--
32
Text-to-Image GenerationMARIO-Eval
CLIPScore0.36
25
Text-Centric Vision-Language UnderstandingOCR Bench
Accuracy448
20
Rain RemovalRain 0.5
PSNR (dB)27.9788
20
Scene Text-Centric Visual Question AnsweringOCRVQA
Accuracy57.6
14
Scene Text-Centric Visual Question AnsweringSTVQA
Accuracy0.513
14
Showing 10 of 15 rows

Other info

Code

Follow for update