Towards Semantic Equivalence of Tokenization in Multimodal LLM

About

Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization, which involves efficiently transforming input visual signals into feature representations that are most beneficial for LLMs. However, existing vision tokenizers, essential for semantic alignment between vision and language, remain problematic. Existing methods aggressively fragment visual input, corrupting the visual semantic integrity. To address this, this paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok), which groups visual features into semantic units via a dynamic clustering algorithm, flexibly determining the number of tokens based on image complexity. The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features. The proposed MLLM (Setokim) equipped with SeTok significantly demonstrates superior performance across various tasks, as evidenced by our experimental results. The project page is at https://chocowu.github.io/SeTok-web/.

Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy89.1	2019
Visual Question Answering	VQA v2	Accuracy78.5	1429
Visual Question Answering	GQA	Accuracy65.6	1425
Multimodal Evaluation	MME	--	727
Multimodal Capability Evaluation	MM-Vet	Score45.2	393
Referring Expression Segmentation	RefCOCO+ (testA)	cIoU72.4	288
Referring Expression Segmentation	RefCOCO+ (val)	cIoU68	272
Referring Expression Segmentation	RefCOCO+ (testB)	cIoU61.2	256
Text-to-Image Generation	MS-COCO	FID8.5	145
Referring Expression Segmentation	RefCOCOg (val (U))	cIoU71.3	95

Showing 10 of 20 rows

Other info

Code

Follow for update

@wizwand_team Discord