Logics-Parsing-Omni Technical Report

About

Addressing the challenges of fragmented task definitions and the heterogeneity of unstructured data in multimodal parsing, this paper proposes the Omni Parsing framework. This framework establishes a Unified Taxonomy covering documents, images, and audio-visual streams, introducing a progressive parsing paradigm that bridges perception and cognition. Specifically, the framework integrates three hierarchical levels: 1) Holistic Detection, which achieves precise spatial-temporal grounding of objects or events to establish a geometric baseline for perception; 2) Fine-grained Recognition, which performs symbolization (e.g., OCR/ASR) and attribute extraction on localized objects to complete structured entity parsing; and 3) Multi-level Interpreting, which constructs a reasoning chain from local semantics to global logic. A pivotal advantage of this framework is its evidence anchoring mechanism, which enforces a strict alignment between high-level semantic descriptions and low-level facts. This enables ``evidence-based'' logical induction, transforming unstructured signals into standardized knowledge that is locatable, enumerable, and traceable. Building on this foundation, we constructed a standardized dataset and released the Logics-Parsing-Omni model, which successfully converts complex audio-visual signals into machine-readable structured knowledge. Experiments demonstrate that fine-grained perception and high-level cognition are synergistic, effectively enhancing model reliability. Furthermore, to quantitatively evaluate these capabilities, we introduce OmniParsingBench. Code, models and the benchmark are released at https://github.com/alibaba/Logics-Parsing/tree/master/Logics-Parsing-Omni.

Xin An, Jingyi Cai, Xiangyang Chen, Huayao Liu, Peiting Liu, Peng Wang, Bei Yang, Xiuwen Zhu, Yongfan Chen, Yan Gao, Yuan Gao, Baoyu Hou, Guangzheng Hu, Shuzhao Li, Weixu Qiao, Weidong Ren, Yanan Wang, Boyu Yang, Fan Yang, Jiangtao Zhang, Lixin Zhang, Lin Qu, Hu Wei, Xiaoxiao Xu, Bing Zhao• 2026

Related benchmarks

Task	Dataset	Result
Document Parsing	OmniDocBench 1.5 (test)	Text Edit Error0.047	132
Document reading	LogicsDocBench	Overall Score84.9	20
Graphics Parsing	OmniParsingBench Graphics Module	Overall Accuracy87.43	14
Difference Perception	OmniParsingBench Natural-Image Difference Module	Precision68.9	7
Geometric Image Difference Parsing	OmniParsingBench Geo-Image Difference Module 1.0 (test)	Overall Score52.81	7
Unified Multimodal Parsing	OmniParsingBench Natural Image	Overall Score62.46	7
Unified Multimodal Parsing	OmniParsingBench Document	Perception Score84.9	7
Visual Parsing	OmniParsingBench Natural Image module	Overall Score62.46	7
Difference Cognition	OmniParsingBench Natural-Image Difference Module	Precision60.9	7
Audio Parsing	OmniParsingBench Audio Module	Overall Score53.75	3

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord