Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Logics-Parsing-Omni Technical Report

About

Addressing the challenges of fragmented task definitions and the heterogeneity of unstructured data in multimodal parsing, this paper proposes the Omni Parsing framework. This framework establishes a Unified Taxonomy covering documents, images, and audio-visual streams, introducing a progressive parsing paradigm that bridges perception and cognition. Specifically, the framework integrates three hierarchical levels: 1) Holistic Detection, which achieves precise spatial-temporal grounding of objects or events to establish a geometric baseline for perception; 2) Fine-grained Recognition, which performs symbolization (e.g., OCR/ASR) and attribute extraction on localized objects to complete structured entity parsing; and 3) Multi-level Interpreting, which constructs a reasoning chain from local semantics to global logic. A pivotal advantage of this framework is its evidence anchoring mechanism, which enforces a strict alignment between high-level semantic descriptions and low-level facts. This enables ``evidence-based'' logical induction, transforming unstructured signals into standardized knowledge that is locatable, enumerable, and traceable. Building on this foundation, we constructed a standardized dataset and released the Logics-Parsing-Omni model, which successfully converts complex audio-visual signals into machine-readable structured knowledge. Experiments demonstrate that fine-grained perception and high-level cognition are synergistic, effectively enhancing model reliability. Furthermore, to quantitatively evaluate these capabilities, we introduce OmniParsingBench. Code, models and the benchmark are released at https://github.com/alibaba/Logics-Parsing/tree/master/Logics-Parsing-Omni.

Xin An, Jingyi Cai, Xiangyang Chen, Huayao Liu, Peiting Liu, Peng Wang, Bei Yang, Xiuwen Zhu, Yongfan Chen, Yan Gao, Yuan Gao, Baoyu Hou, Guangzheng Hu, Shuzhao Li, Weixu Qiao, Weidong Ren, Yanan Wang, Boyu Yang, Fan Yang, Jiangtao Zhang, Lixin Zhang, Lin Qu, Hu Wei, Xiaoxiao Xu, Bing Zhao• 2026

Related benchmarks

TaskDatasetResultRank
Document ParsingOmniDocBench 1.5 (test)
Text Edit Error0.047
111
Document readingLogicsDocBench
Overall Score84.9
20
Graphics ParsingOmniParsingBench Graphics Module
Overall Accuracy87.43
14
Difference PerceptionOmniParsingBench Natural-Image Difference Module
Precision68.9
7
Geometric Image Difference ParsingOmniParsingBench Geo-Image Difference Module 1.0 (test)
Overall Score52.81
7
Unified Multimodal ParsingOmniParsingBench Natural Image
Overall Score62.46
7
Unified Multimodal ParsingOmniParsingBench Document
Perception Score84.9
7
Visual ParsingOmniParsingBench Natural Image module
Overall Score62.46
7
Difference CognitionOmniParsingBench Natural-Image Difference Module
Precision60.9
7
Audio ParsingOmniParsingBench Audio Module
Overall Score53.75
3
Showing 10 of 15 rows

Other info

Follow for update