Logics-Parsing-Omni Technical Report
About
Addressing the challenges of fragmented task definitions and the heterogeneity of unstructured data in multimodal parsing, this paper proposes the Omni Parsing framework. This framework establishes a Unified Taxonomy covering documents, images, and audio-visual streams, introducing a progressive parsing paradigm that bridges perception and cognition. Specifically, the framework integrates three hierarchical levels: 1) Holistic Detection, which achieves precise spatial-temporal grounding of objects or events to establish a geometric baseline for perception; 2) Fine-grained Recognition, which performs symbolization (e.g., OCR/ASR) and attribute extraction on localized objects to complete structured entity parsing; and 3) Multi-level Interpreting, which constructs a reasoning chain from local semantics to global logic. A pivotal advantage of this framework is its evidence anchoring mechanism, which enforces a strict alignment between high-level semantic descriptions and low-level facts. This enables ``evidence-based'' logical induction, transforming unstructured signals into standardized knowledge that is locatable, enumerable, and traceable. Building on this foundation, we constructed a standardized dataset and released the Logics-Parsing-Omni model, which successfully converts complex audio-visual signals into machine-readable structured knowledge. Experiments demonstrate that fine-grained perception and high-level cognition are synergistic, effectively enhancing model reliability. Furthermore, to quantitatively evaluate these capabilities, we introduce OmniParsingBench. Code, models and the benchmark are released at https://github.com/alibaba/Logics-Parsing/tree/master/Logics-Parsing-Omni.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Document Parsing | OmniDocBench 1.5 (test) | Text Edit Error0.047 | 111 | |
| Document reading | LogicsDocBench | Overall Score84.9 | 20 | |
| Graphics Parsing | OmniParsingBench Graphics Module | Overall Accuracy87.43 | 14 | |
| Difference Perception | OmniParsingBench Natural-Image Difference Module | Precision68.9 | 7 | |
| Geometric Image Difference Parsing | OmniParsingBench Geo-Image Difference Module 1.0 (test) | Overall Score52.81 | 7 | |
| Unified Multimodal Parsing | OmniParsingBench Natural Image | Overall Score62.46 | 7 | |
| Unified Multimodal Parsing | OmniParsingBench Document | Perception Score84.9 | 7 | |
| Visual Parsing | OmniParsingBench Natural Image module | Overall Score62.46 | 7 | |
| Difference Cognition | OmniParsingBench Natural-Image Difference Module | Precision60.9 | 7 | |
| Audio Parsing | OmniParsingBench Audio Module | Overall Score53.75 | 3 |