LogiPart: Local Large Language Models for Data Exploration at Scale with Logical Partitioning

About

The discovery of deep, steerable taxonomies in large text corpora is currently restricted by a trade-off between the surface-level efficiency of topic models and the prohibitive, non-scalable assignment costs of LLM-integrated frameworks. We introduce \textbf{LogiPart}, a scalable, hypothesis-first framework for building interpretable hierarchical partitions that decouples hierarchy growth from expensive full-corpus LLM conditioning. LogiPart utilizes locally hosted LLMs on compact, embedding-aware samples to generate concise natural-language taxonomic predicates. These predicates are then evaluated efficiently across the entire corpus using zero-shot Natural Language Inference (NLI) combined with fast graph-based label propagation, achieving constant $O(1)$ generative token complexity per node relative to corpus size. We evaluate LogiPart across four diverse text corpora (totaling $\approx$140,000 documents). Using structured manifolds for \textbf{calibration}, we identify an empirical reasoning threshold at the 14B-parameter scale required for stable semantic grounding. On complex, high-entropy corpora (Wikipedia, US Bills), where traditional thematic metrics reveal an ``alignment gap,'' inverse logic validation confirms the stability of the induced logic, with individual taxonomic bisections maintaining an average per-node routing accuracy of up to 96\%. A qualitative audit by an independent LLM-as-a-judge confirms the discovery of meaningful functional axes, such as policy intent, that thematic ground-truth labels fail to capture. LogiPart enables frontier-level exploratory analysis on consumer-grade hardware, making hypothesis-driven taxonomic discovery feasible under realistic computational and governance constraints.

Tiago Fernandes Tavares• 2025

Related benchmarks

Task	Dataset	Result
Document Classification	AG-News (held-out)	F1 Score64	6
Document Classification	20 Newsgroups (held-out)	F130	6
Hierarchical classification	Wikipedia (held-out set)	NMI0.23	3
Hierarchical classification	US Bills (held-out set)	NMI0.17	3

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord