COREY: Entropy-Guided Runtime Chunk Scheduling for Selective Scan Kernels

About

Mamba selective state space models (SSMs) provide linear-time sequence modeling but remain sensitive to selective-scan chunk scheduling. We present COREY, a \emph{concept-and-feasibility} runtime scheduler that maps fixed-bin activation entropy to chunk size. We evaluate COREY in three tiers: a prototype cost model, real-checkpoint kernel timing, and routed end-to-end ablations on modern GPUs. At the kernel level, a calibrated rule, \(H_{\mathrm{ref}}=\log K\), recovers the locally optimal chunk and matches a one-time static oracle, yielding \(4.41\times\) lower latency than an unoptimized baseline on a consumer GPU and \(3.90\times\)--\(4.04\times\) lower latency on a data-center accelerator. Routing this choice into a patched live scan kernel closes the engineering loop without improving end-to-end speed: in unified routed ablations, the best static chunk outperforms all entropy-guided and proxy schedulers. Sampled-histogram COREY adds \(+4.6\%\) overhead; a guarded fallback to Static-512 reduces this to \(+1.3\%\); and a lightweight sequence-length-keyed table further reduces it to \(+0.7\%\). However, both remain slower than the static oracle because they retain scheduling cost. On an 80-prompt LongBench subset, passive and routed inference are exactly output-equivalent, with \(100\%\) greedy-token agreement and zero metric deltas. A mixed-regime study shows that a single sequence-length rule matches the per-regime chunk oracle for balanced serving. COREY is therefore validated as a quality-preserving scheduling prototype, but current entropy statistics are not a robust throughput win over static chunk tuning on measured SSM checkpoint workloads. SourceCode: https://github.com/mabo1215/COREY_Transformer/.

Bo Ma, Jinsong Wu, Weiqi Yan• 2026

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText-103	PPL809.4	216
Language Modeling	PG-19	Perplexity11.66	206
Long-context Language Understanding	LongBench 20 samples/task	NarrQA Performance1.91	4
Language Model Inference	Sequence Bucket Short	Latency (ms)39.26	3
Language Model Inference	Sequence Bucket Medium	Latency (ms)52.88	3
Language Model Inference	Sequence Bucket Long	Latency (ms)69.58	3
Language Model Inference	Sequence Bucket Ultra-long	Latency (ms)77.97	3

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord