Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

COREY: Entropy-Guided Runtime Chunk Scheduling for Selective Scan Kernels

About

Mamba selective state space models (SSMs) provide linear-time sequence modeling but remain sensitive to selective-scan chunk scheduling. We present COREY, a \emph{concept-and-feasibility} runtime scheduler that maps fixed-bin activation entropy to chunk size. We evaluate COREY in three tiers: a prototype cost model, real-checkpoint kernel timing, and routed end-to-end ablations on modern GPUs. At the kernel level, a calibrated rule, \(H_{\mathrm{ref}}=\log K\), recovers the locally optimal chunk and matches a one-time static oracle, yielding \(4.41\times\) lower latency than an unoptimized baseline on a consumer GPU and \(3.90\times\)--\(4.04\times\) lower latency on a data-center accelerator. Routing this choice into a patched live scan kernel closes the engineering loop without improving end-to-end speed: in unified routed ablations, the best static chunk outperforms all entropy-guided and proxy schedulers. Sampled-histogram COREY adds \(+4.6\%\) overhead; a guarded fallback to Static-512 reduces this to \(+1.3\%\); and a lightweight sequence-length-keyed table further reduces it to \(+0.7\%\). However, both remain slower than the static oracle because they retain scheduling cost. On an 80-prompt LongBench subset, passive and routed inference are exactly output-equivalent, with \(100\%\) greedy-token agreement and zero metric deltas. A mixed-regime study shows that a single sequence-length rule matches the per-regime chunk oracle for balanced serving. COREY is therefore validated as a quality-preserving scheduling prototype, but current entropy statistics are not a robust throughput win over static chunk tuning on measured SSM checkpoint workloads. SourceCode: https://github.com/mabo1215/COREY_Transformer/.

Bo Ma, Jinsong Wu, Weiqi Yan• 2026

Related benchmarks

TaskDatasetResultRank
Language ModelingWikiText-103
PPL809.4
216
Language ModelingPG-19
Perplexity11.66
206
Long-context Language UnderstandingLongBench 20 samples/task
NarrQA Performance1.91
4
Language Model InferenceSequence Bucket Short
Latency (ms)39.26
3
Language Model InferenceSequence Bucket Medium
Latency (ms)52.88
3
Language Model InferenceSequence Bucket Long
Latency (ms)69.58
3
Language Model InferenceSequence Bucket Ultra-long
Latency (ms)77.97
3
Showing 7 of 7 rows

Other info

Follow for update