Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Luna-2: Scalable Single-Token Evaluation with Small Language Models

About

Real-time guardrails require evaluation that is accurate, cheap, and fast - yet today's default, LLM-as-a-judge (LLMAJ), is slow, expensive, and operationally non-deterministic due to multi-token generation. We present Luna-2, a novel architecture that leverages decoder-only small language models (SLMs) into a deterministic evaluation model to reliably compute complex task-specific LLMAJ metrics (e.g. toxicity, hallucination, tool selection quality, etc.) at an accuracy at par or higher than LLMAJ using frontier LLMs while drastically reducing the cost and latency of computation. Each metric is implemented as a lightweight LoRA/PEFT head on top of a shared SLM backbone, enabling hundreds of specialized metrics to run concurrently on a single GPU, deployable locally next to AI systems in a privacy-preserving and latency optimizing manner. Across content safety and hallucination benchmarks, Luna-2 matches the accuracy of state-of-the-art LLM-based evaluators while reducing inference cost by over 80x and latency by over 20x. In this paper, we outline the model architecture, training methodology and report real-world empirical results on accuracy, latency, and throughput results. In production, Luna-2 is protecting 100M+ AI sessions and processing over 100B tokens per month for our customers with eval cost savings of over $30M annually.

Vatsal Goel, Rishon Dsouza, Nikhil Ega, Amey Ramesh Rambatla, Rob Friel, Shuai Shao, Yash Sheth• 2026

Related benchmarks

TaskDatasetResultRank
Guardrail evaluationRepresentative guardrail task ~1250 tokens
Evaluation Cost per 1K0.01
6
PII (multi-label)Representative guardrail dataset
F1 Score89
3
Tone (multi-class)Representative guardrail dataset
F1 Score92
3
Context AdherenceRepresentative guardrail dataset
F1-Score95
3
Prompt InjectionRepresentative guardrail dataset
F1 Score94
3
Tool Selection QualityRepresentative guardrail dataset
F1 Score94
3
Showing 6 of 6 rows

Other info

Follow for update