diff --git a/_posts/2025-12-14-halugate.md b/_posts/2025-12-14-halugate.md new file mode 100644 index 0000000..14b99d9 --- /dev/null +++ b/_posts/2025-12-14-halugate.md @@ -0,0 +1,510 @@ +--- +layout: post +title: "Token-Level Truth: Real-Time Hallucination Detection for Production LLMs" +author: "vLLM Semantic Router Team" +image: /assets/logos/vllm-logo-text-light.png +--- + +Your LLM just called a tool, received accurate data, and still got the answer wrong. Welcome to the world of **extrinsic hallucination**—where models confidently ignore the ground truth sitting right in front of them. + +Building on our [Signal-Decision Architecture](https://blog.vllm.ai/2025/11/19/signal-decision.html), we introduce **HaluGate**—a conditional, token-level hallucination detection pipeline that catches unsupported claims *before* they reach your users. No LLM-as-judge. No Python runtime. Just fast, explainable verification at the point of delivery. + +## The Problem: Hallucinations Block Production Deployment + +Hallucinations have become the single biggest barrier to deploying LLMs in production. Across industries—**legal** (fabricated case citations), **healthcare** (incorrect drug interactions), **finance** (invented financial data), **customer service** (non-existent policies)—the pattern is the same: AI generates plausible-sounding content that appears authoritative but crumbles under scrutiny. + +The challenge isn't obvious nonsense. It's *subtle fabrications embedded in otherwise accurate responses*—errors that require domain expertise or external verification to catch. For enterprises, this uncertainty makes LLM deployment a liability rather than an asset. + +## The Scenario: When Tools Work But Models Don't + +Let's make this concrete. Consider a typical function-calling interaction: + +> **User**: "When was the Eiffel Tower built?" +> +> **Tool Call**: `get_landmark_info("Eiffel Tower")` +> +> **Tool Response**: `{"name": "Eiffel Tower", "built": "1887-1889", "height": "330 meters", "location": "Paris, France"}` +> +> **LLM Response**: "The Eiffel Tower was **built in 1950** and stands at **500 meters** tall in Paris, France." + +The tool returned correct data. The model's response contains facts. But two of those "facts" are fabricated—**extrinsic hallucinations** that directly contradict the provided context. + +This failure mode is particularly insidious: + +- **Users trust it** because they see the tool was called +- **Traditional filters miss it** because there's no toxic or harmful content +- **Evaluation is expensive** if you rely on another LLM to judge + +What if we could detect these errors automatically, in real-time, with millisecond latency? + +## The Insight: Function Calling as Ground Truth + +Here's the key realization: **modern function-calling APIs already provide grounding context**. When users ask factual questions, models call tools—database lookups, API calls, document retrieval. These tool results are semantically equivalent to retrieved documents in RAG. + +![](/assets/figures/semantic-router/halugate-0.png) + +We don't need to build separate retrieval infrastructure. We don't need to call GPT-4 as a judge. We extract three components from the existing API flow: + +| Component | Source | Purpose | +|-----------|--------|---------| +| **Context** | Tool message content | Ground truth for verification | +| **Question** | User message | Intent understanding | +| **Answer** | Assistant response | Claims to verify | + +The question becomes: **Is the answer faithful to the context?** + +## Why Not Just Use LLM-as-Judge? + +The obvious solution—call another LLM to verify—has fundamental problems in production: + +| Approach | Latency | Cost | Explainability | +|----------|---------|------|----------------| +| GPT-4 as judge | 2-5 seconds | $0.01-0.03/request | Low (black box) | +| Local LLM judge | 500ms-2s | GPU compute | Low | +| **HaluGate** | **76-162ms** | **CPU only** | **High (token-level + NLI)** | + +LLM judges also suffer from: +- **Position bias**: Tendency to favor certain answer positions +- **Verbosity bias**: Longer answers rated higher regardless of accuracy +- **Self-preference**: Models favor outputs similar to their own style +- **Inconsistency**: Same input can yield different judgments + +We needed something faster, cheaper, and more explainable. + +## HaluGate: A Two-Stage Detection Pipeline + +HaluGate implements a **conditional two-stage pipeline** that balances efficiency with precision: + +![](/assets/figures/semantic-router/halugate-1.png) + +### Stage 1: HaluGate Sentinel (Prompt Classification) + +Not every query needs hallucination detection. Consider these prompts: + +| Prompt | Needs Fact-Check? | Reason | +|--------|-------------------|--------| +| "When was Einstein born?" | ✅ Yes | Verifiable fact | +| "Write a poem about autumn" | ❌ No | Creative task | +| "Debug this Python code" | ❌ No | Technical assistance | +| "What's your opinion on AI?" | ❌ No | Opinion request | +| "Is the Earth round?" | ✅ Yes | Factual claim | + +Running token-level detection on creative writing or code review is wasteful—and potentially produces false positives ("your poem contains unsupported claims!"). + +**Why pre-classification matters**: Token-level detection scales linearly with context length. For a 4K token RAG context, detection takes ~125ms; for 16K tokens, ~365ms. In production workloads where ~35% of queries are non-factual, pre-classification achieves a **72.2% efficiency gain**—skipping expensive detection entirely for creative, coding, and opinion queries. + +[HaluGate Sentinel](https://huggingface.co/llm-semantic-router/halugate-sentinel) is a ModernBERT-based classifier that answers one question: *Does this prompt warrant factual verification?* + +![](/assets/figures/semantic-router/halugate-2.png) + +The model is trained on a carefully curated mix of: + +**Fact-Check Needed (Positive Class)**: +- **Question Answering**: SQuAD, TriviaQA, Natural Questions, HotpotQA +- **Truthfulness**: TruthfulQA (common misconceptions) +- **Hallucination Benchmarks**: HaluEval, FactCHD +- **Information-Seeking Dialogue**: FaithDial, CoQA +- **RAG Datasets**: neural-bridge/rag-dataset-12000 + +**No Fact-Check Needed (Negative Class)**: +- **Creative Writing**: WritingPrompts, story generation +- **Code**: CodeSearchNet docstrings, programming tasks +- **Opinion/Instruction**: Dolly non-factual, Alpaca creative + +This binary classification achieves **96.4% validation accuracy** with **~12ms inference latency** via native Rust/Candle integration. + +### Stage 2: Token-Level Detection + NLI Explanation + +For prompts classified as fact-seeking, we run a two-model detection pipeline. + +#### Token-Level Hallucination Detection + +Unlike sentence-level classifiers that output a single "hallucinated/not hallucinated" label, **token-level detection** identifies *exactly which tokens* are unsupported by the context. + +![](/assets/figures/semantic-router/halugate-3.png) + +The model architecture: + +```text +Input: [CLS] context [SEP] question [SEP] answer [SEP] + ↓ + ModernBERT Encoder + ↓ + Token Classification Head (Binary per token) + ↓ + Label: 0 = Supported, 1 = Hallucinated (for answer tokens only) +``` + +Key design decisions: +- **Answer-only classification**: We only classify tokens in the answer segment, not context or question +- **Span merging**: Consecutive hallucinated tokens are merged into spans for readability +- **Confidence thresholding**: Configurable threshold (default 0.8) to balance precision/recall + +#### NLI Explanation Layer + +Knowing *that* something is hallucinated isn't enough—we need to know *why*. The NLI (Natural Language Inference) model classifies each detected span against the context: + +![](/assets/figures/semantic-router/halugate-4.png) + +| NLI Label | Meaning | Severity | Action | +|-----------|---------|----------|--------| +| **CONTRADICTION** | Claim conflicts with context | 4 (High) | Flag as error | +| **NEUTRAL** | Claim not supported by context | 2 (Medium) | Flag as unverifiable | +| **ENTAILMENT** | Context supports the claim | 0 | Filter false positive | + +**Why the ensemble works**: Token-level detection alone achieves only 59% F1 on the hallucinated class—nearly half of hallucinations are missed, and one-third of flags are false positives. We experimented with training a unified 5-class model (SUPPORTED/CONTRADICTION/FABRICATION/etc.) but it achieved only 21.7% F1—token-level classification simply cannot distinguish *why* something is wrong. The two-stage approach turns a mediocre detector into an actionable system: LettuceDetect provides recall (catching potential issues), while NLI provides precision (filtering false positives) and explainability (categorizing *why* each span is problematic). + +## Integration with Signal-Decision Architecture + +HaluGate doesn't operate in isolation—it's deeply integrated with our [Signal-Decision Architecture](https://blog.vllm.ai/2025/11/19/signal-decision.html) as a new signal type and plugin. + +### `fact_check` as a Signal Type + +Just as we have keyword, embedding, and domain signals, `fact_check` is now a first-class signal type: + +![](/assets/figures/semantic-router/halugate-5.png) + +This allows decisions to be conditioned on whether the query is fact-seeking: + +> **Note**: Even frontier models show hallucination variance between releases. For example, [GPT-5.2's system card](https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf) demonstrates measurable hallucination delta compared to previous versions, highlighting the importance of continuous verification regardless of model sophistication. + +```yaml +decisions: + - name: "factual-query-with-verification" + priority: 100 + rules: + operator: "AND" + conditions: + - type: "fact_check" + name: "needs_fact_check" + - type: "domain" + name: "general" + plugins: + - type: "hallucination" + configuration: + enabled: true + use_nli: true + hallucination_action: "header" +``` + +### Request-Response Context Propagation + +A key challenge: the classification happens at **request time**, but detection happens at **response time**. We need to propagate state across this boundary. + +![](/assets/figures/semantic-router/halugate-6.png) + +The `RequestContext` structure carries all necessary state: + +```yaml +RequestContext: + # Classification results (set at request time) + FactCheckNeeded: true + FactCheckConfidence: 0.87 + + # Tool context (extracted at request time) + HasToolsForFactCheck: true + ToolResultsContext: "Built 1887-1889, 330 meters..." + UserContent: "When was the Eiffel Tower built?" + + # Detection results (set at response time) + HallucinationDetected: true + HallucinationSpans: ["1950", "500 meters"] + HallucinationConfidence: 0.92 +``` + +### The `hallucination` Plugin + +The hallucination plugin is configured per-decision, allowing fine-grained control: + +```yaml +plugins: + - type: "hallucination" + configuration: + enabled: true + use_nli: true # Enable NLI explanations + + # Action when hallucination detected + hallucination_action: "header" # "header" | "body" | "block" | "none" + + # Action when fact-check needed but no tool context + unverified_factual_action: "header" + + # Include detailed info in response + include_hallucination_details: true +``` + +| Action | Behavior | +|--------|----------| +| `header` | Add warning headers, pass response through | +| `body` | Inject warning into response body | +| `block` | Return error response, don't forward LLM output | +| `none` | Log only, no user-visible action | + +## Response Headers: Actionable Transparency + +Detection results are communicated via HTTP headers, enabling downstream systems to implement custom policies: + +```http +HTTP/1.1 200 OK +Content-Type: application/json +x-vsr-fact-check-needed: true +x-vsr-hallucination-detected: true +x-vsr-hallucination-spans: 1950; 500 meters +x-vsr-nli-contradictions: 2 +x-vsr-max-severity: 4 +``` + +For unverified factual responses (when tools aren't available): + +```http +HTTP/1.1 200 OK +x-vsr-fact-check-needed: true +x-vsr-unverified-factual-response: true +x-vsr-verification-context-missing: true +``` + +These headers enable: + +- **UI Disclaimers**: Show warnings to users when confidence is low +- **Human Review Queues**: Route flagged responses for manual review +- **Audit Logging**: Track unverified claims for compliance +- **Conditional Blocking**: Block high-severity contradictions + +## The Complete Pipeline: Three Paths + +![](/assets/figures/semantic-router/halugate-7.png) + +| Path | Condition | Latency Added | Action | +|------|-----------|---------------|--------| +| **Path 1** | Non-factual prompt | ~12ms (classifier only) | Pass through | +| **Path 2** | Factual + No tools | ~12ms | Add warning headers | +| **Path 3** | Factual + Tools available | 76-162ms | Full detection + headers | + +## Model Architecture Deep Dive + +Let's look at the three models that power HaluGate: + +![](/assets/figures/semantic-router/halugate-8.png) + +### HaluGate Sentinel: Binary Prompt Classification + +**Architecture**: ModernBERT-base + LoRA adapter + binary classification head + +**Training**: + +- **Base Model**: `answerdotai/ModernBERT-base` +- **Fine-tuning**: LoRA (rank=16, alpha=32, dropout=0.1) +- **Training Data**: 50,000 samples from 14 datasets +- **Loss**: CrossEntropy with class weights (handle imbalance) +- **Optimization**: AdamW, lr=2e-5, 3 epochs + +**Inference**: + +- **Input**: Raw prompt text +- **Output**: (class_id, confidence) +- **Latency**: ~12ms on CPU + +The LoRA approach allows efficient fine-tuning while preserving the pretrained knowledge. Only 2.2% of parameters (3.4M out of 149M) are updated during training. + +### HaluGate Detector: Token-Level Binary Classification + +**Architecture**: ModernBERT-base + token classification head + +**Input Format**: + +```text +[CLS] The Eiffel Tower was built in 1887-1889 and is 330 meters tall. +[SEP] When was the Eiffel Tower built? +[SEP] The Eiffel Tower was built in 1950 and is 500 meters tall. [SEP] + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + Answer tokens (classification targets) +``` + +**Output**: Binary label (0=Supported, 1=Hallucinated) for each answer token + +**Post-processing**: + +1. Filter predictions to answer segment only +2. Apply confidence threshold (default: 0.8) +3. Merge consecutive hallucinated tokens into spans +4. Return spans with confidence scores + +### HaluGate Explainer: Three-Way NLI Classification + +**Architecture**: ModernBERT-base fine-tuned on NLI + +**Input Format**: + +```text +[CLS] The Eiffel Tower was built in 1887-1889. [SEP] built in 1950 [SEP] + ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^ + Premise (context) Hypothesis (span) +``` + +**Output**: Three-way classification with confidence: + +- **ENTAILMENT** (0): Context supports the claim +- **NEUTRAL** (1): Cannot be determined from context +- **CONTRADICTION** (2): Context conflicts with claim + +**Severity Mapping**: + +| NLI Label | Severity Score | Interpretation | +|-----------|---------------|----------------| +| ENTAILMENT | 0 | Likely false positive—filter out | +| NEUTRAL | 2 | Claim is unverifiable | +| CONTRADICTION | 4 | Direct factual error | + +## Why Native Rust/Candle Matters + +All three models run natively via **Candle** (Hugging Face's Rust ML framework) with CGO bindings to Go: + +![](/assets/figures/semantic-router/halugate-9.png) + +Benefits of this approach: + +| Aspect | Python (PyTorch) | Native (Candle) | +|--------|------------------|-----------------| +| **Cold start** | 5-10s | <500ms | +| **Memory** | 2-4GB per model | 500MB-1GB per model | +| **Latency** | +50-100ms overhead | Near-zero overhead | +| **Deployment** | Python runtime required | Single binary | +| **Scaling** | GIL contention | True parallelism | + +This eliminates the need for a separate Python service, sidecars, or model servers—everything runs in-process. + +### Latency Breakdown + +Here's the measured latency for each component in the production pipeline: + +| Component | P50 | P99 | Notes | +|-----------|-----|-----|-------| +| Fact-check classifier | 12ms | 28ms | ModernBERT inference | +| Tool context extraction | 1ms | 3ms | JSON parsing | +| Hallucination detector | 45ms | 89ms | Token classification | +| NLI explainer | 18ms | 42ms | Per-span classification | +| **Total overhead** | **76ms** | **162ms** | When detection runs | + +The total overhead (76-162ms) is negligible compared to typical LLM generation times (5-30 seconds), making HaluGate practical for synchronous request processing. + +## Configuration Reference + +Complete configuration for hallucination mitigation: + +```yaml +# Model configuration +hallucination_mitigation: + # Stage 1: Prompt classification + fact_check_model: + model_id: "models/halugate-sentinel" + threshold: 0.6 # Confidence threshold for FACT_CHECK_NEEDED + use_cpu: true + + # Stage 2a: Token-level detection + hallucination_model: + model_id: "models/halugate-detector" + threshold: 0.8 # Token confidence threshold + use_cpu: true + + # Stage 2b: NLI explanation + nli_model: + model_id: "models/halugate-explainer" + threshold: 0.9 # NLI confidence threshold + use_cpu: true + +# Signal rules for fact-check classification +fact_check_rules: + - name: needs_fact_check + description: "Query contains factual claims that should be verified" + - name: no_fact_check_needed + description: "Query is creative, code-related, or opinion-based" + +# Decision with hallucination plugin +decisions: + - name: "verified-factual" + priority: 100 + rules: + operator: "AND" + conditions: + - type: "fact_check" + name: "needs_fact_check" + plugins: + - type: "hallucination" + configuration: + enabled: true + use_nli: true + hallucination_action: "header" + unverified_factual_action: "header" + include_hallucination_details: true +``` + +## Beyond Production: HaluGate as an Evaluation Framework + +While HaluGate is designed for real-time production use, the same pipeline can power **offline model evaluation**. Instead of intercepting live requests, we feed benchmark datasets through the detection pipeline to systematically measure hallucination rates across models. + +![](/assets/figures/semantic-router/halugate-10.png) + +### Evaluation Workflow + +The evaluation framework treats HaluGate as a hallucination scorer: + +1. **Load Dataset**: Use existing QA/RAG benchmarks (TriviaQA, Natural Questions, HotpotQA) or custom enterprise datasets with context-question pairs +2. **Generate Responses**: Run the model under test against each query with provided context +3. **Detect Hallucinations**: Pass (context, query, response) triples through HaluGate Detector +4. **Classify Severity**: Use HaluGate Explainer to categorize each flagged span +5. **Aggregate Metrics**: Compute hallucination rates, contradiction ratios, and per-category breakdowns + +## Limitations and Scope + +HaluGate specifically targets **extrinsic hallucinations**—where tool/RAG context provides grounding for verification. It has known limitations: + +### What HaluGate Cannot Detect + +| Limitation | Example | Reason | +|------------|---------|--------| +| **Intrinsic hallucinations** | Model says "Einstein was born in 1900" without any tool call | No context to verify against | +| **No-context scenarios** | User asks factual question, no tools defined | Missing ground truth | + +### Transparent Degradation + +For requests classified as fact-seeking but lacking tool context, we explicitly flag responses as "unverified factual" rather than silently passing them through: + +```http +x-vsr-fact-check-needed: true +x-vsr-unverified-factual-response: true +x-vsr-verification-context-missing: true +``` + +This transparency allows downstream systems to handle uncertainty appropriately. + +## Acknowledgments + +HaluGate builds on excellent work from the research community: + +- **Token-level detection architecture**: Inspired by [LettuceDetect](https://github.com/KRLabsOrg/LettuceDetect) from KRLabs—pioneering work in ModernBERT-based hallucination detection +- **NLI models**: Built on [tasksource/ModernBERT-base-nli](https://huggingface.co/tasksource/ModernBERT-base-nli)—high-quality NLI fine-tuning +- **Training datasets**: TruthfulQA, HaluEval, FaithDial, RAGTruth, and other publicly available benchmarks + +We're grateful to these teams for advancing the field of hallucination detection. + +## Conclusion + +HaluGate brings principled hallucination detection to production LLM deployments: + +- **Conditional verification**: Skip non-factual queries, verify factual ones +- **Token-level precision**: Know exactly which claims are unsupported +- **Explainable results**: NLI classification tells you *why* something is wrong +- **Zero-latency integration**: Native Rust inference, no Python sidecars +- **Actionable transparency**: Headers enable downstream policy enforcement + +The next time your LLM calls a tool, receives accurate data, and still gets the answer wrong—HaluGate will catch it before your users do. + +--- + +**Resources**: + +- [Signal-Decision Architecture Blog](https://blog.vllm.ai/2025/11/19/signal-decision.html) +- [vLLM Semantic Router GitHub Repo](https://github.com/vllm-project/semantic-router) +- [vLLM Semantic Router Documentation](https://vllm-semantic-router.com) + +**Join the discussion**: Share your use cases and feedback in #semantic-router channel on vLLM Slack diff --git a/assets/figures/semantic-router/halugate-0.png b/assets/figures/semantic-router/halugate-0.png new file mode 100644 index 0000000..2d89d75 Binary files /dev/null and b/assets/figures/semantic-router/halugate-0.png differ diff --git a/assets/figures/semantic-router/halugate-1.png b/assets/figures/semantic-router/halugate-1.png new file mode 100644 index 0000000..870af99 Binary files /dev/null and b/assets/figures/semantic-router/halugate-1.png differ diff --git a/assets/figures/semantic-router/halugate-10.png b/assets/figures/semantic-router/halugate-10.png new file mode 100644 index 0000000..eb6a4b7 Binary files /dev/null and b/assets/figures/semantic-router/halugate-10.png differ diff --git a/assets/figures/semantic-router/halugate-2.png b/assets/figures/semantic-router/halugate-2.png new file mode 100644 index 0000000..1906fd1 Binary files /dev/null and b/assets/figures/semantic-router/halugate-2.png differ diff --git a/assets/figures/semantic-router/halugate-3.png b/assets/figures/semantic-router/halugate-3.png new file mode 100644 index 0000000..3ad66d6 Binary files /dev/null and b/assets/figures/semantic-router/halugate-3.png differ diff --git a/assets/figures/semantic-router/halugate-4.png b/assets/figures/semantic-router/halugate-4.png new file mode 100644 index 0000000..5403c57 Binary files /dev/null and b/assets/figures/semantic-router/halugate-4.png differ diff --git a/assets/figures/semantic-router/halugate-5.png b/assets/figures/semantic-router/halugate-5.png new file mode 100644 index 0000000..166cad1 Binary files /dev/null and b/assets/figures/semantic-router/halugate-5.png differ diff --git a/assets/figures/semantic-router/halugate-6.png b/assets/figures/semantic-router/halugate-6.png new file mode 100644 index 0000000..ee8c7b9 Binary files /dev/null and b/assets/figures/semantic-router/halugate-6.png differ diff --git a/assets/figures/semantic-router/halugate-7.png b/assets/figures/semantic-router/halugate-7.png new file mode 100644 index 0000000..d6dfb6c Binary files /dev/null and b/assets/figures/semantic-router/halugate-7.png differ diff --git a/assets/figures/semantic-router/halugate-8.png b/assets/figures/semantic-router/halugate-8.png new file mode 100644 index 0000000..715c319 Binary files /dev/null and b/assets/figures/semantic-router/halugate-8.png differ diff --git a/assets/figures/semantic-router/halugate-9.png b/assets/figures/semantic-router/halugate-9.png new file mode 100644 index 0000000..3dcf90a Binary files /dev/null and b/assets/figures/semantic-router/halugate-9.png differ