diff --git a/_posts/2025-12-14-halugate.md b/_posts/2025-12-14-halugate.md
new file mode 100644
index 0000000..14b99d9
--- /dev/null
+++ b/_posts/2025-12-14-halugate.md
@@ -0,0 +1,510 @@
+---
+layout: post
+title: "Token-Level Truth: Real-Time Hallucination Detection for Production LLMs"
+author: "vLLM Semantic Router Team"
+image: /assets/logos/vllm-logo-text-light.png
+---
+
+Your LLM just called a tool, received accurate data, and still got the answer wrong. Welcome to the world of **extrinsic hallucination**—where models confidently ignore the ground truth sitting right in front of them.
+
+Building on our [Signal-Decision Architecture](https://blog.vllm.ai/2025/11/19/signal-decision.html), we introduce **HaluGate**—a conditional, token-level hallucination detection pipeline that catches unsupported claims *before* they reach your users. No LLM-as-judge. No Python runtime. Just fast, explainable verification at the point of delivery.
+
+## The Problem: Hallucinations Block Production Deployment
+
+Hallucinations have become the single biggest barrier to deploying LLMs in production. Across industries—**legal** (fabricated case citations), **healthcare** (incorrect drug interactions), **finance** (invented financial data), **customer service** (non-existent policies)—the pattern is the same: AI generates plausible-sounding content that appears authoritative but crumbles under scrutiny.
+
+The challenge isn't obvious nonsense. It's *subtle fabrications embedded in otherwise accurate responses*—errors that require domain expertise or external verification to catch. For enterprises, this uncertainty makes LLM deployment a liability rather than an asset.
+
+## The Scenario: When Tools Work But Models Don't
+
+Let's make this concrete. Consider a typical function-calling interaction:
+
+> **User**: "When was the Eiffel Tower built?"
+>
+> **Tool Call**: `get_landmark_info("Eiffel Tower")`
+>
+> **Tool Response**: `{"name": "Eiffel Tower", "built": "1887-1889", "height": "330 meters", "location": "Paris, France"}`
+>
+> **LLM Response**: "The Eiffel Tower was **built in 1950** and stands at **500 meters** tall in Paris, France."
+
+The tool returned correct data. The model's response contains facts. But two of those "facts" are fabricated—**extrinsic hallucinations** that directly contradict the provided context.
+
+This failure mode is particularly insidious:
+
+- **Users trust it** because they see the tool was called
+- **Traditional filters miss it** because there's no toxic or harmful content
+- **Evaluation is expensive** if you rely on another LLM to judge
+
+What if we could detect these errors automatically, in real-time, with millisecond latency?
+
+## The Insight: Function Calling as Ground Truth
+
+Here's the key realization: **modern function-calling APIs already provide grounding context**. When users ask factual questions, models call tools—database lookups, API calls, document retrieval. These tool results are semantically equivalent to retrieved documents in RAG.
+
+![](/assets/figures/semantic-router/halugate-0.png)
+
+We don't need to build separate retrieval infrastructure. We don't need to call GPT-4 as a judge. We extract three components from the existing API flow:
+
+| Component | Source | Purpose |
+|-----------|--------|---------|
+| **Context** | Tool message content | Ground truth for verification |
+| **Question** | User message | Intent understanding |
+| **Answer** | Assistant response | Claims to verify |
+
+The question becomes: **Is the answer faithful to the context?**
+
+## Why Not Just Use LLM-as-Judge?
+
+The obvious solution—call another LLM to verify—has fundamental problems in production:
+
+| Approach | Latency | Cost | Explainability |
+|----------|---------|------|----------------|
+| GPT-4 as judge | 2-5 seconds | $0.01-0.03/request | Low (black box) |
+| Local LLM judge | 500ms-2s | GPU compute | Low |
+| **HaluGate** | **76-162ms** | **CPU only** | **High (token-level + NLI)** |
+
+LLM judges also suffer from:
+- **Position bias**: Tendency to favor certain answer positions
+- **Verbosity bias**: Longer answers rated higher regardless of accuracy
+- **Self-preference**: Models favor outputs similar to their own style
+- **Inconsistency**: Same input can yield different judgments
+
+We needed something faster, cheaper, and more explainable.
+
+## HaluGate: A Two-Stage Detection Pipeline
+
+HaluGate implements a **conditional two-stage pipeline** that balances efficiency with precision:
+
+![](/assets/figures/semantic-router/halugate-1.png)
+
+### Stage 1: HaluGate Sentinel (Prompt Classification)
+
+Not every query needs hallucination detection. Consider these prompts:
+
+| Prompt | Needs Fact-Check? | Reason |
+|--------|-------------------|--------|
+| "When was Einstein born?" | ✅ Yes | Verifiable fact |
+| "Write a poem about autumn" | ❌ No | Creative task |
+| "Debug this Python code" | ❌ No | Technical assistance |
+| "What's your opinion on AI?" | ❌ No | Opinion request |
+| "Is the Earth round?" | ✅ Yes | Factual claim |
+
+Running token-level detection on creative writing or code review is wasteful—and potentially produces false positives ("your poem contains unsupported claims!").
+
+**Why pre-classification matters**: Token-level detection scales linearly with context length. For a 4K token RAG context, detection takes ~125ms; for 16K tokens, ~365ms. In production workloads where ~35% of queries are non-factual, pre-classification achieves a **72.2% efficiency gain**—skipping expensive detection entirely for creative, coding, and opinion queries.
+
+[HaluGate Sentinel](https://huggingface.co/llm-semantic-router/halugate-sentinel) is a ModernBERT-based classifier that answers one question: *Does this prompt warrant factual verification?*
+
+![](/assets/figures/semantic-router/halugate-2.png)
+
+The model is trained on a carefully curated mix of:
+
+**Fact-Check Needed (Positive Class)**:
+- **Question Answering**: SQuAD, TriviaQA, Natural Questions, HotpotQA
+- **Truthfulness**: TruthfulQA (common misconceptions)
+- **Hallucination Benchmarks**: HaluEval, FactCHD
+- **Information-Seeking Dialogue**: FaithDial, CoQA
+- **RAG Datasets**: neural-bridge/rag-dataset-12000
+
+**No Fact-Check Needed (Negative Class)**:
+- **Creative Writing**: WritingPrompts, story generation
+- **Code**: CodeSearchNet docstrings, programming tasks
+- **Opinion/Instruction**: Dolly non-factual, Alpaca creative
+
+This binary classification achieves **96.4% validation accuracy** with **~12ms inference latency** via native Rust/Candle integration.
+
+### Stage 2: Token-Level Detection + NLI Explanation
+
+For prompts classified as fact-seeking, we run a two-model detection pipeline.
+
+#### Token-Level Hallucination Detection
+
+Unlike sentence-level classifiers that output a single "hallucinated/not hallucinated" label, **token-level detection** identifies *exactly which tokens* are unsupported by the context.
+
+![](/assets/figures/semantic-router/halugate-3.png)
+
+The model architecture:
+
+```text
+Input: [CLS] context [SEP] question [SEP] answer [SEP]
+                                          ↓
+                              ModernBERT Encoder
+                                          ↓
+                    Token Classification Head (Binary per token)
+                                          ↓
+              Label: 0 = Supported, 1 = Hallucinated (for answer tokens only)
+```
+
+Key design decisions:
+- **Answer-only classification**: We only classify tokens in the answer segment, not context or question
+- **Span merging**: Consecutive hallucinated tokens are merged into spans for readability
+- **Confidence thresholding**: Configurable threshold (default 0.8) to balance precision/recall
+
+#### NLI Explanation Layer
+
+Knowing *that* something is hallucinated isn't enough—we need to know *why*. The NLI (Natural Language Inference) model classifies each detected span against the context:
+
+![](/assets/figures/semantic-router/halugate-4.png)
+
+| NLI Label | Meaning | Severity | Action |
+|-----------|---------|----------|--------|
+| **CONTRADICTION** | Claim conflicts with context | 4 (High) | Flag as error |
+| **NEUTRAL** | Claim not supported by context | 2 (Medium) | Flag as unverifiable |
+| **ENTAILMENT** | Context supports the claim | 0 | Filter false positive |
+
+**Why the ensemble works**: Token-level detection alone achieves only 59% F1 on the hallucinated class—nearly half of hallucinations are missed, and one-third of flags are false positives. We experimented with training a unified 5-class model (SUPPORTED/CONTRADICTION/FABRICATION/etc.) but it achieved only 21.7% F1—token-level classification simply cannot distinguish *why* something is wrong. The two-stage approach turns a mediocre detector into an actionable system: LettuceDetect provides recall (catching potential issues), while NLI provides precision (filtering false positives) and explainability (categorizing *why* each span is problematic).
+
+## Integration with Signal-Decision Architecture
+
+HaluGate doesn't operate in isolation—it's deeply integrated with our [Signal-Decision Architecture](https://blog.vllm.ai/2025/11/19/signal-decision.html) as a new signal type and plugin.
+
+### `fact_check` as a Signal Type
+
+Just as we have keyword, embedding, and domain signals, `fact_check` is now a first-class signal type:
+
+![](/assets/figures/semantic-router/halugate-5.png)
+
+This allows decisions to be conditioned on whether the query is fact-seeking:
+
+> **Note**: Even frontier models show hallucination variance between releases. For example, [GPT-5.2's system card](https://cdn.openai.com/pdf/3a4153c8-c748-4b71-8e31-aecbde944f8d/oai_5_2_system-card.pdf) demonstrates measurable hallucination delta compared to previous versions, highlighting the importance of continuous verification regardless of model sophistication.
+
+```yaml
+decisions:
+  - name: "factual-query-with-verification"
+    priority: 100
+    rules:
+      operator: "AND"
+      conditions:
+        - type: "fact_check"
+          name: "needs_fact_check"
+        - type: "domain"
+          name: "general"
+    plugins:
+      - type: "hallucination"
+        configuration:
+          enabled: true
+          use_nli: true
+          hallucination_action: "header"
+```
+
+### Request-Response Context Propagation
+
+A key challenge: the classification happens at **request time**, but detection happens at **response time**. We need to propagate state across this boundary.
+
+![](/assets/figures/semantic-router/halugate-6.png)
+
+The `RequestContext` structure carries all necessary state:
+
+```yaml
+RequestContext:
+  # Classification results (set at request time)
+  FactCheckNeeded: true
+  FactCheckConfidence: 0.87
+
+  # Tool context (extracted at request time)
+  HasToolsForFactCheck: true
+  ToolResultsContext: "Built 1887-1889, 330 meters..."
+  UserContent: "When was the Eiffel Tower built?"
+
+  # Detection results (set at response time)
+  HallucinationDetected: true
+  HallucinationSpans: ["1950", "500 meters"]
+  HallucinationConfidence: 0.92
+```
+
+### The `hallucination` Plugin
+
+The hallucination plugin is configured per-decision, allowing fine-grained control:
+
+```yaml
+plugins:
+  - type: "hallucination"
+    configuration:
+      enabled: true
+      use_nli: true  # Enable NLI explanations
+
+      # Action when hallucination detected
+      hallucination_action: "header"  # "header" | "body" | "block" | "none"
+
+      # Action when fact-check needed but no tool context
+      unverified_factual_action: "header"
+
+      # Include detailed info in response
+      include_hallucination_details: true
+```
+
+| Action | Behavior |
+|--------|----------|
+| `header` | Add warning headers, pass response through |
+| `body` | Inject warning into response body |
+| `block` | Return error response, don't forward LLM output |
+| `none` | Log only, no user-visible action |
+
+## Response Headers: Actionable Transparency
+
+Detection results are communicated via HTTP headers, enabling downstream systems to implement custom policies:
+
+```http
+HTTP/1.1 200 OK
+Content-Type: application/json
+x-vsr-fact-check-needed: true
+x-vsr-hallucination-detected: true
+x-vsr-hallucination-spans: 1950; 500 meters
+x-vsr-nli-contradictions: 2
+x-vsr-max-severity: 4
+```
+
+For unverified factual responses (when tools aren't available):
+
+```http
+HTTP/1.1 200 OK
+x-vsr-fact-check-needed: true
+x-vsr-unverified-factual-response: true
+x-vsr-verification-context-missing: true
+```
+
+These headers enable:
+
+- **UI Disclaimers**: Show warnings to users when confidence is low
+- **Human Review Queues**: Route flagged responses for manual review
+- **Audit Logging**: Track unverified claims for compliance
+- **Conditional Blocking**: Block high-severity contradictions
+
+## The Complete Pipeline: Three Paths
+
+![](/assets/figures/semantic-router/halugate-7.png)
+
+| Path | Condition | Latency Added | Action |
+|------|-----------|---------------|--------|
+| **Path 1** | Non-factual prompt | ~12ms (classifier only) | Pass through |
+| **Path 2** | Factual + No tools | ~12ms | Add warning headers |
+| **Path 3** | Factual + Tools available | 76-162ms | Full detection + headers |
+
+## Model Architecture Deep Dive
+
+Let's look at the three models that power HaluGate:
+
+![](/assets/figures/semantic-router/halugate-8.png)
+
+### HaluGate Sentinel: Binary Prompt Classification
+
+**Architecture**: ModernBERT-base + LoRA adapter + binary classification head
+
+**Training**:
+
+- **Base Model**: `answerdotai/ModernBERT-base`
+- **Fine-tuning**: LoRA (rank=16, alpha=32, dropout=0.1)
+- **Training Data**: 50,000 samples from 14 datasets
+- **Loss**: CrossEntropy with class weights (handle imbalance)
+- **Optimization**: AdamW, lr=2e-5, 3 epochs
+
+**Inference**:
+
+- **Input**: Raw prompt text
+- **Output**: (class_id, confidence)
+- **Latency**: ~12ms on CPU
+
+The LoRA approach allows efficient fine-tuning while preserving the pretrained knowledge. Only 2.2% of parameters (3.4M out of 149M) are updated during training.
+
+### HaluGate Detector: Token-Level Binary Classification
+
+**Architecture**: ModernBERT-base + token classification head
+
+**Input Format**:
+
+```text
+[CLS] The Eiffel Tower was built in 1887-1889 and is 330 meters tall.
+[SEP] When was the Eiffel Tower built?
+[SEP] The Eiffel Tower was built in 1950 and is 500 meters tall. [SEP]
+      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+                    Answer tokens (classification targets)
+```
+
+**Output**: Binary label (0=Supported, 1=Hallucinated) for each answer token
+
+**Post-processing**:
+
+1. Filter predictions to answer segment only
+2. Apply confidence threshold (default: 0.8)
+3. Merge consecutive hallucinated tokens into spans
+4. Return spans with confidence scores
+
+### HaluGate Explainer: Three-Way NLI Classification
+
+**Architecture**: ModernBERT-base fine-tuned on NLI
+
+**Input Format**:
+
+```text
+[CLS] The Eiffel Tower was built in 1887-1889. [SEP] built in 1950 [SEP]
+      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^       ^^^^^^^^^^^^^^^
+                    Premise (context)                Hypothesis (span)
+```
+
+**Output**: Three-way classification with confidence:
+
+- **ENTAILMENT** (0): Context supports the claim
+- **NEUTRAL** (1): Cannot be determined from context
+- **CONTRADICTION** (2): Context conflicts with claim
+
+**Severity Mapping**:
+
+| NLI Label | Severity Score | Interpretation |
+|-----------|---------------|----------------|
+| ENTAILMENT | 0 | Likely false positive—filter out |
+| NEUTRAL | 2 | Claim is unverifiable |
+| CONTRADICTION | 4 | Direct factual error |
+
+## Why Native Rust/Candle Matters
+
+All three models run natively via **Candle** (Hugging Face's Rust ML framework) with CGO bindings to Go:
+
+![](/assets/figures/semantic-router/halugate-9.png)
+
+Benefits of this approach:
+
+| Aspect | Python (PyTorch) | Native (Candle) |
+|--------|------------------|-----------------|
+| **Cold start** | 5-10s | <500ms |
+| **Memory** | 2-4GB per model | 500MB-1GB per model |
+| **Latency** | +50-100ms overhead | Near-zero overhead |
+| **Deployment** | Python runtime required | Single binary |
+| **Scaling** | GIL contention | True parallelism |
+
+This eliminates the need for a separate Python service, sidecars, or model servers—everything runs in-process.
+
+### Latency Breakdown
+
+Here's the measured latency for each component in the production pipeline:
+
+| Component | P50 | P99 | Notes |
+|-----------|-----|-----|-------|
+| Fact-check classifier | 12ms | 28ms | ModernBERT inference |
+| Tool context extraction | 1ms | 3ms | JSON parsing |
+| Hallucination detector | 45ms | 89ms | Token classification |
+| NLI explainer | 18ms | 42ms | Per-span classification |
+| **Total overhead** | **76ms** | **162ms** | When detection runs |
+
+The total overhead (76-162ms) is negligible compared to typical LLM generation times (5-30 seconds), making HaluGate practical for synchronous request processing.
+
+## Configuration Reference
+
+Complete configuration for hallucination mitigation:
+
+```yaml
+# Model configuration
+hallucination_mitigation:
+  # Stage 1: Prompt classification
+  fact_check_model:
+    model_id: "models/halugate-sentinel"
+    threshold: 0.6  # Confidence threshold for FACT_CHECK_NEEDED
+    use_cpu: true
+
+  # Stage 2a: Token-level detection
+  hallucination_model:
+    model_id: "models/halugate-detector"
+    threshold: 0.8  # Token confidence threshold
+    use_cpu: true
+
+  # Stage 2b: NLI explanation
+  nli_model:
+    model_id: "models/halugate-explainer"
+    threshold: 0.9  # NLI confidence threshold
+    use_cpu: true
+
+# Signal rules for fact-check classification
+fact_check_rules:
+  - name: needs_fact_check
+    description: "Query contains factual claims that should be verified"
+  - name: no_fact_check_needed
+    description: "Query is creative, code-related, or opinion-based"
+
+# Decision with hallucination plugin
+decisions:
+  - name: "verified-factual"
+    priority: 100
+    rules:
+      operator: "AND"
+      conditions:
+        - type: "fact_check"
+          name: "needs_fact_check"
+    plugins:
+      - type: "hallucination"
+        configuration:
+          enabled: true
+          use_nli: true
+          hallucination_action: "header"
+          unverified_factual_action: "header"
+          include_hallucination_details: true
+```
+
+## Beyond Production: HaluGate as an Evaluation Framework
+
+While HaluGate is designed for real-time production use, the same pipeline can power **offline model evaluation**. Instead of intercepting live requests, we feed benchmark datasets through the detection pipeline to systematically measure hallucination rates across models.
+
+![](/assets/figures/semantic-router/halugate-10.png)
+
+### Evaluation Workflow
+
+The evaluation framework treats HaluGate as a hallucination scorer:
+
+1. **Load Dataset**: Use existing QA/RAG benchmarks (TriviaQA, Natural Questions, HotpotQA) or custom enterprise datasets with context-question pairs
+2. **Generate Responses**: Run the model under test against each query with provided context
+3. **Detect Hallucinations**: Pass (context, query, response) triples through HaluGate Detector
+4. **Classify Severity**: Use HaluGate Explainer to categorize each flagged span
+5. **Aggregate Metrics**: Compute hallucination rates, contradiction ratios, and per-category breakdowns
+
+## Limitations and Scope
+
+HaluGate specifically targets **extrinsic hallucinations**—where tool/RAG context provides grounding for verification. It has known limitations:
+
+### What HaluGate Cannot Detect
+
+| Limitation | Example | Reason |
+|------------|---------|--------|
+| **Intrinsic hallucinations** | Model says "Einstein was born in 1900" without any tool call | No context to verify against |
+| **No-context scenarios** | User asks factual question, no tools defined | Missing ground truth |
+
+### Transparent Degradation
+
+For requests classified as fact-seeking but lacking tool context, we explicitly flag responses as "unverified factual" rather than silently passing them through:
+
+```http
+x-vsr-fact-check-needed: true
+x-vsr-unverified-factual-response: true
+x-vsr-verification-context-missing: true
+```
+
+This transparency allows downstream systems to handle uncertainty appropriately.
+
+## Acknowledgments
+
+HaluGate builds on excellent work from the research community:
+
+- **Token-level detection architecture**: Inspired by [LettuceDetect](https://github.com/KRLabsOrg/LettuceDetect) from KRLabs—pioneering work in ModernBERT-based hallucination detection
+- **NLI models**: Built on [tasksource/ModernBERT-base-nli](https://huggingface.co/tasksource/ModernBERT-base-nli)—high-quality NLI fine-tuning
+- **Training datasets**: TruthfulQA, HaluEval, FaithDial, RAGTruth, and other publicly available benchmarks
+
+We're grateful to these teams for advancing the field of hallucination detection.
+
+## Conclusion
+
+HaluGate brings principled hallucination detection to production LLM deployments:
+
+- **Conditional verification**: Skip non-factual queries, verify factual ones
+- **Token-level precision**: Know exactly which claims are unsupported
+- **Explainable results**: NLI classification tells you *why* something is wrong
+- **Zero-latency integration**: Native Rust inference, no Python sidecars
+- **Actionable transparency**: Headers enable downstream policy enforcement
+
+The next time your LLM calls a tool, receives accurate data, and still gets the answer wrong—HaluGate will catch it before your users do.
+
+---
+
+**Resources**:
+
+- [Signal-Decision Architecture Blog](https://blog.vllm.ai/2025/11/19/signal-decision.html)
+- [vLLM Semantic Router GitHub Repo](https://github.com/vllm-project/semantic-router)
+- [vLLM Semantic Router Documentation](https://vllm-semantic-router.com)
+
+**Join the discussion**: Share your use cases and feedback in #semantic-router channel on vLLM Slack
diff --git a/assets/figures/semantic-router/halugate-0.png b/assets/figures/semantic-router/halugate-0.png
new file mode 100644
index 0000000..2d89d75
Binary files /dev/null and b/assets/figures/semantic-router/halugate-0.png differ
diff --git a/assets/figures/semantic-router/halugate-1.png b/assets/figures/semantic-router/halugate-1.png
new file mode 100644
index 0000000..870af99
Binary files /dev/null and b/assets/figures/semantic-router/halugate-1.png differ
diff --git a/assets/figures/semantic-router/halugate-10.png b/assets/figures/semantic-router/halugate-10.png
new file mode 100644
index 0000000..eb6a4b7
Binary files /dev/null and b/assets/figures/semantic-router/halugate-10.png differ
diff --git a/assets/figures/semantic-router/halugate-2.png b/assets/figures/semantic-router/halugate-2.png
new file mode 100644
index 0000000..1906fd1
Binary files /dev/null and b/assets/figures/semantic-router/halugate-2.png differ
diff --git a/assets/figures/semantic-router/halugate-3.png b/assets/figures/semantic-router/halugate-3.png
new file mode 100644
index 0000000..3ad66d6
Binary files /dev/null and b/assets/figures/semantic-router/halugate-3.png differ
diff --git a/assets/figures/semantic-router/halugate-4.png b/assets/figures/semantic-router/halugate-4.png
new file mode 100644
index 0000000..5403c57
Binary files /dev/null and b/assets/figures/semantic-router/halugate-4.png differ
diff --git a/assets/figures/semantic-router/halugate-5.png b/assets/figures/semantic-router/halugate-5.png
new file mode 100644
index 0000000..166cad1
Binary files /dev/null and b/assets/figures/semantic-router/halugate-5.png differ
diff --git a/assets/figures/semantic-router/halugate-6.png b/assets/figures/semantic-router/halugate-6.png
new file mode 100644
index 0000000..ee8c7b9
Binary files /dev/null and b/assets/figures/semantic-router/halugate-6.png differ
diff --git a/assets/figures/semantic-router/halugate-7.png b/assets/figures/semantic-router/halugate-7.png
new file mode 100644
index 0000000..d6dfb6c
Binary files /dev/null and b/assets/figures/semantic-router/halugate-7.png differ
diff --git a/assets/figures/semantic-router/halugate-8.png b/assets/figures/semantic-router/halugate-8.png
new file mode 100644
index 0000000..715c319
Binary files /dev/null and b/assets/figures/semantic-router/halugate-8.png differ
diff --git a/assets/figures/semantic-router/halugate-9.png b/assets/figures/semantic-router/halugate-9.png
new file mode 100644
index 0000000..3dcf90a
Binary files /dev/null and b/assets/figures/semantic-router/halugate-9.png differ