vllm-project
diff --git a/‎_posts/2025-12-12-halugate.md‎
Lines changed: 16 additions & 2 deletions b/‎_posts/2025-12-12-halugate.md‎
Lines changed: 16 additions & 2 deletions
diff --git a/‎assets/figures/semantic-router/halugate-10.png‎
439 KB b/‎assets/figures/semantic-router/halugate-10.png‎
439 KB
@@ -438,6 +438,22 @@ decisions:
           include_hallucination_details: true
 ```
 
+## Beyond Production: HaluGate as an Evaluation Framework
+
+While HaluGate is designed for real-time production use, the same pipeline can power **offline model evaluation**. Instead of intercepting live requests, we feed benchmark datasets through the detection pipeline to systematically measure hallucination rates across models.
+
+![](/assets/figures/semantic-router/halugate-10.png)
+
+### Evaluation Workflow
+
+The evaluation framework treats HaluGate as a hallucination scorer:
+
+1. **Load Dataset**: Use existing QA/RAG benchmarks (TriviaQA, Natural Questions, HotpotQA) or custom enterprise datasets with context-question pairs
+2. **Generate Responses**: Run the model under test against each query with provided context
+3. **Detect Hallucinations**: Pass (context, query, response) triples through HaluGate Detector
+4. **Classify Severity**: Use HaluGate Explainer to categorize each flagged span
+5. **Aggregate Metrics**: Compute hallucination rates, contradiction ratios, and per-category breakdowns
+
 ## Limitations and Scope
 
 HaluGate specifically targets **extrinsic hallucinations**—where tool/RAG context provides grounding for verification. It has known limitations:
@@ -448,8 +464,6 @@ HaluGate specifically targets **extrinsic hallucinations**—where tool/RAG cont
 |------------|---------|--------|
 | **Intrinsic hallucinations** | Model says "Einstein was born in 1900" without any tool call | No context to verify against |
 | **No-context scenarios** | User asks factual question, no tools defined | Missing ground truth |
-| **Subtle distortions** | "approximately 330 meters" vs "exactly 330 meters" | Semantic similarity too high |
-| **Multi-hop reasoning** | Inference chains beyond direct context | Single-hop verification only |
 
 ### Transparent Degradation