Skip to content

Commit c9e9d3e

Browse files
committed
review
Signed-off-by: bitliu <[email protected]>
1 parent 8bf8496 commit c9e9d3e

File tree

2 files changed

+16
-2
lines changed

2 files changed

+16
-2
lines changed

_posts/2025-12-12-halugate.md

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -438,6 +438,22 @@ decisions:
438438
include_hallucination_details: true
439439
```
440440

441+
## Beyond Production: HaluGate as an Evaluation Framework
442+
443+
While HaluGate is designed for real-time production use, the same pipeline can power **offline model evaluation**. Instead of intercepting live requests, we feed benchmark datasets through the detection pipeline to systematically measure hallucination rates across models.
444+
445+
![](/assets/figures/semantic-router/halugate-10.png)
446+
447+
### Evaluation Workflow
448+
449+
The evaluation framework treats HaluGate as a hallucination scorer:
450+
451+
1. **Load Dataset**: Use existing QA/RAG benchmarks (TriviaQA, Natural Questions, HotpotQA) or custom enterprise datasets with context-question pairs
452+
2. **Generate Responses**: Run the model under test against each query with provided context
453+
3. **Detect Hallucinations**: Pass (context, query, response) triples through HaluGate Detector
454+
4. **Classify Severity**: Use HaluGate Explainer to categorize each flagged span
455+
5. **Aggregate Metrics**: Compute hallucination rates, contradiction ratios, and per-category breakdowns
456+
441457
## Limitations and Scope
442458

443459
HaluGate specifically targets **extrinsic hallucinations**—where tool/RAG context provides grounding for verification. It has known limitations:
@@ -448,8 +464,6 @@ HaluGate specifically targets **extrinsic hallucinations**—where tool/RAG cont
448464
|------------|---------|--------|
449465
| **Intrinsic hallucinations** | Model says "Einstein was born in 1900" without any tool call | No context to verify against |
450466
| **No-context scenarios** | User asks factual question, no tools defined | Missing ground truth |
451-
| **Subtle distortions** | "approximately 330 meters" vs "exactly 330 meters" | Semantic similarity too high |
452-
| **Multi-hop reasoning** | Inference chains beyond direct context | Single-hop verification only |
453467

454468
### Transparent Degradation
455469

439 KB
Loading

0 commit comments

Comments
 (0)