Conversation
📝 WalkthroughWalkthroughA new ban-list evaluator and reporting helpers were added. Documentation and the evaluation orchestration script were updated to require and propagate Changes
Sequence Diagram(s)sequenceDiagram
participant Runner as "run_all_evaluations.sh"
participant Env as "Environment\n(BAN_LIST_WORDS)"
participant Evaluator as "ban_list/run.py\n(BanList Validator)"
participant Dataset as "Dataset\n(files)"
participant Profiler as "Profiler"
participant Outputs as "Outputs\n(predictions.csv, metrics.json)"
Runner->>Env: export KEY=VALUE args
Runner->>Evaluator: invoke ban_list runner
Evaluator->>Env: read BAN_LIST_WORDS
Evaluator->>Dataset: load dataset (source/target/labels)
Evaluator->>Profiler: start profiling
Evaluator->>Evaluator: run_ban_list(text) per sample -> (redacted_text, fail_flag)
Evaluator->>Profiler: stop profiling
Evaluator->>Outputs: write predictions.csv
Evaluator->>Outputs: build_evaluation_report(...) -> write metrics.json
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
backend/app/evaluation/ban_list/run.py (1)
28-28: Consider wrapping module-level execution inif __name__ == "__main__".
pd.read_csv(DATASET_PATH), theBanListinstantiation, and all evaluation logic execute unconditionally at import time. Wrapping them in an entry-point guard would prevent accidental side effects ifrun_ban_listis ever imported by another module (e.g., a test), and also makes theFileNotFoundErroreasier to trace.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/evaluation/ban_list/run.py` at line 28, The module currently executes pd.read_csv(DATASET_PATH) and the BanList instantiation/evaluation at import time; wrap that module-level work in an entry-point guard by moving the dataset load, the BanList(...) instantiation and the evaluation/run logic into a protected block (either a small main() function invoked under if __name__ == "__main__": or directly under that guard) so importing this module (e.g., for tests) does not trigger pd.read_csv or run the evaluation automatically.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/evaluation/ban_list/run.py`:
- Around line 48-53: The code unconditionally accesses df["target_text"] when
"label" is absent which raises KeyError if target_text is missing; update the
conditional in run.py to explicitly handle three cases: if "label" in df.columns
set df["y_true"] from df["label"]; elif both "source_text" and "target_text" in
df.columns compute df["y_true"] = (df["source_text"].astype(str) !=
df["target_text"].astype(str)).astype(int); else (e.g., only "source_text"
present) set df["y_true"] to a sensible default (such as 0 or NaN) or raise a
clear ValueError explaining missing "target_text"/"label" so callers know the
dataset is invalid. Ensure you reference the df, "y_true", "label",
"source_text", and "target_text" symbols when making the change.
- Around line 65-77: The latency calculations in the write_json payload use
p.latencies directly and will raise ZeroDivisionError/ValueError/IndexError when
p.latencies is empty and also compute p95 with an off-by-one; fix by guarding on
n = len(p.latencies): if n == 0 set latency stats to None (or 0 per project
convention) and otherwise compute mean = sum(p.latencies)/n, compute
sorted_latencies = sorted(p.latencies), compute p95_index = max(0,
math.ceil(0.95 * n) - 1) (import math) and use sorted_latencies[p95_index], and
compute max = max(p.latencies); update the block that builds the "latency_ms"
dict inside the write_json call that references p.latencies to use this guarded
logic.
- Around line 30-38: Instantiate BanList with the explicit
on_fail=OnFailAction.FIX (update the validator creation to pass OnFailAction.FIX
and add an import for OnFailAction if missing) so intent matches
ban_list_safety_validator_config.py/PIIRemover; in run_ban_list (the function
using validator.validate) keep the existing branch that returns
(result.fix_value, 1) when isinstance(result, FailResult) and add a defensive
else branch that returns the original text (or result.value) with a 0 flag to
handle non-FailResult cases safely.
In `@backend/scripts/run_all_evaluations.sh`:
- Around line 10-14: The loop in run_all_evaluations.sh that iterates over
arguments and attempts to export assignments uses export "$arg", which triggers
ShellCheck SC2163; replace export "$arg" with declare -x "$arg" inside the
for-loop so the string-form assignment is exported in the portable/idiomatic way
(locate the for arg in "$@" loop and change the export call to declare -x).
---
Nitpick comments:
In `@backend/app/evaluation/ban_list/run.py`:
- Line 28: The module currently executes pd.read_csv(DATASET_PATH) and the
BanList instantiation/evaluation at import time; wrap that module-level work in
an entry-point guard by moving the dataset load, the BanList(...) instantiation
and the evaluation/run logic into a protected block (either a small main()
function invoked under if __name__ == "__main__": or directly under that guard)
so importing this module (e.g., for tests) does not trigger pd.read_csv or run
the evaluation automatically.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
backend/README.mdbackend/app/evaluation/ban_list/run.pybackend/scripts/run_all_evaluations.sh
There was a problem hiding this comment.
♻️ Duplicate comments (1)
backend/scripts/run_all_evaluations.sh (1)
10-12:⚠️ Potential issue | 🟡 MinorReplace
export "$arg"withdeclare -x "$arg"to avoid SC2163 and keep exports idiomatic.This is the same ShellCheck warning noted earlier; the current form still doesn’t export the variable name reliably.
🔧 Proposed fix
for arg in "$@"; do if [[ "$arg" == *=* ]]; then - export "$arg" + declare -x "$arg" fi doneTo verify, re-run ShellCheck on this script after the change.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/scripts/run_all_evaluations.sh` around lines 10 - 12, The loop that exports key=value args (for arg in "$@"; do ... export "$arg") should use declare -x "$arg" instead of export "$arg" to avoid ShellCheck SC2163 and ensure the variable name is exported reliably; update the token inside the loop (the export call) to declare -x "$arg" in run_all_evaluations.sh and then re-run ShellCheck to verify the warning is resolved.
🧹 Nitpick comments (1)
backend/app/evaluation/common/helper.py (1)
41-51: Preventextra_fieldsfrom overriding reserved report keys.Right now
extra_fieldscan clobberguardrailornum_samples(and could attempt to setperformance). Consider guarding against reserved keys to avoid silent overrides.💡 Suggested guard
def build_evaluation_report( guardrail: str, num_samples: int, profiler: "Profiler", **extra_fields: Any, ) -> dict[str, Any]: + reserved = {"guardrail", "num_samples", "performance"} + overlap = reserved.intersection(extra_fields) + if overlap: + raise ValueError(f"extra_fields includes reserved keys: {', '.join(sorted(overlap))}") return { "guardrail": guardrail, "num_samples": num_samples, **extra_fields, "performance": build_performance_payload(profiler), }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/evaluation/common/helper.py` around lines 41 - 51, In build_evaluation_report, prevent extra_fields from overriding reserved keys ("guardrail", "num_samples", "performance") by checking extra_fields.keys() for any intersection with that reserved set and failing fast or removing them; update the function (build_evaluation_report) to compute reserved = {"guardrail","num_samples","performance"}, compute conflicts = reserved ∩ extra_fields.keys(), and if conflicts is non-empty either raise a ValueError listing the conflicting keys or remove those keys from extra_fields and log/warn before building the return dict so the reserved values from the function cannot be clobbered.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@backend/scripts/run_all_evaluations.sh`:
- Around line 10-12: The loop that exports key=value args (for arg in "$@"; do
... export "$arg") should use declare -x "$arg" instead of export "$arg" to
avoid ShellCheck SC2163 and ensure the variable name is exported reliably;
update the token inside the loop (the export call) to declare -x "$arg" in
run_all_evaluations.sh and then re-run ShellCheck to verify the warning is
resolved.
---
Nitpick comments:
In `@backend/app/evaluation/common/helper.py`:
- Around line 41-51: In build_evaluation_report, prevent extra_fields from
overriding reserved keys ("guardrail", "num_samples", "performance") by checking
extra_fields.keys() for any intersection with that reserved set and failing fast
or removing them; update the function (build_evaluation_report) to compute
reserved = {"guardrail","num_samples","performance"}, compute conflicts =
reserved ∩ extra_fields.keys(), and if conflicts is non-empty either raise a
ValueError listing the conflicting keys or remove those keys from extra_fields
and log/warn before building the return dict so the reserved values from the
function cannot be clobbered.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
backend/app/evaluation/ban_list/run.pybackend/app/evaluation/common/helper.pybackend/app/evaluation/gender_assumption_bias/run.pybackend/app/evaluation/lexical_slur/run.pybackend/app/evaluation/pii/run.pybackend/scripts/run_all_evaluations.sh
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/app/evaluation/ban_list/run.py
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (3)
backend/app/evaluation/ban_list/run.py (3)
32-34: Passon_fail=OnFailAction.FIXexplicitly toBanListfor codebase consistency.This validator is instantiated without an
on_failargument, while every other location in the codebase that constructs aBanList(e.g.,ban_list_safety_validator_config.py) and thePIIRemovervalidator both passon_failexplicitly.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/evaluation/ban_list/run.py` around lines 32 - 34, The BanList validator is created without an explicit on_fail which is inconsistent with other usages; update the constructor call for BanList (the variable validator created with BANNED_WORDS) to pass on_fail=OnFailAction.FIX, and if OnFailAction is not already imported where BanList is used, add the import for OnFailAction so the symbol resolves.
52-57:⚠️ Potential issue | 🟠 Major
KeyErrorwhen neither"label"nor"target_text"column is present.The
elsebranch at line 54 unconditionally accessesdataset["target_text"]without verifying the column exists — identical to the unresolved issue flagged in a prior review.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/evaluation/ban_list/run.py` around lines 52 - 57, The code unconditionally accesses dataset["target_text"] in the else branch, causing a KeyError when neither "label" nor "target_text" exists; update the logic that assigns dataset["y_true"] to first check for the presence of "target_text" and "source_text" in dataset.columns (or use dataset.get) and only compute the difference when both exist, otherwise raise a clear ValueError indicating the missing columns; reference the existing dataset variable and the assignment to "y_true" so the change touches the same block that currently checks "label" and falls back to comparing "source_text" vs "target_text".
39-41:⚠️ Potential issue | 🟠 Major
or-based fallback silently returns unredacted text whenfix_valueis"".When every character of
textis a banned term,BanList._validatesetsfix_value = ""(completely redacted).fix_valueisOptional[Any]— the "auto-fix value that would be applied if the Validator's on_fail method is 'fix'", so an empty string is a valid and meaningful result. Usingortreats""as falsy and substitutes the originaltext, returning unredacted content and corrupting bothredacted_textandexact_matchmetrics.🐛 Proposed fix
- return (result.fix_value or text), 1 + return (text if result.fix_value is None else result.fix_value), 1🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@backend/app/evaluation/ban_list/run.py` around lines 39 - 41, The code uses a falsy check ("result.fix_value or text") that treats an intentional empty-string fix_value as absent; change the fallback logic in the return so that you only substitute text when fix_value is None (not when it's empty string). Update the return that references FailResult/result to use an explicit None check (e.g., use result.fix_value if result.fix_value is not None else text) so empty-string fixes are preserved; adjust the logic in the function where FailResult is handled (the return in run.py that currently uses "or") to implement this explicit None-aware fallback.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@backend/app/evaluation/ban_list/run.py`:
- Line 28: The code builds BANNED_WORDS from BAN_LIST_WORDS_RAW but doesn't
guard against resulting empty list after filtering (e.g.,
BAN_LIST_WORDS_RAW=","), which can silently initialize BanList with no entries;
update the initialization logic that creates BANNED_WORDS to validate the
filtered result and raise/exit with a clear error (or fallback) when
BANNED_WORDS is empty—check the variable BAN_LIST_WORDS_RAW, compute
BANNED_WORDS as now, then if not BANNED_WORDS: log an explicit error and stop or
provide a sensible default before constructing the BanList to ensure BanList is
never created with an empty banned-words list.
---
Duplicate comments:
In `@backend/app/evaluation/ban_list/run.py`:
- Around line 32-34: The BanList validator is created without an explicit
on_fail which is inconsistent with other usages; update the constructor call for
BanList (the variable validator created with BANNED_WORDS) to pass
on_fail=OnFailAction.FIX, and if OnFailAction is not already imported where
BanList is used, add the import for OnFailAction so the symbol resolves.
- Around line 52-57: The code unconditionally accesses dataset["target_text"] in
the else branch, causing a KeyError when neither "label" nor "target_text"
exists; update the logic that assigns dataset["y_true"] to first check for the
presence of "target_text" and "source_text" in dataset.columns (or use
dataset.get) and only compute the difference when both exist, otherwise raise a
clear ValueError indicating the missing columns; reference the existing dataset
variable and the assignment to "y_true" so the change touches the same block
that currently checks "label" and falls back to comparing "source_text" vs
"target_text".
- Around line 39-41: The code uses a falsy check ("result.fix_value or text")
that treats an intentional empty-string fix_value as absent; change the fallback
logic in the return so that you only substitute text when fix_value is None (not
when it's empty string). Update the return that references FailResult/result to
use an explicit None check (e.g., use result.fix_value if result.fix_value is
not None else text) so empty-string fixes are preserved; adjust the logic in the
function where FailResult is handled (the return in run.py that currently uses
"or") to implement this explicit None-aware fallback.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
backend/README.mdbackend/app/evaluation/ban_list/run.py
🚧 Files skipped from review as they are similar to previous changes (1)
- backend/README.md
| "BAN_LIST_WORDS must be set for ban_list evaluation (comma-separated)." | ||
| ) | ||
|
|
||
| BANNED_WORDS = [word.strip() for word in BAN_LIST_WORDS_RAW.split(",") if word.strip()] |
There was a problem hiding this comment.
Guard against an empty BANNED_WORDS list after filtering.
if not BAN_LIST_WORDS_RAW only rejects an unset/empty variable. A value like BAN_LIST_WORDS="," passes that guard but produces BANNED_WORDS = [], silently initializing BanList with an empty banned-words list and rendering the entire evaluation meaningless.
🛡️ Proposed fix
BANNED_WORDS = [word.strip() for word in BAN_LIST_WORDS_RAW.split(",") if word.strip()]
+if not BANNED_WORDS:
+ raise ValueError(
+ "BAN_LIST_WORDS contained no valid words after stripping whitespace and commas."
+ )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@backend/app/evaluation/ban_list/run.py` at line 28, The code builds
BANNED_WORDS from BAN_LIST_WORDS_RAW but doesn't guard against resulting empty
list after filtering (e.g., BAN_LIST_WORDS_RAW=","), which can silently
initialize BanList with no entries; update the initialization logic that creates
BANNED_WORDS to validate the filtered result and raise/exit with a clear error
(or fallback) when BANNED_WORDS is empty—check the variable BAN_LIST_WORDS_RAW,
compute BANNED_WORDS as now, then if not BANNED_WORDS: log an explicit error and
stop or provide a sensible default before constructing the BanList to ensure
BanList is never created with an empty banned-words list.
Summary
Target issue is #63.
Explain the motivation for making this change. What existing problem does the pull request solve?
run_all_evaluations.shto run evaluations for all validators, including ban list.To run evaluations, use this command -
scripts/run_all_evaluations.sh BAN_LIST_WORDS="sonography,gender check"Checklist
Before submitting a pull request, please ensure that you mark these task.
fastapi run --reload app/main.pyordocker compose upin the repository root and test.Notes
Please add here if any other information is required for the reviewer.
Summary by CodeRabbit
New Features
Documentation
Refactor