Score your redaction policies

Philter Scope

Q: Does Philter Scope only work with Philter?

It pairs naturally with Phileas and Philter , but it scores redaction output, so you can point it at the output of any redaction process. The evaluation logic is open source under the Apache License, version 2, so your QA team can read every line of the code that scores them.

Philter Scope is a standalone audit tool that scores redaction policies against gold-standard test data. Stop guessing whether a policy change made the pipeline better. Measure it, version it, and fail the build when it regresses.

Read the quickstart View on GitHub

Star 1

Evaluate a policy. Get precision, recall, and F1: measured, not claimed.

$ export PHILTERSCOPE_PHILTER_TOKEN=sk_your_philter_api_key
$ ./philterscope-audit \
    --golden ./examples/golden/ \
    --input  ./examples/raw/ \
    --output ./examples/ \
    --threshold 0.75

Precision: 0.972   (1,795 of 1,847 redactions correct)
Recall:    0.961   (1,795 of 1,868 ground-truth entities caught)
F1:        0.966
# Full report written to ./examples/report.json

Run philterscope-audit against your golden set and input, then explore the report in a browser with philterscope-serve.

Documentation → Release Notes → GitHub →

Score a policy against gold-standard data

Philter Scope compares a policy's redacted output against your labeled gold-standard set and reports precision, recall, and F1 for every entity type, so a change that regresses recall is caught before it reaches production.

scoring healthcare-v7.json · golden set: clinical-notes-500

EntityPrecisionRecallF1

SSN 1.00 1.00 1.00

PHONE 0.99 0.98 0.98

PERSON 0.97 0.95 0.96

DATE 0.94 0.88 0.91

MRN12 of 43 missed 0.96 0.72 0.82

Overall F1 0.94. Recall on MRN fell to 0.72, below the 0.75 threshold, the kind of regression Philter Scope fails a build on.

Why Philter Scope

Reproducible benchmarks

Same test set, same metrics, every run. Two engineers comparing two policies see the same numbers. No more debates about whether the new rules are actually better.

Gold-standard comparison

Annotate a representative sample of your real text once. Philter Scope compares any policy output against that ground truth and reports precision, recall, and F1 per entity type, along with an entity type confusion matrix showing where detectors misclassify.

Per-entity breakdown

Aggregate scores hide problems. Philter Scope reports per-entity-type metrics so you can see exactly which detectors are weakest and where the next tuning pass should land.

CI integration

Run it as a step in your CI pipeline. Fail the build when precision or recall regresses below a threshold; catch policy regressions before they reach production.

Audit artifact

The evaluation report is the artifact regulators and auditors actually want to see. Demonstrate that your redaction pipeline is verifiably correct, not just "trust us, it works."

Open source

Pair with Phileas and Philter, or use against any redaction output. The evaluation logic is open: your QA team can read every line of the code that scores them.

See it in action

The Philter Scope dashboard breaks down precision, recall, and F1 by entity type, so you can see exactly where your policy is strong and where it needs tuning.

Philter Scope dashboard showing precision, recall, and F1 scores by entity type — Development moves quickly. Screenshots may not always reflect the current version.

Frequently asked questions

If something here isn’t covered, get in touch and we’ll answer.

What is Philter Scope?

Philter Scope is a standalone audit tool that scores redaction policies against gold-standard test data. Instead of guessing whether a policy change made your pipeline better, you measure it: Philter Scope reports precision, recall, and F1, per entity type, so you can version a policy and fail the build when it regresses.

What do precision, recall, and F1 actually tell me?

Precision is how much of what you redacted was genuinely sensitive, so it tells you where the policy is destroying useful data by over-redacting. Recall is how much of the real PII you actually caught, so it tells you exactly where data is leaking. F1 combines the two into a single score. Philter Scope reports all three per entity type, because an aggregate number can read 98 percent while quietly missing most medical record numbers.

Do I need labeled data to use it?

Yes. You annotate a representative sample of your real text once to create a gold-standard ("golden") set. Philter Scope then compares any policy's output against that ground truth and produces the scores, along with a confusion matrix showing where detectors misclassify one entity type as another.

Can I run Philter Scope in CI?

Yes, and that is the point of it. Run it as a CI step, set a floor on precision or recall for the entity types you care about, and fail the build when a policy change regresses below the threshold. A regression that re-exposes Social Security numbers gets caught the same way a broken unit test would, before it reaches production.

Does Philter Scope only work with Philter?

It pairs naturally with Phileas and Philter, but it scores redaction output, so you can point it at the output of any redaction process. The evaluation logic is open source under the Apache License, version 2, so your QA team can read every line of the code that scores them.

Is Philter Scope open source?

Yes. Philter Scope is open source under the Apache License, version 2, and the code is on GitHub.

Why measuring redaction matters

Redaction feels binary. The data went in, redacted data came out, and a spot check of a few documents looked clean. But redaction is not binary, it is statistical: every policy decides, entity by entity, what to catch and what to let through. The moment you change a rule, add a detector, or adopt a new model, you have placed a bet on thousands of decisions you will never read by hand. "It looked right on a few examples" is a hope, not an audit artifact. You cannot tune what you cannot measure.

A single headline accuracy number is almost as dangerous as no number at all, because aggregate scores hide exactly the failures that matter. A policy can report 98% overall while quietly missing most medical record numbers or over-redacting the dates your analysts depend on. That is why Philter Scope reports precision, recall, and F1 per entity type against a gold-standard set you annotate once. Precision tells you how much of what you redacted was actually sensitive, so you can see where the policy is destroying useful data. Recall tells you how much of the real PII you caught, so you can see exactly where data is leaking. Per entity, those two numbers turn tuning from guesswork into a targeted decision.

Measurement is what makes a policy safe to change

Policies drift. Models get swapped, input shapes change, and someone tweaks a rule to fix one document. Without a gate, a regression that re-exposes Social Security numbers ships as silently as any other untested change. Scoring a policy in CI turns that risk into a build you can fail: set a floor on recall for the entity types you care about, and a regression is caught before it reaches production, the same way a broken unit test is. The policies you run through Phileas and Philter become something you can version, review, and verify rather than something you hope still works.

Measurement is also what an auditor will accept. "Trust us, it works" is not a control. A reproducible report that scores your real redaction output against ground truth is the evidence regulators actually ask for, and it is the difference between asserting your pipeline is correct and proving it. For a deeper walk through the three metrics and how to read them, see Privacy shouldn't be a guessing game.

Ready to use Philter Scope?

Grab the open source and run it yourself, or work with our team directly. Pick the path that fits.

See your options