Talk to the Team

Tell us about your stack and the privacy problems you're trying to solve. We typically respond within one business day.

Prefer email? support@philterd.ai

Prefer to skip the form? Pick a time on our calendar →
or send a message

Please do not enter PII or PHI in this form. If you need to share an example, use a sanitized one.

← All posts

Cutting False Positives on National IDs with Checksum Validation

A regular expression is good at matching a shape. It is bad at telling a real identifier from a number that merely looks like one. That gap is where false positives come from, and for national and financial identifiers it is a real problem.

Consider a Canadian Social Insurance Number (SIN): nine digits, often written 046 454 286. The obvious pattern is “three groups of three digits.” But that pattern also matches a tracking number, a padded order number, or any other nine-digit run that happens to share the format. Redact all of them and you have damaged documents and noisy output. Redact none of them and you have leaked a SIN.

The good news is that most national and financial identifiers carry a check digit: a built-in self-test that a genuine value satisfies and a random look-alike almost never does. Phileas can now use it.

The validator field

The custom identifier filter takes an optional validator. When you set it, a regex match is kept only if the named validator passes. The pattern still decides what could be an identifier; the validator decides whether it actually is one.

Here is a Canadian SIN identifier that validates each match with the Luhn checksum:

{
  "identifiers": {
    "identifiers": [
      {
        "classification": "canada-sin",
        "pattern": "\\b\\d{3}[ -]?\\d{3}[ -]?\\d{3}\\b",
        "validator": "luhn",
        "identifierFilterStrategies": [
          { "strategy": "REDACT", "redactionFormat": "[REDACTED-CANADA-SIN]" }
        ]
      }
    ]
  }
}

With that one line, the behavior changes in exactly the way you want:

  • 046 454 286 is Luhn-valid, so it is redacted.
  • 123 456 789 has the same shape but fails the checksum, so it is left in place.

Without the validator, both would have been redacted. The validator is what turns a broad pattern into a precise one, with no extra filters and no code.

The identifiers it covers

Out of the box, the validators cover the checksum and structural algorithms behind a range of national and financial identifiers:

  • Canada: Social Insurance Number (SIN)
  • France: INSEE / NIR, SIREN, SIRET
  • Germany: tax ID (Steuer-ID), ID card number (Personalausweis)
  • Spain: DNI, NIE, CIF
  • Brazil: CPF, CNPJ
  • Banking: IBAN, SWIFT / BIC

Behind these are eight built-in validators: luhn, mod11, mod97, mod23-letter, es-cif, de-steuerid, de-personalausweis, and bic-structural. Adding another identifier is usually just a pattern plus one of them.

Some validators take parameters. The mod-11 family, for example, selects its scheme with a variant:

"validator": { "name": "mod11", "params": { "variant": "cpf" } }

A validator name that the build does not implement is a policy error, not a silent no-op, so a policy can never quietly skip the check it asked for. And because the policy only ever names a built-in validator, there is no executable code in the policy itself: nothing to review for arbitrary-code-execution risk, and the same behavior on every runtime.

Ready-made policies, in three languages

You do not have to assemble these by hand. The open source policy library now includes grouped, validator-backed policies for Canada (SIN), France (NIR, SIREN, SIRET), Germany (Steuer-ID, Personalausweis), Spain (DNI, NIE, CIF), Brazil (CPF, CNPJ), and SWIFT/BIC codes. Each one accepts formatted and unformatted values and ships with example input and output.

The validators are implemented identically in Phileas for Java, Python, and .NET, so a policy behaves the same whether you call the HTTP API, embed the library, or run it in a .NET service.

What a checksum does and does not buy you

A validator narrows false positives. It does not eliminate them, and it does not turn detection into a guarantee. A checksum-valid number is not certain to be a real, issued identifier: a random nine-digit value passes the Luhn check roughly one time in ten. So a validator makes a generic identifier meaningfully more precise, and for high-stakes free text you can tighten it further by anchoring on a nearby cue (“SIN”, “CPF”) and redacting only the captured value. As always, detection is probabilistic, and you should validate any policy against your own representative documents before relying on it.

Availability

The capability is available now on the main branch of Phileas (Java), phileas-python, and phileas-dotnet, so you can build and use it from source today. It lands in the next release of each: Phileas 4.1.0, phileas-python 1.1.0, and phileas-dotnet 1.2.0. The validator field is defined in the redaction policy schema as of PhiSQL 1.1.0. Everything here is open source under the Apache 2.0 license, and the ready-made policy-library entries are available now in the pii-redaction-policies repository.