Talk to the Team

Tell us about your stack and the privacy problems you're trying to solve. We typically respond within one business day.

Prefer email? support@philterd.ai

Prefer to skip the form? Pick a time on our calendar →
or send a message

Please do not enter PII or PHI in this form. If you need to share an example, use a sanitized one.

National and financial ID numbers are some of the hardest values to redact cleanly. A regular expression can match their shape, but it cannot tell a real identifier from a number that merely looks like one. A pattern for a nine-digit Canadian Social Insurance Number also matches a tracking number, a padded order number, and any other nine-digit run. Redact all of them and you damage documents; redact none and you leak a SIN.

Most national and financial identifiers solve this for you. They carry a check digit, a built-in self-test that a genuine value satisfies and a random look-alike almost never does. This guide shows how to use those checks in a redaction policy with the validator field, so a broad pattern becomes a precise one with no extra code. If you have not written a policy before, start with Writing your first redaction policy.

The recipe: anchor, capture, validate

A production-grade identifier policy has three parts, each tightening precision over the one before.

  1. A pattern that matches the format of the ID.
  2. A capture group so you anchor on a nearby cue word but redact only the value.
  3. A validator that keeps a match only if its check digit passes.

Here is a Canadian SIN identifier with all three. It looks for a cue such as “SIN” or “Social Insurance Number,” captures the nine-digit value after it, and validates that value with the Luhn checksum:

{
  "identifiers": {
    "identifiers": [
      {
        "classification": "canada-sin",
        "pattern": "\\b(?:SIN|Social Insurance Number)[\\s:#-]*((?:\\d{3}[ -]?){2}\\d{3})\\b",
        "groupNumber": 1,
        "validator": "luhn",
        "identifierFilterStrategies": [
          { "strategy": "REDACT", "redactionFormat": "[REDACTED-CANADA-SIN]" }
        ]
      }
    ]
  }
}

What each part buys you:

  • The pattern alone would match any nine-digit run, including reference numbers.
  • The capture group (groupNumber: 1) means the cue word “SIN” anchors the match but is not redacted. Only the number is. This cuts matches that share the digit shape but appear in unrelated context.
  • The validator drops values that fail the checksum. 046 454 286 is Luhn-valid and is redacted; 123 456 789 has the same shape, fails the check, and is left in place. Without the validator, both would be redacted.

You can use any of the three on their own. A bare pattern plus a validator is already far more precise than a pattern alone. Adding the cue-word capture group is the extra step that pays off in noisy free text.

Identifiers covered out of the box

The built-in validators cover the checksum and structural algorithms behind a range of national and financial identifiers. Each row links to a ready-made, validator-backed policy in the policy library that accepts formatted and unformatted values and ships with example input and output.

RegionIdentifiersValidator(s)Ready-made policy
CanadaSocial Insurance Number (SIN)luhnCanadian SIN
BrazilCPF, CNPJmod11 (variant cpf or cnpj)Brazilian identifiers
FranceNIR / INSEE, SIREN, SIRETmod97 (variant nir), luhnFrench identifiers
GermanySteuer-ID, Personalausweisde-steuerid, de-personalausweisGerman identifiers
SpainDNI, NIE, CIFmod23-letter, es-cifSpanish identifiers
BankingIBAN, SWIFT / BICmod97 (variant iban), bic-structuralSWIFT/BIC codes

The validator reference

A validator is written one of two ways. Validators that take no parameters can be a plain string:

"validator": "luhn"

Validators that take parameters use the object form with a name and params:

"validator": { "name": "mod11", "params": { "variant": "cpf" } }

The complete set of built-in validators:

ValidatorParametersCovers
luhnnoneStandard mod-10 Luhn checksum. Canadian SIN, French SIREN and SIRET, and other Luhn-checked numbers. Separators are ignored, so a value may be formatted or not.
mod11variant: cpf or cnpjWeighted mod-11 check digits for the Brazilian CPF (11 digits) and CNPJ (14 digits).
mod97variant: iban or nir; nir also takes substitutionsValue mod 97. iban is ISO 13616 (MOD-97-10); nir is the French INSEE/NIR key, with Corsica department substitutions (2A, 2B) configurable.
mod23-lettersubstitutions (optional)Control letter from a 23-entry table for the Spanish DNI and NIE. The NIE leading letter mapping is configurable.
es-cifnoneThe Spanish CIF organization tax ID, whose control character may be a digit or a letter.
de-steueridnoneThe German tax ID (Steuer-ID / IdNr): 11 digits, structural digit-repetition rule, and ISO/IEC 7064 MOD 11,10 check digit.
de-personalausweisnoneICAO 9303 check-digit validation for the German ID card (Personalausweis) number.
bic-structuralnoneStructural check for a SWIFT/BIC code (ISO 9362). It has no checksum, so this validates length and segments and that the country segment is a valid ISO 3166-1 alpha-2 code.

A validator name the build does not implement is a policy error, not a silent no-op, so a policy can never quietly skip a check it asked for. And because a policy only ever names a built-in validator, there is no executable code in the policy itself: nothing to review for arbitrary-code-execution risk, and the same behavior on every runtime. The names are defined in the redaction policy schema; see the schema guide for where validator sits in a policy.

A parameterized example: Brazilian CPF

The mod-11 family selects its scheme with a variant. Here is a CPF identifier that anchors on the “CPF” cue, captures the value, and validates it as a CPF:

{
  "identifiers": {
    "identifiers": [
      {
        "classification": "br-cpf",
        "pattern": "CPF[\\s:#-]*(\\d{3}\\.?\\d{3}\\.?\\d{3}-?\\d{2})",
        "groupNumber": 1,
        "validator": { "name": "mod11", "params": { "variant": "cpf" } },
        "identifierFilterStrategies": [
          { "strategy": "REDACT", "redactionFormat": "[REDACTED-CPF]" }
        ]
      }
    ]
  }
}

Swap "variant": "cpf" for "variant": "cnpj" and widen the pattern to fourteen digits to redact CNPJ company registrations instead.

What a checksum does and does not buy you

A validator narrows false positives. It does not eliminate them, and it does not make detection a guarantee. A checksum-valid number is not certain to be a real, issued identifier: a random nine-digit value passes the Luhn check roughly one time in ten. So a validator makes a generic identifier meaningfully more precise, and for high-stakes free text you can tighten it further by anchoring on a nearby cue word and redacting only the captured value, as the examples here do. Detection is probabilistic, and you remain responsible for validating any policy against your own representative documents before relying on it.

Where to go next

Frequently asked questions

Do I need to write a pattern for every ID myself?
No. The policy library ships ready-made, validator-backed policies for the Canadian SIN, Brazilian CPF and CNPJ, French NIR, SIREN, and SIRET, German Steuer-ID and Personalausweis, Spanish DNI, NIE, and CIF, and SWIFT/BIC codes. Download one and adjust it, or use the patterns on this page as a starting point.
Does a checksum mean the ID is real?
No. A checksum confirms a value is internally consistent, not that it was ever issued to someone. A random nine-digit number passes the Luhn check about one time in ten. A validator makes a broad pattern much more precise; it does not turn detection into a guarantee. Always validate a policy against your own representative documents.
Is the validation reversible or does it change the value?
Neither. A validator only decides whether a match is kept or dropped. It never alters the text. How a kept value is transformed is controlled separately by the filter strategy (redact, mask, encrypt, and so on).
Are the validators the same across Phileas languages?
Yes. The validators are implemented identically in Phileas for Java, Python, and .NET, so a policy behaves the same whether you call the HTTP API, embed the library, or run it in a .NET service.