National and financial ID numbers are some of the hardest values to redact cleanly. A regular expression can match their shape, but it cannot tell a real identifier from a number that merely looks like one. A pattern for a nine-digit Canadian Social Insurance Number also matches a tracking number, a padded order number, and any other nine-digit run. Redact all of them and you damage documents; redact none and you leak a SIN.
Most national and financial identifiers solve this for you. They carry a check digit, a built-in self-test that a genuine value satisfies and a random look-alike almost never does. This guide shows how to use those checks in a redaction policy with the validator field, so a broad pattern becomes a precise one with no extra code. If you have not written a policy before, start with Writing your first redaction policy.
The recipe: anchor, capture, validate
A production-grade identifier policy has three parts, each tightening precision over the one before.
- A pattern that matches the format of the ID.
- A capture group so you anchor on a nearby cue word but redact only the value.
- A validator that keeps a match only if its check digit passes.
Here is a Canadian SIN identifier with all three. It looks for a cue such as “SIN” or “Social Insurance Number,” captures the nine-digit value after it, and validates that value with the Luhn checksum:
{
"identifiers": {
"identifiers": [
{
"classification": "canada-sin",
"pattern": "\\b(?:SIN|Social Insurance Number)[\\s:#-]*((?:\\d{3}[ -]?){2}\\d{3})\\b",
"groupNumber": 1,
"validator": "luhn",
"identifierFilterStrategies": [
{ "strategy": "REDACT", "redactionFormat": "[REDACTED-CANADA-SIN]" }
]
}
]
}
}
What each part buys you:
- The pattern alone would match any nine-digit run, including reference numbers.
- The capture group (
groupNumber: 1) means the cue word “SIN” anchors the match but is not redacted. Only the number is. This cuts matches that share the digit shape but appear in unrelated context. - The validator drops values that fail the checksum.
046 454 286is Luhn-valid and is redacted;123 456 789has the same shape, fails the check, and is left in place. Without the validator, both would be redacted.
You can use any of the three on their own. A bare pattern plus a validator is already far more precise than a pattern alone. Adding the cue-word capture group is the extra step that pays off in noisy free text.
Identifiers covered out of the box
The built-in validators cover the checksum and structural algorithms behind a range of national and financial identifiers. Each row links to a ready-made, validator-backed policy in the policy library that accepts formatted and unformatted values and ships with example input and output.
| Region | Identifiers | Validator(s) | Ready-made policy |
|---|---|---|---|
| Canada | Social Insurance Number (SIN) | luhn | Canadian SIN |
| Brazil | CPF, CNPJ | mod11 (variant cpf or cnpj) | Brazilian identifiers |
| France | NIR / INSEE, SIREN, SIRET | mod97 (variant nir), luhn | French identifiers |
| Germany | Steuer-ID, Personalausweis | de-steuerid, de-personalausweis | German identifiers |
| Spain | DNI, NIE, CIF | mod23-letter, es-cif | Spanish identifiers |
| Banking | IBAN, SWIFT / BIC | mod97 (variant iban), bic-structural | SWIFT/BIC codes |
The validator reference
A validator is written one of two ways. Validators that take no parameters can be a plain string:
"validator": "luhn"
Validators that take parameters use the object form with a name and params:
"validator": { "name": "mod11", "params": { "variant": "cpf" } }
The complete set of built-in validators:
| Validator | Parameters | Covers |
|---|---|---|
luhn | none | Standard mod-10 Luhn checksum. Canadian SIN, French SIREN and SIRET, and other Luhn-checked numbers. Separators are ignored, so a value may be formatted or not. |
mod11 | variant: cpf or cnpj | Weighted mod-11 check digits for the Brazilian CPF (11 digits) and CNPJ (14 digits). |
mod97 | variant: iban or nir; nir also takes substitutions | Value mod 97. iban is ISO 13616 (MOD-97-10); nir is the French INSEE/NIR key, with Corsica department substitutions (2A, 2B) configurable. |
mod23-letter | substitutions (optional) | Control letter from a 23-entry table for the Spanish DNI and NIE. The NIE leading letter mapping is configurable. |
es-cif | none | The Spanish CIF organization tax ID, whose control character may be a digit or a letter. |
de-steuerid | none | The German tax ID (Steuer-ID / IdNr): 11 digits, structural digit-repetition rule, and ISO/IEC 7064 MOD 11,10 check digit. |
de-personalausweis | none | ICAO 9303 check-digit validation for the German ID card (Personalausweis) number. |
bic-structural | none | Structural check for a SWIFT/BIC code (ISO 9362). It has no checksum, so this validates length and segments and that the country segment is a valid ISO 3166-1 alpha-2 code. |
A validator name the build does not implement is a policy error, not a silent no-op, so a policy can never quietly skip a check it asked for. And because a policy only ever names a built-in validator, there is no executable code in the policy itself: nothing to review for arbitrary-code-execution risk, and the same behavior on every runtime. The names are defined in the redaction policy schema; see the schema guide for where validator sits in a policy.
A parameterized example: Brazilian CPF
The mod-11 family selects its scheme with a variant. Here is a CPF identifier that anchors on the “CPF” cue, captures the value, and validates it as a CPF:
{
"identifiers": {
"identifiers": [
{
"classification": "br-cpf",
"pattern": "CPF[\\s:#-]*(\\d{3}\\.?\\d{3}\\.?\\d{3}-?\\d{2})",
"groupNumber": 1,
"validator": { "name": "mod11", "params": { "variant": "cpf" } },
"identifierFilterStrategies": [
{ "strategy": "REDACT", "redactionFormat": "[REDACTED-CPF]" }
]
}
]
}
}
Swap "variant": "cpf" for "variant": "cnpj" and widen the pattern to fourteen digits to redact CNPJ company registrations instead.
What a checksum does and does not buy you
A validator narrows false positives. It does not eliminate them, and it does not make detection a guarantee. A checksum-valid number is not certain to be a real, issued identifier: a random nine-digit value passes the Luhn check roughly one time in ten. So a validator makes a generic identifier meaningfully more precise, and for high-stakes free text you can tighten it further by anchoring on a nearby cue word and redacting only the captured value, as the examples here do. Detection is probabilistic, and you remain responsible for validating any policy against your own representative documents before relying on it.
Where to go next
- The policy library has ready-made, validator-backed policies for every ID in the table above. Download one and adapt it.
- The redaction policy schema guide is the field-by-field reference for the policy JSON, including where
validatorsits in a custom identifier. - New to policies? Writing your first redaction policy walks from an empty file to a working one.
- For the background on why checksum validation matters, see Cutting false positives on national IDs.