What this policy does
Tuned specifically for the AI training and RAG ingestion use case, which has different priorities than other redaction scenarios:
Bias toward over-redaction. Once data enters model weights or a vector store, it’s effectively unrecoverable. Catching a false positive (extra redacted token) is cheap; missing a true positive (PII baked into the model) is expensive. Name confidence threshold is
> 55(looser than general-purpose) for higher recall.Semantic tokens, not asterisks. Replaces PII with
[PERSON],[ORG],[LOCATION], etc. — preserves grammatical structure so the model still learns “patient X was treated by physician Y at facility Z” rather than just learning that asterisks appear randomly.Aggregates similar types.
physicianNameandpersonsNameboth map to[PERSON].hospitalmaps to[ORG]. Reduces vocabulary fragmentation in the trained model.
When to use this
- Pre-training corpus cleanup for healthcare, finance, legal, or other domain-specific LLMs
- RAG ingestion pipelines — redact source documents before chunking and embedding
- Fine-tuning dataset preparation — clean conversation logs, support tickets, etc. before SFT
- Synthetic data generation seed — strip PII from real examples before using them as templates for generating synthetic training data
When NOT to use this
- Production inference output redaction — too aggressive; will redact things the model legitimately needs (e.g., a chatbot answering “what’s the address of your nearest store”). For inference-time redaction, use Philter AI Proxy with a lighter policy.
- External data publication — does not meet HIPAA Safe Harbor (keeps some semantic structure that could enable re-identification). For publication, use hipaa-safe-harbor.json.
- Legal/court filings — wrong tool. Use legal/ policies instead.
When to customize
- Token vocabulary. If your training framework expects specific tokens (e.g., spaCy’s
PERandORG, BERT’s[PERSON]), adjustredactionFormataccordingly. Consistency with downstream tokenizer matters. - Confidence threshold.
> 55is loose. For very large training corpora where false positives are diluted, drop to> 45for maximum recall. For smaller datasets where each over-redaction hurts more, raise to> 70. - Domain-specific entities. Add custom
identifierspatterns for entities specific to your domain (drug names, ICD codes, legal citations, ticker symbols). Decide whether to redact or preserve based on whether they’re identifying.
Why this matters
Beyond Regex: Why General LLMs Fail at PII Discovery covers the failure modes of relying on the LLM itself to handle PII. The short version: production-grade redaction needs to happen before PII reaches the model, not as a post-hoc filter or in-context instruction.
A model trained on un-redacted PII memorizes it. Studies have demonstrated extraction attacks recovering specific phone numbers, addresses, and SSNs from production LLMs trained on web-scraped data. Pre-training redaction is the only reliable defense.