AI Training · Philterd

LLM Training Data Preparation

Aggressive PII redaction for documents being fed into LLM training, fine-tuning, or RAG vector stores — preserves semantic structure with type tokens.

View policy → Download JSON → View source on GitHub

v1.0.0 Updated 2026-05-18 Philter >=3.0.0 By Philterd

AILLMfine-tuningtraining dataRAGingestion

The policy

The full llm-training-data-prep.json file. The same content you’d get by downloading. Copy any part of it, or use the buttons in the hero to grab the whole file.

{
  "config": {
    "splitting": {
      "enabled": true,
      "threshold": 8000
    }
  },
  "ignored": [],
  "identifiers": {
    "person": {
      "phEyeFilterStrategies": [
        {
          "strategy": "REDACT",
          "redactionFormat": "[PERSON]",
          "condition": "confidence > 55"
        }
      ]
    },
    "phoneNumber": {
      "phoneNumberFilterStrategies": [
        {
          "strategy": "REDACT",
          "redactionFormat": "[PHONE]"
        }
      ]
    },
    "emailAddress": {
      "emailAddressFilterStrategies": [
        {
          "strategy": "REDACT",
          "redactionFormat": "[EMAIL]"
        }
      ]
    },
    "ssn": {
      "ssnFilterStrategies": [
        {
          "strategy": "REDACT",
          "redactionFormat": "[SSN]"
        }
      ]
    },
    "creditCard": {
      "onlyValidCreditCardNumbers": true,
      "creditCardFilterStrategies": [
        {
          "strategy": "REDACT",
          "redactionFormat": "[CARD]"
        }
      ]
    },
    "ipAddress": {
      "ipAddressFilterStrategies": [
        {
          "strategy": "REDACT",
          "redactionFormat": "[IP]"
        }
      ]
    },
    "url": {
      "urlFilterStrategies": [
        {
          "strategy": "REDACT",
          "redactionFormat": "[URL]"
        }
      ]
    },
    "passportNumber": {
      "passportNumberFilterStrategies": [
        {
          "strategy": "REDACT",
          "redactionFormat": "[PASSPORT]"
        }
      ]
    },
    "driversLicense": {
      "driversLicenseFilterStrategies": [
        {
          "strategy": "REDACT",
          "redactionFormat": "[LICENSE]"
        }
      ]
    },
    "ibanCode": {
      "ibanCodeFilterStrategies": [
        {
          "strategy": "REDACT",
          "redactionFormat": "[IBAN]"
        }
      ]
    },
    "physicianName": {
      "physicianNameFilterStrategies": [
        {
          "strategy": "REDACT",
          "redactionFormat": "[PERSON]"
        }
      ]
    },
    "hospital": {
      "hospitalFilterStrategies": [
        {
          "strategy": "REDACT",
          "redactionFormat": "[ORG]"
        }
      ]
    },
    "city": {
      "cityFilterStrategies": [
        {
          "strategy": "REDACT",
          "redactionFormat": "[LOCATION]"
        }
      ]
    },
    "streetAddress": {
      "streetAddressFilterStrategies": [
        {
          "strategy": "REDACT",
          "redactionFormat": "[ADDRESS]"
        }
      ]
    }
  }
}

Example

Input

Patient John Smith was treated by Dr. Garcia at Mercy Hospital in Austin, TX. Contact: john@example.com or 555-867-5309.

Output

Patient [PERSON] was treated by [PERSON] at [ORG] in [LOCATION], TX. Contact: [EMAIL] or [PHONE].

Entities this policy acts on

PERSONPHONEEMAILSSNCARDIPURLPASSPORTLICENSEIBANORGLOCATIONADDRESS

What this policy does

Tuned specifically for the AI training and RAG ingestion use case, which has different priorities than other redaction scenarios:

Bias toward over-redaction. Once data enters model weights or a vector store, it’s effectively unrecoverable. Catching a false positive (extra redacted token) is cheap; missing a true positive (PII baked into the model) is expensive. Name confidence threshold is > 55 (looser than general-purpose) for higher recall.
Semantic tokens, not asterisks. Replaces PII with [PERSON], [ORG], [LOCATION], etc. — preserves grammatical structure so the model still learns “patient X was treated by physician Y at facility Z” rather than just learning that asterisks appear randomly.
Aggregates similar types. physicianName and personsName both map to [PERSON]. hospital maps to [ORG]. Reduces vocabulary fragmentation in the trained model.

When to use this

Pre-training corpus cleanup for healthcare, finance, legal, or other domain-specific LLMs
RAG ingestion pipelines — redact source documents before chunking and embedding
Fine-tuning dataset preparation — clean conversation logs, support tickets, etc. before SFT
Synthetic data generation seed — strip PII from real examples before using them as templates for generating synthetic training data

When NOT to use this

Production inference output redaction — too aggressive; will redact things the model legitimately needs (e.g., a chatbot answering “what’s the address of your nearest store”). For inference-time redaction, use Philter AI Proxy with a lighter policy.
External data publication — does not meet HIPAA Safe Harbor (keeps some semantic structure that could enable re-identification). For publication, use hipaa-safe-harbor.json .
Legal/court filings — wrong tool. Use legal/ policies instead.

When to customize

Token vocabulary. If your training framework expects specific tokens (e.g., spaCy’s PER and ORG, BERT’s [PERSON]), adjust redactionFormat accordingly. Consistency with downstream tokenizer matters.
Confidence threshold. > 55 is loose. For very large training corpora where false positives are diluted, drop to > 45 for maximum recall. For smaller datasets where each over-redaction hurts more, raise to > 70.
Domain-specific entities. Add custom identifiers patterns for entities specific to your domain (drug names, ICD codes, legal citations, ticker symbols). Decide whether to redact or preserve based on whether they’re identifying.

Why this matters

Beyond Regex: Why General LLMs Fail at PII Discovery covers the failure modes of relying on the LLM itself to handle PII. The short version: production-grade redaction needs to happen before PII reaches the model, not as a post-hoc filter or in-context instruction.

A model trained on un-redacted PII memorizes it. Studies have demonstrated extraction attacks recovering specific phone numbers, addresses, and SSNs from production LLMs trained on web-scraped data. Pre-training redaction is the only reliable defense.

References

Use this policy

Download and load into your running Philter instance:

# Download the policy
curl -O https://raw.githubusercontent.com/philterd/pii-redaction-policies/main/policies/philterd/ai-training/llm-training-data-prep.json

# Upload to your Philter instance
curl -X POST http://localhost:8080/api/policies \
     -H "Content-Type: application/json" \
     --data @llm-training-data-prep.json

# Redact text using the policy
curl http://localhost:8080/api/filter?p=llm-training-data-prep \
     --data "your text here" \
     -H "Content-Type: text/plain"

No Philter instance yet? Deploy one in 5 minutes → · Want to tune this policy against your data? Talk to the team.

← All policies