PHI redaction is the process of finding and removing Protected Health Information from text so the data that remains falls outside HIPAA scope. Philterd builds open source, self-hosted PHI redaction software that runs inside your own network, so clinical notes, claims, intake forms, and patient messages are de-identified before they ever reach a third party.
What counts as PHI
Under the HIPAA Safe Harbor method (45 CFR 164.514(b)(2)), data is de-identified only when all 18 identifiers are removed: names, geographic subdivisions smaller than a state, every date element tied to an individual, phone and fax numbers, email addresses, Social Security numbers, medical record numbers, health plan beneficiary numbers, account numbers, certificate and license numbers, vehicle and device identifiers, URLs and IP addresses, biometric identifiers, full-face photographs, and any other unique identifying number or code. A single clinical note can carry a dozen of these in a few sentences, which is why manual review does not scale.
How PHI redaction works
Effective PHI redaction combines two detection methods. Pattern-based rules (regex, checksums, and format validators) catch structured identifiers like MRNs, SSNs, and account numbers with high precision. NLP models catch the unstructured identifiers, including physician and patient names, hospital names, and locations, that no pattern can reliably match. Philter applies both, then redacts, masks, encrypts, or replaces each detection according to a policy you control. The practical trade-offs are covered in using an LLM or pattern-based rules for PII and PHI redaction.
Redaction is not the same as de-identification
Redacted PHI is still PHI: a BAA is still required and the data stays in HIPAA scope. De-identified PHI, whether by Safe Harbor or Expert Determination, is no longer PHI and leaves scope entirely. Most teams need both, applied to different workflows. Getting this line right is what makes a data pipeline defensible in an audit.
Self-hosted PHI redaction keeps data in your network
Redaction defines your compliance boundary, and the boundary is what an auditor examines. If clinical text has to leave your network to be redacted by a hosted API, the exposure has already happened. Philter and the Phileas library run inside your own VPC or on-premises, under your existing cloud BAA, so PHI is removed before anything crosses the wire. The detection logic is open source and auditable, not a black box, and policies are versioned artifacts, so “what did we redact, and how, on this date” is answerable with a diff.
Where teams apply it
- EHR exports and research pipelines. De-identify clinical narrative before it lands in the analytics warehouse. Start from the HIPAA Safe Harbor policy.
- Clinical chatbots and RAG. Philter AI Proxy strips PHI from prompts before they reach OpenAI, Anthropic, or Bedrock.
- Health-tech products. Embed redaction to give covered-entity customers an auditable answer on how you handle PHI.
See the full vertical picture in PHI redaction for healthcare, and how each capability maps to HIPAA and other regulations in the compliance matrix.
Get started
Philterd’s PHI redaction toolkit is free and open source under the Apache License, Version 2.0. Deploy Philter from your cloud marketplace into your own account, apply a HIPAA Safe Harbor policy, and measure precision and recall with Philter Scope before you wire it into production.