The Philterd Blog
Deep dives into data de-identification, open-source privacy, and the future of self-hosted compliance.
Start here
Six posts, from "what is PII" to shipping it in production. New here? Start at the top. Evaluating? The middle ones do the heavy lifting.
- Start here
Introducing PhiSQL: The Query Language for PII Operations
What PII actually is (and isn't): the operational definition engineers and compliance teams can both work from.
- The architectural case
Why API-Based Redaction is a Security Antipattern
Why sending sensitive data through a managed redaction API is a deeper mistake than it looks, and what to do instead.
- Evaluation
The TCO of "Free" Cloud PII Redaction: AWS Comprehend, Google DLP, vs Self-Hosted at Scale
A worked-example TCO comparison of AWS Comprehend, Google DLP, and self-hosted Philter at production volumes. The break-even sits closer than most teams expect.
- Evaluation
Beyond Regex: Why General LLMs Fail at PII Discovery
The technical case for purpose-built NLP lenses over generic LLMs: what accuracy looks like when your models were trained for the job.
- Implementation
Automating HIPAA Safe Harbor: A Blueprint for Healthcare Data Pipelines
A concrete blueprint for automating HIPAA Safe Harbor de-identification across healthcare data pipelines: all 18 protected identifiers, end to end.
- Implementation
Building a Privacy-Aware RAG System
How to keep PII out of vector stores, retrieval contexts, and the LLM's response: the architecture pattern for production RAG on regulated data.
Introducing PhiSQL: The Query Language for PII Operations
PhiSQL is a declarative, SQL-like query language for PII privacy operations across the Philterd toolkit. The problem it solves and what ships in v0.1.
PhEye Update: Unified Branch, GPU Support, and Streamlined Testing
PhEye consolidates all model branches into one main branch, adds GPU-accelerated Docker images, and ships a one-command smoke test for every model variant.
Introducing Arbiter: Human-in-the-Loop PII Redaction
Automation handles most of the volume; humans handle the last few percent. Arbiter is the open source review surface that bridges the two, built on Philter.
The TCO of "Free" Cloud PII Redaction: AWS Comprehend, Google DLP, vs Self-Hosted at Scale
Per-character SaaS pricing looks cheap at demo scale and costly in production. A TCO comparison of AWS Comprehend, Google Cloud DLP, and self-hosted Philter.
The Phileas Trino Connector: PII Redaction as SQL
A walkthrough of the open source Phileas connector for Apache Trino: what it does, how to install it, how to redact from SQL queries, and federated patterns.
Redaction for Legal and E-Discovery: Privilege, Rule 9037, and the In-House Counsel's Pipeline
How automated redaction fits legal workflows: court filings, e-discovery production, privilege review, and M&A due diligence for in-house counsel.
The Ethics of Training: Why We Use Synthetic Data
A privacy tool should never be trained on the data it protects. Why Philterd's models are built entirely on synthetic data, and what that means for compliance.
Building a HIPAA-Compliant Medical Chatbot
Why generic RAG chatbots fail HIPAA, and a blueprint for building a medical chatbot that satisfies Safe Harbor at ingestion, retrieval, and inference.
Building a Privacy-Aware RAG System
RAG pipelines have two distinct PII leak vectors: ingestion and inference. A defense-in-depth blueprint with code, using Philter and the Philter AI Proxy.
Redaction for Financial Services: PCI DSS, GLBA, and the Real-World Data Pipeline
A practitioner's guide to redacting NPPI and cardholder data in financial workflows, mapping PCI DSS, GLBA, and state requirements to the Philterd toolkit.
Architecting Privacy in Kafka: Real-Time Redaction for Streaming Data
Three battle-tested patterns for redacting PII inside Apache Kafka pipelines: Phileas as an embedded library, Philter over HTTP, or a Kafka Connect transform.
Beyond Regex: Why General LLMs Fail at PII Discovery
Regex misses context, general LLMs over-redact and burn GPUs. The right answer is hybrid: pattern matching for the deterministic, specialized AI for the rest.
Compliance as Code: Integrating Philter into Your CI/CD Pipeline
Treat data privacy like a unit test. Wire Philter into GitHub Actions, GitLab CI, and pre-commit hooks so PII leaks fail the build, not production.
Migrating from AWS Comprehend to Philter: A Practical Transition Guide
A side-by-side guide for teams migrating PII detection from AWS Comprehend to self-hosted Philter: API translations, code samples, and a shadow-mode cutover.
Redaction for Insurance: Claims, Customer Data, and the State-by-State Patchwork
Insurance carriers sit at the intersection of GLBA, HIPAA, state rules, and the NAIC Model Law. A guide to redacting NPPI and PHI in claims and adjuster notes.
Open Source vs. Black Box: Why You Can't Afford "Trust Me" Privacy
For a CISO, trust me is not a strategy. Why auditable open source is the new enterprise standard for PII redaction, and what it means for compliance.
Automating HIPAA Safe Harbor: A Blueprint for Healthcare Data Pipelines
How the Philterd suite maps to the 18 HIPAA Safe Harbor identifiers, with a deployment blueprint for patient data lakes, research pipelines, and medical RAG.
Privacy Shouldn't Be a Guessing Game: Evaluating Redaction with Philter Scope
Stop hoping your redaction works. Philter Scope turns precision, recall, and F1 into a measurable, auditable health score for any redaction pipeline.
Why API-Based Redaction is a Security Antipattern
Sending sensitive data to a third-party redaction API opens the security holes you are trying to close. Why data sovereignty needs a self-hosted engine.
Redaction for Government and Federal Workloads: FedRAMP, CMMC, ITAR, and the Air-Gap Imperative
Why most commercial PII redaction tools fail federal workloads, and how Philterd's self-hosted architecture maps to FedRAMP, CMMC, ITAR, and air-gapped needs.
Deploying Philter in Air-Gapped Environments
Deploy Philter and the full Philterd toolkit into completely offline VPCs and disconnected cloud regions. No phone-home, no telemetry, no external dependencies.
From Phileas to Philter: The Evolution of Our Open Source Engine
How a focused open source experiment grew into the engine behind a full enterprise PII suite, and why both Phileas and Philter still ship independently.
Redaction for Education: FERPA, Student Records, and Research Data Pipelines
FERPA governs student records but rarely gets the attention HIPAA does. A practitioner's guide for university IT, edtech vendors, and research-data teams.
Snowflake PII Redaction: A Practical Integration Guide
Three production-grade patterns for redacting PII inside Snowflake: external functions, Java UDFs, and ETL-stage redaction, with code and trade-offs.
Philter 3.1.0
Philter 3.1.0 is now on the AWS, Google Cloud, and Azure marketplaces. Built on Phileas 2.12.0 with filter priorities, zip-code validation, and context windows.
Phileas 2.12.0
Phileas 2.12.0, the popular open source redaction library, is released. A new Philter built on it is coming to the AWS, Google Cloud, and Azure marketplaces.
Why Using an LLM to Redact PII and PHI is a Bad Idea
Lots of posts show how to redact PII and PHI text with a large language model (LLM). Can we really just let an LLM handle it? Here is why that is a bad idea.
Shielding Your Search: Redacting PII and PHI in OpenSearch with Phinder
Phinder is an open source OpenSearch plugin built on Phileas that redacts and de-identifies PII and PHI in your search results, keeping sensitive data private.
Phileas 2.10.0
Phileas 2.10.0 is released. This version removes the commons-csv and guava dependencies, adds a bloom filter, updates pdfbox to 3.0, and adds fixes.
Phileas in Graylog – Removing PII from Logs
Graylog has integrated Phileas, the open source PII and PHI redaction engine, into its log management platform to identify and redact sensitive data in logs.
Phileas 2.9.1
Phileas 2.9.1 is released: a new line separator in LineWidthSplitService, empty ph-eye spans no longer signal failure, and a default PhEyeConfiguration value.
Automatically Redacting PII and PHI from Files in Amazon S3 using Amazon Macie and Philter
Use Amazon Macie to find sensitive data in S3, then automatically redact PII and PHI such as SSNs and phone numbers from those files with Philter.
Philter as an AI Policy Layer
An AI policy layer inspects AI-generated text to prevent sensitive information from being exposed, removing names, addresses, telephone numbers, and more.
Phileas: The Open Source PII and PHI redaction engine
Introducing Phileas, the open source PII and PHI redaction engine, now available under the Apache license on GitHub. It powers both Philter and Phirestream.