<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>The Philterd Blog on Philterd — Zero-Trust PII Redaction for Cloud and AI</title><link>https://www.philterd.ai/blog/</link><description>Recent content in The Philterd Blog on Philterd — Zero-Trust PII Redaction for Cloud and AI</description><generator>Hugo</generator><language>en-us</language><lastBuildDate>Sat, 16 May 2026 13:00:00 +0000</lastBuildDate><atom:link href="https://www.philterd.ai/blog/index.xml" rel="self" type="application/rss+xml"/><item><title>Introducing Arbiter: Human-in-the-Loop PII Redaction</title><link>https://www.philterd.ai/blog/introducing-arbiter-human-in-the-loop-pii-redaction/</link><pubDate>Sat, 16 May 2026 13:00:00 +0000</pubDate><guid>https://www.philterd.ai/blog/introducing-arbiter-human-in-the-loop-pii-redaction/</guid><description>&lt;p&gt;Automated PII redaction is genuinely great at most of the job. Pattern matching nails SSNs, credit cards, and phone numbers; &lt;a href="https://www.philterd.ai/ph-eye/"&gt;PhEye&lt;/a&gt;'s NLP models catch names, addresses, and organizations in unstructured text; &lt;a href="https://www.philterd.ai/philter/"&gt;Philter&lt;/a&gt;'s policy engine ties it all together. For the majority of detections, the automated layer is faster, more consistent, and more comprehensive than human review.&lt;/p&gt;

&lt;p&gt;And then there's the last few percent.&lt;/p&gt;

&lt;p&gt;The detection that fires on the phrase &lt;em&gt;"Huntington's disease"&lt;/em&gt; because "Huntington" is a name. The credit-card-shaped number that's actually a transaction ID. The patient name that's also a famous historical figure. The street address inside a quoted news article that's part of the public record. The mention of a person's name in a context where the policy says it shouldn't be redacted, or vice versa. These are the cases where automation can't decide on its own &amp;mdash; and where ignoring the ambiguity means either over-redaction (data utility destroyed) or under-redaction (compliance failure).&lt;/p&gt;</description></item><item><title>The TCO of "Free" Cloud PII Redaction: AWS Comprehend, Google DLP, vs Self-Hosted at Scale</title><link>https://www.philterd.ai/blog/tco-cloud-pii-redaction-aws-comprehend-google-dlp-vs-self-hosted/</link><pubDate>Thu, 14 May 2026 11:00:00 +0000</pubDate><guid>https://www.philterd.ai/blog/tco-cloud-pii-redaction-aws-comprehend-google-dlp-vs-self-hosted/</guid><description>&lt;p class="compare-note"&gt;&lt;strong&gt;Pricing disclaimer:&lt;/strong&gt; the dollar figures, per-unit rates, and tier discounts below reflect the published prices for AWS Comprehend, Google Cloud DLP, and the AWS Marketplace at the time of writing. Cloud providers update their pricing frequently &amp;mdash; sometimes monthly, sometimes via opaque enterprise discount programs. Always verify the current rates on each provider's pricing page before using these numbers for a budgeting decision. The relative shape of the comparison (per-volume vs. per-instance billing) is stable; the specific dollar amounts may not be.&lt;/p&gt;</description></item><item><title>What is PII? A Practical Guide for Engineers and Compliance Teams</title><link>https://www.philterd.ai/blog/what-is-pii-a-practical-guide-for-engineers-and-compliance-teams/</link><pubDate>Sun, 10 May 2026 09:30:00 +0000</pubDate><guid>https://www.philterd.ai/blog/what-is-pii-a-practical-guide-for-engineers-and-compliance-teams/</guid><description>&lt;p&gt;"PII" is one of those terms that everyone in technology uses and almost nobody defines the same way. Ask a compliance lawyer, a database engineer, and an ML scientist what PII is, and you'll get three different answers &amp;mdash; all of them partially right. The disagreement doesn't matter much in conversation, but it matters a great deal when someone has to actually &lt;em&gt;act&lt;/em&gt; on it: write a redaction policy, audit a data lake, or sign off on a release.&lt;/p&gt;</description></item><item><title>The Hidden Difficulties of Redacting PDF Documents</title><link>https://www.philterd.ai/blog/the-hidden-difficulties-of-redacting-pdf-documents/</link><pubDate>Sat, 09 May 2026 10:00:00 +0000</pubDate><guid>https://www.philterd.ai/blog/the-hidden-difficulties-of-redacting-pdf-documents/</guid><description>&lt;p&gt;"Just black it out" is one of the most dangerous sentences in document handling. A user opens a PDF in Acrobat, drops a few black rectangles over the sensitive bits, saves the file, and ships it. The visible result looks redacted. The actual file contains every original character of the supposedly-hidden text, sitting under the rectangle, fully recoverable in under five seconds with a copy-paste.&lt;/p&gt;

&lt;p&gt;This isn't a hypothetical. It's the source of &lt;a href="https://www.nytimes.com/2019/01/08/us/politics/manafort-russia-collusion-trump.html" target="_blank" rel="noopener noreferrer"&gt;some of the most famous PII leaks in the last 20 years&lt;/a&gt;. PDF redaction is genuinely hard &amp;mdash; harder than redaction of plain text, harder than redaction of Word documents, harder than most engineers realize the first time they try to do it.&lt;/p&gt;</description></item><item><title>Prompt Engineering for Privacy: Practical Patterns for Not Leaking PII</title><link>https://www.philterd.ai/blog/prompt-engineering-for-privacy/</link><pubDate>Wed, 06 May 2026 11:00:00 +0000</pubDate><guid>https://www.philterd.ai/blog/prompt-engineering-for-privacy/</guid><description>&lt;p&gt;The prompt-engineering literature is enormous and almost entirely focused on getting better answers. It says very little about a quieter problem: &lt;strong&gt;every prompt sent to a hosted LLM is a data egress point&lt;/strong&gt;. The text leaves your perimeter, lands on a provider's servers, sits in their logs for some retention period, and (depending on the provider and the agreement) may be used to train future models. The model might return your PII to another user. The provider might suffer a breach. Either way, the data is no longer yours.&lt;/p&gt;</description></item><item><title>The Phileas Trino Connector: PII Redaction as SQL</title><link>https://www.philterd.ai/blog/the-phileas-trino-connector-redaction-as-sql/</link><pubDate>Sun, 03 May 2026 12:30:00 +0000</pubDate><guid>https://www.philterd.ai/blog/the-phileas-trino-connector-redaction-as-sql/</guid><description>&lt;p&gt;Apache Trino (formerly PrestoSQL) is the federated query engine that more and more organizations are using to query across data lakes, warehouses, and relational sources without ETL'ing data into a single place. The architectural promise is "query data where it lives"; the privacy implication is that &lt;em&gt;sensitive data in any connected source&lt;/em&gt; can land in the result set of &lt;em&gt;any query a user runs&lt;/em&gt;. PII that was carefully gated in one system becomes exposed the moment Trino joins it with anything else.&lt;/p&gt;</description></item><item><title>Redaction for Legal and E-Discovery: Privilege, Rule 9037, and the In-House Counsel's Pipeline</title><link>https://www.philterd.ai/blog/redaction-for-legal-and-e-discovery-privilege-rule-9037-counsel-pipeline/</link><pubDate>Thu, 30 Apr 2026 14:00:00 +0000</pubDate><guid>https://www.philterd.ai/blog/redaction-for-legal-and-e-discovery-privilege-rule-9037-counsel-pipeline/</guid><description>&lt;p&gt;Legal work has more redaction in it than almost any other industry &amp;mdash; and far less automation than the volume justifies. Court filings get hand-redacted by paralegals with black markers. Discovery productions get scrubbed in Relativity by associates billing $400/hour to draw rectangles over names. M&amp;amp;A due diligence rooms get sanitized one document at a time. The result is a category that spends enormous sums on a problem that's largely solvable with software.&lt;/p&gt;</description></item><item><title>PII in Vector Embeddings: A Defense Guide</title><link>https://www.philterd.ai/blog/pii-in-vector-embeddings-a-defense-guide/</link><pubDate>Wed, 22 Apr 2026 13:00:00 +0000</pubDate><guid>https://www.philterd.ai/blog/pii-in-vector-embeddings-a-defense-guide/</guid><description>&lt;p&gt;"It's just an array of floats" is the most reassuring sentence a vector-store skeptic can hear &amp;mdash; and the most misleading. Sentence embeddings produced by modern models are partially invertible: an attacker with access to the embeddings (but not the original text) can reconstruct meaningful approximations of the source. For teams storing embeddings of sensitive data, this turns "we don't expose the raw text" from a complete defense into half of one.&lt;/p&gt;</description></item><item><title>The Ethics of Training: Why We Use Synthetic Data</title><link>https://www.philterd.ai/blog/the-ethics-of-training-why-we-use-synthetic-data/</link><pubDate>Sun, 19 Apr 2026 13:37:57 +0000</pubDate><guid>https://www.philterd.ai/blog/the-ethics-of-training-why-we-use-synthetic-data/</guid><description>&lt;p&gt;In cybersecurity, trust is easy to lose and nearly impossible to regain. As a decision-maker, you're constantly weighing the benefits of new AI tools against the risk of a headline-making data leak. Most AI companies ask for your trust while simultaneously asking for your data to "improve their models."&lt;/p&gt;

&lt;p&gt;At Philterd, we believe that's a fundamental conflict of interest. &lt;strong&gt;A privacy tool should never be trained on the very data it is meant to protect.&lt;/strong&gt; That is why we've built our intelligence on a foundation of &lt;em&gt;synthetic data&lt;/em&gt;.&lt;/p&gt;</description></item><item><title>Building a HIPAA-Compliant Medical Chatbot</title><link>https://www.philterd.ai/blog/building-a-hipaa-compliant-medical-chatbot/</link><pubDate>Tue, 14 Apr 2026 15:00:00 +0000</pubDate><guid>https://www.philterd.ai/blog/building-a-hipaa-compliant-medical-chatbot/</guid><description>&lt;p&gt;Every health system in 2026 is building or evaluating an internal medical chatbot &amp;mdash; a Q&amp;amp;A interface over chart notes, drug references, clinical guidelines, or operational documentation. The reasoning is straightforward: physicians and care teams spend hours a day searching for information that should be one question away. The technology is straightforward too: it's a RAG system.&lt;/p&gt;

&lt;p&gt;The HIPAA story, however, is decidedly not straightforward. A generic RAG chatbot built with off-the-shelf components fails HIPAA at multiple points, and the failures aren't subtle. This post is the architectural blueprint for the version that passes &amp;mdash; a medical chatbot that satisfies Safe Harbor de-identification at ingestion, defends against PHI leakage at retrieval and inference, and produces the audit artifacts your compliance team will be asked to show.&lt;/p&gt;</description></item><item><title>Building a Privacy-Aware RAG System</title><link>https://www.philterd.ai/blog/building-a-privacy-aware-rag-system/</link><pubDate>Wed, 08 Apr 2026 14:00:00 +0000</pubDate><guid>https://www.philterd.ai/blog/building-a-privacy-aware-rag-system/</guid><description>&lt;p&gt;Retrieval-augmented generation is the dominant pattern for enterprise AI in 2026. Every customer-support assistant, internal-docs Q&amp;amp;A bot, and clinical-summary tool you've seen pitched in the last twelve months is, under the hood, a RAG system. The architecture is straightforward: index your documents into a vector store, retrieve the top-K relevant chunks for a user query, paste them into an LLM prompt, return the answer.&lt;/p&gt;

&lt;p&gt;The privacy problem is also straightforward, and it's bigger than most teams realize. &lt;strong&gt;RAG has two distinct PII leak vectors&lt;/strong&gt; &amp;mdash; one at ingestion, one at inference &amp;mdash; and protecting against only one of them is the same as protecting against neither.&lt;/p&gt;</description></item><item><title>Redaction for Financial Services: PCI DSS, GLBA, and the Real-World Data Pipeline</title><link>https://www.philterd.ai/blog/redaction-for-financial-services-pci-dss-glba-real-world-pipelines/</link><pubDate>Wed, 25 Mar 2026 16:00:00 +0000</pubDate><guid>https://www.philterd.ai/blog/redaction-for-financial-services-pci-dss-glba-real-world-pipelines/</guid><description>&lt;p&gt;Financial services has the strictest data handling requirements outside of healthcare &amp;mdash; and arguably more enforcement teeth, because every bank regulator (OCC, CFPB, FTC, state AGs) has a different angle of attack. Where healthcare has the relative clarity of HIPAA's 18 Safe Harbor identifiers, finance has &lt;em&gt;multiple&lt;/em&gt; overlapping regulatory regimes (PCI DSS, GLBA, SOX, BSA/AML, Reg E, state privacy laws) and a customer-facing surface area (call centers, mobile apps, chat, email) that generates unstructured PII at industrial scale.&lt;/p&gt;</description></item><item><title>PII vs PHI vs NPPI: An Engineer's Guide</title><link>https://www.philterd.ai/blog/pii-vs-phi-vs-nppi-an-engineers-guide/</link><pubDate>Wed, 18 Mar 2026 10:30:00 +0000</pubDate><guid>https://www.philterd.ai/blog/pii-vs-phi-vs-nppi-an-engineers-guide/</guid><description>&lt;p&gt;Few three-letter combinations cause more confusion in data privacy than PII, PHI, and NPPI. They overlap, they get used interchangeably, and the regulatory implications of mixing them up are real.&lt;/p&gt;

&lt;p&gt;This is the short, definitional reference. One paragraph each, the regulatory framework that defines it, and the architectural implication for the engineer who has to do something about it.&lt;/p&gt;

&lt;h2&gt;PII: the umbrella&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Personally Identifiable Information&lt;/strong&gt; is any data that can be used &amp;mdash; on its own or in combination with other data &amp;mdash; to identify a specific person. Names, email addresses, SSNs, IP addresses, and biometric data are PII. So are quasi-identifiers like zip code + date of birth + gender, which together are sufficient to uniquely identify ~87% of the U.S. population.&lt;/p&gt;</description></item><item><title>Architecting Privacy in Kafka: Real-Time Redaction for Streaming Data</title><link>https://www.philterd.ai/blog/architecting-privacy-in-kafka-real-time-redaction-for-streaming-data/</link><pubDate>Wed, 11 Mar 2026 20:42:02 +0000</pubDate><guid>https://www.philterd.ai/blog/architecting-privacy-in-kafka-real-time-redaction-for-streaming-data/</guid><description>&lt;p&gt;Most organizations think about PII the same way they think about backups: as a thing they'll worry about once it lands in a database. But in a streaming architecture, that's already too late. The moment a message hits a Kafka topic, dozens of downstream consumers &amp;mdash; analytics jobs, search indexers, archival sinks, ML pipelines &amp;mdash; can read it. &lt;strong&gt;Once a single SSN reaches one downstream system, it's everywhere.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the "PII at rest" problem inverted. The right place to redact streaming data isn't downstream &amp;mdash; it's &lt;em&gt;in flight&lt;/em&gt;, between the producers and the first consumer that doesn't need the raw values. We learned this pattern years ago building Phirestream; today the same approach is implemented with &lt;a href="https://www.philterd.ai/philter/"&gt;Philter&lt;/a&gt; and the underlying &lt;a href="https://www.philterd.ai/phileas/"&gt;Phileas&lt;/a&gt; library, depending on the constraints of your stack.&lt;/p&gt;</description></item><item><title>Beyond Regex: Why General LLMs Fail at PII Discovery</title><link>https://www.philterd.ai/blog/beyond-regex-why-general-llms-fail-at-pii-discovery/</link><pubDate>Wed, 04 Mar 2026 20:44:54 +0000</pubDate><guid>https://www.philterd.ai/blog/beyond-regex-why-general-llms-fail-at-pii-discovery/</guid><description>&lt;p&gt;Regex was never meant for the messy reality of human language. It's great at finding a 10-digit number that looks like a phone identifier, but it's famously terrible at telling you &lt;em&gt;why&lt;/em&gt; that number exists. On the flip side, we're now seeing companies try to throw massive, general-purpose LLMs at the problem. Those models are incredible conversationalists &amp;mdash; but using them for PII discovery is like using a sledgehammer for surgery.&lt;/p&gt;</description></item><item><title>Compliance as Code: Integrating Philter into Your CI/CD Pipeline</title><link>https://www.philterd.ai/blog/compliance-as-code-integrating-philter-into-your-ci-cd-pipeline/</link><pubDate>Wed, 25 Feb 2026 21:51:13 +0000</pubDate><guid>https://www.philterd.ai/blog/compliance-as-code-integrating-philter-into-your-ci-cd-pipeline/</guid><description>&lt;p&gt;Engineering teams shifted security left a decade ago: SAST scanners, dependency audits, and IaC linters all run in CI now, blocking the merge button when something's off. &lt;strong&gt;Privacy is the next thing to shift.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Most organizations still treat PII leaks the same way they treat bugs in production &amp;mdash; surfaced by an incident, triaged by an SRE, written up in a postmortem. That's the most expensive place to catch them. Every minute of triage, every regulator notification, every customer-trust call is downstream of a failure that should have failed a build.&lt;/p&gt;</description></item><item><title>Migrating from AWS Comprehend to Philter: A Practical Transition Guide</title><link>https://www.philterd.ai/blog/migrating-from-aws-comprehend-to-philter/</link><pubDate>Wed, 18 Feb 2026 13:30:00 +0000</pubDate><guid>https://www.philterd.ai/blog/migrating-from-aws-comprehend-to-philter/</guid><description>&lt;p&gt;Teams move from AWS Comprehend PII to &lt;a href="https://www.philterd.ai/philter/"&gt;Philter&lt;/a&gt; for one of three reasons: &lt;a href="https://www.philterd.ai/blog/tco-cloud-pii-redaction-aws-comprehend-google-dlp-vs-self-hosted/"&gt;the bill got uncomfortable&lt;/a&gt;, the data path stopped passing compliance review, or the customization surface ran out (custom entity types, domain lenses, per-entity replacement strategies). Whatever the reason, the migration itself is more mechanical than most teams expect.&lt;/p&gt;

&lt;p&gt;This guide is the practical playbook: how Comprehend concepts map onto Philter, the code translation for the most common API calls, and a safe shadow-mode cutover pattern that lets you migrate without taking on detection-quality risk.&lt;/p&gt;</description></item><item><title>Redaction for Insurance: Claims, Customer Data, and the State-by-State Patchwork</title><link>https://www.philterd.ai/blog/redaction-for-insurance-claims-and-customer-data-glba-and-state-rules/</link><pubDate>Thu, 12 Feb 2026 14:30:00 +0000</pubDate><guid>https://www.philterd.ai/blog/redaction-for-insurance-claims-and-customer-data-glba-and-state-rules/</guid><description>&lt;p&gt;Insurance is the vertical most people forget to mention when discussing regulated data. Healthcare gets HIPAA; banking gets GLBA and PCI; tech gets GDPR and CCPA. Insurance gets &lt;em&gt;all of them at once&lt;/em&gt; &amp;mdash; plus state-level insurance commissioner rules, plus the NAIC Insurance Data Security Model Law, plus (for health insurers) HIPAA in addition to everything else.&lt;/p&gt;

&lt;p&gt;For carriers, third-party administrators, and insurtech platforms, that overlapping regulatory environment combines with a uniquely PII-dense data flow: claims adjusters write free-text notes, customers correspond via email and chat, third-party medical reports arrive in PDFs, agent training data captures live calls. PII is everywhere; the redaction problem is constant.&lt;/p&gt;</description></item><item><title>Open Source vs. Black Box: Why You Can't Afford "Trust Me" Privacy</title><link>https://www.philterd.ai/blog/open-source-vs-black-box-trust-me-privacy/</link><pubDate>Thu, 05 Feb 2026 10:28:11 +0000</pubDate><guid>https://www.philterd.ai/blog/open-source-vs-black-box-trust-me-privacy/</guid><description>&lt;p&gt;For a Chief Information Security Officer, the word &lt;em&gt;trust&lt;/em&gt; is a calculated risk. When you buy a security tool, you aren't just buying a feature &amp;mdash; you are inheriting the vendor's vulnerabilities, their blind spots, and their secret handling of your data.&lt;/p&gt;

&lt;p&gt;In PII redaction and data privacy, this risk is magnified. If a tool fails to catch a Social Security Number or a patient identifier, the liability doesn't fall on the software vendor &amp;mdash; &lt;strong&gt;it falls on you&lt;/strong&gt;. That is why black-box proprietary systems are becoming a relic of the past, and why auditable, open source tools are the new enterprise standard for 2026.&lt;/p&gt;</description></item><item><title>Automating HIPAA Safe Harbor: A Blueprint for Healthcare Data Pipelines</title><link>https://www.philterd.ai/blog/automating-hipaa-safe-harbor-a-blueprint-for-healthcare-data-pipelines/</link><pubDate>Sun, 25 Jan 2026 19:41:38 +0000</pubDate><guid>https://www.philterd.ai/blog/automating-hipaa-safe-harbor-a-blueprint-for-healthcare-data-pipelines/</guid><description>&lt;p&gt;For healthcare CTOs and Data Protection Officers, HIPAA Safe Harbor de-identification &amp;mdash; 45 CFR § 164.514(b)(2) &amp;mdash; sounds simple on paper. Remove 18 specific identifier categories, and the resulting dataset is no longer considered Protected Health Information (PHI). You can share it for research, train AI on it, or move it into less-restricted infrastructure.&lt;/p&gt;

&lt;p&gt;In practice, mechanically enforcing those 18 categories across patient narratives, clinical notes, intake forms, and the data flowing through modern AI pipelines is non-trivial. This post lays out how each of the 18 identifiers maps to a specific capability in the &lt;a href="https://www.philterd.ai/#toolkit"&gt;Philterd toolkit&lt;/a&gt;, and then sketches a reference architecture for three common healthcare data flows: patient data lakes, clinical research pipelines, and RAG-based medical AI systems.&lt;/p&gt;</description></item><item><title>Privacy Shouldn't Be a Guessing Game: Evaluating Redaction with Philter Scope</title><link>https://www.philterd.ai/blog/evaluating-redaction-with-philter-scope/</link><pubDate>Sun, 18 Jan 2026 08:28:10 +0000</pubDate><guid>https://www.philterd.ai/blog/evaluating-redaction-with-philter-scope/</guid><description>&lt;p&gt;In data privacy, &lt;em&gt;"I think we caught everything"&lt;/em&gt; is a dangerous sentence. When you're preparing a massive dataset for research or moving sensitive logs into a cloud environment, you can't rely on a gut feeling that your redaction tool is working. &lt;strong&gt;You need proof.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The problem is that most redaction engines are black boxes. You feed data in, you get redacted data out, but you have no clear way to measure what was missed or what was accidentally destroyed.&lt;/p&gt;</description></item><item><title>Why API-Based Redaction is a Security Antipattern</title><link>https://www.philterd.ai/blog/why-api-based-redaction-is-a-security-antipattern/</link><pubDate>Sat, 17 Jan 2026 15:07:49 +0000</pubDate><guid>https://www.philterd.ai/blog/why-api-based-redaction-is-a-security-antipattern/</guid><description>&lt;p&gt;In the rush to adopt AI and modern data processing, many organizations have fallen into a convenient but dangerous trap: &lt;em&gt;"Privacy-as-a-Service"&lt;/em&gt; APIs. It sounds simple &amp;mdash; you send your raw text to a third-party provider, they redact the sensitive bits, and send it back.&lt;/p&gt;

&lt;p&gt;But there is a fundamental flaw in this logic. &lt;strong&gt;To protect your PII (Personally Identifiable Information), you are starting the process by handing that PII over to someone else.&lt;/strong&gt;&lt;/p&gt;</description></item><item><title>Redaction for Government and Federal Workloads: FedRAMP, CMMC, ITAR, and the Air-Gap Imperative</title><link>https://www.philterd.ai/blog/redaction-for-government-and-federal-fedramp-cmmc-itar/</link><pubDate>Fri, 09 Jan 2026 10:00:00 +0000</pubDate><guid>https://www.philterd.ai/blog/redaction-for-government-and-federal-fedramp-cmmc-itar/</guid><description>&lt;p&gt;Government workloads have the strictest data-handling requirements of any sector in the West &amp;mdash; and the smallest population of commercial tools that genuinely qualify. Most PII redaction products marketed to "enterprise" customers fail at the first authorization step because the underlying architecture assumes a hosted SaaS data path. Federal contractors quickly discover that "we redact your PII for you" means "we move your CUI to our infrastructure first," which is the opposite of what FedRAMP, CMMC, ITAR, and the rest of the federal compliance stack require.&lt;/p&gt;</description></item><item><title>Deploying Philter in Air-Gapped Environments</title><link>https://www.philterd.ai/blog/deploying-philter-in-air-gapped-environments/</link><pubDate>Sat, 27 Dec 2025 22:06:40 +0000</pubDate><guid>https://www.philterd.ai/blog/deploying-philter-in-air-gapped-environments/</guid><description>&lt;p&gt;In data security, &lt;em&gt;connected&lt;/em&gt; is often synonymous with &lt;em&gt;vulnerable&lt;/em&gt;. For high-security sectors &amp;mdash; defense, intelligence, national healthcare &amp;mdash; the gold standard isn't just a firewall; it's an air gap. When your data is so sensitive that it cannot exist on a network with outbound internet access, your software stack has to be just as self-sufficient.&lt;/p&gt;

&lt;p&gt;Most modern AI and PII redaction tools are cloud-first, meaning they constantly call home for updates, telemetry, or license verification. In an air-gapped environment, those tools don't just fail &amp;mdash; &lt;strong&gt;they won't even start.&lt;/strong&gt;&lt;/p&gt;</description></item><item><title>From Phileas to Philter: The Evolution of Our Open Source Engine</title><link>https://www.philterd.ai/blog/from-phileas-to-philter-the-evolution-of-our-open-source-engine/</link><pubDate>Fri, 19 Dec 2025 09:22:01 +0000</pubDate><guid>https://www.philterd.ai/blog/from-phileas-to-philter-the-evolution-of-our-open-source-engine/</guid><description>&lt;p&gt;In software, there's a saying that "nothing is ever finished &amp;mdash; it's just released." Looking back at the trajectory of our privacy engine, that sentiment couldn't be more accurate. What began as a focused open source experiment called &lt;a href="https://www.philterd.ai/phileas/"&gt;Phileas&lt;/a&gt; has evolved into &lt;a href="https://www.philterd.ai/philter/"&gt;Philter&lt;/a&gt; &amp;mdash; the core of a comprehensive enterprise PII suite used by healthcare providers, government agencies, and global tech firms.&lt;/p&gt;

&lt;p&gt;Understanding the history of this engine isn't just a trip down memory lane &amp;mdash; it's an explanation of why the software is as stable and context-aware as it is today. Here is how we moved from simple pattern matching to high-velocity, hybrid privacy intelligence.&lt;/p&gt;</description></item><item><title>Redaction for Education: FERPA, Student Records, and Research Data Pipelines</title><link>https://www.philterd.ai/blog/redaction-for-education-ferpa-student-records-and-research-data/</link><pubDate>Sun, 14 Dec 2025 11:30:00 +0000</pubDate><guid>https://www.philterd.ai/blog/redaction-for-education-ferpa-student-records-and-research-data/</guid><description>&lt;p&gt;Education is the vertical that almost no one talks about when discussing regulated data &amp;mdash; and yet it's one of the largest categories of PII in the country. Every K-12 district, every university registrar, every learning-management-system vendor, every edtech startup, every educational researcher works under FERPA, the Family Educational Rights and Privacy Act of 1974. The regulatory framework is real, the penalties for institutions are severe (loss of federal funding), and the tooling situation is genuinely thin compared to healthcare or finance.&lt;/p&gt;</description></item><item><title>Snowflake PII Redaction: A Practical Integration Guide</title><link>https://www.philterd.ai/blog/snowflake-pii-redaction-a-practical-integration-guide/</link><pubDate>Sat, 06 Dec 2025 11:00:00 +0000</pubDate><guid>https://www.philterd.ai/blog/snowflake-pii-redaction-a-practical-integration-guide/</guid><description>&lt;p&gt;Snowflake is the data warehouse for a huge swath of mid-to-large enterprises, and that's exactly why it's where unstructured PII piles up. Customer-support transcripts, scanned-document text, application logs, chat history, transaction descriptions &amp;mdash; everything that's text and was ever called "data" eventually lands in a Snowflake table.&lt;/p&gt;

&lt;p&gt;Snowflake's built-in dynamic data masking and external tokenization handle column-level structured data well (a SSN column, a credit card column), but they don't address the harder problem: PII &lt;em&gt;buried inside free-text columns&lt;/em&gt;. A customer-service-ticket table with a TEXT column holding the full conversation is the canonical case &amp;mdash; the SSN is in there somewhere, but it's not in a column you can mask.&lt;/p&gt;</description></item><item><title>What is Data Redaction? A Practical Guide</title><link>https://www.philterd.ai/blog/what-is-data-redaction-a-practical-guide/</link><pubDate>Wed, 26 Nov 2025 10:30:00 +0000</pubDate><guid>https://www.philterd.ai/blog/what-is-data-redaction-a-practical-guide/</guid><description>&lt;p&gt;"Data redaction" is one of those terms everyone uses and few people define the same way. The mental image is usually a black bar over a sentence in a court filing &amp;mdash; an image that captures one specific technique while obscuring the broader category. Real-world redaction includes that, plus several other strategies that look very different from a black bar but solve the same underlying problem.&lt;/p&gt;

&lt;p&gt;This post is a practitioner's reference: what data redaction actually means, the five common strategies, when each one fits, and how to pick the right approach for your workload.&lt;/p&gt;</description></item><item><title>Using an LLM or Pattern-based Rules for PII/PHI Redaction</title><link>https://www.philterd.ai/blog/using-an-llm-or-pattern-based-rules-for-pii-phi-redaction/</link><pubDate>Thu, 01 May 2025 20:53:57 +0000</pubDate><guid>https://www.philterd.ai/blog/using-an-llm-or-pattern-based-rules-for-pii-phi-redaction/</guid><description>&lt;p&gt;In our data-driven world, being able to protect Personally Identifiable Information (PII) and Protected Health Information (PHI) is imperative. Whether you&amp;#8217;re securing customer data, complying with regulations like GDPR or HIPAA, or simply aiming for responsible data handling, the need to effectively redact sensitive information is crucial.&lt;/p&gt;



&lt;p&gt;Today, there are two primary approaches: leveraging the power of Large Language Models (LLMs) and employing traditional pattern-based rules. While LLMs have understandably received significant attention for their impressive natural language understanding, it&amp;#8217;s essential to compare their capabilities against the tried-and-true methods of pattern matching. &lt;/p&gt;</description></item><item><title>Philter 3.1.0</title><link>https://www.philterd.ai/blog/philter-3-1-0/</link><pubDate>Sun, 23 Mar 2025 15:19:36 +0000</pubDate><guid>https://www.philterd.ai/blog/philter-3-1-0/</guid><description>&lt;p&gt;&lt;a href="https://www.philterd.ai/philter/"&gt;Philter&lt;/a&gt; 3.1.0 is now available on all three major cloud marketplaces.&lt;/p&gt;

&lt;ul&gt;
 &lt;li&gt;&lt;a href="https://aws.amazon.com/marketplace/pp/B07YVB8FFT?ref=_ptnr_mf_launch" target="_blank" rel="noopener noreferrer"&gt;Philter 3.1.0 on the AWS Marketplace&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="https://console.cloud.google.com/marketplace/product/philterd-public/philter" target="_blank" rel="noopener noreferrer"&gt;Philter 3.1.0 on the Google Cloud Marketplace&lt;/a&gt;&lt;/li&gt;
 &lt;li&gt;&lt;a href="https://azuremarketplace.microsoft.com/en-us/marketplace/apps/philterdllc1687189098111.philter?tab=Overview" target="_blank" rel="noopener noreferrer"&gt;Philter 3.1.0 on the Microsoft Azure Marketplace&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;What's new in 3.1.0&lt;/h2&gt;

&lt;p&gt;Philter 3.1.0 is built on &lt;a href="https://www.philterd.ai/phileas/"&gt;Phileas&lt;/a&gt; 2.12.0, which brings:&lt;/p&gt;

&lt;ul&gt;
 &lt;li&gt;&lt;strong&gt;Filter priorities.&lt;/strong&gt; Each filter can now have its own priority that is used as a tie-breaker when the same text is identified by two filters. For example, if you're using the phone-number filter and an ID filter for 10-digit numbers, both may detect PII on the same text. The filter priority decides which label wins.&lt;/li&gt;
 &lt;li&gt;&lt;strong&gt;Zip code validation.&lt;/strong&gt; The zip-code filter can now optionally validate zip codes against an internal database. When enabled, a string that &lt;em&gt;looks&lt;/em&gt; like a zip code but doesn't actually exist won't be redacted &amp;mdash; reducing false positives on otherwise-numeric data.&lt;/li&gt;
 &lt;li&gt;&lt;strong&gt;Per-filter context window sizes.&lt;/strong&gt; The window size is roughly the number of words surrounding PII that the engine uses for contextual disambiguation. Previously every filter shared one window size; now each filter can set its own. Tighten the window where you want strict matching; widen it where surrounding context matters.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;What Philter is (in case you're new here)&lt;/h2&gt;

&lt;p&gt;Philter is open source software that redacts PII and PHI from text and PDF documents. It runs entirely inside your cloud &amp;mdash; your data never leaves your perimeter, never reaches a third-party API, and never lands in someone else's logs. A REST API takes text in and returns redacted text out:&lt;/p&gt;</description></item><item><title>Phileas 2.12.0</title><link>https://www.philterd.ai/blog/phileas-2-12-0/</link><pubDate>Thu, 20 Mar 2025 19:22:54 +0000</pubDate><guid>https://www.philterd.ai/blog/phileas-2-12-0/</guid><description>&lt;p&gt;&lt;a href="https://github.com/philterd/phileas"&gt;Phileas&lt;/a&gt; 2.12.0 has been released. This version of the popular open source redaction library brings:&lt;/p&gt;



&lt;ul class="wp-block-list"&gt;
&lt;li&gt;Filter priorities &amp;#8211; Each filter can have its own priority that is used as a tie-breaker in cases where text is identified by two filters. For example, if you are using the phone number filter and an ID filter of 10 digit numbers, both filters may detect PII on the same text. In this case, the filter priority will be used to determine the ultimate labeling of the text as either a phone number or an ID number.&lt;/li&gt;



&lt;li&gt;Zip code validation &amp;#8211; The zip code filter can now optionally attempt to validate zip codes. When enabled, if a zip code does not exist in the internal database, the zip code will not be redacted.&lt;/li&gt;



&lt;li&gt;Each filter can have a custom window size &amp;#8211; The window size is roughly the number of words surrounding PII that is used to provide contextual information about the PII. Previously, each filter had to use the same window size. Now, each filter can have the window size set independently.&lt;/li&gt;
&lt;/ul&gt;



&lt;p&gt;Look for a new version of &lt;a href="https://www.philterd.ai/philter/"&gt;Philter&lt;/a&gt; soon in the AWS, Google Cloud, and Azure marketplaces soon that is built on Phileas 2.12.0!&lt;/p&gt;</description></item><item><title>Why Using an LLM to Redact PII and PHI is a Bad Idea</title><link>https://www.philterd.ai/blog/why-using-an-llm-to-identify-and-redact-pii-and-phi-is-a-bad-idea/</link><pubDate>Mon, 17 Feb 2025 02:23:09 +0000</pubDate><guid>https://www.philterd.ai/blog/why-using-an-llm-to-identify-and-redact-pii-and-phi-is-a-bad-idea/</guid><description>&lt;p&gt;We have seen a lot &amp;#8211; and you probably have to &amp;#8211; posts on various social media and blogging platforms showing how you can redact text using a large language model (LLM). They present a fairly simple solution to the complex problem of redaction. Can we really just let an LLM handle our text redaction and be done with it? The answer is simply no.&lt;/p&gt;



&lt;p&gt;Here is one such example: &lt;a href="https://ravichinni.medium.com/using-generative-ai-for-content-redaction-46ee61a3a4e6"&gt;https://ravichinni.medium.com/using-generative-ai-for-content-redaction-46ee61a3a4e6&lt;/a&gt; (&lt;strong&gt;Don&amp;#8217;t do this.)&lt;/strong&gt;&lt;/p&gt;</description></item><item><title>Shielding Your Search: Redacting PII and PHI in OpenSearch with Phinder</title><link>https://www.philterd.ai/blog/shielding-your-search-redacting-pii-and-phi-in-opensearch-with-phinder/</link><pubDate>Fri, 10 Jan 2025 13:26:28 +0000</pubDate><guid>https://www.philterd.ai/blog/shielding-your-search-redacting-pii-and-phi-in-opensearch-with-phinder/</guid><description>&lt;p&gt;In today&amp;#8217;s data-driven world, safeguarding Personally Identifiable Information (PII) and Protected Health Information (PHI) is paramount. When leveraging search platforms like OpenSearch, ensuring sensitive data remains confidential is crucial. Enter Phinder, an open-source OpenSearch plugin that leverages the power of the Phileas project to effectively redact and de-identify PII and PHI within your search results.&lt;/p&gt;



&lt;p&gt;This post explores how Phinder can bolster your data privacy and security when using OpenSearch.Phinder is available on GitHub at &lt;a href="https://www.blogger.com/blog/post/edit/2240948136496450313/985026074423256854#"&gt;https://github.com/philterd/phinder-pii-opensearch-plugin&lt;/a&gt;.&lt;/p&gt;</description></item><item><title>Phileas 2.10.0</title><link>https://www.philterd.ai/blog/phileas-2-10-0/</link><pubDate>Mon, 06 Jan 2025 14:03:31 +0000</pubDate><guid>https://www.philterd.ai/blog/phileas-2-10-0/</guid><description>&lt;p&gt;We are excited to announce the release of Phileas 2.10.0!&lt;/p&gt;



&lt;p&gt;What’s changed in this version:&lt;/p&gt;



&lt;p&gt;* Making FilterResponse not be a final record class by @jzonthemtn in #166&lt;br&gt;* Removing commons-csv dependency by @jzonthemtn in #174&lt;br&gt;* Removing guava dependency and adding bloom filter by @jzonthemtn in #172&lt;br&gt;* Update pdfbox to 3.0.* by @JessieAMorris in #177&lt;br&gt;* Fixes a bug with the policy service being hard coded to “local” by @JessieAMorris in #178&lt;br&gt;* Enable outputting the replacement value on PDFs by @JessieAMorris in #179&lt;br&gt;* Add truncation filter strategy for all filters by @JessieAMorris in #180&lt;br&gt;* Adding line about snapshots being published nightly. by @jzonthemtn in #182&lt;br&gt;* #183 Replacing redis test dependency. by @jzonthemtn in #184&lt;br&gt;* Replace the Lucene-based filter with a fuzzy dictionary filter by @jzonthemtn in #185&lt;br&gt;* GitHub release:&amp;nbsp;&lt;a href="https://github.com/philterd/phileas/releases/tag/2.10.0"&gt;https://github.com/philterd/phileas/releases/tag/2.10.0&lt;/a&gt;&lt;/p&gt;</description></item><item><title>Phileas in Graylog – Removing PII from Logs</title><link>https://www.philterd.ai/blog/phileas-in-graylog-removing-pii-from-logs/</link><pubDate>Sun, 01 Dec 2024 13:21:41 +0000</pubDate><guid>https://www.philterd.ai/blog/phileas-in-graylog-removing-pii-from-logs/</guid><description>&lt;figure class="wp-block-image"&gt;&lt;img decoding="async" src="https://www.philterd.ai/blog/images/graylog.png" alt=""/&gt;&lt;/figure&gt;



&lt;p&gt;We are very excited to share with you that&amp;nbsp;&lt;a href="https://www.blogger.com/blog/post/edit/2240948136496450313/5680962908072244527#"&gt;Graylog&lt;/a&gt;&amp;nbsp;has integrated&amp;nbsp;&lt;a href="https://www.blogger.com/blog/post/edit/2240948136496450313/5680962908072244527#"&gt;Phileas&lt;/a&gt;, the open source PII/PHI redaction engine, into their centralized log management solution. With this new integration, Graylog now has the ability to identify and redact different types of PII (personally identifiable information) present in logs.&lt;/p&gt;



&lt;p&gt;The presence of PII in logs is a serious concern. Even careful application developers can find it difficult to prevent all PII from being included in logs. Error messages and stack traces can inadvertently include PII exposing the business to risk and liability.&lt;/p&gt;</description></item><item><title>Phileas 2.9.1</title><link>https://www.philterd.ai/blog/phileas-2-9-1/</link><pubDate>Wed, 27 Nov 2024 14:04:01 +0000</pubDate><guid>https://www.philterd.ai/blog/phileas-2-9-1/</guid><description>&lt;p&gt;We are excited to announce the release of Phileas 2.9.1.&lt;/p&gt;



&lt;p&gt;What’s changed in this version:&lt;/p&gt;



&lt;p&gt;* LineWidthSplitService is using a new line separator instead of a space&lt;br&gt;* An empty list of spans from ph-eye does not indicate failure&lt;br&gt;* Have a default PhEyeConfiguration value in AbstractPhEyeConfiguration so a filter does not have to provide one&lt;/p&gt;



&lt;p&gt;GitHub release:&amp;nbsp;&lt;a href="https://github.com/philterd/phileas/releases/tag/2.9.1"&gt;https://github.com/philterd/phileas/releases/tag/2.9.1&lt;/a&gt;&lt;/p&gt;



&lt;p&gt;Artifacts are available in the Philterd repository as described in the README.&lt;/p&gt;
&lt;!-- related-posts --&gt;
&lt;hr&gt;
&lt;p&gt;&lt;strong&gt;Related posts:&lt;/strong&gt;&lt;/p&gt;</description></item><item><title>Automatically Redacting PII and PHI from Files in Amazon S3 using Amazon Macie and Philter</title><link>https://www.philterd.ai/blog/automatically-redacting-pii-and-phi-from-files-in-amazon-s3-using-amazon-macie-and-philter/</link><pubDate>Sun, 17 Nov 2024 13:23:14 +0000</pubDate><guid>https://www.philterd.ai/blog/automatically-redacting-pii-and-phi-from-files-in-amazon-s3-using-amazon-macie-and-philter/</guid><description>&lt;p&gt;&lt;a href="https://aws.amazon.com/macie/" target="_blank" rel="noopener noreferrer"&gt;Amazon Macie&lt;/a&gt; is "a data security service that discovers sensitive data using machine learning and pattern matching." With Amazon Macie you can find potentially sensitive information in files in your Amazon S3 buckets, but what do you do when Amazon Macie finds a file that contains an SSN, phone number, or other piece of sensitive information?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.philterd.ai/philter/"&gt;Philter&lt;/a&gt; is software that redacts PII, PHI, and other sensitive information from text. Philter runs entirely within your private cloud and does not require any external connectivity. Your data never leaves your private cloud and is not sent to any third-party. In fact, you can run Philter without any external network connectivity and we recommend doing so.&lt;/p&gt;</description></item><item><title>Philter as an AI Policy Layer</title><link>https://www.philterd.ai/blog/philter-as-an-ai-policy-layer/</link><pubDate>Thu, 10 Oct 2024 13:24:17 +0000</pubDate><guid>https://www.philterd.ai/blog/philter-as-an-ai-policy-layer/</guid><description>&lt;h1 class="wp-block-heading"&gt;A policy layer is an important part of every source of AI-generated text.&lt;/h1&gt;



&lt;p&gt;An AI policy layer is an important part of every source of AI-generated text because it inspects the AI-generated text to prevent sensitive information from being exposed. A policy layer can help remove information such as names, addresses, and telephone numbers from responses.&lt;/p&gt;



&lt;p&gt;In this blog post we will describe the function of an AI policy layer and how&amp;nbsp;&lt;a href="https://www.blogger.com/blog/post/edit/2240948136496450313/1104433716323541742#"&gt;Philter&lt;/a&gt;&amp;nbsp;is well-suited for the role. Philter is available on the AWS Marketplace, Google Cloud Marketplace, and the Microsoft Azure Marketplace.&lt;/p&gt;</description></item><item><title>Redacting Text in Amazon Kinesis Data Firehose</title><link>https://www.philterd.ai/blog/redacting-text-in-amazon-kinesis-data-firehose/</link><pubDate>Mon, 09 Sep 2024 13:24:55 +0000</pubDate><guid>https://www.philterd.ai/blog/redacting-text-in-amazon-kinesis-data-firehose/</guid><description>&lt;p&gt;Amazon Kinesis Firehose is a managed streaming service designed to take large amounts of data from one place to another. For example, you can take data from sources such as Amazon CloudWatch, AWS IoT, and custom applications using the AWS SDK to destinations Amazon S3, Amazon Redshift, Amazon Elasticsearch, and other services. In this post we will use Amazon S3 as the firehose's destination.&lt;/p&gt;

&lt;p&gt;In some cases you may need to manipulate the data as it goes through the firehose to remove sensitive information. In this blog post we will show how Amazon Kinesis Firehose and AWS Lambda can be used in conjunction with &lt;a href="https://www.philterd.ai/philter/"&gt;Philter&lt;/a&gt; to remove sensitive information (PII and PHI) from the text as it travels through the firehose.&lt;/p&gt;</description></item><item><title>Phileas — The Open Source PII and PHI redaction engine</title><link>https://www.philterd.ai/blog/phileas-the-open-source-pii-and-phi-redaction-engine/</link><pubDate>Mon, 22 May 2023 18:16:12 +0000</pubDate><guid>https://www.philterd.ai/blog/phileas-the-open-source-pii-and-phi-redaction-engine/</guid><description>&lt;figure class="wp-block-image"&gt;&lt;img decoding="async" src="https://www.philterd.ai/blog/images/0_JOqvXaaYsu9SSiR9.png" alt=""/&gt;&lt;/figure&gt;

&lt;p&gt;I am delighted to announce that the project providing the core PII and PHI redaction capabilities is now open source. Introducing &lt;a href="https://www.philterd.ai/phileas/"&gt;Phileas&lt;/a&gt;, the PII and PHI redaction engine &amp;mdash; available under the Apache 2.0 license on &lt;a href="https://github.com/philterd/phileas" target="_blank" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Both &lt;a href="https://www.philterd.ai/philter/"&gt;Philter&lt;/a&gt; and Phirestream use Phileas to identify and redact sensitive information like PII and PHI. Phileas does all of the heavy lifting, while Philter and Phirestream make its functionality user-friendly and provide the NLP models.&lt;/p&gt;</description></item><item><title>What is format-preserving encryption?</title><link>https://www.philterd.ai/blog/what-is-format-preserving-encryption/</link><pubDate>Sun, 21 May 2023 18:15:20 +0000</pubDate><guid>https://www.philterd.ai/blog/what-is-format-preserving-encryption/</guid><description>&lt;p&gt;In cryptography, you have plain text and cipher text. An encryption algorithm transforms the plain text into the cipher text. The cipher text won't look anything like the plain text &amp;mdash; in characters or length. There are many different encryption algorithms serving many different purposes, and the cipher text for each one will be different.&lt;/p&gt;

&lt;figure class="wp-block-image"&gt;&lt;img decoding="async" src="https://www.philterd.ai/blog/images/0_4iPG8EUUPaU123Th.png" alt=""/&gt;&lt;/figure&gt;

&lt;p&gt;Take the case of a credit card number, a common piece of sensitive information that's often encrypted. A credit card number is 16 digits long. Encrypting it with the industry-standard AES-128-CBC algorithm produces a cipher text much longer than 16 digits &amp;mdash; usually around 64 base64-encoded characters. If you're storing the credit card number in a database column configured for length 16, the cipher text won't fit.&lt;/p&gt;</description></item></channel></rss>