AI and Machine Learning

PII Guardrails for AI and LLM Workloads

Stop sensitive data from reaching LLM providers, vector stores, and training corpora. Drop-in middleware for prompts; ingestion-time redaction for RAG; pre-training data cleanup: all the patterns your security team will ask about.

Or deploy Philter yourself →

The AI privacy problem nobody talks about enough

Every team building AI features hits the same gate eventually: security review. The roadmap says “chatbot in Q3.” The security review says “you can’t send customer data to OpenAI.” Both are reasonable; the gap between them is where most AI projects stall.

The same problem shows up in three different shapes: prompts going out to hosted LLMs (chatbots, agents, summarizers), ingestion into vector stores for RAG, and training corpora for fine-tuning. Each has a different solution, but the underlying engine is the same: identify the PII before it crosses the boundary, decide what to do with it, and keep a record.

How Philterd handles AI/ML

LLM proxy middleware

Philter AI Proxy sits between your application and the LLM provider. Point your existing SDK at the proxy URL; everything else stays the same. Outbound prompts get PII-stripped before reaching OpenAI/Anthropic/Bedrock.

RAG ingestion redaction

Redact PII at the document-ingestion step, before chunking and embedding. Vector stores can’t leak what was never written to them. Defends against embedding inversion attacks and against retrieved-context leaks in the LLM’s response.

Training-data preparation

Aggressive redaction for pre-training and fine-tuning corpora. Once PII enters model weights it’s effectively unrecoverable; catch it before training, not in the model’s outputs.

Medical chatbot ready

Healthcare-specific policy that preserves clinical meaning (symptoms, medications, conditions) while stripping identifiers. Pair with Philter AI Proxy for end-to-end HIPAA-conscious chatbot architecture.

Drop-in for any LLM provider

OpenAI, Anthropic Claude, Google Gemini, and local LLMs via Ollama. One proxy URL, multiple providers.

Open source, self-hosted

Released under the permissive and business-friendly Apache license. Runs inside your VPC. No third-party API in the data path. The security review that blocked your AI rollout has a concrete answer.

Try it live

Try it out! Select one of the industries and click Redact to redact the text.

Input

Patient Margaret Collins, born on 04/12/1978, with SSN 523-88-4021 was admitted to the ER at St. Luke’s Medical Center. Her primary care physician, Dr. Howard Banks, can be reached at hbanks@stlukesmed.org or (555) 342-9187.

Redacted output

The redacted text appears here after you click Redact.

Do not enter PHI or PII.

Ready-to-use policies

Free, ready-to-use policies from the open source policy library. Download and load into your Philter instance.

AI Training v1.0.0

LLM Training Data Preparation

Aggressive PII redaction for documents being fed into LLM training, fine-tuning, or RAG vector stores — preserves semantic structure with type tokens.

AILLMfine-tuningtraining data

Healthcare v1.0.0

Medical Chatbot — User Input Redaction

Redact PHI from user messages to a healthcare chatbot before they reach the LLM — preserves clinical meaning while removing identifiers.

HIPAAPHIchatbotLLM

Browse all redaction policies →

Recent writing on AI/ML

Building a Privacy-Aware RAG System

RAG pipelines have two distinct PII leak vectors: ingestion and inference. A defense-in-depth blueprint with code, using Philter and the Philter AI Proxy.

Beyond Regex: Why General LLMs Fail at PII Discovery

Regex misses context, general LLMs over-redact and burn GPUs. The right answer is hybrid: pattern matching for the deterministic, specialized AI for the rest.

All blog posts →

Where AI teams start

Identify the data crossing. Is the problem outbound prompts (chatbot, summarizer, agent), ingestion (RAG vector store), or training (fine-tuning data)? Each calls for a different pattern.
For outbound prompts: deploy Philter AI Proxy and point your existing OpenAI/Anthropic/Bedrock SDK at it. Zero code changes in the application; full PII redaction in the prompt path.
For RAG ingestion: insert Philter as a pre-chunking step in the document pipeline. The policy library has a medical chatbot policy as a starting template; tune for your domain.
For training data: use the LLM training-data prep policy from the library. More aggressive than chatbot-input redaction because once it’s in the weights, it’s in the weights.
Measure with Philter Scope and iterate on the policy. AI workloads are usually higher-stakes than other redaction surfaces because the failure mode is “model memorized PII permanently.”

Common deployments

1. Healthcare chatbots. Patient-facing or provider-facing chatbots that call hosted LLMs. The user message can contain anything (“my mom Linda is 72 with hypertension and her MRN is…”). Philter AI Proxy with the medical chatbot policy strips identifiers before the prompt reaches the LLM, preserves clinical context for the model to actually be useful. Architecture .

2. Enterprise RAG systems. A company-internal RAG system trained on contracts, support tickets, customer correspondence, or product documentation that contains PII. Standard RAG architectures index unredacted documents, which means anyone with query access can retrieve PII. Privacy-aware RAG architecture inserts the redaction at ingestion, before chunking and embedding.

3. Fine-tuning corpora . Teams fine-tuning open-source models on real customer data (support tickets, transcripts, internal communications). The training-data prep policy handles the aggressive redaction needed for training data. Once PII is in the model weights it’s extractable, sometimes years later.

4. Outbound LLM gateways. Some enterprises proxy ALL outbound LLM traffic through a central gateway for governance, cost control, and PII guardrails. Philter AI Proxy is the PII layer in that gateway. Single point of policy enforcement across every team using LLMs in the org.

What teams need to be careful about

Post-hoc filtering doesn’t fix training data. If you trained on un-redacted PII, no inference-time filter recovers from that. Carlini et al.’s extraction research demonstrated extractability years after training. Pre-training redaction is the only reliable defense.
Embedding inversion is real. Published research shows you can reconstruct text chunks from their vector representations, especially with smaller embedding models. The “it’s just numbers in the vector store” argument doesn’t hold for sensitive content.
Defense in depth. Redacting prompts via Philter AI Proxy doesn’t eliminate the BAA requirement (for HIPAA workloads) or the data-processing agreement (for GDPR). It’s a layer; the legal agreements are a separate layer.
Tokens matter for downstream models. If your downstream LLM is fine-tuned to expect specific tokens (<patient>, [NAME]), align the redaction format with what the model expects. Mismatch hurts accuracy.

Build PII redaction into your AI/ML pipeline

Your AI roadmap doesn’t have to stall on security review. Talk to the engineers who built the Philter AI Proxy and the medical-chatbot policy and get a concrete architecture answer this week.

Or deploy Philter yourself →