Talk to an Expert

Tell us about your stack and the privacy problems you're trying to solve. We typically respond within one business day.

Prefer to skip the form? Pick a time on our calendar →
or send a message

← All industries

AI & Machine Learning

PII Guardrails for AI and LLM Workloads

Stop sensitive data from reaching LLM providers, vector stores, and training corpora. Drop-in middleware for prompts; ingestion-time redaction for RAG; pre-training data cleanup — all the patterns your security team will ask about.

Or deploy Philter yourself →

The AI privacy problem nobody talks about enough

Every team building AI features hits the same gate eventually: security review. The roadmap says “chatbot in Q3.” The security review says “you can’t send customer data to OpenAI.” Both are reasonable; the gap between them is where most AI projects stall.

The same problem shows up in three different shapes: prompts going out to hosted LLMs (chatbots, agents, summarizers), ingestion into vector stores for RAG, and training corpora for fine-tuning. Each has a different solution, but the underlying engine is the same: identify the PII before it crosses the boundary, decide what to do with it, and keep a record.

How Philterd handles AI/ML

LLM proxy middleware

Philter AI Proxy sits between your application and the LLM provider. Point your existing SDK at the proxy URL; everything else stays the same. Outbound prompts get PII-stripped before reaching OpenAI/Anthropic/Bedrock; optionally scan responses on the way back.

RAG ingestion redaction

Redact PII at the document-ingestion step, before chunking and embedding. Vector stores can’t leak what was never written to them. Defends against embedding inversion attacks and against retrieved-context leaks in the LLM’s response.

Training-data preparation

Aggressive redaction for pre-training and fine-tuning corpora. Once PII enters model weights it’s effectively unrecoverable; catch it before training, not in the model’s outputs.

Medical chatbot ready

Healthcare-specific policy that preserves clinical meaning (symptoms, medications, conditions) while stripping identifiers. Pair with Philter AI Proxy for end-to-end HIPAA-conscious chatbot architecture.

Drop-in for any LLM provider

OpenAI, Anthropic Claude, AWS Bedrock, Azure OpenAI, Google Vertex AI, local LLMs (Ollama, vLLM). One proxy URL, all providers.

Open source, self-hosted

Apache 2.0. Runs inside your VPC. No third-party API in the data path. The security review that blocked your AI rollout has a concrete answer.

Ready-to-use policies

Apache 2.0 policies from the open source policy library — download and load into your Philter instance.

AI Training v1.0.0

LLM Training Data Preparation

Aggressive PII redaction for documents being fed into LLM training, fine-tuning, or RAG vector stores — preserves semantic structure with type tokens.

AILLMfine-tuningtraining data
Healthcare v1.0.0

Medical Chatbot — User Input Redaction

Redact PHI from user messages to a healthcare chatbot before they reach the LLM — preserves clinical meaning while removing identifiers.

HIPAAPHIchatbotLLM

Browse the full policy library →

Recent writing on AI/ML

Building a Privacy-Aware RAG System

RAG pipelines have two distinct PII leak vectors: ingestion and inference. A defense-in-depth blueprint with code, using Philter, Philter AI Proxy, and the rest of the Philterd toolkit.

All blog posts →

Where AI teams start

Common deployments

1. Healthcare chatbots. Patient-facing or provider-facing chatbots that call hosted LLMs. The user message can contain anything (“my mom Linda is 72 with hypertension and her MRN is…”). Philter AI Proxy with the medical chatbot policy strips identifiers before the prompt reaches the LLM, preserves clinical context for the model to actually be useful. Architecture.

2. Enterprise RAG systems. A company-internal RAG system trained on contracts, support tickets, customer correspondence, or product documentation that contains PII. Standard RAG architectures index unredacted documents, which means anyone with query access can retrieve PII. Privacy-aware RAG architecture inserts the redaction at ingestion, before chunking and embedding.

3. Fine-tuning corpora. Teams fine-tuning open-source models on real customer data (support tickets, transcripts, internal communications). The training-data prep policy handles the aggressive redaction needed for training data — once PII is in the model weights it’s extractable, sometimes years later.

4. Outbound LLM gateways. Some enterprises proxy ALL outbound LLM traffic through a central gateway for governance, cost control, and PII guardrails. Philter AI Proxy is the PII layer in that gateway. Single point of policy enforcement across every team using LLMs in the org.

What teams need to be careful about

  • Post-hoc filtering doesn’t fix training data. If you trained on un-redacted PII, no inference-time filter recovers from that. Carlini et al.’s extraction research demonstrated extractability years after training. Pre-training redaction is the only reliable defense.
  • Embedding inversion is real. Published research shows you can reconstruct text chunks from their vector representations, especially with smaller embedding models. The “it’s just numbers in the vector store” argument doesn’t hold for sensitive content.
  • Defense in depth. Redacting prompts via Philter AI Proxy doesn’t eliminate the BAA requirement (for HIPAA workloads) or the data-processing agreement (for GDPR). It’s a layer; the legal agreements are a separate layer.
  • Tokens matter for downstream models. If your downstream LLM is fine-tuned to expect specific tokens (<patient>, [NAME]), align the redaction format with what the model expects. Mismatch hurts accuracy.

Build PII redaction into your AI/ML pipeline

Your AI roadmap doesn’t have to stall on security review. Talk to the engineers who built the Philter AI Proxy and the medical-chatbot policy — get a concrete architecture answer this week.

Or deploy Philter yourself →