Reference and how-to

Redacting PII in Apache Kafka

Redact PII and PHI in Apache Kafka with the philter-kafka-connect Single Message Transform, on source, sink, or MirrorMaker connectors.

This guide shows how to redact PII and PHI flowing through Apache Kafka using philter-kafka-connect , a Kafka Connect Single Message Transform (SMT) that redacts record keys and values with Philter and Phileas .

All of this is open source and self-hosted : the connector, Philter, Phileas, the PhEye detection models, and PhiSQL are Apache-2.0 licensed and run inside your own environment, so your data never leaves it.

If you want the conceptual lay of the land first, the companion blog post, Architecting Privacy in Kafka , compares three patterns: Phileas embedded in a Kafka Streams app, Philter called over HTTP, and a Kafka Connect transform. This guide is the hands-on version of that third pattern: one end-to-end, runnable integration built on the connector.

Redaction is probabilistic. These tools are designed to reduce how much sensitive data flows through your pipeline; they do not catch every instance, and you are responsible for validating the output against your own data.

Why redact PII in the stream

A Kafka topic is durable. Once a record with PII lands in a topic it is retained on disk, replicated across brokers, delivered to every consumer group, and fanned out to whatever the pipeline feeds: data lakes, search indexes, warehouses, and other clusters. Redacting in the pipeline keeps sensitive values from spreading to all of those places, instead of trying to scrub each downstream store after the fact. It is data minimization applied at the point the data moves.

That placement also supports your compliance efforts for regimes such as GDPR, HIPAA, and PCI DSS: one policy decides what is sensitive and how it is handled, applied the same way on the way in, the way out, and during cross-cluster replication, and Philter records what it redacted so the pipeline has an audit trail. The connector is one control, not a compliance guarantee. It reduces how much regulated data your topics and downstream systems ever see, and you validate the result against your own data and obligations.

Why a Single Message Transform

Because redaction runs as an SMT, it applies anywhere a Kafka Connect transform can run, with one implementation:

on a source connector, to redact data before it ever lands in a Kafka topic;
on a sink connector, to redact before data leaves Kafka for an external system;
on MirrorMaker, to redact during cross-cluster replication.

That covers data entering Kafka, leaving Kafka, and crossing between clusters without a separate proxy in front of every producer. Because it is a standard Kafka Connect transform, it runs on any Connect deployment, including managed Kafka Connect such as Amazon MSK Connect, where you add it as a custom plugin.

Where the redaction engine runs

The connector’s detector setting chooses where the engine runs. It defaults to philter.

philter (default): the transform is a thin HTTP client to a separate Philter service, which holds the policy and routes name and entity detection through PhEye . This keeps one policy as the source of truth and gives detection identical to Philter’s batch and API paths, which matters for compliance and validation. Per-record NER inference also belongs in PhEye, which scales independently and can use a GPU, rather than in Connect workers tuned for I/O throughput.
embedded (planned): Phileas runs inside the Connect worker, for deployments that want no separate Philter service (fully self-contained and air-gapped when the policy uses a local ONNX model). This mode is planned and currently raises an error at startup; use philter today.

Note that remote-PhEye versus local-ONNX is not a connector setting. In embedded Phileas that choice is driven entirely by the policy (Phileas uses the local ONNX detector when the policy’s PhEyeConfiguration sets a modelPath, otherwise the remote PhEye endpoint), so the connector only decides whether to call a Philter service or embed the engine.

The default philter mode needs a running Philter service. You can launch Philter from the AWS, Azure, or Google Cloud marketplace or run the container yourself.

What it can redact

Philter and Phileas detect more than 30 kinds of PII and PHI out of the box: names, Social Security numbers, credit card and bank account numbers, email addresses, phone numbers, street addresses, dates, ages, IP and MAC addresses, and more, plus custom identifiers you define with a regular expression and an optional checksum or structural validator. See Phileas for the full list.

A common use is streaming healthcare data. A pipeline carrying clinical events or messages can redact patient names (detected by the PhEye NER models) along with identifiers such as medical record numbers and Social Security numbers (matched by pattern, with validators to drop format-valid look-alikes), so downstream consumers, topics, and analytics stores receive the clinical signal without the identifying details.

Build and install the plugin

The connector is built from source and installed onto your Connect workers. From a checkout of philter-kafka-connect :

mvn clean package

This produces an installable plugin jar (with its runtime dependencies bundled) under target/. Copy that jar into a directory on each Connect worker’s plugin.path and restart the worker so Connect loads the plugin.

Configure the transform

Add the transform to any connector. At a minimum, register the transform, point it at your Philter service, and name the policy to apply:

transforms=redact
transforms.redact.type=ai.philterd.kafka.connect.RedactTransform
transforms.redact.philter.endpoint=http://philter:8080
transforms.redact.philter.api.key=${file:/secrets/philter:api-key}
transforms.redact.policy=phi

That relies on the defaults for everything else: the philter detector mode, redacting the record value, and fail-closed on error. The remaining options (redact the key instead, limit redaction to specific fields, set a context for consistent pseudonymization, and tune timeouts and retries) are documented in the connector configuration reference .

A policy that detects names or other free-text entities needs Philter to have PhEye available; a pattern-only policy (email addresses, Social Security numbers, and similar) does not.

The policy and what comes out

The policy setting names a Philter redaction policy that decides which entities to detect and how to transform each one. A simple policy that redacts email addresses and Social Security numbers:

{
  "name": "email-ssn",
  "identifiers": {
    "emailAddress": { "emailAddressFilterStrategies": [ { "strategy": "REDACT" } ] },
    "ssn": { "ssnFilterStrategies": [ { "strategy": "REDACT" } ] }
  }
}

You can write policies as JSON like this, or author them in PhiSQL , a SQL-like language that compiles to the same JSON (the policy above is REDACT EMAIL_ADDRESS WITH REDACT; REDACT SSN WITH REDACT;). Load the policy into Philter and reference it by name from the connector with transforms.redact.policy=email-ssn; Philter applies it to each record the transform sends. Build policies with the policy library or the Redaction Policy Editor .

With that policy, the sensitive values in a record are replaced in place and the rest passes through untouched:

in:  Patient contact: jane.roe@example.org, SSN 987-65-4321.
out: Patient contact: {{{REDACTED-email-address}}}, SSN {{{REDACTED-ssn}}}.

REDACT swaps each detected value for a {{{REDACTED-<type>}}} placeholder by default; other strategies can mask, encrypt, or replace it with a realistic value instead.

Fail closed by default

behavior.on.error defaults to fail-closed: if a redaction call fails, the connector fails the record rather than forwarding it with raw data. Configure a dead letter queue on the connector to capture failed records so a transient Philter outage does not silently drop data. Use fail-open only where forwarding unredacted data on error is acceptable.

Try it end to end

The repository’s demo/ directory has a runnable docker-compose stack (Kafka, Kafka Connect with the plugin, Philter, and MongoDB) and a walkthrough. It redacts on a FileStreamSource connector: lines are read from a file and produced to a topic, and the transform redacts each line through Philter using a small policy that catches email addresses and SSNs. You append a line such as:

Patient contact: jane.roe@example.org, SSN 987-65-4321.

and consume the topic to see the email address and SSN redacted while the rest of the line passes through. See demo/README.md for the full steps.

Where to go next

Pick or author the redaction policy the connector applies.
Read Architecting Privacy in Kafka for the embedded Kafka Streams and HTTP-proxy patterns, and when each fits.
Browse all integrations across the Philterd toolkit.