Redacting PII and PHI in OpenSearch Data Prepper Pipelines

OpenSearch Data Prepper is the server-side collector that most teams use to get logs and traces into OpenSearch. Data flows through it in a familiar shape: a source pulls records in, a chain of processors transforms them, and a sink writes them out to an index. That processor chain is exactly where a redaction step belongs.

The reason is the same one that drives in-stream redaction for Kafka : once a value lands in the search index, it is hard to contain. Search indices are queried broadly, copied into dashboards, and replicated for resilience. An SSN or a patient name that reaches the index has effectively reached everyone who can search it. The fix is to remove sensitive values before the OpenSearch sink, while the record is still inside the pipeline you control.

This guide shows where the redaction step fits in a Data Prepper pipeline and walks through two ways to implement it: embedding the Phileas library in a custom processor, or calling the Philter redaction API over HTTP.

An OpenSearch Data Prepper pipeline with a redaction step. Logs and traces enter through a source and pass through a buffer, then a redaction processor that either calls Philter over HTTP or embeds the Phileas library. The processor is driven by a named policy. Only after redaction does data reach the OpenSearch sink and index, so raw PII never lands in the index. — The redaction processor sits ahead of the OpenSearch sink. A single named policy drives what gets redacted.

Where the redaction step fits

A Data Prepper pipeline is defined in YAML with three required sections. A redaction processor is just another entry in the processor list, placed after any parsing you do (so the fields you want to redact are populated) and before the sink:

log-pipeline:
  source:
    http:
      port: 2021
  processor:
    - grok:
        match:
          message: ['%{COMMONAPACHELOG}']
    # Redaction runs here, after parsing and before the sink.
    - phileas_redact:
        policy_name: "healthcare"
        fields: ["message", "request"]
  sink:
    - opensearch:
        hosts: ["https://opensearch:9200"]
        index: "app-logs"

Because the step lives inside the pipeline, the raw record never leaves Data Prepper unredacted. Everything downstream of the sink (the index, dashboards, alerting, replicas) sees only the redacted form.

Data Prepper processors are Java plugins, so both options below are wired in the same way. They differ only in whether redaction happens in-process or over a network hop.

Option 1: Embed Phileas in a custom processor

If you want the lowest latency and no extra service to operate, embed Phileas directly. Phileas is the open source Java library underneath Philter; you add it as a dependency to a Data Prepper plugin and call it in-process, with no network round-trip.

// build.gradle
implementation 'io.philterd:phileas:4.0.0'
implementation 'org.opensearch.dataprepper:data-prepper-api:2.x'

The processor loads the policy once at startup and redacts the configured fields on each event. This example is illustrative; check the Data Prepper plugin API for the exact interfaces in your version.

@DataPrepperPlugin(name = "phileas_redact", pluginType = Processor.class,
                   pluginConfigurationType = PhileasRedactConfig.class)
public class PhileasRedactProcessor
        extends AbstractProcessor<Record<Event>, Record<Event>> {

  private final PhileasFilterService phileas;
  private final String policyName;
  private final List<String> fields;

  @DataPrepperPluginConstructor
  public PhileasRedactProcessor(PhileasRedactConfig config, PluginMetrics metrics)
      throws Exception {
    super(metrics);
    this.policyName = config.getPolicyName();
    this.fields     = config.getFields();
    // Load the named policy from disk. Reload on a schedule if you
    // want policy updates without restarting the pipeline.
    Policy policy = Policy.fromFile("policies/" + policyName + ".json");
    this.phileas = new PhileasFilterService(policy);
  }

  @Override
  public Collection<Record<Event>> doExecute(Collection<Record<Event>> records) {
    for (Record<Event> record : records) {
      Event event = record.getData();
      for (String field : fields) {
        String value = event.get(field, String.class);
        if (value == null) continue;
        try {
          String redacted = phileas.filter(
              policyName, "default", UUID.randomUUID().toString(), value
          ).getFilteredText();
          event.put(field, redacted);
        } catch (Exception e) {
          // Tag the event so a downstream route can divert it
          // instead of indexing a record that failed to redact.
          event.getMetadata().addTag("redaction_failed");
        }
      }
    }
    return records;
  }

  @Override public void prepareForShutdown() { }
  @Override public boolean isReadyForShutdown() { return true; }
  @Override public void shutdown() { }
}

When to use this: JVM-only deployments, tight latency budgets, and high record volume. Embedding Phileas avoids a network hop and keeps the redaction engine in the same process as the pipeline.

Trade-offs: the redaction logic now ships inside your Data Prepper plugin. Policy or library upgrades mean rebuilding and redeploying the plugin unless you add a hot-reload mechanism, and the redaction throughput scales with the pipeline rather than on its own.

Option 2: Call the Philter API over HTTP

If you would rather run redaction as a standalone service (so several pipelines, and other systems entirely, share one redaction engine and one set of policies) deploy Philter and have the processor call its REST API. Philter is stateless and policy-driven, so you can scale it independently of Data Prepper.

The processor shape is the same; only the body of the loop changes to a POST /api/filter call, where c is the context and p is the policy name:

@Override
public Collection<Record<Event>> doExecute(Collection<Record<Event>> records) {
  for (Record<Event> record : records) {
    Event event = record.getData();
    for (String field : fields) {
      String value = event.get(field, String.class);
      if (value == null) continue;
      HttpRequest req = HttpRequest.newBuilder()
          .uri(URI.create(philterUrl + "/api/filter?c=default&p=" + policyName))
          .header("Content-Type", "text/plain")
          .POST(HttpRequest.BodyPublishers.ofString(value))
          .build();
      try {
        HttpResponse<String> resp = http.send(req, BodyHandlers.ofString());
        if (resp.statusCode() == 200) {
          event.put(field, resp.body());
        } else {
          event.getMetadata().addTag("redaction_failed");
        }
      } catch (Exception e) {
        event.getMetadata().addTag("redaction_failed");
      }
    }
  }
  return records;
}

If you would rather not build and maintain a Java plugin at all, the stock aws_lambda processor is an alternative: point it at a small Lambda function that calls Philter and returns the redacted event. That keeps your pipeline definition fully declarative at the cost of a Lambda invocation per batch.

When to use this: polyglot environments, an organization standardizing on one redaction service across many pipelines, or operations teams who prefer scaling a stateless HTTP service over shipping an embedded library.

Trade-offs: a network round-trip per call (typically a few milliseconds inside a VPC). Batch fields per call where you can, since the Philter API accepts batched documents, and run multiple Philter instances behind a load balancer for availability.

Drive the redaction with a named policy

Both options reference the redaction policy by name (healthcare above) rather than inlining rules. This is the same convention used across Philter and Phileas integrations: a policy is a JSON document that lists the detectors to run (identifiers like SSN and credit card, dictionaries, and trained PhEye lenses) and the strategy to apply to each match.

Keeping the policy as a named, versioned file pays off in a pipeline:

One source of truth. The same healthcare policy can drive this Data Prepper pipeline, a Kafka redactor, and an ad-hoc batch job, so redaction behaves consistently everywhere.
Reviewable changes. Policy edits go through version control and code review like any other change, and you can diff exactly what a given pipeline was redacting on a given day.
Tunable cost. Deterministic detectors (SSN, credit card, IBAN) are nearly free; trained lenses cost more. Enable only the detectors a given pipeline needs.

You can build and test policies with the Redaction Policy Editor and measure them against gold-standard data with Philter Scope before they reach production.

Validate against your own data

PII and PHI detection is probabilistic. A well-tuned policy is designed to reduce how much sensitive data reaches your index, but no policy catches every instance across every input format. Treat the redaction step as one control among several, and validate its output against representative samples of your own data before you rely on it.

Two tools help you do that continuously:

Philter Scope scores a policy’s precision and recall against a labeled test set, so you can catch a regression in the policy before it ships.
Phield watches the live pipeline and alerts on anomalies, such as a sudden drop in the number of a given entity type, which can signal that a parsing change moved PII into a field the policy is not redacting.

Wiring a redaction_failed route into your pipeline matters here too. If the processor cannot redact a record, divert it to a separate index or dead-letter sink rather than indexing it unredacted:

  route:
    - redaction-failed: '/redaction_failed == true'
  sink:
    - opensearch:
        routes: [redaction-failed]
        hosts: ["https://opensearch:9200"]
        index: "quarantine"
    - opensearch:
        hosts: ["https://opensearch:9200"]
        index: "app-logs"

The bottom line

Data Prepper already has the right shape for privacy work: a processor chain that every record passes through on its way to the index. Adding a redaction step there, ahead of the OpenSearch sink, keeps raw PII and PHI out of the search index without changing how the rest of your observability stack consumes the data.

Embed Phileas when you want in-process speed and a JVM-only footprint, or call the Philter API when you want one shared redaction service across many pipelines. Both read the same named policy, run the same detectors, and produce the same audit story. Pick whichever fits next to the pipeline you already run.

Want help fitting this to your OpenSearch ingest path? Talk to our consulting team . We have stood up redaction for log and trace pipelines in healthcare and financial services, and we would be glad to compare notes on yours.

Related posts: