Reference and how-to

Redacting PII in Spark and Databricks

Redact PII column by column in Apache Spark and Databricks with a Phileas UDF: a verified PySpark example, plus the performance tradeoffs to plan for.

Spark and Databricks pipelines move large volumes of text through transformations, and any of it can carry PII: support transcripts, application logs, free-text fields, scraped content. The cleanest place to redact it is in the job itself, as a column transformation, so the data is cleaned in the same pass that processes it and never has to leave the cluster.

There is no packaged Spark connector for this, and it does not need one. Phileas is a library, so you wrap it in a user-defined function (UDF) and apply it to any text column. Phileas ships for both the JVM (Java and Scala) and Python, so this guide shows the UDF both ways: PySpark, the natural fit for Databricks, and Java for Spark on the JVM. It is a build-it-yourself pattern, written as such, with the performance tradeoffs to plan for.

A PySpark redaction UDF

Install the Phileas Python library on your cluster, then define a UDF that redacts a string and apply it to a column. The Phileas service and policy are built once per worker and reused for every row, so the cost is paid once per executor, not per record.

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from phileas import Policy, FilterService

# The redaction policy: what to detect and how to transform it.
POLICY = {
    "identifiers": {
        "emailAddress": {"emailAddressFilterStrategies": [{"strategy": "REDACT", "redactionFormat": "[EMAIL]"}]},
        "ssn": {"ssnFilterStrategies": [{"strategy": "REDACT", "redactionFormat": "[SSN]"}]},
        "phoneNumber": {"phoneNumberFilterStrategies": [{"strategy": "REDACT", "redactionFormat": "[PHONE]"}]},
    }
}

# Build the service once per worker process, not once per row.
_service = None
_policy = None

def redact(text):
    global _service, _policy
    if text is None:
        return None
    if _service is None:
        _service = FilterService()
        _policy = Policy.from_dict(POLICY)
    return _service.filter(_policy, "spark", "doc", text).filtered_text

redact_udf = udf(redact, StringType())

Apply it to a column with the DataFrame API:

cleaned = df.withColumn("note_redacted", redact_udf(df["note"]))

On a small sample, Email a@b.com becomes Email [EMAIL], SSN 123-45-6789 becomes SSN [SSN], and text with no PII is returned unchanged. You can also register the UDF for Spark SQL:

spark.udf.register("redact", redact, StringType())
spark.sql("SELECT id, redact(note) AS note_redacted FROM notes")

On Databricks

The same code runs on Databricks unchanged. Install the Phileas Python library on the cluster (a cluster library or %pip install in the notebook), define the UDF, and apply it in a notebook or job. To expose only redacted data to downstream consumers, write the redacted DataFrame to a table, or wrap the redaction in a view, and grant access to that rather than the raw table.

A Java redaction UDF

For Spark on the JVM, the same pattern uses the Phileas Java library directly. Build the filter service and prepare the policy once with service.prepare(policy), hold the prepared handle statically so it is reused across rows, and register a UDF1<String, String> that calls prepared.filter(context, text) per row. Preparing resolves the policy’s filters and post-filters a single time, so each row skips that per-call work.

import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.api.java.UDF1;
import org.apache.spark.sql.types.DataTypes;
import ai.philterd.phileas.PhileasConfiguration;
import ai.philterd.phileas.policy.Policy;
import ai.philterd.phileas.services.context.DefaultContextService;
import ai.philterd.phileas.services.disambiguation.vector.InMemoryVectorService;
import ai.philterd.phileas.services.filters.filtering.PlainTextFilterService;
import com.google.gson.Gson;
import java.util.Properties;

public final class PhileasUdf {

    // Prepared once per executor JVM, then shared across all task threads. prepare()
    // resolves the policy's filters a single time; the handle is safe to call
    // concurrently, so redact() takes no lock and only the one-time init is synchronized.
    // Do not wrap the filter() call itself in a lock: that would serialize every row on
    // the executor and throw away Spark's parallelism.
    private static volatile PlainTextFilterService.PreparedPolicy prepared;

    private static synchronized PlainTextFilterService.PreparedPolicy init() throws Exception {
        if (prepared == null) {
            String json = "{ \"identifiers\": {"
                + " \"emailAddress\": {\"emailAddressFilterStrategies\":[{\"strategy\":\"REDACT\",\"redactionFormat\":\"[EMAIL]\"}]},"
                + " \"ssn\": {\"ssnFilterStrategies\":[{\"strategy\":\"REDACT\",\"redactionFormat\":\"[SSN]\"}]} } }";
            Policy policy = new Gson().fromJson(json, Policy.class);
            PlainTextFilterService service = new PlainTextFilterService(
                new PhileasConfiguration(new Properties()),
                new DefaultContextService(), new InMemoryVectorService(), null);
            // Resolve the policy once, then reuse the prepared handle per row.
            prepared = service.prepare(policy);
        }
        return prepared;
    }

    private static String redact(String text) throws Exception {
        if (text == null) {
            return null;
        }
        PlainTextFilterService.PreparedPolicy p = prepared;
        if (p == null) {
            p = init();
        }
        return p.filter("spark", text).getFilteredText();
    }

    public static void register(SparkSession spark) {
        spark.udf().register("redact", (UDF1<String, String>) PhileasUdf::redact, DataTypes.StringType);
    }
}

PhileasUdf.register(spark);

// Spark SQL
spark.sql("SELECT id, redact(note) AS note_redacted FROM notes").show(false);

// DataFrame API
df.selectExpr("id", "redact(note) AS note_redacted");

The Scala API is the same idea: prepare the policy once and register a UDF that calls prepared.filter(context, text).getFilteredText(). Whichever language, manage the Phileas dependency in your build and run on a JDK your Spark version supports.

Performance and what to plan for

A UDF runs on every row, so treat it as work on the hot path and tune accordingly.

Build and prepare once per worker, then share it lock-free. Constructing the Phileas service and resolving the policy is not free. Build the service and call service.prepare(policy) once per executor (the lazy singleton above), never per row, so each row reuses the prepared handle instead of re-resolving the policy. Share that one handle across the executor’s task threads; it is safe to call concurrently, so only the one-time init is synchronized. Do not put a lock around the filter() call itself, or you serialize every row and lose Spark’s parallelism.
Keep the policy lean for throughput. Pattern-based types (emails, phone numbers, SSNs, card numbers) are cheap. Model-based name detection is far heavier; if you only need the structured types, leave the models out.
For heavy name detection, use a service. Loading a model on every executor is expensive. When you need model-based detection at scale, run the model in PhEye as a service the job calls, rather than embedding it in every worker.
Partition and batch sensibly. Redaction parallelizes naturally across workers; size partitions so the per-row cost stays amortized.

A note on accuracy

Detection is probabilistic and configurable. These tools reduce how much sensitive data gets through; they do not catch every instance. Validate your policy against representative data from your own tables, set thresholds for your tolerance, and treat redaction as one layer of a defense-in-depth strategy. You remain responsible for the data you process.

Where to go next

Writing your first redaction policy and the policy schema guide to build the policy the UDF applies.
Self-hosted PII detection for running detection in your own boundary, including in Spark.
PhEye for serving name-detection models when you need them at scale.

Frequently asked questions

Is there a Spark connector to install?

Not a packaged one. This is a build-it-yourself pattern: you wrap Phileas in a Spark UDF, which is a few lines. The code below is small and runs in-process on your executors, so it does not need a separate connector. If you would rather have a maintained, packaged artifact, that is tracked separately.

Does data leave my cluster?

No. Phileas runs inside your Spark executors, so detection and redaction happen where the data already is. Nothing is sent to a third-party service. See self-hosted PII detection for the broader picture.

Will this scale to large jobs?

The UDF runs per row, so cost matters. Build the Phileas service once per worker (as shown), use a lean pattern-based policy for high-throughput jobs, and reserve model-based name detection for where you need it. For heavy NER at scale, run the model in PhEye as a service rather than loading it on every executor.