Snowflake PII Redaction: A Practical Integration Guide

Snowflake is the data warehouse for a huge swath of mid-to-large enterprises, and that’s exactly why it’s where unstructured PII piles up. Customer-support transcripts, scanned-document text, application logs, chat history, transaction descriptions: everything that’s text and was ever called “data” eventually lands in a Snowflake table.

Snowflake’s built-in dynamic data masking and external tokenization handle column-level structured data well (a SSN column, a credit card column), but they don’t address the harder problem: PII buried inside free-text columns. A customer-service-ticket table with a TEXT column holding the full conversation is the canonical case. The SSN is in there somewhere, but it’s not in a column you can mask.

This post walks through three production-grade patterns for solving that problem with Philter , with code, trade-offs, and guidance on when to pick each.

The three patterns

The patterns differ in where the redaction happens relative to Snowflake:

External function: Snowflake calls Philter via an HTTPS endpoint (typically AWS API Gateway / Lambda or Google Cloud Functions in front of a Philter instance).
Java UDF embedding Phileas: the redaction runs inside Snowflake’s compute layer using a UDF that includes the Phileas library directly.
ETL-stage redaction: redact before data lands in Snowflake, in your Snowpipe or Kafka Connect path.

All three are valid; the right pick depends on your latency tolerance, your existing platform, and where redaction needs to happen in the data flow.

Pattern 1: External function

The most flexible pattern. Snowflake exposes an EXTERNAL FUNCTION mechanism that lets a SQL query call out to an HTTPS endpoint. Wire that endpoint to a Philter instance and you can redact text from a SQL query as if it were a built-in function.

The setup has four pieces: a Philter deployment (typically on the AWS Marketplace), an API Gateway in front of it, an integration policy in Snowflake, and the function definition itself.

-- 1. Create the API integration (one-time, by ACCOUNTADMIN)
CREATE OR REPLACE API INTEGRATION philter_integration
  API_PROVIDER = aws_api_gateway
  API_AWS_ROLE_ARN = 'arn:aws:iam::123456789012:role/snowflake-philter-role'
  API_ALLOWED_PREFIXES = ('https://abc123.execute-api.us-east-1.amazonaws.com/prod/')
  ENABLED = TRUE;

-- 2. Create the external function
CREATE OR REPLACE EXTERNAL FUNCTION redact_pii(text VARCHAR)
  RETURNS VARCHAR
  API_INTEGRATION = philter_integration
  AS 'https://abc123.execute-api.us-east-1.amazonaws.com/prod/redact';

-- 3. Use it in queries
SELECT
    ticket_id,
    redact_pii(transcript) AS transcript_redacted
FROM customer_support.tickets
WHERE created_at > CURRENT_DATE - 7;

The Lambda function behind the API Gateway is a thin proxy: receive Snowflake’s batched payload, forward each row to Philter’s /api/filter endpoint, return the results. Snowflake batches automatically (typically 100 rows per call), so you’re not paying network overhead per row.

When to use: teams already running Philter as a service for other applications; SQL-driven workflows; analyst-friendly access where the redaction happens transparently inside the query.

Trade-offs: requires AWS / GCP integration setup; per-call network hop adds a few milliseconds; batched-payload limits cap the per-query rate. For huge bulk queries, the Java UDF or ETL-stage patterns are faster.

Pattern 2: Java UDF with embedded Phileas

Snowflake supports Java UDFs that can include external JAR dependencies. Phileas is a Java library; you can include it as a dependency in a UDF, load a policy from a Snowflake stage, and run redaction entirely inside Snowflake’s compute layer, with no external service required.

-- Upload Phileas + dependencies to a Snowflake stage
PUT file:///path/to/phileas-uber.jar @philter_jars/;
PUT file:///path/to/healthcare-policy.json @philter_jars/;

-- Create the UDF
CREATE OR REPLACE FUNCTION redact_pii(text VARCHAR)
  RETURNS VARCHAR
  LANGUAGE JAVA
  RUNTIME_VERSION = '11'
  IMPORTS = ('@philter_jars/phileas-uber.jar',
             '@philter_jars/healthcare-policy.json')
  HANDLER = 'PhileasRedactor.redact'
  AS $$
    import io.philterd.phileas.model.policy.Policy;
    import io.philterd.phileas.services.PhileasFilterService;
    import java.io.InputStream;
    import java.util.UUID;

    public class PhileasRedactor {
        private static final PhileasFilterService PHILEAS;
        static {
            try (InputStream in = PhileasRedactor.class
                    .getResourceAsStream("/healthcare-policy.json")) {
                PHILEAS = new PhileasFilterService(Policy.fromStream(in));
            } catch (Exception e) {
                throw new RuntimeException(e);
            }
        }
        public static String redact(String text) {
            if (text == null) return null;
            try {
                return PHILEAS.filter(
                    "healthcare", "default",
                    UUID.randomUUID().toString(),
                    text
                ).getFilteredText();
            } catch (Exception e) {
                return null;
            }
        }
    }
  $$;

When to use: you want zero external dependencies; you want redaction to happen inside Snowflake’s billing perimeter (no extra Lambda costs); your policy doesn’t change often (UDF deployment is heavier than updating an API).

Trade-offs: the JAR upload + UDF definition is more upfront work; updating the policy requires a UDF redeploy (or you load it from a stage at query time, with a cold-start cost on each call); JVM startup time is a noticeable fraction of small-table queries.

Pattern 3: ETL-stage redaction (before the data lands)

The architecturally cleanest pattern: redact before the data ever enters Snowflake. If your ingestion path is Kafka → Snowflake (via Snowpipe or Kafka Connect), insert the redaction stage in the streaming layer. If it’s S3 → Snowpipe, redact in the S3 writer.

  Producers ──▶ Kafka raw-events ──▶ Philter redaction stage
                                                │
                                                ▼
                                        Kafka redacted-events
                                                │
                                                ▼
                                          Snowpipe / Connect
                                                │
                                                ▼
                                       Snowflake (only clean data)

This is the pattern we detailed in “Architecting Privacy in Kafka” . The Snowflake side just sees clean data; no UDF, no external function, no per-query cost. Snowflake’s existing column masking can then handle structured PII in dedicated columns (SSN, credit card) while the text columns are already clean.

When to use: you control the ingestion path; you have a clear “boundary” where PII can be intercepted before Snowflake; you want zero per-query overhead.

Trade-offs: doesn’t help with existing data already in Snowflake (you need a one-time backfill pass); requires coordination with the team that owns the ingestion pipeline; can’t selectively redact based on the query (everything in Snowflake is redacted, period).

A hybrid you’ll actually want

Most production deployments end up combining patterns. A typical layout:

ETL-stage redaction for high-volume streaming data (Pattern 3), the cheapest per record and keeps Snowflake clean by default.
External function for ad-hoc analyst queries (Pattern 1), for when an analyst wants to query data that arrived before the streaming redaction was in place, or when they need a different policy than what was applied at ingestion.
Java UDF for batch ETL jobs inside Snowflake (Pattern 2), for periodic backfills, audit passes, ML feature engineering that has to happen inside Snowpark.

All three call the same Phileas engine with the same policy file. The hybrid setup lets you pick the right execution pattern per workload without forking the detection logic.

Performance characteristics

Some rules of thumb from the deployments we’ve seen:

External function (Pattern 1). 2-5ms per row beyond Philter’s own redaction time, dominated by the API Gateway round-trip. Best amortized over batched payloads (Snowflake automatically batches up to ~100 rows per call).
Java UDF (Pattern 2). Faster per row once the UDF is warm (no network hop), but JVM cold-start can add 1-2 seconds to the first call in a session. Use Snowflake’s UDF caching to keep the JVM warm across queries.
ETL-stage (Pattern 3). Effectively free at query time. The redaction cost moves to your streaming infrastructure, where it amortizes across the pipeline.

None of the three is “the fastest” universally; the right choice depends on whether your dominant cost is per-query latency or per-record throughput.

Validating the result

However you apply redaction, the question becomes: did it work? The same answer applies in Snowflake as anywhere else. Philter Scope measures precision and recall against a gold-standard set, and you should run it in CI on every policy change. The CI pattern works identically whether your redaction lives in a UDF, an external function, or an upstream Kafka stage.

The bottom line

Redacting PII inside Snowflake is solvable in three different ways, and the right pick depends on where in your data flow the redaction needs to happen. External functions are the most flexible. Java UDFs are the most self-contained. ETL-stage redaction is the architecturally cleanest. Most production stacks end up using more than one.

If you’re standing up Snowflake-resident PII redaction for the first time, or trying to migrate from a hand-rolled solution that’s not keeping up, get in touch . We’ve done this for healthcare claims warehouses, financial-services customer-data lakes, and SaaS-platform telemetry stores; the patterns above are stable, and the policy details are where the consulting work concentrates.