Talk to an Expert

Tell us about your stack and the privacy problems you're trying to solve. We typically respond within one business day.

Prefer email? support@philterd.ai

Prefer to skip the form? Pick a time on our calendar →
or send a message

Please do not enter PII or PHI in this form. If you need to share an example, use a sanitized one.

← All posts

Introducing PhiSQL: The Query Language for PII Operations

Every tool in the Philterd toolkit has its own way of being told what to do. Philter and Phileas read a JSON redaction policy. Phinder takes command-line flags to scan a bucket. Phield is configured with its own monitoring rules. Philter Scope is pointed at a gold-standard set to benchmark precision and recall. Each surface is reasonable on its own, but together they ask you to hold four different mental models for what is, conceptually, one job: deciding how sensitive data should be handled.

Worse, the most common surface (the redaction policy) is JSON. JSON is a fine machine format and a poor authoring format. A HIPAA Safe Harbor policy is dozens of nested objects. A reviewer looking at a pull request sees a wall of braces, not a decision. “Redact every SSN” should not be hard to read.

PhiSQL is our answer: a single declarative language for the whole PII privacy lifecycle. Today it covers redaction. Over time it will cover discovery, monitoring, and benchmarking too, with the same grammar and the same mental model across every tool.

What it looks like

Here is a complete PhiSQL policy:

-- Redact U.S. Social Security Numbers.
POLICY ssn_only;

REDACT SSN WITH MASK;

That compiles to the canonical Phileas JSON policy:

{
  "identifiers": {
    "ssn": {
      "ssnFilterStrategies": [
        { "strategy": "MASK" }
      ]
    }
  }
}

The PhiSQL is the thing a human reads and reviews. The JSON is the thing the runtime executes. You can still hand-write the JSON whenever you want to; PhiSQL is an authoring convenience on top of it, not a replacement for it.

Why SQL-like

SQL is the closest thing our industry has to a universal language for operating on data. Data engineers, analysts, and DBAs already think in SELECT, WHERE, and verbs that act on a target. A declarative, SQL-flavored syntax for privacy operations means the people who already manage the data can manage how its sensitive parts are handled, with a near-zero learning curve. REDACT SSN WITH MASK reads the way it works.

It also keeps policies reviewable. PhiSQL files diff cleanly in Git, fit in a pull request, and read like intent rather than configuration.

The verb families in v0.1

The v0.1 draft covers the redaction subset. There are three verb families, each with a worked example below.

POLICY: declare the policy

Every policy opens with a POLICY declaration. The name is optional and, when present, must match the filename. An optional DESCRIPTION compiles to a sibling Markdown file.

POLICY support_tickets
  DESCRIPTION 'Customer support ticket redaction with allowlist.';

REDACT: redact specific entities

REDACT applies a strategy to one or more entity types. Strategies include MASK, LAST_4, STATIC_REPLACE, and format-preserving encryption with FPE_ENCRYPT. A WHERE predicate can gate the rule on detection confidence.

-- PCI DSS v4.0 scope reduction: keep only the last 4 digits of a PAN,
-- and only when the detector is confident.
POLICY pci_dss_scope_reduction
  DESCRIPTION 'PCI DSS v4.0 scope reduction.';

REDACT CREDIT_CARD WITH LAST_4 WHERE CONFIDENCE > 0.85;

Strategies take named arguments where they need them:

REDACT FIRST_NAME, SURNAME WITH STATIC_REPLACE(value='Customer', scope=document);
REDACT EMAIL_ADDRESS WITH MASK;
REDACT PHONE_NUMBER WITH MASK;

DEIDENTIFY: apply mixed strategies across many entities

DEIDENTIFY is the convenience form for de-identification, where different entity types each get their own strategy. This is HIPAA Safe Harbor (45 CFR 164.514(b)(2)) expressed as one readable block:

POLICY hipaa_safe_harbor
  DESCRIPTION 'HIPAA Safe Harbor de-identification.';

DEIDENTIFY
  PHYSICIAN_NAME  AS RANDOM_REPLACE,
  HOSPITAL        AS RANDOM_REPLACE,
  DATE            AS TRUNCATE,
  AGE             AS REDACT,
  SSN             AS REDACT,
  PHONE_NUMBER    AS REDACT,
  EMAIL_ADDRESS   AS REDACT,
  STREET_ADDRESS  AS REDACT,
  CITY            AS REDACT,
  STATE           AS REDACT,
  ZIP_CODE        AS REDACT;

IGNORE: carve out exceptions

IGNORE keeps known-safe values from being redacted. Allowlist a customer’s own company names so they survive the redaction pass:

IGNORE TERMS ('Acme', 'AcmeCorp') FOR FIRST_NAME;
IGNORE TERMS ('Corp', 'Support', 'Engineering') FOR SURNAME;

Every one of these compiles to a standard Phileas JSON policy that Philter and Phileas already know how to run. There is no new runtime and no migration.

A specification with a reference implementation

PhiSQL is defined as an open specification, with a Java reference implementation that proves it. Both live in the philterd/phisql repository under the permissive Apache 2.0 license.

  • The specification lives under spec/v0.1/: an ANTLR4 grammar and EBNF, a catalog of entity types, strategies, keywords, and predicates, plus worked examples that pair each .phisql file with the JSON it compiles to.
  • The reference implementation is a Java parser and compiler in reference/, published to Maven as ai.philterd:phisql. Build it with mvn verify.

CI keeps the two from drifting: the reference implementation generates its parser directly from the spec’s grammar and parses every example file in its test suite, and a separate workflow validates that every example compiles to JSON that conforms to the canonical Phileas schema.

The governing principle is Phileas JSON leads; PhiSQL follows. Anything PhiSQL can express must be representable as Phileas JSON. PhiSQL never introduces a construct that has no JSON equivalent, the runtime never changes to accommodate the language, and existing JSON policies stay canonical forever. There is no second source of truth and no lock-in.

Trademark-governed conformance

“PhiSQL” is a registered trademark of Philterd, LLC. The specification is freely readable and freely implementable: anyone can write a parser, a compiler, or an SDK against it. What is reserved is the name. “PhiSQL” is meant to mean a defined thing, so the name is reserved for implementations that pass the conformance test suite (forthcoming at philterd/phisql-conformance).

While v0.1 is a draft, no implementation should claim conformance: the grammar and semantics can still change before v1.0. Pin a version if you build on it now.

What is coming next

v0.1 is deliberately scoped to redaction so the foundation is solid before the language grows. The plan is to extend the same grammar to the rest of the lifecycle, one verb family per tool:

  • Discovery via Phinder: inventory where PII actually lives across your storage.
  • Monitoring via Phield: watch how sensitive data flows and alert on the unexpected.
  • Benchmarking via Philter Scope: score a policy on precision and recall against a gold standard.
  • Cross-tool joins: compose operations that today require stitching several tools together by hand.

The syntax below is illustrative of the direction and is not yet implemented. It exists to show where the language is headed, not to be relied on:

-- Discovery (planned): inventory where PII lives.
FIND PII IN 's3://patient-records/' WHERE CONFIDENCE > 0.8;

-- Benchmarking (planned): score a policy on precision and recall.
BENCHMARK POLICY hipaa_safe_harbor AGAINST 'gold-standard/';

-- Monitoring (planned): alert on unexpected PII flow.
MONITOR PII ON 'kafka://topic/events' ALERT WHEN VOLUME > 1000;

Get involved

PhiSQL is being designed in the open, and the draft phase is exactly when feedback shapes the language most.

If your team already lives in SQL, PhiSQL should feel like home the moment you read your first policy. Tell us where it does not, and help us get it right before v1.0.