Introducing PhiSQL: The Query Language for PII Operations
Every tool in the Philterd toolkit has its own way of being told what to do. Philter and Phileas read a JSON redaction policy. Phinder takes command-line flags to scan a bucket. Phield is configured with its own monitoring rules. Philter Scope is pointed at a gold-standard set to benchmark precision and recall. Each surface is reasonable on its own, but together they ask you to hold four different mental models for what is, conceptually, one job: deciding how sensitive data should be handled.
Worse, the most common surface (the redaction policy) is JSON. JSON is a fine machine format and a poor authoring format. A HIPAA Safe Harbor policy is dozens of nested objects. A reviewer looking at a pull request sees a wall of braces, not a decision. “Redact every SSN” should not be hard to read.
PhiSQL is our answer: a single declarative language for the whole PII privacy lifecycle. Today it covers redaction. Over time it will cover discovery, monitoring, and benchmarking too, with the same grammar and the same mental model across every tool.
What it looks like
Here is a complete PhiSQL policy:
-- Redact U.S. Social Security Numbers.
POLICY ssn_only;
REDACT SSN WITH MASK;
That compiles to the canonical Phileas JSON policy:
{
"identifiers": {
"ssn": {
"ssnFilterStrategies": [
{ "strategy": "MASK" }
]
}
}
}
The PhiSQL is the thing a human reads and reviews. The JSON is the thing the runtime executes. You can still hand-write the JSON whenever you want to; PhiSQL is an authoring convenience on top of it, not a replacement for it.
Why SQL-like
SQL is the closest thing our industry has to a universal language for operating on data. Data engineers, analysts, and DBAs already think in SELECT, WHERE, and verbs that act on a target. A declarative, SQL-flavored syntax for privacy operations means the people who already manage the data can manage how its sensitive parts are handled, with a near-zero learning curve. REDACT SSN WITH MASK reads the way it works.
It also keeps policies reviewable. PhiSQL files diff cleanly in Git, fit in a pull request, and read like intent rather than configuration.
The verb families in v0.1
The v0.1 draft covers the redaction subset. There are three verb families, each with a worked example below.
POLICY: declare the policy
Every policy opens with a POLICY declaration. The name is optional and, when present, must match the filename. An optional DESCRIPTION compiles to a sibling Markdown file.
POLICY support_tickets
DESCRIPTION 'Customer support ticket redaction with allowlist.';
REDACT: redact specific entities
REDACT applies a strategy to one or more entity types. Strategies include MASK, LAST_4, STATIC_REPLACE, and format-preserving encryption with FPE_ENCRYPT. A WHERE predicate can gate the rule on detection confidence.
-- PCI DSS v4.0 scope reduction: keep only the last 4 digits of a PAN,
-- and only when the detector is confident.
POLICY pci_dss_scope_reduction
DESCRIPTION 'PCI DSS v4.0 scope reduction.';
REDACT CREDIT_CARD WITH LAST_4 WHERE CONFIDENCE > 0.85;
Strategies take named arguments where they need them:
REDACT FIRST_NAME, SURNAME WITH STATIC_REPLACE(value='Customer', scope=document);
REDACT EMAIL_ADDRESS WITH MASK;
REDACT PHONE_NUMBER WITH MASK;
DEIDENTIFY: apply mixed strategies across many entities
DEIDENTIFY is the convenience form for de-identification, where different entity types each get their own strategy. This is HIPAA Safe Harbor (45 CFR 164.514(b)(2)) expressed as one readable block:
POLICY hipaa_safe_harbor
DESCRIPTION 'HIPAA Safe Harbor de-identification.';
DEIDENTIFY
PHYSICIAN_NAME AS RANDOM_REPLACE,
HOSPITAL AS RANDOM_REPLACE,
DATE AS TRUNCATE,
AGE AS REDACT,
SSN AS REDACT,
PHONE_NUMBER AS REDACT,
EMAIL_ADDRESS AS REDACT,
STREET_ADDRESS AS REDACT,
CITY AS REDACT,
STATE AS REDACT,
ZIP_CODE AS REDACT;
IGNORE: carve out exceptions
IGNORE keeps known-safe values from being redacted. Allowlist a customer’s own company names so they survive the redaction pass:
IGNORE TERMS ('Acme', 'AcmeCorp') FOR FIRST_NAME;
IGNORE TERMS ('Corp', 'Support', 'Engineering') FOR SURNAME;
Every one of these compiles to a standard Phileas JSON policy that Philter and Phileas already know how to run. There is no new runtime and no migration.
A specification with a reference implementation
PhiSQL is defined as an open specification, with a Java reference implementation that proves it. Both live in the philterd/phisql repository under the permissive Apache 2.0 license.
- The specification lives under
spec/v0.1/: an ANTLR4 grammar and EBNF, a catalog of entity types, strategies, keywords, and predicates, plus worked examples that pair each.phisqlfile with the JSON it compiles to. - The reference implementation is a Java parser and compiler in
reference/, published to Maven asai.philterd:phisql. Build it withmvn verify.
CI keeps the two from drifting: the reference implementation generates its parser directly from the spec’s grammar and parses every example file in its test suite, and a separate workflow validates that every example compiles to JSON that conforms to the canonical Phileas schema.
The governing principle is Phileas JSON leads; PhiSQL follows. Anything PhiSQL can express must be representable as Phileas JSON. PhiSQL never introduces a construct that has no JSON equivalent, the runtime never changes to accommodate the language, and existing JSON policies stay canonical forever. There is no second source of truth and no lock-in.
Trademark-governed conformance
“PhiSQL” is a registered trademark of Philterd, LLC. The specification is freely readable and freely implementable: anyone can write a parser, a compiler, or an SDK against it. What is reserved is the name. “PhiSQL” is meant to mean a defined thing, so the name is reserved for implementations that pass the conformance test suite (forthcoming at philterd/phisql-conformance).
While v0.1 is a draft, no implementation should claim conformance: the grammar and semantics can still change before v1.0. Pin a version if you build on it now.
What is coming next
v0.1 is deliberately scoped to redaction so the foundation is solid before the language grows. The plan is to extend the same grammar to the rest of the lifecycle, one verb family per tool:
- Discovery via Phinder: inventory where PII actually lives across your storage.
- Monitoring via Phield: watch how sensitive data flows and alert on the unexpected.
- Benchmarking via Philter Scope: score a policy on precision and recall against a gold standard.
- Cross-tool joins: compose operations that today require stitching several tools together by hand.
The syntax below is illustrative of the direction and is not yet implemented. It exists to show where the language is headed, not to be relied on:
-- Discovery (planned): inventory where PII lives.
FIND PII IN 's3://patient-records/' WHERE CONFIDENCE > 0.8;
-- Benchmarking (planned): score a policy on precision and recall.
BENCHMARK POLICY hipaa_safe_harbor AGAINST 'gold-standard/';
-- Monitoring (planned): alert on unexpected PII flow.
MONITOR PII ON 'kafka://topic/events' ALERT WHEN VOLUME > 1000;
Get involved
PhiSQL is being designed in the open, and the draft phase is exactly when feedback shapes the language most.
- Read the PhiSQL overview on the site.
- Read the specification and the worked examples.
- Try the reference implementation.
- File issues and propose extensions in the repository.
If your team already lives in SQL, PhiSQL should feel like home the moment you read your first policy. Tell us where it does not, and help us get it right before v1.0.