How We Built PhEye's PII Name Models
We’re releasing three new models for PhEye: ph-eye-pii-en-small, ph-eye-pii-en-medium, and ph-eye-pii-en-large. They are English person-name detectors, fine-tuned from GLiNER and published on Hugging Face under a commercially friendly license.
Three sizes, one job: find names in unstructured text so the rest of the Philterd stack can redact them. This post is about why we built models that do only one thing, how we trained them, and how they fit into PhEye, Phileas, and Philter.
Why a model that finds only one thing
The central design decision in these models is what they don’t detect. They find names. Not emails, not phone numbers, not SSNs, not credit cards, not IP addresses. One label: name.
That’s deliberate. Names are the one PII type that genuinely needs a machine-learning model. They have no fixed format, they overlap with ordinary words, and whether a token is a name depends entirely on context. “Apple” is a fruit, a company, or a patient’s surname depending on the sentence around it. “Huntington” is a person in one document and a disease in the next. There is no regex for that; it takes a model that reads the sentence.
The other common PII types are the opposite. Emails, phone numbers, SSNs, credit cards, and IP addresses all follow regular patterns. They are caught more reliably, and for a tiny fraction of the compute, with regexes, checksum validation, and dictionary lookups. Spending a neural network on a Luhn-checkable credit card number is overkill, and worse, it’s less reliable than the deterministic check.
So we don’t. In the Philterd stack, the structured types are handled by the pattern-based detectors in Phileas and Philter, the logic layer. These models handle the part that actually requires intelligence. This is the same hybrid philosophy we apply everywhere: deterministic logic for the deterministic types, specialized AI for the hard, contextual ones. Each model focuses on the 20% of the problem that the regexes can’t touch.
Recall comes first
A redaction model has an asymmetric cost function. Missing a name is a leak: sensitive data escapes. Flagging an extra span is over-redaction: annoying, but safe. Those are not equally bad, and we don’t treat them as such.
Every one of these models is recall-leaning by design. We tune for recall, not for the F1 number that would look best in a table. When in doubt, the model flags the span. You can always dial precision back up with a higher confidence threshold once you know your data; you can’t recover a name the model never emitted.
That priority shows up in the confidence thresholds we recommend per model (more on that below) and in how we evaluate: recall is the number we watch.
How we built them
The recipe is the same across all three sizes; only the base model changes.
Base models. We fine-tuned the GLiNER -v2.1 family: gliner_small-v2.1 (DeBERTa-v3-small), gliner_medium-v2.1 (DeBERTa-v3-base), and gliner_large-v2.1 (DeBERTa-v3-large). GLiNER is a strong zero-shot NER architecture, and the -v2.1 bases are Apache-2.0, which matters: a non-commercial base would make the whole derived model non-commercial. Three base sizes give us a small, low-latency model for on-device use, a mid-size default, and a high-capacity model for server-side workloads.
Training data. We trained on nvidia/Nemotron-PII, a synthetic PII dataset spanning 50+ industries with 100k training and 100k test documents, released under CC-BY-4.0. Synthetic data is a real advantage here: it’s diverse, it’s labeled, and it carries none of the consent and exposure problems of training a PII model on actual personal data.
Teaching the model what a name is. Nemotron labels names as separate first_name and last_name spans. PhEye, and standard NER benchmarks, want a single full-name span. So during data preparation we map both labels to name, then merge adjacent first/last spans into one span. “Maria” plus “Gonzalez” becomes “Maria Gonzalez,” which is what gets redacted downstream.
Handling long documents. A DeBERTa-v3 model has a 512-token window. A name on page three of a long document would normally fall outside that window and never get learned. So we window long documents at roughly 1,200 characters, cutting on whitespace and never through a span, during both training and evaluation. That keeps training honest and makes evaluation match how PhEye actually serves inference.
Letting the validation curve decide. We don’t pick an epoch count. Training scores validation F1 after every epoch, keeps the best checkpoint, and stops early when F1 plateaus. The model trains exactly as long as it keeps improving and no longer.
What the numbers say, and what they don’t
On the held-out Nemotron-PII test split, at each model’s recommended threshold, exact span-and-label matching:
| Model | Base | Threshold | Precision | Recall | F1 |
|---|---|---|---|---|---|
ph-eye-pii-en-small | DeBERTa-v3-small (~580 MB) | 0.90 | 0.96 | 0.99 | 0.98 |
ph-eye-pii-en-medium | DeBERTa-v3-base (~745 MB) | 0.70 | 0.96 | 0.99 | 0.98 |
ph-eye-pii-en-large | DeBERTa-v3-large (~1.7 GB) | 0.95 | 0.97 | 0.99 | 0.98 |
Recall sits around 0.99 across the board: the models find nearly every name. The recommended thresholds differ by size because the models calibrate their confidence differently; each card documents the operating point we evaluated. medium is the recommended default for most uses.
These numbers are in-distribution. The models were trained and tested on the same synthetic Nemotron distribution, so they represent a ceiling, not a production guarantee. Accuracy on real, non-synthetic English will be lower; how much lower depends on how far your text sits from the training distribution. Validate precision and recall on your own data before you rely on these for anything that matters. A redaction model you haven’t measured on your own text is a model you don’t actually know.
Engineering for a redaction model specifically
A model whose job is to prevent data leaks has failure modes that ordinary NER models don’t, and we built guardrails for them.
One such guardrail is the ONNX export parity gate. We export each model to ONNX so PhEye can serve it efficiently, but a model that quietly loses a few points of recall during export would leak names in production while every dashboard still showed green. The export refuses to publish if the ONNX model disagrees with the original PyTorch model. A divergence aborts the release. We’d rather ship nothing than ship a model that’s silently worse than the one we evaluated. (We also tried int8 quantization: it collapsed recall for this task, so we don’t ship it.)
The full training and publishing pipeline (config-driven runs, reproducible evaluation, provenance capture, the publish gate) is what lets us release a family of models with confidence rather than one hand-tuned artifact.
How they fit into PhEye and the Philterd stack
A model on Hugging Face is just weights. PhEye is what turns it into a service.
PhEye is the model server for the Philterd toolkit. It loads one or more models as lenses, exposes a simple HTTP endpoint, and runs entirely inside your own infrastructure: no third-party API, no model-provider account, no text leaving your VPC. You POST text to /find and get back entities with confidence scores:
$ curl http://localhost:5000/find \
--data "Please forward the invoice to Maria Gonzalez."
[
{ "type": "PER", "text": "Maria Gonzalez", "confidence": 0.98 }
]
That confidence score is where the recall-leaning design pays off. Because the model errs toward flagging names, you tune the precision/recall balance at request time by filtering on the threshold: accept everything above your cutoff, drop the rest, and set policy per entity type. You’re not stuck with whatever the model decided; you decide.
In a full deployment, Philter (or Phileas, or any HTTP client) calls PhEye to get the name detections, and combines them with its own regex, dictionary, and checksum rules for the structured PII types. The model finds the names; the deterministic layer catches the emails, SSNs, and card numbers; Philter merges everything and applies your redaction policy. The caller never has to know which models are loaded: it just gets entities and confidence scores back. These three models drop into that flow as the English name lens, in whichever size fits your latency and footprint budget: the small model on a CPU at the edge, the large model on a GPU server, the same recipe behind both.
License, attribution, and getting started
All three models are released under CC-BY-4.0, inherited from the training data. The Nemotron-PII dataset is CC-BY-4.0 and permits commercial use with attribution to NVIDIA, so any use of these models must retain that attribution. The GLiNER -v2.1 bases are Apache-2.0.
Grab them on Hugging Face:
philterd/ph-eye-pii-en-small: low-latency, on-devicephilterd/ph-eye-pii-en-medium: the recommended defaultphilterd/ph-eye-pii-en-large: highest capacity, server-side
To run one as a service, point PhEye at it. To see where it fits in an end-to-end redaction pipeline, start with Philter.
Finally, these models make no guarantee of accuracy or completeness. Name detection is probabilistic: they will miss some names and flag some non-names. You remain the data controller for any personal data you process. Measure them on your own text, set your thresholds accordingly, and treat them as one layer of a defense-in-depth privacy strategy, not a silver bullet.
Need a model that isn’t here, another language, a domain lens, a custom entity type? Talk to the team.