Redaction for Education: FERPA, Student Records, and Research Data Pipelines

Education is the vertical that almost no one talks about when discussing regulated data, and yet it’s one of the largest categories of PII in the country. Every K-12 district, every university registrar, every learning-management-system vendor, every edtech startup, every educational researcher works under FERPA, the Family Educational Rights and Privacy Act of 1974. The regulatory framework is real, the penalties for institutions are severe (loss of federal funding), and the tooling situation is genuinely thin compared to healthcare or finance.

This post is for the people who actually have to operate FERPA-compliant pipelines: university IT departments, edtech platform engineers, institutional-research staff, and the LMS vendors that serve them. The architecture story is similar to HIPAA’s; the specifics differ in ways that matter.

What FERPA actually requires

FERPA (20 U.S.C. § 1232g; 34 CFR Part 99) governs education records, specifically records “directly related to a student” maintained by an educational institution or agency receiving federal funding. The core requirements:

No disclosure of personally identifiable information from education records without written consent from the eligible student (or parent, if the student is under 18 and not in postsecondary education).
Exceptions for school officials with legitimate educational interest, other schools the student is transferring to, specified educational research, court orders, health and safety emergencies, and a handful of others.
Right to inspect and review education records (the parent / eligible student side).
“Directory information” (name, address, phone, dates of attendance, degrees received, etc.) can be disclosed without consent unless the student / parent has opted out. Each institution defines its own directory list.

FERPA’s definition of “personally identifiable information” (34 CFR 99.3) is unusually broad: it includes direct identifiers (name, address, SSN, student ID) plus indirect identifiers (date and place of birth, mother’s maiden name) plus any other information that, alone or in combination, would allow a reasonable person in the school community to identify the student with reasonable certainty. That last phrase is the broadest “quasi-identifier” formulation in U.S. privacy law, broader than HIPAA’s specific 18 categories or GLBA’s NPPI list.

The practical identifier set

Distilled to a working identifier list for an educational redaction policy:

Identifier	Context	Philterd handling
Student names	Always PII unless directory-info opt-in	NER + name dictionary; replace with consistent pseudonym for research datasets
Student ID number	Always PII	Custom identifier filter, institution-specific format
SSN	Always PII (historic use as student ID)	Built-in SSN detector
Date of birth	PII; date-shift for research datasets	Date filter with year-only or per-student offset strategies
Home address	PII (unless explicitly directory)	NER + address detector
Email / phone	PII (unless explicitly directory)	Built-in detectors
Parent / guardian names	PII	NER; same pseudonymization strategy as student names
Grades, GPAs, transcripts	Education records (always covered)	Custom identifier filter for grade patterns; structured handling in pipelines
Disciplinary records, IEP / 504 plans	Education records (high sensitivity)	Document-level handling (whole-document classification); not just entity extraction
Free-text teacher notes, counselor notes	Education records (PII-dense unstructured text)	Full NER pass with custom dictionaries for institution-specific vocabulary
Class section + small population identifiers	Quasi-identifier under 34 CFR 99.3	Custom dictionary; population-threshold validation similar to ZIP-code handling

The last row is the FERPA-specific challenge that doesn’t show up in HIPAA or GLBA: aggregations that would normally be safe under other frameworks can become re-identifying in small populations. “Math 401, Fall 2026, grade B” is identifying if only three students were in Math 401 that semester. Handling this requires the redaction system to be aware of cohort sizes when making release decisions. This is a domain-specific configuration rather than an out-of-the-box detector.

Architecture 1: institutional research and analytics

University offices of institutional research, K-12 district analytics teams, and edtech platforms with multi-tenant analytics offerings all face the same problem: derive useful insights from student data without exposing PII to analysts who don’t have a legitimate educational interest in specific individuals.

  Student records ──▶ raw SIS / LMS extract (FERPA-controlled access)
                                  │
                                  ▼
                       Philter (FERPA policy)
                                  │
                                  ▼
                       De-identified analytics zone
                                  │
                  ┌───────────────┼────────────────┐
                  ▼               ▼                ▼
            Cohort analytics   Outcome studies   Dashboards

The non-obvious part for FERPA: the de-identified zone has to satisfy the “reasonable certainty” test, not just “the names are removed.” Per-student pseudonyms must be consistent across tables for joins to work, but the pseudonym function has to be one-way (the analytics team can’t reverse-engineer back to identities). Date shifts have to preserve event-interval semantics. Small-cohort suppression has to fire automatically.

This is more demanding than naive PII redaction. Phileas ’s policy engine supports the consistent-pseudonym pattern out of the box; the small-cohort suppression typically needs a custom layer that runs after entity-level redaction.

Education research is a heavily-collaborative field. A single longitudinal study may pull from multiple districts’ SIS data, plus standardized-test results, plus survey responses, plus LMS-engagement logs. Each source comes with its own FERPA consent framework (or 34 CFR 99.31 educational-research exception), and the resulting dataset has to be de-identified before it can be shared with collaborators outside the contributing institutions.

The architectural pattern is similar to clinical research pipelines under HIPAA : per-source ingestion with source-specific redaction policies, a unified de-identified zone with consistent pseudonyms across sources, and a documented release process. The precision/recall measurement story is the same too. The IRB packet needs metrics specific to the dataset, not vendor-quoted averages.

Architecture 3: LMS / edtech platform multi-tenancy

LMS vendors (Canvas, Blackboard, Brightspace) and edtech platforms serve hundreds or thousands of institutions. Each institution’s data has to stay isolated; PII can’t leak across tenants even by accident. For platforms that use shared backend services for analytics, ML training (e.g., learning-outcome prediction models), or product telemetry, the cross-tenant boundary is a constant redaction opportunity.

The pattern: PII redaction at the cross-tenant aggregation boundary. Per-tenant raw data stays in tenant-isolated storage; aggregated cross-tenant analytics use redacted versions. Any ML training that draws from multiple tenants pulls from the redacted side. Platform telemetry that flows into ops dashboards passes through Philter on the way out.

FERPA-specific gotchas

Four things that bite educational deployments specifically:

“Directory information” definitions vary. Each institution sets its own. A student who opted out of directory disclosure at one institution shouldn’t have their data leak as “directory” at another. Track the opt-out status as data, not as institutional policy.
Annual privacy notices. FERPA requires annual notification to parents / eligible students. The notice has to accurately describe what’s being collected and disclosed, which means your redaction policy and your privacy notice need to stay in sync. Version-control the policy alongside the notice.
Minor / adult transition at 18. A student’s privacy controls transfer from parent to student at age 18 (in postsecondary contexts, regardless of age). Pipelines that hard-code parent-as-controller assumptions break at scale.
School official with legitimate educational interest. The most-used FERPA exception. The institution defines who qualifies and what “legitimate” means. Audit logs need to capture not just who accessed but under what justification. This is not technically a Philter problem, but the surrounding logging infrastructure has to support it.

The audit story

Department of Education enforcement is rare but consequential (loss of federal funding for the institution). When it does happen, the artifact set looks familiar:

Discovery inventory of where student PII lives across systems, refreshed on a schedule (Phinder ).
Policy regression validation against a gold-standard set of representative records (Philter Scope in CI).
Per-disclosure logging showing what was released to whom under which exception.
Live monitoring of detection patterns to catch upstream format changes (Phield ).

The bottom line

FERPA is one of the most-used and least-discussed regulatory frameworks in the U.S. Universities, K-12 districts, edtech vendors, and educational researchers all operate under it; the tooling story is genuinely thinner than what’s available for healthcare or finance. The architectural pattern that works is the same one that works for those better-served verticals: redact at the boundary between raw and analytic zones, maintain consistent pseudonyms for joinability, suppress small cohorts, audit per-disclosure.

The Philterd toolkit covers the textual scope of FERPA with the same open source, self-hosted, policy-driven engine we use for HIPAA and GLBA workloads. The institution-specific configuration (student-ID formats, course-numbering conventions, directory-info opt-out tracking) is where the work concentrates; the engine and architecture are the same.

If you’re standing up a FERPA-compliant pipeline for an institution, an LMS platform, or an education research project and want help designing the redaction layer, let’s talk . The framework is solvable; most education-vertical deployments fail by trying to retrofit consumer-grade tools onto requirements that need purpose-built ones.