The Hidden Difficulties of Redacting PDF Documents
"Just black it out" is one of the most dangerous sentences in document handling. A user opens a PDF in Acrobat, drops a few black rectangles over the sensitive bits, saves the file, and ships it. The visible result looks redacted. The actual file contains every original character of the supposedly-hidden text, sitting under the rectangle, fully recoverable in under five seconds with a copy-paste.
This isn't a hypothetical. It's the source of some of the most famous PII leaks in the last 20 years. PDF redaction is genuinely hard — harder than redaction of plain text, harder than redaction of Word documents, harder than most engineers realize the first time they try to do it.
This post walks through why PDFs are uniquely difficult, the famous cases where major institutions got it wrong, and what a real PDF redaction pipeline has to do to actually be safe.
Why PDFs are uniquely hard to redact
The PDF format is, by design, a layered representation of a document. What's visible on the rendered page is the output of those layers, not the underlying content. Several distinct categories of "hidden" content can carry PII that a naive redaction completely misses.
1. The text layer is independent of the visual layer
When you draw a black rectangle on a PDF using Acrobat's annotation tool, you're not removing text — you're adding an opaque shape on top of it. The rectangle looks like redaction, but the underlying text stream is unchanged. Anyone can:
- Open the file in any PDF viewer with text selection (almost all of them)
- Click-drag across the "redacted" rectangle
- Copy
- Paste the original text into any other application
This is the failure mode behind almost every famous PDF-redaction mishap. The text is right there; it's just covered by a rectangle.
2. Invisible OCR text layers
When a scanned image is run through OCR (optical character recognition), the resulting PDF typically contains both the image (what you see) and an invisible text layer overlaid behind it (what your computer can search and copy). The text layer is intentionally transparent — it's not visible during normal reading.
If you "redact" the visible image with a rectangle, the OCR text layer underneath remains untouched. The page looks redacted and is still fully searchable and copyable.
3. Document metadata
Every PDF carries metadata: author name, organization, software used to create it, creation and modification timestamps, sometimes the original filename and full file path. The Microsoft Word "Track Changes" history, if left enabled at the source, can also serialize into the PDF as embedded XML.
None of this is visible on the rendered page. None of it is touched by visual redaction. All of it can identify the document's author and origin in ways the redaction was supposed to prevent.
4. Form fields and JavaScript
PDFs can include interactive form fields (text inputs, dropdowns, checkboxes) whose default values are stored separately from their rendered appearance. They can also include embedded JavaScript that may contain string literals with names, account numbers, or other identifiers used for form validation.
5. Layers and Optional Content Groups
The PDF spec supports layers (formally called Optional Content Groups). A document can ship with multiple layers, some of which are not rendered by default but can be enabled by an end user via the layers panel in Acrobat. A "hidden" layer can contain the un-redacted version of content visible on a different layer.
6. Embedded files and attachments
PDFs can carry arbitrary file attachments — spreadsheets, other PDFs, images, source documents. Attachments are stored as raw bytes inside the PDF and have no relationship to the visible page content. A reviewer who redacts the visible text and ships the file is also shipping every attachment, which may contain the very PII the redaction was meant to remove.
7. PDF Portfolios (nested PDFs)
A PDF Portfolio is a single PDF container holding multiple independent PDFs. The outer wrapper can be heavily redacted while the inner files are untouched. Anyone who opens the portfolio sees the inner documents directly.
8. Embedded fonts can leak source text
Subsetted embedded fonts include only the glyphs needed to render the visible text. If you redact a word by covering it with a rectangle but the unique glyph used in that word ("Á" in a document otherwise entirely English, for example) remains in the font subset, an attacker can sometimes infer the redacted content from the font composition alone.
Famous failures
The danger of cosmetic-only PDF redaction isn't theoretical. Major institutions have learned this the hard way, repeatedly, for at least two decades.
The Manafort filing (2019)
In January 2019, lawyers for Paul Manafort filed a heavily-redacted court document in response to special counsel Robert Mueller's investigation. The redactions used Acrobat's black-rectangle annotation tool rather than actual text removal. Reporters at multiple outlets discovered within hours that the redacted text could be revealed by copying it from the PDF and pasting it elsewhere. The "redacted" passages revealed that Manafort had shared internal polling data with a business associate the FBI assessed had ties to Russian intelligence — one of the most consequential factual disclosures of the entire Mueller investigation, leaked through a redaction tooling failure.
The TSA standard operating procedures leak (2009)
The Transportation Security Administration published a heavily-redacted version of its Screening Management Standard Operating Procedures online. The redactions, again, used black rectangles over text that remained in the document. Civil-liberties researchers and security writers downloaded the PDF, removed the rectangles in Acrobat Reader, and published the original full text. The leaked passages revealed detailed screening procedures, exemption rules, and identification standards used at U.S. airports.
Government FOIA responses, repeatedly
Throughout the 2000s and 2010s, multiple federal agencies released FOIA documents with rectangle-only redaction. The list of incidents is long enough that civil-liberties groups built tooling specifically to detect and de-redact these failures — a category of tooling that wouldn't exist if the underlying problem wasn't pervasive.
The common thread
Every one of these failures has the same root cause: the operator believed that drawing a black rectangle in a PDF viewer constituted redaction. It does not. It constitutes adding a rectangle on top of unchanged text.
What real PDF redaction has to do
A redaction pipeline that actually removes content from a PDF has to operate at the structural level, not the visual level. At minimum, that means:
- Parse the PDF structure. Identify the text objects, the OCR text layer, form fields, embedded JavaScript, attachments, and any Optional Content Groups.
- Extract every text source. Pull the actual characters out of every layer that contributes to the document's content — visible and invisible.
- Run each text source through PII detection. Apply the same redaction policy you'd use on plain text.
- Reconstruct the PDF with redacted content. Replace the original text objects with the redacted versions; remove invisible OCR layers that contained PII; strip metadata; remove attachments (or recursively redact them); flatten layers.
- Re-render the visible content. Generate visual indicators of redaction (replacement text, blocks, or whatever the policy specifies) so the rendered output matches what's actually in the file.
This is meaningfully more work than "draw a rectangle." It also requires a PDF parser capable of safely round-tripping the document — many "PDF redaction" tools mangle the file in unpredictable ways and produce output that's only superficially valid PDF.
Philter's approach to PDF redaction
Philter handles PDFs at the structural level. When a PDF is sent to Philter's API, the engine:
- Parses the PDF into its constituent text objects, image objects, form fields, and metadata.
- Extracts the text layer (including any OCR text layer added by image-based PDFs).
- Runs every text source through the same Phileas detection and redaction logic used for plain-text workloads — with the same per-entity policy strategies (mask, encrypt, replace, drop, etc.).
- Reconstructs the PDF with the redacted text actually replacing the original characters in the text stream — not overlaid on top of them.
- Returns the redacted PDF as the response body.
The result is a PDF where copying and pasting "redacted" content yields the redacted version, not the original. Because Philter is policy-driven, the same JSON policy file that controls plain-text redaction also controls PDF redaction — one configuration covers both.
A curl example
The simplest PDF redaction call sends a binary PDF and receives a redacted binary PDF back:
$ curl -X POST "http://localhost:8080/api/filter?p=default" \
-H "Content-type: application/pdf" \
-H "Accept: application/pdf" \
--data-binary @input.pdf \
--output redacted.pdf
# redacted.pdf now contains the same document with detected
# PII replaced according to the "default" policy.Apply a specific policy by name (for example, your HIPAA Safe Harbor policy):
$ curl -X POST "http://localhost:8080/api/filter?p=hipaa-safe-harbor" \
-H "Content-type: application/pdf" \
-H "Accept: application/pdf" \
--data-binary @patient-chart.pdf \
--output patient-chart-redacted.pdfFor pipelines that need both the redacted PDF and the structured detection report (useful for audit logging), Philter returns the detected entities as JSON via a parallel /api/find endpoint:
$ curl -X POST "http://localhost:8080/api/find?p=hipaa-safe-harbor" \
-H "Content-type: application/pdf" \
--data-binary @patient-chart.pdf
[
{ "type": "SSN", "text": "123-45-6789", "page": 1, "confidence": 0.99 },
{ "type": "PER", "text": "Jane Doe", "page": 1, "confidence": 0.94 },
{ "type": "DATE", "text": "2026-03-14", "page": 2, "confidence": 0.91 }
]The audit report goes to your SIEM or compliance log; the redacted PDF goes to the downstream consumer.
What Philter can't do automatically (yet)
A few categories of PDF content still require manual handling or upstream pipeline work:
- Image-only PDFs (no text layer). If the PDF is a scanned image with no OCR text layer, Philter has no text to redact. Run the document through OCR (Tesseract, AWS Textract, Google Document AI) first, then send the resulting text-layer PDF to Philter.
- Embedded files / attachments. Philter focuses on the document body. If your PDFs carry attached files with their own PII, your pipeline should extract attachments, redact each independently, and reattach the redacted versions — or strip them entirely depending on your retention policy.
- PDF Portfolios. The outer container needs to be unpacked, each inner PDF redacted independently, then optionally repackaged.
- Document metadata stripping. Configurable in the policy; check that author, organization, and edit-history metadata are removed for your use case.
- Faces and other image content. Visual content inside the page images isn't text and isn't in Philter's scope. For workflows that need to redact identifying images (faces in scanned IDs, signatures in scanned documents), pair Philter with a separate image-redaction step upstream.
Operational habits that prevent the failures
Three habits dramatically reduce the chance of shipping a PDF with broken redaction:
- Never use a PDF viewer's rectangle annotation as a redaction tool. Even if a particular reader has a real "redact" feature, that feature has the wrong defaults often enough that automated pipelines beat human tools for any volume above a handful of documents.
- Run the redacted output through verification. Extract text from the redacted PDF, search for any of the original sensitive values, fail loudly if any are found. This is the same idea as Philter Scope's measurement loop applied to PDF output specifically.
- Treat every attachment, every layer, every metadata field as a potential leak. The pipeline should explicitly decide how each is handled, not silently inherit the source document's choices.
The bottom line
PDFs are not simple containers of visible text. They are layered, multi-stream, attachment-bearing, metadata-rich documents that leak in ways most operators don't anticipate. "Blacking out" content in a PDF viewer is the source of more famous PII disclosures than any other single failure mode in modern document handling.
Real PDF redaction requires structural parsing, layer-aware text extraction, and reconstruction of the file with the original text actually replaced. Philter does this at the API layer; the same policy that handles your plain-text workloads handles your PDF workloads.
If you're standing up a document-redaction pipeline for a regulated workload and want to talk through the architecture (including the upstream OCR step and the downstream verification loop), get in touch. We've done PDF redaction at scale for healthcare chart workflows, legal e-discovery production, financial-services KYC, and government FOIA pipelines — the surface area is wider than most teams expect on day one.