The Philterd Blog on Philterd — Zero-Trust PII Redaction for Cloud and AI

Introducing Arbiter: Human-in-the-Loop PII Redaction

Sat, 16 May 2026 13:00:00 +0000

Automated PII redaction is genuinely great at most of the job. Pattern matching nails SSNs, credit cards, and phone numbers; PhEye's NLP models catch names, addresses, and organizations in unstructured text; Philter's policy engine ties it all together. For the majority of detections, the automated layer is faster, more consistent, and more comprehensive than human review.

And then there's the last few percent.

The detection that fires on the phrase "Huntington's disease" because "Huntington" is a name. The credit-card-shaped number that's actually a transaction ID. The patient name that's also a famous historical figure. The street address inside a quoted news article that's part of the public record. The mention of a person's name in a context where the policy says it shouldn't be redacted, or vice versa. These are the cases where automation can't decide on its own — and where ignoring the ambiguity means either over-redaction (data utility destroyed) or under-redaction (compliance failure).

The TCO of "Free" Cloud PII Redaction: AWS Comprehend, Google DLP, vs Self-Hosted at Scale

Thu, 14 May 2026 11:00:00 +0000

Pricing disclaimer: the dollar figures, per-unit rates, and tier discounts below reflect the published prices for AWS Comprehend, Google Cloud DLP, and the AWS Marketplace at the time of writing. Cloud providers update their pricing frequently — sometimes monthly, sometimes via opaque enterprise discount programs. Always verify the current rates on each provider's pricing page before using these numbers for a budgeting decision. The relative shape of the comparison (per-volume vs. per-instance billing) is stable; the specific dollar amounts may not be.

What is PII? A Practical Guide for Engineers and Compliance Teams

Sun, 10 May 2026 09:30:00 +0000

"PII" is one of those terms that everyone in technology uses and almost nobody defines the same way. Ask a compliance lawyer, a database engineer, and an ML scientist what PII is, and you'll get three different answers — all of them partially right. The disagreement doesn't matter much in conversation, but it matters a great deal when someone has to actually act on it: write a redaction policy, audit a data lake, or sign off on a release.

The Hidden Difficulties of Redacting PDF Documents

Sat, 09 May 2026 10:00:00 +0000

"Just black it out" is one of the most dangerous sentences in document handling. A user opens a PDF in Acrobat, drops a few black rectangles over the sensitive bits, saves the file, and ships it. The visible result looks redacted. The actual file contains every original character of the supposedly-hidden text, sitting under the rectangle, fully recoverable in under five seconds with a copy-paste.

This isn't a hypothetical. It's the source of some of the most famous PII leaks in the last 20 years. PDF redaction is genuinely hard — harder than redaction of plain text, harder than redaction of Word documents, harder than most engineers realize the first time they try to do it.

Prompt Engineering for Privacy: Practical Patterns for Not Leaking PII

Wed, 06 May 2026 11:00:00 +0000

The prompt-engineering literature is enormous and almost entirely focused on getting better answers. It says very little about a quieter problem: every prompt sent to a hosted LLM is a data egress point. The text leaves your perimeter, lands on a provider's servers, sits in their logs for some retention period, and (depending on the provider and the agreement) may be used to train future models. The model might return your PII to another user. The provider might suffer a breach. Either way, the data is no longer yours.

The Phileas Trino Connector: PII Redaction as SQL

Sun, 03 May 2026 12:30:00 +0000

Apache Trino (formerly PrestoSQL) is the federated query engine that more and more organizations are using to query across data lakes, warehouses, and relational sources without ETL'ing data into a single place. The architectural promise is "query data where it lives"; the privacy implication is that sensitive data in any connected source can land in the result set of any query a user runs. PII that was carefully gated in one system becomes exposed the moment Trino joins it with anything else.

Redaction for Legal and E-Discovery: Privilege, Rule 9037, and the In-House Counsel's Pipeline

Thu, 30 Apr 2026 14:00:00 +0000

Legal work has more redaction in it than almost any other industry — and far less automation than the volume justifies. Court filings get hand-redacted by paralegals with black markers. Discovery productions get scrubbed in Relativity by associates billing $400/hour to draw rectangles over names. M&A due diligence rooms get sanitized one document at a time. The result is a category that spends enormous sums on a problem that's largely solvable with software.

PII in Vector Embeddings: A Defense Guide

Wed, 22 Apr 2026 13:00:00 +0000

"It's just an array of floats" is the most reassuring sentence a vector-store skeptic can hear — and the most misleading. Sentence embeddings produced by modern models are partially invertible: an attacker with access to the embeddings (but not the original text) can reconstruct meaningful approximations of the source. For teams storing embeddings of sensitive data, this turns "we don't expose the raw text" from a complete defense into half of one.

The Ethics of Training: Why We Use Synthetic Data

Sun, 19 Apr 2026 13:37:57 +0000

In cybersecurity, trust is easy to lose and nearly impossible to regain. As a decision-maker, you're constantly weighing the benefits of new AI tools against the risk of a headline-making data leak. Most AI companies ask for your trust while simultaneously asking for your data to "improve their models."

At Philterd, we believe that's a fundamental conflict of interest. A privacy tool should never be trained on the very data it is meant to protect. That is why we've built our intelligence on a foundation of synthetic data.

Building a HIPAA-Compliant Medical Chatbot

Tue, 14 Apr 2026 15:00:00 +0000

Every health system in 2026 is building or evaluating an internal medical chatbot — a Q&A interface over chart notes, drug references, clinical guidelines, or operational documentation. The reasoning is straightforward: physicians and care teams spend hours a day searching for information that should be one question away. The technology is straightforward too: it's a RAG system.

The HIPAA story, however, is decidedly not straightforward. A generic RAG chatbot built with off-the-shelf components fails HIPAA at multiple points, and the failures aren't subtle. This post is the architectural blueprint for the version that passes — a medical chatbot that satisfies Safe Harbor de-identification at ingestion, defends against PHI leakage at retrieval and inference, and produces the audit artifacts your compliance team will be asked to show.

Building a Privacy-Aware RAG System

Wed, 08 Apr 2026 14:00:00 +0000

Retrieval-augmented generation is the dominant pattern for enterprise AI in 2026. Every customer-support assistant, internal-docs Q&A bot, and clinical-summary tool you've seen pitched in the last twelve months is, under the hood, a RAG system. The architecture is straightforward: index your documents into a vector store, retrieve the top-K relevant chunks for a user query, paste them into an LLM prompt, return the answer.

The privacy problem is also straightforward, and it's bigger than most teams realize. RAG has two distinct PII leak vectors — one at ingestion, one at inference — and protecting against only one of them is the same as protecting against neither.

Redaction for Financial Services: PCI DSS, GLBA, and the Real-World Data Pipeline

Wed, 25 Mar 2026 16:00:00 +0000

Financial services has the strictest data handling requirements outside of healthcare — and arguably more enforcement teeth, because every bank regulator (OCC, CFPB, FTC, state AGs) has a different angle of attack. Where healthcare has the relative clarity of HIPAA's 18 Safe Harbor identifiers, finance has multiple overlapping regulatory regimes (PCI DSS, GLBA, SOX, BSA/AML, Reg E, state privacy laws) and a customer-facing surface area (call centers, mobile apps, chat, email) that generates unstructured PII at industrial scale.

PII vs PHI vs NPPI: An Engineer's Guide

Wed, 18 Mar 2026 10:30:00 +0000

Few three-letter combinations cause more confusion in data privacy than PII, PHI, and NPPI. They overlap, they get used interchangeably, and the regulatory implications of mixing them up are real.

This is the short, definitional reference. One paragraph each, the regulatory framework that defines it, and the architectural implication for the engineer who has to do something about it.

PII: the umbrella

Personally Identifiable Information is any data that can be used — on its own or in combination with other data — to identify a specific person. Names, email addresses, SSNs, IP addresses, and biometric data are PII. So are quasi-identifiers like zip code + date of birth + gender, which together are sufficient to uniquely identify ~87% of the U.S. population.

Architecting Privacy in Kafka: Real-Time Redaction for Streaming Data

Wed, 11 Mar 2026 20:42:02 +0000

Most organizations think about PII the same way they think about backups: as a thing they'll worry about once it lands in a database. But in a streaming architecture, that's already too late. The moment a message hits a Kafka topic, dozens of downstream consumers — analytics jobs, search indexers, archival sinks, ML pipelines — can read it. Once a single SSN reaches one downstream system, it's everywhere.

This is the "PII at rest" problem inverted. The right place to redact streaming data isn't downstream — it's in flight, between the producers and the first consumer that doesn't need the raw values. We learned this pattern years ago building Phirestream; today the same approach is implemented with Philter and the underlying Phileas library, depending on the constraints of your stack.

Beyond Regex: Why General LLMs Fail at PII Discovery

Wed, 04 Mar 2026 20:44:54 +0000

Regex was never meant for the messy reality of human language. It's great at finding a 10-digit number that looks like a phone identifier, but it's famously terrible at telling you why that number exists. On the flip side, we're now seeing companies try to throw massive, general-purpose LLMs at the problem. Those models are incredible conversationalists — but using them for PII discovery is like using a sledgehammer for surgery.

Compliance as Code: Integrating Philter into Your CI/CD Pipeline

Wed, 25 Feb 2026 21:51:13 +0000

Engineering teams shifted security left a decade ago: SAST scanners, dependency audits, and IaC linters all run in CI now, blocking the merge button when something's off. Privacy is the next thing to shift.

Most organizations still treat PII leaks the same way they treat bugs in production — surfaced by an incident, triaged by an SRE, written up in a postmortem. That's the most expensive place to catch them. Every minute of triage, every regulator notification, every customer-trust call is downstream of a failure that should have failed a build.

Migrating from AWS Comprehend to Philter: A Practical Transition Guide

Wed, 18 Feb 2026 13:30:00 +0000

Teams move from AWS Comprehend PII to Philter for one of three reasons: the bill got uncomfortable, the data path stopped passing compliance review, or the customization surface ran out (custom entity types, domain lenses, per-entity replacement strategies). Whatever the reason, the migration itself is more mechanical than most teams expect.

This guide is the practical playbook: how Comprehend concepts map onto Philter, the code translation for the most common API calls, and a safe shadow-mode cutover pattern that lets you migrate without taking on detection-quality risk.

Redaction for Insurance: Claims, Customer Data, and the State-by-State Patchwork

Thu, 12 Feb 2026 14:30:00 +0000

Insurance is the vertical most people forget to mention when discussing regulated data. Healthcare gets HIPAA; banking gets GLBA and PCI; tech gets GDPR and CCPA. Insurance gets all of them at once — plus state-level insurance commissioner rules, plus the NAIC Insurance Data Security Model Law, plus (for health insurers) HIPAA in addition to everything else.

For carriers, third-party administrators, and insurtech platforms, that overlapping regulatory environment combines with a uniquely PII-dense data flow: claims adjusters write free-text notes, customers correspond via email and chat, third-party medical reports arrive in PDFs, agent training data captures live calls. PII is everywhere; the redaction problem is constant.

Open Source vs. Black Box: Why You Can't Afford "Trust Me" Privacy

Thu, 05 Feb 2026 10:28:11 +0000

For a Chief Information Security Officer, the word trust is a calculated risk. When you buy a security tool, you aren't just buying a feature — you are inheriting the vendor's vulnerabilities, their blind spots, and their secret handling of your data.

In PII redaction and data privacy, this risk is magnified. If a tool fails to catch a Social Security Number or a patient identifier, the liability doesn't fall on the software vendor — it falls on you. That is why black-box proprietary systems are becoming a relic of the past, and why auditable, open source tools are the new enterprise standard for 2026.

Automating HIPAA Safe Harbor: A Blueprint for Healthcare Data Pipelines

Sun, 25 Jan 2026 19:41:38 +0000

For healthcare CTOs and Data Protection Officers, HIPAA Safe Harbor de-identification — 45 CFR § 164.514(b)(2) — sounds simple on paper. Remove 18 specific identifier categories, and the resulting dataset is no longer considered Protected Health Information (PHI). You can share it for research, train AI on it, or move it into less-restricted infrastructure.

In practice, mechanically enforcing those 18 categories across patient narratives, clinical notes, intake forms, and the data flowing through modern AI pipelines is non-trivial. This post lays out how each of the 18 identifiers maps to a specific capability in the Philterd toolkit, and then sketches a reference architecture for three common healthcare data flows: patient data lakes, clinical research pipelines, and RAG-based medical AI systems.

Privacy Shouldn't Be a Guessing Game: Evaluating Redaction with Philter Scope

Sun, 18 Jan 2026 08:28:10 +0000

In data privacy, "I think we caught everything" is a dangerous sentence. When you're preparing a massive dataset for research or moving sensitive logs into a cloud environment, you can't rely on a gut feeling that your redaction tool is working. You need proof.

The problem is that most redaction engines are black boxes. You feed data in, you get redacted data out, but you have no clear way to measure what was missed or what was accidentally destroyed.

Why API-Based Redaction is a Security Antipattern

Sat, 17 Jan 2026 15:07:49 +0000

In the rush to adopt AI and modern data processing, many organizations have fallen into a convenient but dangerous trap: "Privacy-as-a-Service" APIs. It sounds simple — you send your raw text to a third-party provider, they redact the sensitive bits, and send it back.

But there is a fundamental flaw in this logic. To protect your PII (Personally Identifiable Information), you are starting the process by handing that PII over to someone else.

Redaction for Government and Federal Workloads: FedRAMP, CMMC, ITAR, and the Air-Gap Imperative

Fri, 09 Jan 2026 10:00:00 +0000

Government workloads have the strictest data-handling requirements of any sector in the West — and the smallest population of commercial tools that genuinely qualify. Most PII redaction products marketed to "enterprise" customers fail at the first authorization step because the underlying architecture assumes a hosted SaaS data path. Federal contractors quickly discover that "we redact your PII for you" means "we move your CUI to our infrastructure first," which is the opposite of what FedRAMP, CMMC, ITAR, and the rest of the federal compliance stack require.

Deploying Philter in Air-Gapped Environments

Sat, 27 Dec 2025 22:06:40 +0000

In data security, connected is often synonymous with vulnerable. For high-security sectors — defense, intelligence, national healthcare — the gold standard isn't just a firewall; it's an air gap. When your data is so sensitive that it cannot exist on a network with outbound internet access, your software stack has to be just as self-sufficient.

Most modern AI and PII redaction tools are cloud-first, meaning they constantly call home for updates, telemetry, or license verification. In an air-gapped environment, those tools don't just fail — they won't even start.

From Phileas to Philter: The Evolution of Our Open Source Engine

Fri, 19 Dec 2025 09:22:01 +0000

In software, there's a saying that "nothing is ever finished — it's just released." Looking back at the trajectory of our privacy engine, that sentiment couldn't be more accurate. What began as a focused open source experiment called Phileas has evolved into Philter — the core of a comprehensive enterprise PII suite used by healthcare providers, government agencies, and global tech firms.

Understanding the history of this engine isn't just a trip down memory lane — it's an explanation of why the software is as stable and context-aware as it is today. Here is how we moved from simple pattern matching to high-velocity, hybrid privacy intelligence.

Redaction for Education: FERPA, Student Records, and Research Data Pipelines

Sun, 14 Dec 2025 11:30:00 +0000

Education is the vertical that almost no one talks about when discussing regulated data — and yet it's one of the largest categories of PII in the country. Every K-12 district, every university registrar, every learning-management-system vendor, every edtech startup, every educational researcher works under FERPA, the Family Educational Rights and Privacy Act of 1974. The regulatory framework is real, the penalties for institutions are severe (loss of federal funding), and the tooling situation is genuinely thin compared to healthcare or finance.

Snowflake PII Redaction: A Practical Integration Guide

Sat, 06 Dec 2025 11:00:00 +0000

Snowflake is the data warehouse for a huge swath of mid-to-large enterprises, and that's exactly why it's where unstructured PII piles up. Customer-support transcripts, scanned-document text, application logs, chat history, transaction descriptions — everything that's text and was ever called "data" eventually lands in a Snowflake table.

Snowflake's built-in dynamic data masking and external tokenization handle column-level structured data well (a SSN column, a credit card column), but they don't address the harder problem: PII buried inside free-text columns. A customer-service-ticket table with a TEXT column holding the full conversation is the canonical case — the SSN is in there somewhere, but it's not in a column you can mask.

What is Data Redaction? A Practical Guide

Wed, 26 Nov 2025 10:30:00 +0000

"Data redaction" is one of those terms everyone uses and few people define the same way. The mental image is usually a black bar over a sentence in a court filing — an image that captures one specific technique while obscuring the broader category. Real-world redaction includes that, plus several other strategies that look very different from a black bar but solve the same underlying problem.

This post is a practitioner's reference: what data redaction actually means, the five common strategies, when each one fits, and how to pick the right approach for your workload.

Using an LLM or Pattern-based Rules for PII/PHI Redaction

Thu, 01 May 2025 20:53:57 +0000

In our data-driven world, being able to protect Personally Identifiable Information (PII) and Protected Health Information (PHI) is imperative. Whether you’re securing customer data, complying with regulations like GDPR or HIPAA, or simply aiming for responsible data handling, the need to effectively redact sensitive information is crucial.

Today, there are two primary approaches: leveraging the power of Large Language Models (LLMs) and employing traditional pattern-based rules. While LLMs have understandably received significant attention for their impressive natural language understanding, it’s essential to compare their capabilities against the tried-and-true methods of pattern matching.

Philter 3.1.0

Sun, 23 Mar 2025 15:19:36 +0000

Philter 3.1.0 is now available on all three major cloud marketplaces.

What's new in 3.1.0

Philter 3.1.0 is built on Phileas 2.12.0, which brings:

Filter priorities. Each filter can now have its own priority that is used as a tie-breaker when the same text is identified by two filters. For example, if you're using the phone-number filter and an ID filter for 10-digit numbers, both may detect PII on the same text. The filter priority decides which label wins.
Zip code validation. The zip-code filter can now optionally validate zip codes against an internal database. When enabled, a string that looks like a zip code but doesn't actually exist won't be redacted — reducing false positives on otherwise-numeric data.
Per-filter context window sizes. The window size is roughly the number of words surrounding PII that the engine uses for contextual disambiguation. Previously every filter shared one window size; now each filter can set its own. Tighten the window where you want strict matching; widen it where surrounding context matters.

What Philter is (in case you're new here)

Philter is open source software that redacts PII and PHI from text and PDF documents. It runs entirely inside your cloud — your data never leaves your perimeter, never reaches a third-party API, and never lands in someone else's logs. A REST API takes text in and returns redacted text out:

Phileas 2.12.0

Thu, 20 Mar 2025 19:22:54 +0000

Phileas 2.12.0 has been released. This version of the popular open source redaction library brings:

Filter priorities – Each filter can have its own priority that is used as a tie-breaker in cases where text is identified by two filters. For example, if you are using the phone number filter and an ID filter of 10 digit numbers, both filters may detect PII on the same text. In this case, the filter priority will be used to determine the ultimate labeling of the text as either a phone number or an ID number.
Zip code validation – The zip code filter can now optionally attempt to validate zip codes. When enabled, if a zip code does not exist in the internal database, the zip code will not be redacted.
Each filter can have a custom window size – The window size is roughly the number of words surrounding PII that is used to provide contextual information about the PII. Previously, each filter had to use the same window size. Now, each filter can have the window size set independently.

Look for a new version of Philter soon in the AWS, Google Cloud, and Azure marketplaces soon that is built on Phileas 2.12.0!

Why Using an LLM to Redact PII and PHI is a Bad Idea

Mon, 17 Feb 2025 02:23:09 +0000

We have seen a lot – and you probably have to – posts on various social media and blogging platforms showing how you can redact text using a large language model (LLM). They present a fairly simple solution to the complex problem of redaction. Can we really just let an LLM handle our text redaction and be done with it? The answer is simply no.

Here is one such example: https://ravichinni.medium.com/using-generative-ai-for-content-redaction-46ee61a3a4e6 (Don’t do this.)

Shielding Your Search: Redacting PII and PHI in OpenSearch with Phinder

Fri, 10 Jan 2025 13:26:28 +0000

In today’s data-driven world, safeguarding Personally Identifiable Information (PII) and Protected Health Information (PHI) is paramount. When leveraging search platforms like OpenSearch, ensuring sensitive data remains confidential is crucial. Enter Phinder, an open-source OpenSearch plugin that leverages the power of the Phileas project to effectively redact and de-identify PII and PHI within your search results.

This post explores how Phinder can bolster your data privacy and security when using OpenSearch.Phinder is available on GitHub at https://github.com/philterd/phinder-pii-opensearch-plugin.

Phileas 2.10.0

Mon, 06 Jan 2025 14:03:31 +0000

We are excited to announce the release of Phileas 2.10.0!

What’s changed in this version:

* Making FilterResponse not be a final record class by @jzonthemtn in #166
* Removing commons-csv dependency by @jzonthemtn in #174
* Removing guava dependency and adding bloom filter by @jzonthemtn in #172
* Update pdfbox to 3.0.* by @JessieAMorris in #177
* Fixes a bug with the policy service being hard coded to “local” by @JessieAMorris in #178
* Enable outputting the replacement value on PDFs by @JessieAMorris in #179
* Add truncation filter strategy for all filters by @JessieAMorris in #180
* Adding line about snapshots being published nightly. by @jzonthemtn in #182
* #183 Replacing redis test dependency. by @jzonthemtn in #184
* Replace the Lucene-based filter with a fuzzy dictionary filter by @jzonthemtn in #185
* GitHub release: https://github.com/philterd/phileas/releases/tag/2.10.0

Phileas in Graylog – Removing PII from Logs

Sun, 01 Dec 2024 13:21:41 +0000

We are very excited to share with you that Graylog has integrated Phileas, the open source PII/PHI redaction engine, into their centralized log management solution. With this new integration, Graylog now has the ability to identify and redact different types of PII (personally identifiable information) present in logs.

The presence of PII in logs is a serious concern. Even careful application developers can find it difficult to prevent all PII from being included in logs. Error messages and stack traces can inadvertently include PII exposing the business to risk and liability.

Phileas 2.9.1

Wed, 27 Nov 2024 14:04:01 +0000

We are excited to announce the release of Phileas 2.9.1.

What’s changed in this version:

* LineWidthSplitService is using a new line separator instead of a space
* An empty list of spans from ph-eye does not indicate failure
* Have a default PhEyeConfiguration value in AbstractPhEyeConfiguration so a filter does not have to provide one

GitHub release: https://github.com/philterd/phileas/releases/tag/2.9.1

Artifacts are available in the Philterd repository as described in the README.

Related posts:

Automatically Redacting PII and PHI from Files in Amazon S3 using Amazon Macie and Philter

Sun, 17 Nov 2024 13:23:14 +0000

Amazon Macie is "a data security service that discovers sensitive data using machine learning and pattern matching." With Amazon Macie you can find potentially sensitive information in files in your Amazon S3 buckets, but what do you do when Amazon Macie finds a file that contains an SSN, phone number, or other piece of sensitive information?

Philter is software that redacts PII, PHI, and other sensitive information from text. Philter runs entirely within your private cloud and does not require any external connectivity. Your data never leaves your private cloud and is not sent to any third-party. In fact, you can run Philter without any external network connectivity and we recommend doing so.

Philter as an AI Policy Layer

Thu, 10 Oct 2024 13:24:17 +0000

A policy layer is an important part of every source of AI-generated text.

An AI policy layer is an important part of every source of AI-generated text because it inspects the AI-generated text to prevent sensitive information from being exposed. A policy layer can help remove information such as names, addresses, and telephone numbers from responses.

In this blog post we will describe the function of an AI policy layer and how Philter is well-suited for the role. Philter is available on the AWS Marketplace, Google Cloud Marketplace, and the Microsoft Azure Marketplace.

Redacting Text in Amazon Kinesis Data Firehose

Mon, 09 Sep 2024 13:24:55 +0000

Amazon Kinesis Firehose is a managed streaming service designed to take large amounts of data from one place to another. For example, you can take data from sources such as Amazon CloudWatch, AWS IoT, and custom applications using the AWS SDK to destinations Amazon S3, Amazon Redshift, Amazon Elasticsearch, and other services. In this post we will use Amazon S3 as the firehose's destination.

In some cases you may need to manipulate the data as it goes through the firehose to remove sensitive information. In this blog post we will show how Amazon Kinesis Firehose and AWS Lambda can be used in conjunction with Philter to remove sensitive information (PII and PHI) from the text as it travels through the firehose.

Phileas — The Open Source PII and PHI redaction engine

Mon, 22 May 2023 18:16:12 +0000

I am delighted to announce that the project providing the core PII and PHI redaction capabilities is now open source. Introducing Phileas, the PII and PHI redaction engine — available under the Apache 2.0 license on GitHub.

Both Philter and Phirestream use Phileas to identify and redact sensitive information like PII and PHI. Phileas does all of the heavy lifting, while Philter and Phirestream make its functionality user-friendly and provide the NLP models.

What is format-preserving encryption?

Sun, 21 May 2023 18:15:20 +0000

In cryptography, you have plain text and cipher text. An encryption algorithm transforms the plain text into the cipher text. The cipher text won't look anything like the plain text — in characters or length. There are many different encryption algorithms serving many different purposes, and the cipher text for each one will be different.

Take the case of a credit card number, a common piece of sensitive information that's often encrypted. A credit card number is 16 digits long. Encrypting it with the industry-standard AES-128-CBC algorithm produces a cipher text much longer than 16 digits — usually around 64 base64-encoded characters. If you're storing the credit card number in a database column configured for length 16, the cipher text won't fit.