Microsoft Built the Thing You Need Before You Feed Your Data to an LLM
Presidio is a free, open-source PII detector and anonymizer that has been quietly sitting on GitHub this whole time.
PII handling before LLM processing remains critical infrastructure. presidio and similar tools are now essential. the timing observation — 'it has been sitting there for years' while everyone rediscovers it — is still the pattern.
Microsoft, of all organizations, shipped a free open-source library that detects and scrubs PII from your data before you do something regrettable with it.
It's called Presidio. It finds names, phone numbers, credit card numbers, social security numbers, email addresses, IP addresses, and a couple dozen other things you probably shouldn't be piping into an API endpoint you don't control. Then it anonymizes them — redacts, replaces with fake values, hashes, whatever you want.
The timing is not subtle. Every enterprise right now is in the same conversation: "We want to use LLMs on our internal data but legal is nervous." Presidio is the answer legal is looking for, already built, already open source, already sitting there. It has been sitting there since 2021.
There's something quietly funny about the fact that the company selling Azure OpenAI subscriptions also maintains the library for making sure you don't accidentally feed your customers' home addresses into Azure OpenAI.
It runs as a Python library or a REST API, supports custom recognizers if the defaults don't cover your specific flavor of sensitive information, and the anonymizer module is separate from the analyzer so you can detect without destroying if you need to audit what's in your corpus first.
Not glamorous. Exactly the kind of thing that prevents a very bad news cycle.
Counterpoints
Push back, extend the argument, or sharpen it. New counterpoints go through review before they show up here.
No approved counterpoints yet.