Scrubadub: PII Redaction Library
Scrubadub is a Python library designed to clean personally identifiable information (PII) from unstructured text. It automatically detects and replaces various types of sensitive data like names, email addresses, phone numbers, and more, with configurable placeholders. The library is actively maintained, currently at version 2.0.1, and receives regular updates, including major releases that introduce new detectors and architectural changes.
Warnings
- breaking Version 2.0.0 introduced significant changes, including the splitting of the library into smaller sub-packages and a shift from loading all detectors by default to loading only a default set. Code relying on previously auto-loaded detectors (e.g., spaCy, Stanford NER) will need explicit `add_detector()` calls or installation of optional packages (`scrubadub_spacy`, `scrubadub_stanford`, `scrubadub_address`).
- breaking Python 2.7 and 3.5 support was dropped starting from version 2.0.0. If you require these Python versions, you must use `scrubadub` version 1.2.2 or earlier.
- gotcha Only a default set of detectors are loaded when initializing a `Scrubber` or using `scrubadub.clean()` since version 2.0.0. If you need functionality from optional or external detectors (e.g., `SpacyNameDetector`, `AddressDetector`), you must explicitly install their packages and add them to your `Scrubber` instance.
- gotcha Attempting to add two detectors with the same name to a `Scrubber` instance will result in a `KeyError`.
- gotcha Version 2.0.1 fixed an issue where the `scikit-learn` dependency was incorrectly named. Users might have encountered installation problems with `scrubadub==2.0.0` due to this.
Install
-
pip install scrubadub -
pip install scrubadub scrubadub-spacy scrubadub-stanford scrubadub-address
Imports
- clean
import scrubadub cleaned_text = scrubadub.clean(text)
- Scrubber
from scrubadub import Scrubber scrubber = Scrubber() cleaned_text = scrubber.clean(text)
- Detector
from scrubadub.detectors import EmailDetector scrubber = Scrubber() scrubber.add_detector(EmailDetector())
Quickstart
import scrubadub text = "My cat can be contacted on example@example.com, or 1800 555-5555. His name is John Doe." cleaned_text = scrubadub.clean(text) print(cleaned_text) # For more control, use the Scrubber class from scrubadub import Scrubber from scrubadub.detectors import TextBlobNameDetector # Example of an optional detector scrubber = Scrubber() # Add a detector if it's not enabled by default or for custom configuration scrubber.add_detector(TextBlobNameDetector()) controlled_cleaned_text = scrubber.clean(text) print(controlled_cleaned_text)