Presidio Analyzer
Presidio Analyzer is a Python library and service for detecting Personally Identifiable Information (PII) entities in text. It leverages a combination of predefined recognizers, regular expressions, and Named Entity Recognition (NER) models to identify sensitive data. The library is actively maintained, with a current version of 2.2.362, and releases frequently to add new features, fix bugs, and improve detection capabilities.
Warnings
- breaking Presidio underwent a significant revamp from V1 to V2 (starting around 2.0.0). This involved a migration from gRPC to HTTP-based APIs, changes in JSON payload formats (structured objects to flattened JSON, camelCase to snake_case), and deprecation of some services. Code written for V1 is not compatible with V2 APIs.
- gotcha Many country-specific PII recognizers (e.g., for Singapore, Australia, Germany, Sweden) are disabled by default to prevent false positives when they are not explicitly needed. If you require detection for specific regional PII, you must explicitly enable these recognizers either via a YAML configuration file or programmatically by adding them to the RecognizerRegistry.
- gotcha The AnalyzerEngine relies on NLP models (like spaCy) for many detections. While `spacy` itself is a dependency, the language models (e.g., `en_core_web_lg`) must be downloaded separately using `python -m spacy download <model_name>`. Failure to do so will result in errors or limited detection capabilities.
- gotcha Specific versions of underlying NLP libraries, particularly spaCy, might be explicitly restricted in certain `presidio-analyzer` releases. For example, `spacy.cli` was restricted for version 3.7.0 in release 2.2.356. Using an incompatible spaCy version can lead to unexpected behavior or errors.
- gotcha Users employing static type checking (e.g., mypy) may encounter type errors in versions after 2.2.33, specifically related to the initialization of `AnonymizerEngine` and type mismatches for `RecognizerResult` between `presidio-analyzer` and `presidio-anonymizer` due to separate class definitions.
Install
-
pip install presidio-analyzer -
python -m spacy download en_core_web_lg
Imports
- AnalyzerEngine
from presidio_analyzer import AnalyzerEngine
- RecognizerRegistry
from presidio_analyzer.recognizer_registry import RecognizerRegistry
- NlpEngineProvider
from presidio_analyzer.nlp_engine import NlpEngineProvider
Quickstart
from presidio_analyzer import AnalyzerEngine
# Initialize the AnalyzerEngine
# This will load the default spaCy NLP model (en_core_web_lg if downloaded)
analyzer = AnalyzerEngine()
text = "My name is John Doe and my phone number is (123) 456-7890."
# Analyze the text for PII entities
# Specify entities to look for, or leave empty for all supported entities
results = analyzer.analyze(text=text, entities=["PERSON", "PHONE_NUMBER"], language='en')
for result in results:
print(f"Entity: {result.entity_type}, Text: {text[result.start:result.end]}, Score: {result.score:.2f}")