EMR Validator
EMR Validator (emrvalidator) is a Python library designed for comprehensive data validation of healthcare data. It allows users to define validation rules in an Excel-based schema and apply them to various data formats like CSV. The current version is 1.0.2, and it receives active maintenance with minor releases addressing bug fixes and enhancements.
Common errors
-
FileNotFoundError: [Errno 2] No such file or directory: 'your_file.csv'
cause The `data_path` or `schema_path` provided to `EMRValidator` does not point to an existing file, or the path is incorrect.fixEnsure the `schema_path` and `data_path` arguments provide correct, absolute or relative paths to the respective files. Double-check file names, extensions, and the current working directory. -
KeyError: 'Column Name'
cause The schema file (e.g., `schema.xlsx`) is missing one of the mandatory column headers expected by the validator, or there's a typo in a column name.fixVerify that your schema file includes all required column headers: 'Column Name', 'Data Type', 'Is Mandatory', 'Allowed Values', 'Min Length', 'Max Length', 'Regex Pattern'. Ensure correct spelling and casing for each header. -
ModuleNotFoundError: No module named 'emrvalidator'
cause The `emrvalidator` library has not been installed in the current Python environment, or the environment where the script is run is different from where it was installed.fixInstall the library using `pip install emrvalidator`. If already installed, ensure you are running your script in the correct Python environment.
Warnings
- gotcha The schema definition (e.g., `schema.xlsx`) must strictly adhere to the expected column headers and structure described in the documentation. Incorrect headers, missing mandatory columns, or deviations in format will lead to validation failures or `KeyError`.
- gotcha Data type definitions in the schema (`Data Type` column) must use specific keywords recognized by the library (e.g., 'STRING', 'INTEGER', 'DECIMAL', 'DATETIME', 'BOOLEAN'). Mismatches between these keywords and actual data types or unrecognized keywords will cause validation errors.
Install
-
pip install emrvalidator
Imports
- EMRValidator
from emrvalidator import EMRValidator
Quickstart
import os
import pandas as pd
from emrvalidator import EMRValidator
# --- Dummy file creation for runnable example START ---
# In a real scenario, you would have these files pre-existing.
schema_data = {
"Column Name": ["PatientID", "Name", "Age", "AdmissionDate"],
"Data Type": ["STRING", "STRING", "INTEGER", "DATETIME"],
"Is Mandatory": ["YES", "YES", "YES", "NO"],
"Allowed Values": ["", "", "", ""],
"Min Length": ["", "2", "0", ""],
"Max Length": ["", "50", "120", ""],
"Regex Pattern": ["", "", "", ""]
}
schema_df = pd.DataFrame(schema_data)
# Using tempfile for demonstration, replace with your actual file paths
import tempfile
temp_dir = tempfile.gettempdir()
schema_path = os.path.join(temp_dir, "registry_schema.xlsx")
data_path = os.path.join(temp_dir, "registry_data.csv")
with pd.ExcelWriter(schema_path, engine='openpyxl') as writer:
schema_df.to_excel(writer, index=False, sheet_name='Sheet1')
data_csv_content = """PatientID,Name,Age,AdmissionDate
P001,Alice,30,2023-01-15
P002,Bob,25,
P003,Charlie,40,2024-03-20
"""
with open(data_path, 'w') as f:
f.write(data_csv_content)
# --- Dummy file creation for runnable example END ---
# Initialize the EMRValidator
# Replace 'schema_path' and 'data_path' with your actual file paths
validator = EMRValidator(schema_path=schema_path, data_path=data_path)
# Run the validation
validation_result = validator.validate()
# Get summary of validation
summary = validator.get_summary()
print("Validation Summary:")
print(summary)
# Get invalid records
invalid_records = validator.get_invalid_records()
if not invalid_records.empty:
print("\nInvalid Records:")
print(invalid_records)
else:
print("\nNo invalid records found.")
# Get validated records
validated_records = validator.get_validated_records()
if not validated_records.empty:
print("\nValidated Records:")
print(validated_records)
# Clean up temporary files (optional, for demonstration)
os.remove(schema_path)
os.remove(data_path)