pyjanitor
pyjanitor is a Python library that extends pandas DataFrames with a clean, user-friendly API for data cleaning and preprocessing. Inspired by the R `janitor` package, it facilitates common data wrangling tasks like cleaning column names, handling missing values, and method chaining. Currently at version 0.32.23, the library maintains an active development pace with frequent releases addressing performance, new features, and deprecations to align with evolving pandas APIs.
Common errors
-
ModuleNotFoundError: No module named 'pyjanitor'
cause The pyjanitor package is not installed in the current Python environment or the environment in use is not the one where pyjanitor was installed.fixEnsure pyjanitor is installed in your active environment: `pip install pyjanitor`. If using virtual environments, activate the correct environment before running your code. -
AttributeError: 'DataFrame' object has no attribute 'clean_names'
cause The `janitor` module was not imported, which means its DataFrame accessor methods have not been registered with pandas.fixAdd `import janitor` to your script after `import pandas as pd`. This registers pyjanitor's functions as DataFrame methods. -
TypeError: 'DataFrameGroupBy' object has no attribute 'mutate' (or similar for other deprecated methods on groupby objects)
cause Attempting to use a deprecated pyjanitor method on a pandas GroupBy object, or before the relevant methods were added to GroupBy objects.fixFor operations on grouped DataFrames, use `df.groupby(...).assign(...)` instead of `mutate`. Refer to pyjanitor's documentation for the correct methods available on GroupBy objects for your version.
Warnings
- deprecated The `mutate` DataFrame method has been deprecated. Users are advised to transition to alternative approaches for adding or modifying columns.
- breaking Direct usage of 'by' methods for groupby operations on DataFrames has been migrated to be directly available on groupby objects for improved API consistency.
- deprecated Functions like `add_column`, `add_columns`, `remove_columns`, `rename_column`, `rename_columns`, and `filter_on` are slated for deprecation in a future 1.x release, as their functionality largely overlaps with native pandas methods.
Install
-
pip install pyjanitor
Imports
- janitor
import janitor
- pandas
import pandas as pd
Quickstart
import pandas as pd
import janitor
# Sample DataFrame with messy column names
data = {
'First Name': ['Alice', 'Bob'],
'Last-Name': ['Smith', 'Johnson'],
'AGE (Years)': [24, 30]
}
df = pd.DataFrame(data)
print("Original DataFrame:\n", df)
# Clean column names using pyjanitor's clean_names()
cleaned_df = df.clean_names()
print("\nCleaned DataFrame:\n", cleaned_df)
print("\nCleaned column names:", cleaned_df.columns.tolist())