Clean-Text
Clean-Text (pypi: clean-text) provides functions to preprocess and normalize text, making it suitable for various NLP tasks. It offers features like lowercasing, removing emojis, URLs, digits, punctuation, and normalizing whitespace. It is currently at version 0.7.1 and has an active but infrequent release cadence.
Common errors
-
ModuleNotFoundError: No module named 'clean_text'
cause The `clean-text` package is not installed or is installed in a different Python environment than the one currently active.fixEnsure `clean-text` is installed in your active Python environment by running `pip install clean-text`. You can verify installation with `pip show clean-text`. -
ImportError: Missing optional dependency 'emoji'. Install clean-text[emoji] to use this feature.
cause You are attempting to use a feature that relies on the `emoji` package (e.g., directly calling `clean_text.remove_emoji`) without having installed it via the optional extras.fixInstall the required extra dependency: `pip install clean-text[emoji]` or `pip install clean-text[all]` if you need all optional features. -
My text is over-cleaned! All my punctuation/emojis/capitalization is gone!
cause The `clean` function's default parameters are set to perform aggressive text cleaning, including lowercasing and removing various elements.fixReview and customize the `clean` function's parameters. For example, use `clean(text, lower=False, no_punct=False, no_emoji=False)` to retain capitalization, punctuation, and emojis respectively.
Warnings
- gotcha The default behavior of the `clean` function is highly aggressive, performing lowercasing, removing emojis, URLs, digits, and punctuation. This might lead to unexpected data loss if not explicitly configured.
- gotcha Functions like `clean_text.remove_emoji` or direct usage of `clean_text.normalize` (for diacritics) require optional dependencies `emoji` and `unidecode` respectively. Calling these directly without installing their corresponding extras will result in an `ImportError`.
- gotcha The `clean` function's `lang` parameter defaults to 'en', which primarily affects stop word removal (if enabled). Using this with non-English text can lead to incorrect behavior if stop words are to be removed.
Install
-
pip install clean-text -
pip install clean-text[all]
Imports
- clean
from clean_text import clean
Quickstart
from clean_text import clean
text = " Hello World! ๐ Check out my site: https://example.com This is a test. 123 ๐ "
cleaned_text = clean(text)
print(f"Original: '{text}'")
print(f"Cleaned: '{cleaned_text}'")
# To customize cleaning, for example, keep emojis and punctuation:
text_custom = " Hello World! ๐ This is a test. :) "
cleaned_custom = clean(text_custom, no_emoji=False, no_punct=False)
print(f"Custom Cleaned: '{cleaned_custom}'")