cleantext

raw JSON →
1.1.4 verified Fri May 01 auth: no python

An open-source Python package to clean raw text data. Version 1.1.4 provides simple functions to normalize whitespace, remove URLs, emojis, numbers, punctuation, and more. The package is relatively stable with low release cadence. It is widely used for quick text preprocessing in NLP pipelines.

pip install cleantext
error ModuleNotFoundError: No module named 'cleantext'
cause Package not installed.
fix
Run 'pip install cleantext' to install the package.
error TypeError: clean() got an unexpected keyword argument 'fix_unicode'
cause The 'fix_unicode' parameter was removed in version 1.0.0.
fix
Remove 'fix_unicode' from your function call. Unicode normalization is handled automatically.
error clean() returns a list, not string
cause The 'clean' function returns a list of tokens by default.
fix
Use 'cleaned_text = ' '.join(clean(...))' to get a single string.
gotcha The 'clean' function returns a list of tokens by default (not a string). If you expect a single string, set 'lowercase=False' and then join the list, or check output type.
fix Use 'cleaned_text = ' '.join(clean(...))' if you need a single string.
deprecated The 'fix_unicode' parameter was removed in version 1.0.0. Using it raises an error.
fix Remove 'fix_unicode' from the call. Unicode normalization is now handled automatically.
breaking In version 1.0.0, the 'clean' function changed its default parameter values and removed some arguments like 'fix_unicode'. Code written for <1.0.0 may break.
fix Review your function calls and adjust parameter names. For example, 'no_urls' is now 'replace_with_url'.
gotcha The 'clean' function can remove too much. For instance, setting 'punct=False' removes all punctuation including periods in abbreviations.
fix Use the 'replace_with_punct' parameter with a placeholder instead of removing punctuation entirely.

Basic example: removes URLs, numbers, punctuation, and extra spaces, leaving text intact.

from cleantext import clean

text = "Hello! Check out https://example.com 😊 I have 5 apples..."
cleaned = clean(
    text,
    extra_spaces=True,
    lowercase=False,
    numbers=False,
    punct=False,
    replace_with_url='<URL>',
    replace_with_number='<NUM>',
    replace_with_punct='',
    lang='en'
)
print(cleaned)