Text Preprocessing Library
proces is a Python library (version 0.1.7) designed for efficient text preprocessing. It offers a flexible `TextCleaner` class with various options to clean, normalize, and prepare raw text data for natural language processing (NLP) tasks, including removing HTML, URLs, mentions, hashtags, numbers, punctuation, and handling case conversion and whitespace. As a 0.x.x release, its API might evolve.
Common errors
-
ModuleNotFoundError: No module named 'proces'
cause The 'proces' library is not installed in your current Python environment.fixEnsure the library is installed using `pip install proces`. If using a virtual environment, ensure it's activated. -
AttributeError: module 'proces' has no attribute 'clean'
cause You are attempting to call a method like `clean()` directly on the imported `proces` module, rather than on an instance of the `TextCleaner` class.fixInstantiate the `TextCleaner` class first: `from proces import TextCleaner; cleaner = TextCleaner(); cleaned_text = cleaner.clean(my_text)`. -
TypeError: TextCleaner.__init__() got an unexpected keyword argument 'remove_emoji'
cause You are attempting to use an unsupported configuration option in the `TextCleaner` constructor. The library's functionality is limited to its documented parameters.fixConsult the `proces` library's documentation (e.g., GitHub README) for the available `TextCleaner` initialization parameters and ensure you are only using supported arguments.
Warnings
- breaking As a library in early development (version 0.x.x), the API of 'proces' is subject to change without strict backward compatibility guarantees. Future minor versions might introduce breaking changes.
- gotcha The generic package name 'proces' can easily be confused with Python's built-in 'multiprocessing' module or other process management libraries. Ensure you are importing the correct 'proces' for text preprocessing.
- gotcha The `TextCleaner` class allows for removing stopwords, but it does not come with a default set of stopwords. If `remove_stopwords=True` is set without providing a `stopwords_list`, it will have no effect.
Install
-
pip install proces
Imports
- TextCleaner
import proces; cleaner = proces.TextCleaner()
from proces import TextCleaner
Quickstart
from proces import TextCleaner
# Basic cleaning: lowercase, remove punctuation, strip whitespace
cleaner = TextCleaner(lower=True, remove_punctuation=True, strip_whitespace=True)
text_input = " Hello, World! This is a Sample Text with HTML <br> tags. And @mentions, #hashtags, links: http://example.com 123 "
cleaned_text = cleaner.clean(text_input)
print(f"Original: {text_input}")
print(f"Cleaned (basic): {cleaned_text}")
# Advanced cleaning: remove HTML, URLs, mentions, hashtags, numbers, replace with tokens
advanced_cleaner = TextCleaner(
lower=True,
remove_html=True,
remove_urls=True,
remove_mentions=True,
remove_hashtags=True,
remove_numbers=True,
remove_punctuation=True,
strip_whitespace=True,
replace_numbers_with='<NUM>',
replace_urls_with='<URL>',
replace_mentions_with='<MENTION>',
replace_hashtags_with='<HASHTAG>'
)
cleaned_advanced_text = advanced_cleaner.clean(text_input)
print(f"Cleaned (advanced): {cleaned_advanced_text}")