Text Preprocessing Library

0.1.7 · active · verified Thu Apr 16

proces is a Python library (version 0.1.7) designed for efficient text preprocessing. It offers a flexible `TextCleaner` class with various options to clean, normalize, and prepare raw text data for natural language processing (NLP) tasks, including removing HTML, URLs, mentions, hashtags, numbers, punctuation, and handling case conversion and whitespace. As a 0.x.x release, its API might evolve.

Common errors

Warnings

Install

Imports

Quickstart

Demonstrates basic and advanced usage of the TextCleaner class to preprocess a sample string, applying various cleaning rules and replacement tokens.

from proces import TextCleaner

# Basic cleaning: lowercase, remove punctuation, strip whitespace
cleaner = TextCleaner(lower=True, remove_punctuation=True, strip_whitespace=True)
text_input = "  Hello, World! This is a Sample Text with HTML <br> tags. And @mentions, #hashtags, links: http://example.com 123  "
cleaned_text = cleaner.clean(text_input)
print(f"Original: {text_input}")
print(f"Cleaned (basic): {cleaned_text}")

# Advanced cleaning: remove HTML, URLs, mentions, hashtags, numbers, replace with tokens
advanced_cleaner = TextCleaner(
    lower=True,
    remove_html=True,
    remove_urls=True,
    remove_mentions=True,
    remove_hashtags=True,
    remove_numbers=True,
    remove_punctuation=True,
    strip_whitespace=True,
    replace_numbers_with='<NUM>',
    replace_urls_with='<URL>',
    replace_mentions_with='<MENTION>',
    replace_hashtags_with='<HASHTAG>'
)
cleaned_advanced_text = advanced_cleaner.clean(text_input)
print(f"Cleaned (advanced): {cleaned_advanced_text}")

view raw JSON →