Bangla Unicode Normalizer
bnunicodenormalizer (v0.1.7) is a Python library designed for normalizing Bangla Unicode text. It provides tools to clean and standardize Bangla text by addressing inconsistent character representations, digit forms, and other common challenges, making the text suitable for various Natural Language Processing (NLP) tasks. The library saw active development in mid-2023 and is currently in maintenance.
Common errors
-
FileNotFoundError: [Errno 2] No such file or directory: '.../bnunicodenormalizer/romanize_map.json'
cause The `Normalizer` tried to load the default `romanize_map.json` but could not find it, often due to an incomplete installation or running from a non-standard environment.fixEnsure `bnunicodenormalizer` is properly installed via `pip install bnunicodenormalizer`. If providing a custom `romanize_mapping_path`, double-check the file existence and permissions for that path. -
ModuleNotFoundError: No module named 'bnunicodenormalizer'
cause The `bnunicodenormalizer` package is not installed in the active Python environment.fixInstall the package using `pip install bnunicodenormalizer`. If using virtual environments, ensure your IDE or terminal is activated to the correct environment. -
AttributeError: 'dict' object has no attribute 'normalized_text'
cause The output of `bn_normalize(text)` is a dictionary. You are attempting to access a key as an attribute.fixAccess dictionary keys using square bracket notation, e.g., `result['normalized_text']`. The returned dictionary structure is `{'normalized_text': '...', 'detected_lang': '...'}` (if language detection is enabled).
Warnings
- gotcha The `Normalizer` class can optionally take a `romanize_mapping_path` argument. If a custom path is provided and is incorrect or the file is missing, it will result in a `FileNotFoundError`. If not provided, it attempts to load a default file from the package installation directory.
- breaking As a `0.x.x` version library, minor version increments (e.g., from 0.1.x to 0.2.x) can introduce breaking changes without adhering strictly to SemVer, though no explicit breaking changes are documented between recent `0.1.x` versions.
- gotcha The library has a direct dependency on `fasttext`. Installing `fasttext` can sometimes be challenging due to its native dependencies (e.g., C++ compiler). If `fasttext` fails to install correctly, the language detection features of `bnunicodenormalizer` will be unavailable or may cause errors, even if normalization functions still work.
Install
-
pip install bnunicodenormalizer
Imports
- Normalizer
from bnunicodenormalizer import Normalizer
Quickstart
from bnunicodenormalizer import Normalizer
# Initialize the normalizer.
# By default, it attempts to load 'romanize_map.json' from its package directory.
bn_normalize = Normalizer()
text_to_normalize = "এই টেস্টিং টেক্সট। ১০০ টাকা ।"
result = bn_normalize(text_to_normalize)
normalized_text = result["normalized_text"]
print(f"Original: {text_to_normalize}")
print(f"Normalized: {normalized_text}")
# The result dictionary might also contain 'detected_lang'
# if fasttext is enabled and detects it.
# print(f"Detected language: {result.get('detected_lang', 'N/A')}")