UrduHack

raw JSON →
1.1.1 verified Fri May 01 auth: no python maintenance

A Natural Language Processing (NLP) library for the Urdu language, providing tokenization, normalization, lemmatization, and POS tagging. Current version is 1.1.1, released in 2020. The project appears to be in maintenance mode with last commit in 2021.

pip install urduhack
error ModuleNotFoundError: No module named 'tensorflow'
cause urduhack depends on TensorFlow, which is not installed automatically in some environments.
fix
pip install tensorflow (or tensorflow-cpu) before using urduhack.
error AttributeError: module 'urduhack' has no attribute 'tokenize'
cause The import path is incorrect or an older version of urduhack is installed.
fix
Use import urduhack and ensure you have version 1.1.1. The function is urduhack.tokenize, not urduhack.tokenization.tokenize.
gotcha urduhack.tokenize returns lists of tokens, not spaCy-like objects. Do not expect .text attributes.
fix Treat output as a plain list of strings.
deprecated The library has not been updated since 2021 and may not work with Python 3.10+. Dependencies like TensorFlow may be outdated.
fix Consider using alternatives like `urdu-words` or `hazm` for Persian/Urdu, or fork the library to update dependencies.
gotcha urduhack.normalize and urduhack.lemmatize require downloading models on first run. Internet connection needed.
fix Run `urduhack.download()` or let the functions auto-download.

Basic usage: import urduhack and call tokenize on Urdu text.

import urduhack
# Example: tokenize a sentence
text = "اردو زبان پاکستان کی قومی زبان ہے۔"
tokens = urduhack.tokenize(text)
print(tokens)