Unidecode
Unidecode is a Python library that provides ASCII transliterations of Unicode text. It converts non-ASCII Unicode characters into their closest ASCII approximations, which is useful for tasks like generating URL slugs or integrating with legacy systems. The current version is 1.4.0, with releases occurring as improvements to transliteration tables are made, rather than on a fixed schedule.
Warnings
- breaking The output of `unidecode()` is not guaranteed to be stable across different versions of the library. Improvements to transliteration tables can cause the ASCII approximation for certain Unicode characters to change in new releases.
- gotcha Unidecode performs a context-free, character-by-character mapping and is not language-specific. This means transliterations may not align with linguistic rules or cultural expectations for all languages (e.g., German umlauts are 'a', 'o', 'u' instead of 'ae', 'oe', 'ue'; East Asian languages may have simplified mappings).
- gotcha Unidecode requires a Python build with 'wide' Unicode characters (UCS-4 build) to correctly handle characters outside the Basic Multilingual Plane (BMP). 'Narrow' Python builds using surrogate pair encoding are not supported, which can lead to incorrect transliterations for mathematical symbols, emojis, etc.
- gotcha The `unidecode` function expects a Unicode string (Python 3 `str`) as input. Passing byte data (e.g., from reading a file in binary mode) will result in a `TypeError` or incorrect output.
- gotcha The output of `unidecode` is a 'lossy' approximation. Since some characters map to `''` (empty string) or generic characters (like `?`), and due to its non-linguistic approach, the transliterated output should not be directly exposed to users without careful consideration, as it may be perceived as offensive or simply incorrect.
Install
-
pip install unidecode
Imports
- unidecode
from unidecode import unidecode
Quickstart
from unidecode import unidecode
# Basic transliteration
text_unicode = 'Łódź, 北京, Español'
text_ascii = unidecode(text_unicode)
print(f"Original: {text_unicode}")
print(f"Transliterated: {text_ascii}")
# Example for URL slug generation (common use case)
article_title = 'The "Café" where you can find "Piñatas"!'
slug = unidecode(article_title).replace(' ', '-').lower()
print(f"\nURL Slug: {slug}")