SacreMoses
SacreMoses is a Python port of the widely-used Moses tokenizer, truecaser, and punctuation normalizer tools, essential for many Natural Language Processing (NLP) tasks, particularly in machine translation workflows. The current version is 0.1.1. Releases are made periodically, addressing bug fixes, performance improvements, and alignment with the original Perl implementation.
Warnings
- breaking SacreMoses dropped official support for Python 2. If you are using Python 2, you must use `sacremoses==0.0.40` or an earlier version. Later versions (`sacremoses>=0.0.41`) require Python 3.
- breaking Version 0.1.0 introduced changes that can affect output, including how `use_known` works in `MosesTruecaser.truecase()` and how the order of `protected_patterns` is handled in `MosesTokenizer.tokenize()`.
- gotcha The `MosesPunctNormalizer` gained a `perl_parity:bool` argument in version 0.1.0 to align behavior with the latest Perl Moses implementation. This argument might become the default or only behavior in future releases.
- gotcha When implementing custom span tokenization by subclassing `MosesTokenizer`, be cautious with the `escape`, `unescape`, and `detokenize` interactions, as they can sometimes lead to flaky results.
Install
-
pip install sacremoses
Imports
- MosesTokenizer
from sacremoses import MosesTokenizer
- MosesDetokenizer
from sacremoses import MosesDetokenizer
- MosesPunctNormalizer
from sacremoses import MosesPunctNormalizer
- MosesTruecaser
from sacremoses import MosesTruecaser
Quickstart
from sacremoses import MosesTokenizer, MosesDetokenizer, MosesPunctNormalizer
# Punctuation Normalization
mpn = MosesPunctNormalizer(lang='en')
text_to_normalize = 'THIS EBOOK IS OTHERWISE PROVIDED TO YOU "AS-IS."'
normalized_text = mpn.normalize(text_to_normalize)
print(f"Normalized: {normalized_text}")
# Tokenization and Detokenization
mt = MosesTokenizer(lang='en')
md = MosesDetokenizer(lang='en')
sample_text = "Hello, world! This is a test sentence with numbers 123 and some special characters like @#$%."
tokenized_list = mt.tokenize(sample_text)
detokenized_text = md.detokenize(tokenized_list)
print(f"Original: {sample_text}")
print(f"Tokenized: {tokenized_list}")
print(f"Detokenized: {detokenized_text}")