SacreMoses

0.1.1 · active · verified Fri Apr 10

SacreMoses is a Python port of the widely-used Moses tokenizer, truecaser, and punctuation normalizer tools, essential for many Natural Language Processing (NLP) tasks, particularly in machine translation workflows. The current version is 0.1.1. Releases are made periodically, addressing bug fixes, performance improvements, and alignment with the original Perl implementation.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates the core functionalities: punctuation normalization and the tokenization/detokenization of a sample English sentence.

from sacremoses import MosesTokenizer, MosesDetokenizer, MosesPunctNormalizer

# Punctuation Normalization
mpn = MosesPunctNormalizer(lang='en')
text_to_normalize = 'THIS EBOOK IS OTHERWISE PROVIDED TO YOU "AS-IS."'
normalized_text = mpn.normalize(text_to_normalize)
print(f"Normalized: {normalized_text}")

# Tokenization and Detokenization
mt = MosesTokenizer(lang='en')
md = MosesDetokenizer(lang='en')

sample_text = "Hello, world! This is a test sentence with numbers 123 and some special characters like @#$%." 
tokenized_list = mt.tokenize(sample_text)
detokenized_text = md.detokenize(tokenized_list)

print(f"Original: {sample_text}")
print(f"Tokenized: {tokenized_list}")
print(f"Detokenized: {detokenized_text}")

view raw JSON →