{"id":2759,"library":"sacremoses","title":"SacreMoses","description":"SacreMoses is a Python port of the widely-used Moses tokenizer, truecaser, and punctuation normalizer tools, essential for many Natural Language Processing (NLP) tasks, particularly in machine translation workflows. The current version is 0.1.1. Releases are made periodically, addressing bug fixes, performance improvements, and alignment with the original Perl implementation.","status":"active","version":"0.1.1","language":"en","source_language":"en","source_url":"https://github.com/hplt-project/sacremoses","tags":["NLP","tokenization","natural language processing","machine translation","text processing"],"install":[{"cmd":"pip install sacremoses","lang":"bash","label":"Install latest version"}],"dependencies":[],"imports":[{"note":"MosesTokenizer was previously part of NLTK but was moved to sacremoses due to licensing issues.","wrong":"from nltk.tokenize.moses import MosesTokenizer","symbol":"MosesTokenizer","correct":"from sacremoses import MosesTokenizer"},{"symbol":"MosesDetokenizer","correct":"from sacremoses import MosesDetokenizer"},{"symbol":"MosesPunctNormalizer","correct":"from sacremoses import MosesPunctNormalizer"},{"symbol":"MosesTruecaser","correct":"from sacremoses import MosesTruecaser"}],"quickstart":{"code":"from sacremoses import MosesTokenizer, MosesDetokenizer, MosesPunctNormalizer\n\n# Punctuation Normalization\nmpn = MosesPunctNormalizer(lang='en')\ntext_to_normalize = 'THIS EBOOK IS OTHERWISE PROVIDED TO YOU \"AS-IS.\"'\nnormalized_text = mpn.normalize(text_to_normalize)\nprint(f\"Normalized: {normalized_text}\")\n\n# Tokenization and Detokenization\nmt = MosesTokenizer(lang='en')\nmd = MosesDetokenizer(lang='en')\n\nsample_text = \"Hello, world! This is a test sentence with numbers 123 and some special characters like @#$%.\" \ntokenized_list = mt.tokenize(sample_text)\ndetokenized_text = md.detokenize(tokenized_list)\n\nprint(f\"Original: {sample_text}\")\nprint(f\"Tokenized: {tokenized_list}\")\nprint(f\"Detokenized: {detokenized_text}\")","lang":"python","description":"This quickstart demonstrates the core functionalities: punctuation normalization and the tokenization/detokenization of a sample English sentence."},"warnings":[{"fix":"Upgrade to Python 3.8+ or pin sacremoses to version 0.0.40.","message":"SacreMoses dropped official support for Python 2. If you are using Python 2, you must use `sacremoses==0.0.40` or an earlier version. Later versions (`sacremoses>=0.0.41`) require Python 3.","severity":"breaking","affected_versions":"<0.0.41 (Python 2)"},{"fix":"Review your code for any reliance on previous behavior, especially regarding truecasing and custom protected patterns. Retest your NLP pipeline after updating.","message":"Version 0.1.0 introduced changes that can affect output, including how `use_known` works in `MosesTruecaser.truecase()` and how the order of `protected_patterns` is handled in `MosesTokenizer.tokenize()`.","severity":"breaking","affected_versions":">=0.1.0"},{"fix":"Consider explicitly setting `perl_parity=True` in `MosesPunctNormalizer` to ensure future compatibility and consistent behavior with the latest Perl Moses. Be aware that this might subtly change normalization output.","message":"The `MosesPunctNormalizer` gained a `perl_parity:bool` argument in version 0.1.0 to align behavior with the latest Perl Moses implementation. This argument might become the default or only behavior in future releases.","severity":"gotcha","affected_versions":">=0.1.0"},{"fix":"Thoroughly test any custom span tokenization logic, especially when dealing with escaped characters or complex detokenization scenarios. Consult existing implementations if available.","message":"When implementing custom span tokenization by subclassing `MosesTokenizer`, be cautious with the `escape`, `unescape`, and `detokenize` interactions, as they can sometimes lead to flaky results.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-10T00:00:00.000Z","next_check":"2026-07-09T00:00:00.000Z"}