LaTeX Codec
latexcodec is a Python library providing a lexer and codec for converting text between LaTeX markup and Unicode. It is particularly suited for handling short segments of LaTeX code, such as paragraphs or entries in a BibTeX file, rather than compiling full LaTeX documents. The current stable version is 3.0.1, and it maintains an active but measured release cadence.
Warnings
- breaking Versions prior to 3.0.1 are incompatible with Python 3.13+ due to the removal of `pkg_resources.open_text`. Users on Python 3.13 and newer must upgrade to `latexcodec` 3.0.1 or later.
- deprecated The maintainer strongly encourages users to consider `pylatexenc` as a superior alternative to `latexcodec` for LaTeX code processing.
- gotcha This library is primarily designed for processing short fragments of LaTeX text (e.g., paragraphs, BibTeX entries) and is not intended to function as a full LaTeX compiler or for comprehensive document processing.
- gotcha When decoding LaTeX, commands that do not directly represent characters (e.g., macros, formatting commands like `\textbf`) or are unrecognized by the codec are passed through unchanged. This can result in a 'hybrid' Unicode string containing unexpanded LaTeX commands.
- gotcha Encoding Unicode characters to LaTeX can fail if the characters cannot be represented by the default (ASCII) LaTeX encoding. For more robust encoding, use the `ulatex+utf8` codec or specify the `'keep'` error handler with `ulatex` to retain unencodable characters.
- gotcha The decoding process canonicalizes certain LaTeX elements: comments are dropped, paragraphs are converted to double newlines, and spacing after LaTeX commands is standardized. This can lead to subtle differences in the decoded text's structure compared to the original LaTeX source.
Install
-
pip install latexcodec
Imports
- latexcodec
import latexcodec import codecs # Use codecs.decode() and codecs.encode()
Quickstart
import codecs
import latexcodec # This registers the 'latex' and 'ulatex' codecs
# Decode LaTeX to Unicode
latex_text = r"I like b\"all{\oe}ns and M\"uller."
unicode_output = codecs.decode(latex_text, "ulatex")
print(f"Decoded LaTeX: {unicode_output}")
# Encode Unicode to LaTeX
unicode_input = "élève"
latex_output = codecs.encode(unicode_input, "ulatex")
print(f"Encoded Unicode: {latex_output}")
# Example with specific encoding (e.g., Latin-1)
latin1_latex_bytes = b"\xfe" # Represents 'þ' in Latin-1
decoded_latin1 = latin1_latex_bytes.decode("latex+latin1")
print(f"Decoded Latin1 LaTeX: {decoded_latin1}")
# Example with error handling during encoding for unrepresentable characters
unicode_with_unrepresentable = "A keyboard: ⌨"
# Using 'keep' error handler with 'ulatex' codec
encoded_kept = codecs.encode(unicode_with_unrepresentable, "ulatex", "keep")
print(f"Encoded with 'keep' error (ulatex): {encoded_kept}")
# Using 'ulatex+utf8' for robust encoding of all Unicode characters
encoded_utf8 = codecs.encode(unicode_with_unrepresentable, "ulatex+utf8")
print(f"Encoded with ulatex+utf8: {encoded_utf8}")