Pandoc Documents for Python
Pandoc is a powerful, open-source command-line tool for converting documents between various formats (e.g., Markdown, HTML, LaTeX, PDF, Word). The `pandoc` Python library (version 2.4, released August 7, 2024) provides Python bindings to interact with Pandoc's document model, allowing for in-Python analysis, creation, and transformation of documents. It leverages the underlying Haskell-based Pandoc executable, which must be installed separately. The library generally follows an active release cadence, with updates to support recent Pandoc executable versions.
Warnings
- breaking The Python `pandoc` library is a thin wrapper and does not bundle the Pandoc executable. Users MUST install the Pandoc command-line tool separately (e.g., via `conda install pandoc`, `sudo apt install pandoc`, or `brew install pandoc`). Failure to do so will result in runtime errors as the Python library will not find the `pandoc` binary.
- gotcha The `pandoc` Python library should not be confused with `pypandoc`. While both are Python wrappers for Pandoc, `pypandoc` offers a `pypandoc_binary` package that bundles the Pandoc executable, whereas `pandoc` (this library) always requires a separate installation of the underlying Pandoc tool.
- gotcha When programmatically interacting with the Pandoc executable (e.g., via Python's `subprocess` module or the `pandoc` library's underlying calls), command-line arguments, especially those with values, must be passed as distinct items in a list. Combining them into a single string (e.g., `'-Vtitle="My Title"'`) can lead to incorrect parsing by Pandoc. Instead, use `['-V', 'title="My Title"']`.
- breaking The underlying Pandoc executable (version 3.1 and later) changed how it parses code block attributes. The syntax ````{lang}` (without a leading dot) is no longer interpreted as a language class but as a literal string. The correct syntax for specifying a language class is ````{.lang}````. This change in the Pandoc executable can affect how the Python `pandoc` library processes markdown documents.
- gotcha The Python `pandoc` library is tested against specific versions of the Pandoc executable. While it might issue a warning for unsupported Pandoc executable versions instead of failing, using an incompatible version could lead to unexpected behavior or incorrect document transformations due to differences in the underlying document model.
Install
-
pip install --upgrade pandoc -
conda install -c conda-forge pandoc -
sudo apt install pandoc -
brew install pandoc
Imports
- pandoc
import pandoc
- types
from pandoc.types import Str, Space, Para, Meta
Quickstart
import pandoc
from pandoc.types import Str, Space, Para, Meta
# Read a simple markdown string into a Pandoc document object
text = "Hello world!"
doc = pandoc.read(text)
print(f"Initial document: {doc}")
# Access and modify an element in the document's Abstract Syntax Tree (AST)
# For "Hello world!", doc is Pandoc(Meta({}), [Para([Str('Hello'), Space(), Str('world!')])])
# The paragraph is at doc[1][0]
# The 'world!' string is at doc[1][0][2][0]
paragraph = doc[1][0]
# Modify the 'world!' string to 'Python!'
# The Str object is at paragraph[2] (0: Str('Hello'), 1: Space(), 2: Str('world!'))
# The actual string value is the first element of the Str tuple: Str('world!')[0]
paragraph[2][0] = 'Python!'
# Write the modified document back to a markdown string
modified_text = pandoc.write(doc)
print(f"Modified document text: {modified_text.strip()}")
# Example of converting to a different format (requires actual pandoc executable)
# doc_to_convert = pandoc.read("# My Title\n\nHello from Pandoc!", format='markdown')
# html_output = pandoc.write(doc_to_convert, format='html')
# print(f"HTML output:\n{html_output}")