rtfparse: RTF Parser

0.9.5 · active · verified Wed Apr 15

rtfparse is a Python library for parsing Microsoft Rich Text Format (RTF) documents. It constructs an in-memory object representing the RTF document's tree structure. The library currently provides a renderer (HTML_Decapsulator) to extract encapsulated HTML, particularly useful for processing HTML-formatted emails from Microsoft Outlook, which often use RTF compression. It aims to support custom renderers for diverse RTF processing needs. The library is actively maintained, with version 0.9.5 focusing on bug fixes and documentation, and future plans for version 1.x to include image embedding.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to programmatically parse an RTF string (written to a temporary file for demonstration) and then use the `De_encapsulate_HTML` renderer to extract any embedded HTML content, saving it to another temporary file. The extracted HTML is then printed to the console.

import pathlib
from rtfparse.parser import Rtf_Parser
from rtfparse.renderers.de_encapsulate_html import De_encapsulate_HTML
import os
import tempfile

# Create a dummy RTF file for demonstration
# In a real scenario, you would read an existing RTF file.
rtf_content = r"{\rtf1\ansi\deff0 This is some {\b bold} text and a line break.\par This is a new line.}"
temp_rtf_file = pathlib.Path(tempfile.gettempdir()) / "dummy_example.rtf"
temp_html_file = pathlib.Path(tempfile.gettempdir()) / "extracted_example.html"

try:
    # Write dummy RTF content to a temporary file
    with open(temp_rtf_file, "w", encoding="ascii") as f:
        f.write(rtf_content)

    print(f"Created temporary RTF file: {temp_rtf_file}")

    # Programmatic usage: Parse the RTF file
    parser = Rtf_Parser(rtf_path=temp_rtf_file)
    parsed_document = parser.parse_file()

    # Render the parsed RTF to extract HTML content
    renderer = De_encapsulate_HTML()
    with open(temp_html_file, mode="w", encoding="utf-8") as html_file:
        renderer.render(parsed_document, html_file)

    print(f"RTF parsed and HTML extracted to: {temp_html_file}")

    # Display the extracted HTML content
    with open(temp_html_file, "r", encoding="utf-8") as f:
        print("\nExtracted HTML content:")
        print(f.read())

except Exception as e:
    print(f"An error occurred: {e}")
finally:
    # Clean up temporary files
    if temp_rtf_file.exists():
        os.remove(temp_rtf_file)
        print(f"Cleaned up {temp_rtf_file}")
    if temp_html_file.exists():
        os.remove(temp_html_file)
        print(f"Cleaned up {temp_html_file}")

view raw JSON →