{"id":6864,"library":"rtfparse","title":"rtfparse: RTF Parser","description":"rtfparse is a Python library for parsing Microsoft Rich Text Format (RTF) documents. It constructs an in-memory object representing the RTF document's tree structure. The library currently provides a renderer (HTML_Decapsulator) to extract encapsulated HTML, particularly useful for processing HTML-formatted emails from Microsoft Outlook, which often use RTF compression. It aims to support custom renderers for diverse RTF processing needs. The library is actively maintained, with version 0.9.5 focusing on bug fixes and documentation, and future plans for version 1.x to include image embedding.","status":"active","version":"0.9.5","language":"en","source_language":"en","source_url":"https://github.com/fleetingbytes/rtfparse","tags":["RTF","parser","document parsing","HTML extraction","Microsoft Rich Text Format"],"install":[{"cmd":"pip install rtfparse","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Required for CLI usage to process MS Outlook message files (.msg) that contain RTF content.","package":"extract-msg","optional":true},{"reason":"Required for CLI usage to decompress RTF content from MS Outlook message files.","package":"compressed_rtf","optional":true},{"reason":"Provides command-line argument completion for the rtfparse executable.","package":"argcomplete","optional":true}],"imports":[{"symbol":"Rtf_Parser","correct":"from rtfparse.parser import Rtf_Parser"},{"symbol":"De_encapsulate_HTML","correct":"from rtfparse.renderers.de_encapsulate_html import De_encapsulate_HTML"}],"quickstart":{"code":"import pathlib\nfrom rtfparse.parser import Rtf_Parser\nfrom rtfparse.renderers.de_encapsulate_html import De_encapsulate_HTML\nimport os\nimport tempfile\n\n# Create a dummy RTF file for demonstration\n# In a real scenario, you would read an existing RTF file.\nrtf_content = r\"{\\rtf1\\ansi\\deff0 This is some {\\b bold} text and a line break.\\par This is a new line.}\"\ntemp_rtf_file = pathlib.Path(tempfile.gettempdir()) / \"dummy_example.rtf\"\ntemp_html_file = pathlib.Path(tempfile.gettempdir()) / \"extracted_example.html\"\n\ntry:\n    # Write dummy RTF content to a temporary file\n    with open(temp_rtf_file, \"w\", encoding=\"ascii\") as f:\n        f.write(rtf_content)\n\n    print(f\"Created temporary RTF file: {temp_rtf_file}\")\n\n    # Programmatic usage: Parse the RTF file\n    parser = Rtf_Parser(rtf_path=temp_rtf_file)\n    parsed_document = parser.parse_file()\n\n    # Render the parsed RTF to extract HTML content\n    renderer = De_encapsulate_HTML()\n    with open(temp_html_file, mode=\"w\", encoding=\"utf-8\") as html_file:\n        renderer.render(parsed_document, html_file)\n\n    print(f\"RTF parsed and HTML extracted to: {temp_html_file}\")\n\n    # Display the extracted HTML content\n    with open(temp_html_file, \"r\", encoding=\"utf-8\") as f:\n        print(\"\\nExtracted HTML content:\")\n        print(f.read())\n\nexcept Exception as e:\n    print(f\"An error occurred: {e}\")\nfinally:\n    # Clean up temporary files\n    if temp_rtf_file.exists():\n        os.remove(temp_rtf_file)\n        print(f\"Cleaned up {temp_rtf_file}\")\n    if temp_html_file.exists():\n        os.remove(temp_html_file)\n        print(f\"Cleaned up {temp_html_file}\")","lang":"python","description":"This quickstart demonstrates how to programmatically parse an RTF string (written to a temporary file for demonstration) and then use the `De_encapsulate_HTML` renderer to extract any embedded HTML content, saving it to another temporary file. The extracted HTML is then printed to the console."},"warnings":[{"fix":"Users can press 'A' for automatic configuration during the first run or manually configure settings if preferred. For programmatic use, this initial setup does not directly interfere with script execution, but logs might still be generated.","message":"The first execution of the `rtfparse` executable (CLI) will trigger a configuration wizard. This wizard creates a `.rtfparse` folder in the user's home directory to store configuration files and logs. This automatic setup might be unexpected for some users.","severity":"gotcha","affected_versions":"All versions"},{"fix":"For precise styling, manual CSS application or a more advanced RTF-to-HTML conversion library capable of interpreting and translating RTF formatting directives might be necessary. This library is focused on 'decapsulating' existing HTML rather than a full RTF rendering to HTML.","message":"The `HTML_Decapsulator` primarily extracts raw HTML embedded within the RTF structure. It does not fully re-apply RTF-specific styles (like bolding, font size, or specific fonts) during the conversion to plain HTML. This can result in a loss of some visual formatting in the output HTML compared to the original RTF document's appearance.","severity":"gotcha","affected_versions":"All versions up to 0.9.5"},{"fix":"Users needing to embed images should await the 1.x release or implement custom logic to handle image embedding after HTML extraction.","message":"The `--embed-img` option for the `rtfparse` command-line interface, intended for embedding images into decapsulated HTML, is currently non-functional in all 0.x.x versions. This feature is planned for implementation in `rtfparse` version 1.x.","severity":"gotcha","affected_versions":"All 0.x.x versions up to 0.9.5"},{"fix":"Always ensure that the `rtf_path` provided to `Rtf_Parser` is a valid and accessible `pathlib.Path` object pointing to an existing RTF file. Use `pathlib.Path.exists()` for verification or implement robust error handling.","message":"When providing file paths to `Rtf_Parser`, incorrect or non-existent paths will lead to a `FileNotFoundError`. This is a common pitfall, especially when dealing with dynamic paths or different operating system conventions.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Simplify RTF documents where possible, especially if the primary goal is HTML conversion. Be aware that precise replication of all RTF formatting in HTML is often difficult across different parsing and rendering engines.","message":"RTF documents can contain complex elements (e.g., text boxes, columns, embedded images, headers/footers, nested tables) that are challenging to convert accurately to HTML. When parsing such complex RTF files, the resulting HTML might suffer from significant formatting errors, elements disappearing, or layout issues due to the inherent 'sensitivity' of RTF-to-HTML conversion.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-15T00:00:00.000Z","next_check":"2026-07-14T00:00:00.000Z","problems":[]}