RTFDE - RTF Encapsulated HTML Extractor
RTFDE (RTF De-Encapsulator) is a Python library designed to extract HTML content from RTF-encapsulated HTML, a common format found within Exchange MSG email files. It provides robust parsing and de-encapsulation capabilities, focusing on raw byte input. The library is currently at version 0.1.2.2 and receives active maintenance with regular bug fixes and minor updates.
Warnings
- breaking Starting from version 0.1.0, the `deencapsulate` method strictly requires `bytes` as input. Prior versions accepted string input, which is no longer supported.
- gotcha Versions of `rtfde` prior to 0.1.2.2 contained a bug (Issue #34) where invalid Unicode escape sequences within RTF byte strings could lead to parsing errors or incorrect output.
- gotcha The library is designed to extract HTML from potentially complex and sometimes malformed RTF structures, especially those found in email attachments. Inputting poorly formed RTF may result in partial, incorrect, or no HTML being extracted.
Install
-
pip install rtfde
Imports
- DeEncapsulator
from rtfde.deencapsulate import DeEncapsulator
Quickstart
from rtfde.deencapsulate import DeEncapsulator
# Example RTF content (must be bytes)
rtf_bytes = b'{\rtf1\ansi{\fonttbl\f0\fswiss Helvetica;}\pard\ql{\f0\fs24 Hello World!}\par\htmlrtf {\html \pard This is <b>HTML</b> content.}}'
de = DeEncapsulator()
html_content = de.deencapsulate(rtf_bytes)
print(html_content)
# Expected output: 'This is <b>HTML</b> content.'