inscriptis
Inscriptis is a Python-based HTML to text conversion library, command line client, and Web service (v2.7.1). It specializes in providing high-quality, layout-aware text representations of HTML content, including support for nested tables and a subset of CSS, and offers optional annotated output. The library is actively maintained with regular releases addressing new Python versions and feature enhancements.
Warnings
- breaking The `XmlAnnotationProcessor` (introduced in 2.6.0) now requires a mandatory root element. The generated XML will contain a `<content>` root element by default. If you were using this processor directly, your XML output structure will change.
- deprecated Support for Python 3.9 has been removed as of version 2.7.0. Python 3.8 support was deprecated in 2.5.1 and subsequently removed.
- gotcha When processing very complex HTML pages, `inscriptis` (which uses `lxml` internally) may exhibit increased memory consumption due to `lxml`'s tendency to reuse memory rather than releasing it back to the operating system.
Install
-
pip install inscriptis -
pip install inscriptis[web-service]
Imports
- get_text
from inscriptis import get_text
Quickstart
import urllib.request
from inscriptis import get_text
url = "https://www.informationscience.ch"
try:
with urllib.request.urlopen(url) as response:
html_content = response.read().decode('utf-8')
except Exception as e:
html_content = f"<html><body><p>Error fetching URL: {e}</p></body></html>"
text = get_text(html_content)
print(text)