{"id":4055,"library":"inscriptis","title":"inscriptis","description":"Inscriptis is a Python-based HTML to text conversion library, command line client, and Web service (v2.7.1). It specializes in providing high-quality, layout-aware text representations of HTML content, including support for nested tables and a subset of CSS, and offers optional annotated output. The library is actively maintained with regular releases addressing new Python versions and feature enhancements.","status":"active","version":"2.7.1","language":"en","source_language":"en","source_url":"https://github.com/weblyzard/inscriptis","tags":["html-to-text","converter","text-extraction","nlp","web-scraping"],"install":[{"cmd":"pip install inscriptis","lang":"bash","label":"Core library"},{"cmd":"pip install inscriptis[web-service]","lang":"bash","label":"With Web service (FastAPI/Uvicorn)"}],"dependencies":[{"reason":"Used for fetching web content.","package":"requests","optional":false},{"reason":"HTML parsing backend.","package":"lxml","optional":false},{"reason":"Required for the optional web-service.","package":"fastapi","optional":true},{"reason":"Required for the optional web-service.","package":"uvicorn","optional":true}],"imports":[{"symbol":"get_text","correct":"from inscriptis import get_text"}],"quickstart":{"code":"import urllib.request\nfrom inscriptis import get_text\n\nurl = \"https://www.informationscience.ch\"\ntry:\n    with urllib.request.urlopen(url) as response:\n        html_content = response.read().decode('utf-8')\nexcept Exception as e:\n    html_content = f\"<html><body><p>Error fetching URL: {e}</p></body></html>\"\n\ntext = get_text(html_content)\nprint(text)","lang":"python","description":"Convert HTML from a URL to plain text, preserving layout and structure. The example fetches content from 'https://www.informationscience.ch' and prints its text representation."},"warnings":[{"fix":"If using `XmlAnnotationProcessor`, be aware of the new `<content>` root element. The name can be overwritten by providing the `root_element` parameter to the processor call.","message":"The `XmlAnnotationProcessor` (introduced in 2.6.0) now requires a mandatory root element. The generated XML will contain a `<content>` root element by default. If you were using this processor directly, your XML output structure will change.","severity":"breaking","affected_versions":">=2.6.0"},{"fix":"Upgrade your Python environment to version 3.10 or newer (up to <3.15) to maintain compatibility with `inscriptis`.","message":"Support for Python 3.9 has been removed as of version 2.7.0. Python 3.8 support was deprecated in 2.5.1 and subsequently removed.","severity":"deprecated","affected_versions":">=2.7.0 (for Python 3.9), >=2.5.1 (for Python 3.8)"},{"fix":"For long-running services processing many complex HTML documents, monitor memory usage and consider restarting processes periodically or optimizing the HTML input where possible. This is a characteristic of `lxml` rather than a direct `inscriptis` bug.","message":"When processing very complex HTML pages, `inscriptis` (which uses `lxml` internally) may exhibit increased memory consumption due to `lxml`'s tendency to reuse memory rather than releasing it back to the operating system.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}