{"id":3513,"library":"html-text","title":"html-text","description":"html-text is a Python library designed to extract clean, readable plain text from HTML content. It goes beyond simple text extraction by removing invisible non-text content like inline styles, JavaScript, and comments. The library intelligently normalizes whitespace and can optionally add newlines after block-level elements (e.g., headers, paragraphs) to produce text that more closely resembles browser rendering, making it suitable for text classification or further natural language processing. The current version is 0.7.1, and it maintains an active development status.","status":"active","version":"0.7.1","language":"en","source_language":"en","source_url":"https://github.com/zytedata/html-text","tags":["html","text extraction","web scraping","cleaning","lxml","nlp"],"install":[{"cmd":"pip install html-text","lang":"bash","label":"Install with pip"}],"dependencies":[{"reason":"Core dependency for HTML parsing; may require system-level development packages for installation.","package":"lxml","optional":false}],"imports":[{"symbol":"extract_text","correct":"from html_text import extract_text"},{"symbol":"parse_html","correct":"from html_text import parse_html"},{"symbol":"cleaner","correct":"from html_text import cleaner"},{"symbol":"etree_to_text","correct":"from html_text import etree_to_text"},{"symbol":"cleaned_selector","correct":"from html_text import cleaned_selector"}],"quickstart":{"code":"import html_text\n\nhtml_content = '<h1>Hello</h1><p>This is a <b>paragraph</b> with <span>inline</span> text.</p>'\nplain_text = html_text.extract_text(html_content)\nprint(plain_text)\n\n# To get text without layout-driven newlines (e.g., after h1)\nplain_text_flat = html_text.extract_text(html_content, guess_layout=False)\nprint(plain_text_flat)","lang":"python","description":"Demonstrates the basic usage of `html_text.extract_text` to convert an HTML string into plain text, including an example of disabling layout guessing for a flatter output."},"warnings":[{"fix":"For flatter output, call `html_text.extract_text(html_string, guess_layout=False)`.","message":"By default, `html_text.extract_text()` attempts to 'guess layout' and inserts newlines after block-level HTML elements (e.g., <h1>, <p>) to improve readability. If a flat, single-line text output is desired, explicitly set `guess_layout=False`.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Always clean manually before using lower-level extraction functions, e.g., `cleaned_tree = html_text.cleaner.clean_html(tree)` or `cleaned_sel = html_text.cleaned_selector(html_content)`.","message":"When working with pre-parsed `lxml.html.HtmlElement` trees or `parsel.Selector` objects, lower-level functions like `html_text.etree_to_text()` or `html_text.selector_to_text()` do NOT automatically clean the HTML. You must manually apply cleaning using `html_text.cleaner.clean_html()` or `html_text.cleaned_selector()` first.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Refer to the `lxml` installation guide (often linked from its PyPI page or documentation) for specific system dependencies required before `pip install lxml` or `pip install html-text`.","message":"The `html-text` library depends on `lxml`. Installing `lxml` can sometimes be complex on various operating systems, requiring system-level development packages (e.g., `libxml2-dev` and `libxslt-dev` on Debian/Ubuntu, or Xcode Command Line Tools on macOS) for its C extensions to compile correctly.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}