{"id":2086,"library":"justext","title":"justext","description":"justext is a heuristic-based boilerplate removal tool for HTML documents. It extracts the main content from web pages, discarding navigation, advertisements, and other extraneous elements. The current version is 3.0.2, and it typically releases updates for bug fixes and compatibility issues.","status":"active","version":"3.0.2","language":"en","source_language":"en","source_url":"https://github.com/miso-belica/jusText","tags":["web scraping","html parsing","boilerplate removal","nlp","text extraction"],"install":[{"cmd":"pip install justext","lang":"bash","label":"Install latest version"}],"dependencies":[{"reason":"Core dependency for HTML parsing and DOM manipulation. Changes in lxml can affect justext's parsing capabilities.","package":"lxml","optional":false}],"imports":[{"note":"The primary function is justext.justext(). Accessing it via 'import justext' and then 'justext.justext()' is the most common pattern.","wrong":"from justext import justext # While technically possible, justext is often used as a module directly, and justext.justext() is the main function.","symbol":"justext","correct":"import justext"}],"quickstart":{"code":"import requests\nimport justext\n\n# Example URL (replace with a real URL for actual testing)\nurl = \"https://www.python.org\"\n\ntry:\n    response = requests.get(url, timeout=5)\n    # justext expects bytes as input\n    html_content = response.content\n\n    # Get the English stoplist\n    stoplist = justext.get_stoplist(\"English\")\n\n    # Process the HTML content\n    paragraphs = justext.justext(html_content, stoplist)\n\n    print(f\"Extracted text from {url}:\")\n    for paragraph in paragraphs:\n        if not paragraph.is_boilerplate:\n            print(paragraph.text)\n\nexcept requests.exceptions.RequestException as e:\n    print(f\"Error fetching URL: {e}\")\nexcept Exception as e:\n    print(f\"An error occurred: {e}\")","lang":"python","description":"This quickstart demonstrates how to fetch an HTML document using `requests`, then pass its raw byte content to `justext.justext()` along with a predefined stoplist (e.g., 'English') to extract and print human-readable text, filtering out boilerplate."},"warnings":[{"fix":"Upgrade to Python 3.5 or newer. For projects requiring older Python versions, consider pinning justext to < 3.0.0 (e.g., `justext<3.0.0`).","message":"justext v3.0.0 dropped support for Python 3.4 and older versions (including Python 2.x). Attempts to install or run on these versions will fail or lead to unexpected behavior.","severity":"breaking","affected_versions":"< 3.0.0 (running on Python 3.4- or 2.x)"},{"fix":"Always provide HTML content as bytes. If you have a string, ensure it's encoded correctly before passing it: `html_string.encode('utf-8')`.","message":"The `justext.justext()` function expects raw HTML content as bytes (e.g., from `response.content`). Passing a decoded string (e.g., `response.text`) can lead to parsing errors or incorrect results due to encoding issues with `lxml`.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure you are using `justext` version 3.0.1 or higher for better `lxml` compatibility, and 3.0.0 or higher for Python 3.8+ environments. Regularly update `justext` to its latest stable release.","message":"Older versions of justext (specifically before v3.0.1) had compatibility issues with newer versions of `lxml`, leading to parsing errors. Similarly, versions before v3.0.0 would fail on Python 3.8+ due to the removal of `cgi.escape`.","severity":"gotcha","affected_versions":"< 3.0.1 (lxml compatibility), < 3.0.0 (Python 3.8+)"}],"env_vars":null,"last_verified":"2026-04-09T00:00:00.000Z","next_check":"2026-07-08T00:00:00.000Z"}