{"id":2319,"library":"trafilatura","title":"Trafilatura","description":"Trafilatura is a Python and command-line tool designed for gathering text and metadata from the web. It specializes in crawling, scraping, and extracting main content from web pages, supporting various output formats like CSV, JSON, HTML, Markdown, TXT, and XML. The library is actively maintained with frequent releases, offering robust extraction, navigation, and deduplication features.","status":"active","version":"2.0.0","language":"en","source_language":"en","source_url":"https://github.com/adbar/trafilatura","tags":["web scraping","text extraction","web crawling","metadata","NLP"],"install":[{"cmd":"pip install trafilatura","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Required for secure connections.","package":"certifi"},{"reason":"For character encoding detection.","package":"charset_normalizer>=3.4.0"},{"reason":"Underlying library for URL management and parsing.","package":"courlan>=1.3.2"},{"reason":"For robust date extraction from HTML.","package":"htmldate>=1.9.2"},{"reason":"Used as a fallback for text extraction.","package":"justext>=3.0.1"},{"reason":"Core dependency for HTML parsing and XPath operations.","package":"lxml>=5.3.0"},{"reason":"HTTP client for fetching web pages.","package":"urllib3<3,>=1.26"},{"reason":"Optional: For faster character encoding detection.","package":"cchardet","optional":true},{"reason":"Optional: For language detection.","package":"langid","optional":true}],"imports":[{"symbol":"fetch_url","correct":"from trafilatura import fetch_url"},{"symbol":"extract","correct":"from trafilatura import extract"},{"note":"`as_dict` argument is deprecated; `bare_extraction()` now returns a `Document` object. Use `.as_dict()` method on the returned object instead.","wrong":"result = bare_extraction(html_content, as_dict=True)","symbol":"bare_extraction","correct":"from trafilatura import bare_extraction"},{"note":"Returned by `bare_extraction()`, provides an interface to extracted data including `.as_dict()` method.","symbol":"Document","correct":"from trafilatura.settings import Document"}],"quickstart":{"code":"from trafilatura import fetch_url, extract\nimport os\n\n# Example URL from GitHub blog\nurl = 'https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/'\n# In a production setting, you might fetch a URL from a variable or a list\n# For this example, we use a fixed public URL.\n\nprint(f\"Fetching URL: {url}\")\ndownloaded_html = fetch_url(url)\n\nif downloaded_html:\n    print(\"Content successfully downloaded. Extracting...\")\n    # Extract main content and comments as plain text by default\n    extracted_text = extract(downloaded_html)\n    \n    if extracted_text:\n        print(\"--- Extracted Text (first 500 chars) ---\")\n        print(extracted_text[:500])\n        print(\"...\")\n\n        # Example of custom output: JSON with metadata\n        # Note: with_metadata=True is required for metadata inclusion since v1.11.0\n        print(\"\\n--- Extracting as JSON with metadata ---\")\n        extracted_json = extract(downloaded_html, output_format=\"json\", with_metadata=True)\n        if extracted_json:\n            print(extracted_json[:500])\n            print(\"...\")\n        else:\n            print(\"Failed to extract content as JSON.\")\n\n    else:\n        print(\"No text extracted from the downloaded HTML.\")\nelse:\n    print(f\"Failed to download content from {url}\")","lang":"python","description":"This quickstart demonstrates how to fetch a web page and extract its main text content using `trafilatura`. It includes a basic extraction to plain text and an example of extracting structured JSON output with metadata."},"warnings":[{"fix":"Upgrade your Python environment to version 3.8 or newer.","message":"Python 3.6 and 3.7 are no longer supported. Users must upgrade to Python 3.8 or higher.","severity":"breaking","affected_versions":"2.0.0+"},{"fix":"Access dictionary representation by calling the `.as_dict()` method on the returned `Document` object: `doc = bare_extraction(...); result_dict = doc.as_dict()`.","message":"The `bare_extraction()` function now returns an instance of the `Document` class by default. The `as_dict` argument is deprecated.","severity":"breaking","affected_versions":"2.0.0+"},{"fix":"Use the `fast` argument instead: `extract(html, fast=True)`.","message":"The `no_fallback` argument in `bare_extraction()` and `extract()` functions has been deprecated.","severity":"breaking","affected_versions":"2.0.0+"},{"fix":"To get a full response object with control over decoding, use `fetch_response()` directly. `fetch_url()` now seamlessly decodes to a Unicode string.","message":"The `decode` argument in `fetch_url()` has been removed.","severity":"breaking","affected_versions":"2.0.0+"},{"fix":"To include metadata in your output, you must explicitly set `with_metadata=True` in your `extract()` calls or use the `--with-metadata` CLI flag.","message":"Metadata is now skipped by default (`with_metadata=False`).","severity":"deprecated","affected_versions":"1.11.0+"},{"fix":"Use the specified output format options (e.g., `--json`, `--xml`, `--markdown`) instead of `-out`.","message":"The command-line interface (CLI) enforces a fixed list of output formats. The `-out` argument is deprecated.","severity":"breaking","affected_versions":"1.12.0+"}],"env_vars":null,"last_verified":"2026-04-09T00:00:00.000Z","next_check":"2026-07-08T00:00:00.000Z"}