{"id":7211,"library":"extruct","title":"Extruct","description":"Extruct is a Python library for extracting embedded metadata from HTML markup. It currently supports W3C's HTML Microdata, embedded JSON-LD, Microformat (via mf2py), Facebook's Open Graph, experimental RDFa (via rdflib), and Dublin Core Metadata (DC-HTML-2003). The library is actively maintained with its current stable version being 0.18.0.","status":"active","version":"0.18.0","language":"en","source_language":"en","source_url":"https://github.com/scrapinghub/extruct","tags":["web scraping","metadata extraction","html parsing","json-ld","microdata","opengraph","rdfa"],"install":[{"cmd":"pip install extruct","lang":"bash","label":"Core library"},{"cmd":"pip install 'extruct[cli]'","lang":"bash","label":"With command-line tool dependencies"}],"dependencies":[{"reason":"Core dependency for HTML parsing.","package":"lxml","optional":false},{"reason":"Used for HTML utilities like getting the base URL.","package":"w3lib","optional":false},{"reason":"Required for Microformat extraction.","package":"mf2py","optional":false},{"reason":"Required for experimental RDFa extraction.","package":"rdflib","optional":false},{"reason":"Used for robust JSON-LD parsing.","package":"jstyleson","optional":false},{"reason":"Used for cleaning HTML before parsing.","package":"lxml-html-clean","optional":false},{"reason":"Related to RDFa parsing.","package":"pyrdfa3","optional":false},{"reason":"Optional dependency for the command-line interface to fetch URLs.","package":"requests","optional":true}],"imports":[{"note":"The primary function for all-in-one metadata extraction.","symbol":"extract","correct":"from extruct import extract"},{"note":"Commonly imported for resolving relative URLs in extracted metadata.","symbol":"get_base_url","correct":"from w3lib.html import get_base_url"},{"note":"Example of importing a specific extractor if only certain formats are needed.","symbol":"OpenGraphExtractor","correct":"from extruct.opengraph import OpenGraphExtractor"}],"quickstart":{"code":"import extruct\nimport requests\nfrom w3lib.html import get_base_url\nimport pprint\n\npp = pprint.PrettyPrinter(indent=2)\n\n# Replace with a real URL to test\nurl = 'http://quotes.toscrape.com/scroll'\nr = requests.get(url)\nbase_url = get_base_url(r.text, r.url)\n\ndata = extruct.extract(r.text, base_url=base_url, uniform=True, syntaxes=['json-ld', 'microdata', 'opengraph'])\n\npp.pprint(data)","lang":"python","description":"This quickstart fetches HTML content from a URL, determines the base URL for resolving relative paths, and then uses `extruct.extract` to retrieve structured metadata in common formats (JSON-LD, Microdata, Open Graph). The `uniform=True` parameter ensures a consistent output structure for easier processing."},"warnings":[{"fix":"Upgrade `extruct` to version 0.18.0 or newer. If an upgrade is not feasible, temporarily pin `lxml` to a version less than 5.1.0 (e.g., `pip install lxml<5.1.0`).","message":"Versions of `extruct` prior to 0.18.0 might encounter `ImportError: cannot import name '_ElementStringResult' from 'lxml.etree'` when used with `lxml` versions 5.1.0 or higher due to internal API changes in `lxml`.","severity":"breaking","affected_versions":"<0.18.0"},{"fix":"Use the `uniform=True` parameter in `extruct.extract()` to ensure a more consistent output structure (e.g., always a list). Always check the type of the returned data before attempting to access elements by index or key.","message":"The output structure of `extruct` can be inconsistent for certain metadata types, sometimes returning a list of dictionaries and other times a single dictionary, which can lead to `TypeError` or `IndexError` if not handled carefully in post-processing.","severity":"gotcha","affected_versions":"All versions"},{"fix":"To optimize performance and resource usage, specify only the required syntaxes using the `syntaxes` parameter (e.g., `syntaxes=['json-ld', 'opengraph']`).","message":"Extracting all supported syntaxes from very large or complex HTML documents can be memory-intensive and slow. By default, `extruct.extract()` attempts all formats.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Install `extruct` with the `cli` extra to include `requests`: `pip install 'extruct[cli]'`.","message":"The command-line tool `extruct` (e.g., `extruct 'http://example.com'`) requires the `requests` library, which is an optional dependency and not installed by default with a basic `pip install extruct`.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Update `extruct` to version 0.18.0 or later. If updating `extruct` is not possible, downgrade `lxml` to a version prior to 5.1.0 (e.g., `pip install lxml==5.0.1`).","cause":"An incompatibility between older `extruct` versions (<0.18.0) and `lxml` versions 5.1.0 or newer.","error":"ImportError: cannot import name '_ElementStringResult' from 'lxml.etree'"},{"fix":"First, inspect the source HTML for the presence of Microdata, JSON-LD, Open Graph, etc. Second, always provide the `base_url` parameter to `extruct.extract(html_string, base_url=actual_url)` to ensure proper resolution of relative URLs and images.","cause":"The target HTML either does not contain metadata in the formats `extruct` supports, or relative URLs were not resolved because `base_url` was omitted.","error":"Empty dictionary or unexpected missing metadata in extruct output."},{"fix":"Install `extruct` with its command-line interface dependencies using `pip install 'extruct[cli]'`.","cause":"The `requests` library, which the command-line interface uses to fetch web pages, is an optional dependency and not installed by default.","error":"ModuleNotFoundError: No module named 'requests' when running the `extruct` command-line tool."}]}