Extruct
Extruct is a Python library for extracting embedded metadata from HTML markup. It currently supports W3C's HTML Microdata, embedded JSON-LD, Microformat (via mf2py), Facebook's Open Graph, experimental RDFa (via rdflib), and Dublin Core Metadata (DC-HTML-2003). The library is actively maintained with its current stable version being 0.18.0.
Common errors
-
ImportError: cannot import name '_ElementStringResult' from 'lxml.etree'
cause An incompatibility between older `extruct` versions (<0.18.0) and `lxml` versions 5.1.0 or newer.fixUpdate `extruct` to version 0.18.0 or later. If updating `extruct` is not possible, downgrade `lxml` to a version prior to 5.1.0 (e.g., `pip install lxml==5.0.1`). -
Empty dictionary or unexpected missing metadata in extruct output.
cause The target HTML either does not contain metadata in the formats `extruct` supports, or relative URLs were not resolved because `base_url` was omitted.fixFirst, inspect the source HTML for the presence of Microdata, JSON-LD, Open Graph, etc. Second, always provide the `base_url` parameter to `extruct.extract(html_string, base_url=actual_url)` to ensure proper resolution of relative URLs and images. -
ModuleNotFoundError: No module named 'requests' when running the `extruct` command-line tool.
cause The `requests` library, which the command-line interface uses to fetch web pages, is an optional dependency and not installed by default.fixInstall `extruct` with its command-line interface dependencies using `pip install 'extruct[cli]'`.
Warnings
- breaking Versions of `extruct` prior to 0.18.0 might encounter `ImportError: cannot import name '_ElementStringResult' from 'lxml.etree'` when used with `lxml` versions 5.1.0 or higher due to internal API changes in `lxml`.
- gotcha The output structure of `extruct` can be inconsistent for certain metadata types, sometimes returning a list of dictionaries and other times a single dictionary, which can lead to `TypeError` or `IndexError` if not handled carefully in post-processing.
- gotcha Extracting all supported syntaxes from very large or complex HTML documents can be memory-intensive and slow. By default, `extruct.extract()` attempts all formats.
- gotcha The command-line tool `extruct` (e.g., `extruct 'http://example.com'`) requires the `requests` library, which is an optional dependency and not installed by default with a basic `pip install extruct`.
Install
-
pip install extruct -
pip install 'extruct[cli]'
Imports
- extract
from extruct import extract
- get_base_url
from w3lib.html import get_base_url
- OpenGraphExtractor
from extruct.opengraph import OpenGraphExtractor
Quickstart
import extruct import requests from w3lib.html import get_base_url import pprint pp = pprint.PrettyPrinter(indent=2) # Replace with a real URL to test url = 'http://quotes.toscrape.com/scroll' r = requests.get(url) base_url = get_base_url(r.text, r.url) data = extruct.extract(r.text, base_url=base_url, uniform=True, syntaxes=['json-ld', 'microdata', 'opengraph']) pp.pprint(data)