Extruct

0.18.0 · active · verified Thu Apr 16

Extruct is a Python library for extracting embedded metadata from HTML markup. It currently supports W3C's HTML Microdata, embedded JSON-LD, Microformat (via mf2py), Facebook's Open Graph, experimental RDFa (via rdflib), and Dublin Core Metadata (DC-HTML-2003). The library is actively maintained with its current stable version being 0.18.0.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart fetches HTML content from a URL, determines the base URL for resolving relative paths, and then uses `extruct.extract` to retrieve structured metadata in common formats (JSON-LD, Microdata, Open Graph). The `uniform=True` parameter ensures a consistent output structure for easier processing.

import extruct
import requests
from w3lib.html import get_base_url
import pprint

pp = pprint.PrettyPrinter(indent=2)

# Replace with a real URL to test
url = 'http://quotes.toscrape.com/scroll'
r = requests.get(url)
base_url = get_base_url(r.text, r.url)

data = extruct.extract(r.text, base_url=base_url, uniform=True, syntaxes=['json-ld', 'microdata', 'opengraph'])

pp.pprint(data)

view raw JSON →