Trafilatura

2.0.0 · active · verified Thu Apr 09

Trafilatura is a Python and command-line tool designed for gathering text and metadata from the web. It specializes in crawling, scraping, and extracting main content from web pages, supporting various output formats like CSV, JSON, HTML, Markdown, TXT, and XML. The library is actively maintained with frequent releases, offering robust extraction, navigation, and deduplication features.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to fetch a web page and extract its main text content using `trafilatura`. It includes a basic extraction to plain text and an example of extracting structured JSON output with metadata.

from trafilatura import fetch_url, extract
import os

# Example URL from GitHub blog
url = 'https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/'
# In a production setting, you might fetch a URL from a variable or a list
# For this example, we use a fixed public URL.

print(f"Fetching URL: {url}")
downloaded_html = fetch_url(url)

if downloaded_html:
    print("Content successfully downloaded. Extracting...")
    # Extract main content and comments as plain text by default
    extracted_text = extract(downloaded_html)
    
    if extracted_text:
        print("--- Extracted Text (first 500 chars) ---")
        print(extracted_text[:500])
        print("...")

        # Example of custom output: JSON with metadata
        # Note: with_metadata=True is required for metadata inclusion since v1.11.0
        print("\n--- Extracting as JSON with metadata ---")
        extracted_json = extract(downloaded_html, output_format="json", with_metadata=True)
        if extracted_json:
            print(extracted_json[:500])
            print("...")
        else:
            print("Failed to extract content as JSON.")

    else:
        print("No text extracted from the downloaded HTML.")
else:
    print(f"Failed to download content from {url}")

view raw JSON →