Trafilatura
Trafilatura is a Python and command-line tool designed for gathering text and metadata from the web. It specializes in crawling, scraping, and extracting main content from web pages, supporting various output formats like CSV, JSON, HTML, Markdown, TXT, and XML. The library is actively maintained with frequent releases, offering robust extraction, navigation, and deduplication features.
Warnings
- breaking Python 3.6 and 3.7 are no longer supported. Users must upgrade to Python 3.8 or higher.
- breaking The `bare_extraction()` function now returns an instance of the `Document` class by default. The `as_dict` argument is deprecated.
- breaking The `no_fallback` argument in `bare_extraction()` and `extract()` functions has been deprecated.
- breaking The `decode` argument in `fetch_url()` has been removed.
- deprecated Metadata is now skipped by default (`with_metadata=False`).
- breaking The command-line interface (CLI) enforces a fixed list of output formats. The `-out` argument is deprecated.
Install
-
pip install trafilatura
Imports
- fetch_url
from trafilatura import fetch_url
- extract
from trafilatura import extract
- bare_extraction
from trafilatura import bare_extraction
- Document
from trafilatura.settings import Document
Quickstart
from trafilatura import fetch_url, extract
import os
# Example URL from GitHub blog
url = 'https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/'
# In a production setting, you might fetch a URL from a variable or a list
# For this example, we use a fixed public URL.
print(f"Fetching URL: {url}")
downloaded_html = fetch_url(url)
if downloaded_html:
print("Content successfully downloaded. Extracting...")
# Extract main content and comments as plain text by default
extracted_text = extract(downloaded_html)
if extracted_text:
print("--- Extracted Text (first 500 chars) ---")
print(extracted_text[:500])
print("...")
# Example of custom output: JSON with metadata
# Note: with_metadata=True is required for metadata inclusion since v1.11.0
print("\n--- Extracting as JSON with metadata ---")
extracted_json = extract(downloaded_html, output_format="json", with_metadata=True)
if extracted_json:
print(extracted_json[:500])
print("...")
else:
print("Failed to extract content as JSON.")
else:
print("No text extracted from the downloaded HTML.")
else:
print(f"Failed to download content from {url}")