htmldate
htmldate is a Python library designed for fast and robust extraction of original and updated publication dates from URLs and web pages. It is actively maintained with frequent minor releases, often addressing bug fixes, dependency updates, and improvements to extraction heuristics.
Warnings
- breaking As of `v1.9.0`, htmldate officially focuses on and requires Python 3.8 or newer. Older Python versions are no longer supported and may encounter compatibility issues or installation failures.
- gotcha The `originaldate` parameter behavior was fixed in `v1.7.0` to more accurately distinguish between original publication dates and updated dates from meta properties. If you relied on the previous behavior (pre-1.7.0) for this distinction, your results might change.
- gotcha In `v1.6.0`, the library introduced stricter extraction patterns and replaced `lxml.html.Cleaner` for a focus on precision. This might result in `htmldate` no longer finding a date on some pages where it previously did, or extracting a different (and potentially more accurate) date.
Install
-
pip install htmldate
Imports
- find_date
from htmldate import find_date
Quickstart
import requests
from htmldate import find_date
# Example 1: Extract date from a URL
url = 'https://www.example.com/news/article'
# For real-world usage, consider handling network errors
# Example using a placeholder URL, replace with a real one for testing
# html_content = requests.get(url, timeout=10).text
# Using a mock HTML content for reproducibility
html_content = """
<html><head><meta property="article:published_time" content="2023-10-26T10:00:00Z"></head>
<body><h1>Latest News</h1><p>Published: October 26, 2023</p></body></html>
"""
date_from_url = find_date(url=url, html=html_content)
print(f"Date extracted from URL: {date_from_url}")
# Example 2: Extract original publication date (if available and different from updated)
# The 'originaldate' parameter hints the extractor to prioritize the earliest date.
html_content_updated = """
<html><head><meta property="article:published_time" content="2023-10-26T10:00:00Z">
<meta property="article:modified_time" content="2024-03-15T14:30:00Z"></head>
<body><h1>Latest News</h1><p>Published: October 26, 2023</p><p>Last Updated: March 15, 2024</p></body></html>
"""
original_date = find_date(html=html_content_updated, originaldate=True)
updated_date = find_date(html=html_content_updated, originaldate=False)
print(f"Original Date: {original_date}")
print(f"Updated Date: {updated_date}")