htmldate

1.9.4 · active · verified Thu Apr 09

htmldate is a Python library designed for fast and robust extraction of original and updated publication dates from URLs and web pages. It is actively maintained with frequent minor releases, often addressing bug fixes, dependency updates, and improvements to extraction heuristics.

Warnings

breaking As of `v1.9.0`, htmldate officially focuses on and requires Python 3.8 or newer. Older Python versions are no longer supported and may encounter compatibility issues or installation failures.
Fix: Ensure your project runs on Python 3.8 or a newer version.
gotcha The `originaldate` parameter behavior was fixed in `v1.7.0` to more accurately distinguish between original publication dates and updated dates from meta properties. If you relied on the previous behavior (pre-1.7.0) for this distinction, your results might change.
Fix: Review your code if you used `originaldate=True` in versions prior to `1.7.0` and verify the extracted dates are as expected after upgrading.
gotcha In `v1.6.0`, the library introduced stricter extraction patterns and replaced `lxml.html.Cleaner` for a focus on precision. This might result in `htmldate` no longer finding a date on some pages where it previously did, or extracting a different (and potentially more accurate) date.
Fix: Be aware that date extraction results might vary for some URLs when upgrading from versions older than `1.6.0` due to refined heuristics. Evaluate critical extractions after upgrade.

Install

pip install htmldate Install htmldate

Imports

find_date
```
from htmldate import find_date
```

Quickstart

To use `htmldate`, import the `find_date` function. It can extract dates directly from a URL or from an HTML string. When providing an HTML string, it's often useful to also provide the `url` parameter for better relative path resolution and more accurate heuristics. The `originaldate` parameter allows you to prioritize the earliest found date (original publication) over potentially updated dates.

import requests
from htmldate import find_date

# Example 1: Extract date from a URL
url = 'https://www.example.com/news/article'
# For real-world usage, consider handling network errors
# Example using a placeholder URL, replace with a real one for testing
# html_content = requests.get(url, timeout=10).text

# Using a mock HTML content for reproducibility
html_content = """
<html><head><meta property="article:published_time" content="2023-10-26T10:00:00Z"></head>
<body><h1>Latest News</h1><p>Published: October 26, 2023</p></body></html>
"""

date_from_url = find_date(url=url, html=html_content)
print(f"Date extracted from URL: {date_from_url}")

# Example 2: Extract original publication date (if available and different from updated)
# The 'originaldate' parameter hints the extractor to prioritize the earliest date.
html_content_updated = """
<html><head><meta property="article:published_time" content="2023-10-26T10:00:00Z">
<meta property="article:modified_time" content="2024-03-15T14:30:00Z"></head>
<body><h1>Latest News</h1><p>Published: October 26, 2023</p><p>Last Updated: March 15, 2024</p></body></html>
"""
original_date = find_date(html=html_content_updated, originaldate=True)
updated_date = find_date(html=html_content_updated, originaldate=False)

print(f"Original Date: {original_date}")
print(f"Updated Date: {updated_date}")

view raw JSON →