htmldate

1.9.4 · active · verified Thu Apr 09

htmldate is a Python library designed for fast and robust extraction of original and updated publication dates from URLs and web pages. It is actively maintained with frequent minor releases, often addressing bug fixes, dependency updates, and improvements to extraction heuristics.

Warnings

Install

Imports

Quickstart

To use `htmldate`, import the `find_date` function. It can extract dates directly from a URL or from an HTML string. When providing an HTML string, it's often useful to also provide the `url` parameter for better relative path resolution and more accurate heuristics. The `originaldate` parameter allows you to prioritize the earliest found date (original publication) over potentially updated dates.

import requests
from htmldate import find_date

# Example 1: Extract date from a URL
url = 'https://www.example.com/news/article'
# For real-world usage, consider handling network errors
# Example using a placeholder URL, replace with a real one for testing
# html_content = requests.get(url, timeout=10).text

# Using a mock HTML content for reproducibility
html_content = """
<html><head><meta property="article:published_time" content="2023-10-26T10:00:00Z"></head>
<body><h1>Latest News</h1><p>Published: October 26, 2023</p></body></html>
"""

date_from_url = find_date(url=url, html=html_content)
print(f"Date extracted from URL: {date_from_url}")

# Example 2: Extract original publication date (if available and different from updated)
# The 'originaldate' parameter hints the extractor to prioritize the earliest date.
html_content_updated = """
<html><head><meta property="article:published_time" content="2023-10-26T10:00:00Z">
<meta property="article:modified_time" content="2024-03-15T14:30:00Z"></head>
<body><h1>Latest News</h1><p>Published: October 26, 2023</p><p>Last Updated: March 15, 2024</p></body></html>
"""
original_date = find_date(html=html_content_updated, originaldate=True)
updated_date = find_date(html=html_content_updated, originaldate=False)

print(f"Original Date: {original_date}")
print(f"Updated Date: {updated_date}")

view raw JSON →