Ultimate Sitemap Parser
Ultimate Sitemap Parser (USP) is a performant and robust Python library for parsing and crawling sitemaps. It supports all major sitemap formats (XML, Google News, plain text, RSS/Atom), handles nested sitemaps, is error-tolerant, and efficiently processes large hierarchies using a lazy-loading generator for pages. The library is actively maintained, with frequent releases; the current version is 1.8.0.
Common errors
-
ModuleNotFoundError: No module named 'ultimate_sitemap_parser'
cause The top-level package for importing is `usp`, not `ultimate_sitemap_parser`.fixChange your import statements from `import ultimate_sitemap_parser` or `from ultimate_sitemap_parser.tree import ...` to `import usp` or `from usp.tree import ...` respectively. -
AttributeError: 'InvalidSitemap' object has no attribute 'all_pages'
cause The `sitemap_tree_for_homepage` function or other parsing methods returned an `InvalidSitemap` object, indicating that the sitemap could not be fetched or parsed successfully.fixCheck the URL for correctness and accessibility. The `InvalidSitemap` object itself might contain information about the failure (e.g., HTTP status code or parsing error). Inspect the logs for details during the parsing process. -
SitemapException: Maximum recursion depth exceeded (URL: ...)
cause The library detected a circular reference within the sitemap hierarchy (a sitemap linking back to itself or an ancestor) or an excessively deep, potentially infinite, recursion.fixThis usually indicates a malformed sitemap structure on the target website. Review the sitemap files for unintended circular dependencies. The library prevents infinite loops, but this exception signals a problematic sitemap design.
Warnings
- breaking Python 3.8 is no longer supported as of version 1.3.0. The minimum required Python version for recent releases (including 1.8.0) is >=3.10.
- breaking If you use custom web clients by subclassing `AbstractWebClient`, you must implement the new `url()` method as of version 1.3.0. This method should return the actual URL fetched after any redirects.
- gotcha Malformed sitemap XML (e.g., incorrect namespace, missing tags, invalid URLs, improper encoding) can lead to `InvalidSitemap` objects or parsing failures, even though the library is error-tolerant.
- gotcha While designed for efficiency, processing extremely large sitemaps (e.g., >50MB uncompressed or >50,000 URLs) can still be resource-intensive or hit memory limits.
Install
-
pip install ultimate-sitemap-parser
Imports
- sitemap_tree_for_homepage
from ultimate_sitemap_parser.tree import sitemap_tree_for_homepage
from usp.tree import sitemap_tree_for_homepage
- sitemap_from_str
from usp.tree import sitemap_from_str
Quickstart
from usp.tree import sitemap_tree_for_homepage
# Replace with the target website URL for sitemap discovery
target_url = "https://www.example.org/"
try:
# Fetches sitemaps, discovers nested sitemaps, and builds a tree structure
tree = sitemap_tree_for_homepage(target_url)
print(f"Successfully parsed sitemap for: {target_url}")
print("Listing all discovered pages:")
# Iterate through all pages found across the sitemap hierarchy
# Uses a generator for memory efficiency with large sitemaps
page_count = 0
for page in tree.all_pages():
print(page.url)
page_count += 1
print(f"Found {page_count} pages.")
except Exception as e:
print(f"An error occurred: {e}")