URL Canon
urlcanon is a URL canonicalization and normalization library for Python and Java, currently at version 0.3.1. It provides a URL parser that preserves input bytes, a predefined set of canonicalization rules aiming to match browser parsing behavior, and an alternative URL serialization format called SSURT. The library is stable and in production use, though API and output stability are not yet guaranteed, and feature sets differ between its Python and Java implementations. It does not have a strict release cadence but updates as needed.
Common errors
-
My original ParsedUrl object changed after calling a canonicalization function, but I expected a new object.
cause Canonicalization methods like `urlcanon.whatwg()` perform in-place modification on the `ParsedUrl` object rather than returning a new instance.fixExplicitly create a copy of the `ParsedUrl` object before canonicalizing if you need to retain the original. Example: `original_url = urlcanon.parse_url('...'); canonicalized_url = urlcanon.parse_url(str(original_url)); urlcanon.whatwg(canonicalized_url)`. -
The canonicalized URL doesn't match my expectation (e.g., specific path segments or casing were not changed).
cause The `urlcanon` library implements a specific set of canonicalization rules (e.g., WHATWG standard). Your expectations might differ or you might be encountering behavior specific to edge cases not covered by the default rules.fixReview the specific canonicalization rules implemented by `urlcanon` (e.g., by inspecting the source for `whatwg`). Test with a variety of URLs to understand its behavior. If necessary, implement custom canonicalization steps using `urlcanon`'s components.
Warnings
- breaking The library explicitly states 'no API or output stability guarantees yet'. Future versions may introduce breaking changes without prior notice.
- gotcha Canonicalization functions (e.g., `urlcanon.whatwg()`) modify the `ParsedUrl` object in place. They do not return a new, canonicalized object.
- gotcha There are known differences in features and behavior between the Python and Java versions of urlcanon.
Install
-
pip install urlcanon
Imports
- parse_url
import urlcanon parsed_url = urlcanon.parse_url(input_url)
- whatwg
import urlcanon urlcanon.whatwg(parsed_url)
- ParsedUrl
from urlcanon.parse import ParsedUrl
- MatchRule
from urlcanon import MatchRule
Quickstart
import urlcanon
input_url = "http://///EXAMPLE.com:80/foo/../bar"
# Parse the URL. This preserves the original input string.
parsed_url = urlcanon.parse_url(input_url)
print(f"Original parsed URL: {parsed_url}")
# Apply WHATWG canonicalization rules (modifies parsed_url in place)
urlcanon.whatwg(parsed_url)
print(f"Canonicalized URL (WHATWG): {parsed_url}")
# Get the SSURT representation (suitable for sorting/prefix-matching)
ssurt_representation = parsed_url.ssurt()
print(f"SSURT: {ssurt_representation}")
# Example of using a MatchRule
rule = urlcanon.MatchRule(ssurt=b'com,example,//:http/bar')
url_to_check = b'HTtp:////eXAMple.Com/bar//baz//..///quu'
applies = urlcanon.whatwg.rule_applies(rule, url_to_check)
print(f"Does rule apply to '{url_to_check.decode()}': {applies}")