{"id":7842,"library":"urlcanon","title":"URL Canon","description":"urlcanon is a URL canonicalization and normalization library for Python and Java, currently at version 0.3.1. It provides a URL parser that preserves input bytes, a predefined set of canonicalization rules aiming to match browser parsing behavior, and an alternative URL serialization format called SSURT. The library is stable and in production use, though API and output stability are not yet guaranteed, and feature sets differ between its Python and Java implementations. It does not have a strict release cadence but updates as needed.","status":"active","version":"0.3.1","language":"en","source_language":"en","source_url":"https://github.com/iipc/urlcanon","tags":["url canonicalization","url normalization","web scraping","web archiving","url parsing"],"install":[{"cmd":"pip install urlcanon","lang":"bash","label":"Install with pip"}],"dependencies":[{"reason":"Used for handling internationalized domain names (IDN) in URL processing.","package":"idna","optional":false}],"imports":[{"symbol":"parse_url","correct":"import urlcanon\nparsed_url = urlcanon.parse_url(input_url)"},{"note":"whatwg is a canonicalizer function that modifies the ParsedUrl object in place.","symbol":"whatwg","correct":"import urlcanon\nurlcanon.whatwg(parsed_url)"},{"symbol":"ParsedUrl","correct":"from urlcanon.parse import ParsedUrl"},{"symbol":"MatchRule","correct":"from urlcanon import MatchRule"}],"quickstart":{"code":"import urlcanon\n\ninput_url = \"http://///EXAMPLE.com:80/foo/../bar\"\n\n# Parse the URL. This preserves the original input string.\nparsed_url = urlcanon.parse_url(input_url)\nprint(f\"Original parsed URL: {parsed_url}\")\n\n# Apply WHATWG canonicalization rules (modifies parsed_url in place)\nurlcanon.whatwg(parsed_url)\nprint(f\"Canonicalized URL (WHATWG): {parsed_url}\")\n\n# Get the SSURT representation (suitable for sorting/prefix-matching)\nssurt_representation = parsed_url.ssurt()\nprint(f\"SSURT: {ssurt_representation}\")\n\n# Example of using a MatchRule\nrule = urlcanon.MatchRule(ssurt=b'com,example,//:http/bar')\nurl_to_check = b'HTtp:////eXAMple.Com/bar//baz//..///quu'\napplies = urlcanon.whatwg.rule_applies(rule, url_to_check)\nprint(f\"Does rule apply to '{url_to_check.decode()}': {applies}\")","lang":"python","description":"Demonstrates parsing a URL, applying WHATWG canonicalization rules, obtaining the SSURT representation, and checking a URL against a MatchRule. Note that canonicalization functions like `whatwg` modify the `ParsedUrl` object in place."},"warnings":[{"fix":"Always pin to a specific patch version (e.g., `urlcanon==0.3.1`) and thoroughly test when upgrading.","message":"The library explicitly states 'no API or output stability guarantees yet'. Future versions may introduce breaking changes without prior notice.","severity":"breaking","affected_versions":"0.3.1 and earlier"},{"fix":"If you need to preserve the original `ParsedUrl` object, create a copy before applying canonicalization: `modified_url = urlcanon.parse_url(str(original_url))`.","message":"Canonicalization functions (e.g., `urlcanon.whatwg()`) modify the `ParsedUrl` object in place. They do not return a new, canonicalized object.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Do not assume feature parity between Python and Java implementations. Refer to the specific language's documentation or source code for exact behavior.","message":"There are known differences in features and behavior between the Python and Java versions of urlcanon.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Explicitly create a copy of the `ParsedUrl` object before canonicalizing if you need to retain the original. Example: `original_url = urlcanon.parse_url('...'); canonicalized_url = urlcanon.parse_url(str(original_url)); urlcanon.whatwg(canonicalized_url)`.","cause":"Canonicalization methods like `urlcanon.whatwg()` perform in-place modification on the `ParsedUrl` object rather than returning a new instance.","error":"My original ParsedUrl object changed after calling a canonicalization function, but I expected a new object."},{"fix":"Review the specific canonicalization rules implemented by `urlcanon` (e.g., by inspecting the source for `whatwg`). Test with a variety of URLs to understand its behavior. If necessary, implement custom canonicalization steps using `urlcanon`'s components.","cause":"The `urlcanon` library implements a specific set of canonicalization rules (e.g., WHATWG standard). Your expectations might differ or you might be encountering behavior specific to edge cases not covered by the default rules.","error":"The canonicalized URL doesn't match my expectation (e.g., specific path segments or casing were not changed)."}]}