URL Canon

0.3.1 · active · verified Thu Apr 16

urlcanon is a URL canonicalization and normalization library for Python and Java, currently at version 0.3.1. It provides a URL parser that preserves input bytes, a predefined set of canonicalization rules aiming to match browser parsing behavior, and an alternative URL serialization format called SSURT. The library is stable and in production use, though API and output stability are not yet guaranteed, and feature sets differ between its Python and Java implementations. It does not have a strict release cadence but updates as needed.

Common errors

Warnings

Install

Imports

Quickstart

Demonstrates parsing a URL, applying WHATWG canonicalization rules, obtaining the SSURT representation, and checking a URL against a MatchRule. Note that canonicalization functions like `whatwg` modify the `ParsedUrl` object in place.

import urlcanon

input_url = "http://///EXAMPLE.com:80/foo/../bar"

# Parse the URL. This preserves the original input string.
parsed_url = urlcanon.parse_url(input_url)
print(f"Original parsed URL: {parsed_url}")

# Apply WHATWG canonicalization rules (modifies parsed_url in place)
urlcanon.whatwg(parsed_url)
print(f"Canonicalized URL (WHATWG): {parsed_url}")

# Get the SSURT representation (suitable for sorting/prefix-matching)
ssurt_representation = parsed_url.ssurt()
print(f"SSURT: {ssurt_representation}")

# Example of using a MatchRule
rule = urlcanon.MatchRule(ssurt=b'com,example,//:http/bar')
url_to_check = b'HTtp:////eXAMple.Com/bar//baz//..///quu'
applies = urlcanon.whatwg.rule_applies(rule, url_to_check)
print(f"Does rule apply to '{url_to_check.decode()}': {applies}")

view raw JSON →