w3lib

2.4.1 · active · verified Thu Apr 09

w3lib is a Python library offering a collection of web-related utility functions, commonly used in web scraping and data processing contexts. It provides tools for URL manipulation, HTML cleaning, HTTP header parsing, and more. The current version is 2.4.1, and it typically releases new versions every few months, often tied to Python version support updates or minor bug fixes/improvements.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates basic usage of w3lib for URL canonicalization, HTML tag removal, and HTTP header parsing.

from w3lib.url import canonicalize_url
from w3lib.html import remove_tags
from w3lib.http import headers_raw_to_dict

# URL canonicalization
url = "http://example.com/path/../foo.html?a=1&b=2#frag"
canonical_url = canonicalize_url(url)
print(f"Canonical URL: {canonical_url}")

# HTML tag removal
html_content = "<div>Hello <b>world</b>!</div>"
clean_text = remove_tags(html_content)
print(f"Clean text: {clean_text}")

# HTTP headers parsing
raw_headers = b"Content-Type: text/html\r\nUser-Agent: my-spider/1.0"
parsed_headers = headers_raw_to_dict(raw_headers)
print(f"Parsed Headers: {parsed_headers}")

view raw JSON →