w3lib

2.4.1 · active · verified Thu Apr 09

w3lib is a Python library offering a collection of web-related utility functions, commonly used in web scraping and data processing contexts. It provides tools for URL manipulation, HTML cleaning, HTTP header parsing, and more. The current version is 2.4.1, and it typically releases new versions every few months, often tied to Python version support updates or minor bug fixes/improvements.

Warnings

breaking Python 2 support was dropped in v2.0.1. Users must migrate to Python 3.6+.
Fix: Upgrade your Python environment to 3.6 or later. For current versions (2.4.x), Python 3.10+ is required.
breaking Several Python versions have had their support dropped in recent releases (e.g., 3.7 in v2.2.0, 3.8 in v2.3.0, 3.9 in v2.4.0). Ensure your Python environment meets the minimum requirement for the installed w3lib version.
Fix: Upgrade your Python environment to Python 3.10+ for w3lib versions 2.4.x and newer.
breaking The utility functions `w3lib.util.str_to_unicode`, `w3lib.util.to_native_str`, and `w3lib.util.unicode_to_str` were removed in v2.3.0. These were deprecated in v2.0.0.
Fix: Replace calls to these functions with standard Python string encoding and decoding methods (e.g., `.encode()` and `.decode()`).
gotcha The behavior of `w3lib.url.canonicalize_url` and `w3lib.url.safe_url_string` has changed regarding how they handle `%23` and userinfo components, potentially affecting URL fingerprinting or comparisons.
Fix: Review your application's reliance on the exact output of these functions, especially if URL fingerprinting or strict URL comparisons are critical. Test thoroughly after upgrading.

Install

pip install w3lib Install latest version

Imports

canonicalize_url
```
from w3lib.url import canonicalize_url
```
remove_tags
```
from w3lib.html import remove_tags
```

headers_raw_to_dict

from w3lib.http import headers_raw_to_dict

str_to_unicode
```
w3lib.util.str_to_unicode is removed in v2.3.0+
```
Functions `str_to_unicode`, `to_native_str`, `unicode_to_str` were removed in v2.3.0. Use standard Python string encoding/decoding methods instead.

Quickstart

This quickstart demonstrates basic usage of w3lib for URL canonicalization, HTML tag removal, and HTTP header parsing.

from w3lib.url import canonicalize_url
from w3lib.html import remove_tags
from w3lib.http import headers_raw_to_dict

# URL canonicalization
url = "http://example.com/path/../foo.html?a=1&b=2#frag"
canonical_url = canonicalize_url(url)
print(f"Canonical URL: {canonical_url}")

# HTML tag removal
html_content = "<div>Hello <b>world</b>!</div>"
clean_text = remove_tags(html_content)
print(f"Clean text: {clean_text}")

# HTTP headers parsing
raw_headers = b"Content-Type: text/html\r\nUser-Agent: my-spider/1.0"
parsed_headers = headers_raw_to_dict(raw_headers)
print(f"Parsed Headers: {parsed_headers}")

view raw JSON →