w3lib
w3lib is a Python library offering a collection of web-related utility functions, commonly used in web scraping and data processing contexts. It provides tools for URL manipulation, HTML cleaning, HTTP header parsing, and more. The current version is 2.4.1, and it typically releases new versions every few months, often tied to Python version support updates or minor bug fixes/improvements.
Warnings
- breaking Python 2 support was dropped in v2.0.1. Users must migrate to Python 3.6+.
- breaking Several Python versions have had their support dropped in recent releases (e.g., 3.7 in v2.2.0, 3.8 in v2.3.0, 3.9 in v2.4.0). Ensure your Python environment meets the minimum requirement for the installed w3lib version.
- breaking The utility functions `w3lib.util.str_to_unicode`, `w3lib.util.to_native_str`, and `w3lib.util.unicode_to_str` were removed in v2.3.0. These were deprecated in v2.0.0.
- gotcha The behavior of `w3lib.url.canonicalize_url` and `w3lib.url.safe_url_string` has changed regarding how they handle `%23` and userinfo components, potentially affecting URL fingerprinting or comparisons.
Install
-
pip install w3lib
Imports
- canonicalize_url
from w3lib.url import canonicalize_url
- remove_tags
from w3lib.html import remove_tags
- headers_raw_to_dict
from w3lib.http import headers_raw_to_dict
- str_to_unicode
w3lib.util.str_to_unicode is removed in v2.3.0+
Quickstart
from w3lib.url import canonicalize_url
from w3lib.html import remove_tags
from w3lib.http import headers_raw_to_dict
# URL canonicalization
url = "http://example.com/path/../foo.html?a=1&b=2#frag"
canonical_url = canonicalize_url(url)
print(f"Canonical URL: {canonical_url}")
# HTML tag removal
html_content = "<div>Hello <b>world</b>!</div>"
clean_text = remove_tags(html_content)
print(f"Clean text: {clean_text}")
# HTTP headers parsing
raw_headers = b"Content-Type: text/html\r\nUser-Agent: my-spider/1.0"
parsed_headers = headers_raw_to_dict(raw_headers)
print(f"Parsed Headers: {parsed_headers}")