{"id":1781,"library":"w3lib","title":"w3lib","description":"w3lib is a Python library offering a collection of web-related utility functions, commonly used in web scraping and data processing contexts. It provides tools for URL manipulation, HTML cleaning, HTTP header parsing, and more. The current version is 2.4.1, and it typically releases new versions every few months, often tied to Python version support updates or minor bug fixes/improvements.","status":"active","version":"2.4.1","language":"en","source_language":"en","source_url":"https://github.com/scrapy/w3lib","tags":["web scraping","utilities","url parsing","html parsing","http","scrapy"],"install":[{"cmd":"pip install w3lib","lang":"bash","label":"Install latest version"}],"dependencies":[],"imports":[{"symbol":"canonicalize_url","correct":"from w3lib.url import canonicalize_url"},{"symbol":"remove_tags","correct":"from w3lib.html import remove_tags"},{"symbol":"headers_raw_to_dict","correct":"from w3lib.http import headers_raw_to_dict"},{"note":"Functions `str_to_unicode`, `to_native_str`, `unicode_to_str` were removed in v2.3.0. Use standard Python string encoding/decoding methods instead.","wrong":"from w3lib.util import str_to_unicode","symbol":"str_to_unicode","correct":"w3lib.util.str_to_unicode is removed in v2.3.0+"}],"quickstart":{"code":"from w3lib.url import canonicalize_url\nfrom w3lib.html import remove_tags\nfrom w3lib.http import headers_raw_to_dict\n\n# URL canonicalization\nurl = \"http://example.com/path/../foo.html?a=1&b=2#frag\"\ncanonical_url = canonicalize_url(url)\nprint(f\"Canonical URL: {canonical_url}\")\n\n# HTML tag removal\nhtml_content = \"<div>Hello <b>world</b>!</div>\"\nclean_text = remove_tags(html_content)\nprint(f\"Clean text: {clean_text}\")\n\n# HTTP headers parsing\nraw_headers = b\"Content-Type: text/html\\r\\nUser-Agent: my-spider/1.0\"\nparsed_headers = headers_raw_to_dict(raw_headers)\nprint(f\"Parsed Headers: {parsed_headers}\")","lang":"python","description":"This quickstart demonstrates basic usage of w3lib for URL canonicalization, HTML tag removal, and HTTP header parsing."},"warnings":[{"fix":"Upgrade your Python environment to 3.6 or later. For current versions (2.4.x), Python 3.10+ is required.","message":"Python 2 support was dropped in v2.0.1. Users must migrate to Python 3.6+.","severity":"breaking","affected_versions":"<2.0.1"},{"fix":"Upgrade your Python environment to Python 3.10+ for w3lib versions 2.4.x and newer.","message":"Several Python versions have had their support dropped in recent releases (e.g., 3.7 in v2.2.0, 3.8 in v2.3.0, 3.9 in v2.4.0). Ensure your Python environment meets the minimum requirement for the installed w3lib version.","severity":"breaking","affected_versions":">=2.2.0"},{"fix":"Replace calls to these functions with standard Python string encoding and decoding methods (e.g., `.encode()` and `.decode()`).","message":"The utility functions `w3lib.util.str_to_unicode`, `w3lib.util.to_native_str`, and `w3lib.util.unicode_to_str` were removed in v2.3.0. These were deprecated in v2.0.0.","severity":"breaking","affected_versions":">=2.3.0"},{"fix":"Review your application's reliance on the exact output of these functions, especially if URL fingerprinting or strict URL comparisons are critical. Test thoroughly after upgrading.","message":"The behavior of `w3lib.url.canonicalize_url` and `w3lib.url.safe_url_string` has changed regarding how they handle `%23` and userinfo components, potentially affecting URL fingerprinting or comparisons.","severity":"gotcha","affected_versions":">=2.0.1, >=2.2.1"}],"env_vars":null,"last_verified":"2026-04-09T00:00:00.000Z","next_check":"2026-07-08T00:00:00.000Z"}