{"id":1972,"library":"courlan","title":"Courlan","description":"Courlan (version 1.3.2) is a Python library designed to clean, filter, and sample URLs, optimizing data collection workflows. It includes features for spam detection, content type filtering, and language identification. The library maintains an active development pace with minor releases typically every few months.","status":"active","version":"1.3.2","language":"en","source_language":"en","source_url":"https://github.com/adbar/courlan","tags":["url-processing","web-scraping","data-cleaning","link-extraction","url-filtering"],"install":[{"cmd":"pip install courlan","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Required for parsing HTML content to extract links efficiently.","package":"lxml","optional":false},{"reason":"Used for locale information, particularly for language filtering and identification.","package":"babel","optional":false},{"reason":"Enables bzip2 compression for UrlStore if desired. Not installed by default.","package":"python-bzip2","optional":true},{"reason":"Enables zlib compression for UrlStore if desired. Not installed by default.","package":"zlib-python","optional":true}],"imports":[{"symbol":"clean_url","correct":"from courlan import clean_url"},{"symbol":"extract_links","correct":"from courlan import extract_links"},{"symbol":"UrlStore","correct":"from courlan import UrlStore"}],"quickstart":{"code":"from courlan import clean_url, extract_links, UrlStore\n\n# Example 1: Clean a URL\nraw_url = 'http://www.Example.com/path/?query=value#fragment'\ncleaned_url = clean_url(raw_url)\nprint(f\"Cleaned URL: {cleaned_url}\")\n\n# Example 2: Extract links from HTML (requires lxml)\nhtml_content = \"\"\"\n<html>\n<body>\n    <a href=\"/relative/path\">Relative Link</a>\n    <a href=\"https://example.org/absolute\">Absolute Link</a>\n    <a href=\"http://invalid.com?utm_source=foo\">Tracker Link</a>\n</body>\n</html>\n\"\"\"\n\nextracted_links = extract_links(\n    html_content, \n    url='https://example.com/base',\n    deduplicate=True,\n    with_fragment=False,\n    with_query=False,\n    original_url_and_query=False\n)\nprint(f\"Extracted links: {list(extracted_links.keys())}\")\n\n# Example 3: Using UrlStore\nstore = UrlStore()\nstore.add('https://example.com/page1')\nstore.add('https://example.org/page2')\n\nprint(f\"URLs in store: {store.size}\")\n\n# Mark a URL as visited\nstore.visit('https://example.com/page1')\n\nunvisited_urls = store.get_unvisited_urls()\nprint(f\"Unvisited URLs: {list(unvisited_urls)}\")","lang":"python","description":"This quickstart demonstrates basic URL cleaning, link extraction from HTML, and usage of the `UrlStore` for managing visited and unvisited URLs. Link extraction often benefits from a `base_url` to resolve relative links, and the `UrlStore` provides efficient tracking for web crawling applications."},"warnings":[{"fix":"Upgrade your Python environment to version 3.8 or higher.","message":"Python 3.6 and 3.7 support was officially dropped with version 1.3.0. Users on these older Python versions must upgrade to Python 3.8 or newer to use courlan 1.3.0 and later.","severity":"breaking","affected_versions":">=1.3.0"},{"fix":"For `UrlStore.get_download_urls()`, remove the `timelimit` parameter. For other `UrlStore` methods, rename `timelimit` to `time_limit`.","message":"The `timelimit` parameter was entirely removed from the `UrlStore.get_download_urls()` method in version 1.3.2. For other `UrlStore` methods, the parameter was renamed from `timelimit` to `time_limit` in version 1.1.0, with the old name being deprecated in 1.2.0.","severity":"breaking","affected_versions":">=1.3.2 for `get_download_urls()`, >=1.1.0 for other `UrlStore` methods"},{"fix":"The intended replacement for handling relative URLs is often to clean them after extraction using `clean_url` with an explicit base, or to ensure the input `url` parameter to `extract_links` provides a sufficient base for resolution. Consult the latest documentation for alternatives as the parameter's removal approaches.","message":"The `base_url` parameter in `extract_links()` was deprecated in version 1.3.1 and is scheduled for removal. While it currently still works, it's advised to avoid its use.","severity":"deprecated","affected_versions":">=1.3.1"},{"fix":"Explicitly install the required packages: `pip install python-bzip2` for bzip2 compression, or ensure zlib development libraries are available for zlib-python (often built-in with Python).","message":"Starting with version 1.3.1, `UrlStore` compression using `bz2` or `zlib` is optional. If you attempt to use these compression types without installing the respective Python packages (`python-bzip2` or `zlib-python`), `UrlStore` will raise an error or fall back to an uncompressed state.","severity":"gotcha","affected_versions":">=1.3.1"}],"env_vars":null,"last_verified":"2026-04-09T00:00:00.000Z","next_check":"2026-07-08T00:00:00.000Z"}