Courlan
Courlan (version 1.3.2) is a Python library designed to clean, filter, and sample URLs, optimizing data collection workflows. It includes features for spam detection, content type filtering, and language identification. The library maintains an active development pace with minor releases typically every few months.
Common errors
-
ModuleNotFoundError: No module named 'courlan.core'
cause Users might attempt to import specific submodules directly (e.g., `courlan.core`) which are not exposed at the top level or are not intended for direct import, leading to a `ModuleNotFoundError`.fixMost commonly used functions are available directly from the top-level `courlan` package. If a function is in a submodule, import it from `courlan` directly, or check the documentation for the correct import path. For example, `from courlan import get_base_url` instead of trying to guess the submodule. -
AttributeError: module 'courlan' has no attribute 'filter_url'
cause This error occurs when a user tries to call a function (like `filter_url`) that either doesn't exist, has been renamed, or is part of a different object/class within the `courlan` library.fixConsult the `courlan` documentation or source code to confirm the correct function name and its location. For example, `courlan` offers `filter_links()` for link filtering. -
courlan cache clear
cause While not a direct error message, users search for this when `courlan` exhibits unexpected behavior, such as processing URLs inconsistently or not applying new filtering rules, often due to stale internal caches.fixTo reset the internal cache and resolve inconsistencies, import `clear_caches` from `courlan.meta` and call it: `from courlan.meta import clear_caches; clear_caches()`
Warnings
- breaking Python 3.6 and 3.7 support was officially dropped with version 1.3.0. Users on these older Python versions must upgrade to Python 3.8 or newer to use courlan 1.3.0 and later.
- breaking The `timelimit` parameter was entirely removed from the `UrlStore.get_download_urls()` method in version 1.3.2. For other `UrlStore` methods, the parameter was renamed from `timelimit` to `time_limit` in version 1.1.0, with the old name being deprecated in 1.2.0.
- deprecated The `base_url` parameter in `extract_links()` was deprecated in version 1.3.1 and is scheduled for removal. While it currently still works, it's advised to avoid its use.
- gotcha Starting with version 1.3.1, `UrlStore` compression using `bz2` or `zlib` is optional. If you attempt to use these compression types without installing the respective Python packages (`python-bzip2` or `zlib-python`), `UrlStore` will raise an error or fall back to an uncompressed state.
Install
-
pip install courlan
Imports
- clean_url
from courlan import clean_url
- extract_links
from courlan import extract_links
- UrlStore
from courlan import UrlStore
Quickstart
from courlan import clean_url, extract_links, UrlStore
# Example 1: Clean a URL
raw_url = 'http://www.Example.com/path/?query=value#fragment'
cleaned_url = clean_url(raw_url)
print(f"Cleaned URL: {cleaned_url}")
# Example 2: Extract links from HTML (requires lxml)
html_content = """
<html>
<body>
<a href="/relative/path">Relative Link</a>
<a href="https://example.org/absolute">Absolute Link</a>
<a href="http://invalid.com?utm_source=foo">Tracker Link</a>
</body>
</html>
"""
extracted_links = extract_links(
html_content,
url='https://example.com/base',
deduplicate=True,
with_fragment=False,
with_query=False,
original_url_and_query=False
)
print(f"Extracted links: {list(extracted_links.keys())}")
# Example 3: Using UrlStore
store = UrlStore()
store.add('https://example.com/page1')
store.add('https://example.org/page2')
print(f"URLs in store: {store.size}")
# Mark a URL as visited
store.visit('https://example.com/page1')
unvisited_urls = store.get_unvisited_urls()
print(f"Unvisited URLs: {list(unvisited_urls)}")