Courlan
Courlan (version 1.3.2) is a Python library designed to clean, filter, and sample URLs, optimizing data collection workflows. It includes features for spam detection, content type filtering, and language identification. The library maintains an active development pace with minor releases typically every few months.
Warnings
- breaking Python 3.6 and 3.7 support was officially dropped with version 1.3.0. Users on these older Python versions must upgrade to Python 3.8 or newer to use courlan 1.3.0 and later.
- breaking The `timelimit` parameter was entirely removed from the `UrlStore.get_download_urls()` method in version 1.3.2. For other `UrlStore` methods, the parameter was renamed from `timelimit` to `time_limit` in version 1.1.0, with the old name being deprecated in 1.2.0.
- deprecated The `base_url` parameter in `extract_links()` was deprecated in version 1.3.1 and is scheduled for removal. While it currently still works, it's advised to avoid its use.
- gotcha Starting with version 1.3.1, `UrlStore` compression using `bz2` or `zlib` is optional. If you attempt to use these compression types without installing the respective Python packages (`python-bzip2` or `zlib-python`), `UrlStore` will raise an error or fall back to an uncompressed state.
Install
-
pip install courlan
Imports
- clean_url
from courlan import clean_url
- extract_links
from courlan import extract_links
- UrlStore
from courlan import UrlStore
Quickstart
from courlan import clean_url, extract_links, UrlStore
# Example 1: Clean a URL
raw_url = 'http://www.Example.com/path/?query=value#fragment'
cleaned_url = clean_url(raw_url)
print(f"Cleaned URL: {cleaned_url}")
# Example 2: Extract links from HTML (requires lxml)
html_content = """
<html>
<body>
<a href="/relative/path">Relative Link</a>
<a href="https://example.org/absolute">Absolute Link</a>
<a href="http://invalid.com?utm_source=foo">Tracker Link</a>
</body>
</html>
"""
extracted_links = extract_links(
html_content,
url='https://example.com/base',
deduplicate=True,
with_fragment=False,
with_query=False,
original_url_and_query=False
)
print(f"Extracted links: {list(extracted_links.keys())}")
# Example 3: Using UrlStore
store = UrlStore()
store.add('https://example.com/page1')
store.add('https://example.org/page2')
print(f"URLs in store: {store.size}")
# Mark a URL as visited
store.visit('https://example.com/page1')
unvisited_urls = store.get_unvisited_urls()
print(f"Unvisited URLs: {list(unvisited_urls)}")