Courlan

1.3.2 · active · verified Thu Apr 09

Courlan (version 1.3.2) is a Python library designed to clean, filter, and sample URLs, optimizing data collection workflows. It includes features for spam detection, content type filtering, and language identification. The library maintains an active development pace with minor releases typically every few months.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates basic URL cleaning, link extraction from HTML, and usage of the `UrlStore` for managing visited and unvisited URLs. Link extraction often benefits from a `base_url` to resolve relative links, and the `UrlStore` provides efficient tracking for web crawling applications.

from courlan import clean_url, extract_links, UrlStore

# Example 1: Clean a URL
raw_url = 'http://www.Example.com/path/?query=value#fragment'
cleaned_url = clean_url(raw_url)
print(f"Cleaned URL: {cleaned_url}")

# Example 2: Extract links from HTML (requires lxml)
html_content = """
<html>
<body>
    <a href="/relative/path">Relative Link</a>
    <a href="https://example.org/absolute">Absolute Link</a>
    <a href="http://invalid.com?utm_source=foo">Tracker Link</a>
</body>
</html>
"""

extracted_links = extract_links(
    html_content, 
    url='https://example.com/base',
    deduplicate=True,
    with_fragment=False,
    with_query=False,
    original_url_and_query=False
)
print(f"Extracted links: {list(extracted_links.keys())}")

# Example 3: Using UrlStore
store = UrlStore()
store.add('https://example.com/page1')
store.add('https://example.org/page2')

print(f"URLs in store: {store.size}")

# Mark a URL as visited
store.visit('https://example.com/page1')

unvisited_urls = store.get_unvisited_urls()
print(f"Unvisited URLs: {list(unvisited_urls)}")

view raw JSON →