Crawlee

1.6.2 · active · verified Tue Apr 14

Crawlee is a Python library for web scraping and browser automation, providing a robust and flexible framework for building web scraping tasks. It offers features like automatic parallel crawling, proxy rotation, session management, and persistent storage, supporting various techniques from static HTML parsing with BeautifulSoup and Parsel to dynamic JavaScript-rendered content with Playwright. Currently at version 1.6.2, Crawlee for Python has been stable since its v1.0 release in September 2025 and follows semantic versioning, meaning breaking changes are reserved for major releases.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use `PlaywrightCrawler` to recursively crawl a website, extract the page title and content length, and store the results. It highlights the asynchronous nature of Crawlee, defining a request handler, enqueuing new links, and exporting collected data. Remember to install browser binaries separately with `playwright install` if using PlaywrightCrawler.

import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
import os

async def main() -> None:
    # Initialize the PlaywrightCrawler.
    # For a real project, consider using persistent storage via storage_client argument.
    crawler = PlaywrightCrawler(
        max_requests_per_crawl=10, # Limit for demo purposes
        headless=True,  # Run in headless mode (set to False to see the browser)
        browser_type='chromium', # Use Chromium browser by default
    )

    # Define the default request handler, which will be called for every request.
    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        context.log.info(f'Processing {context.request.url} ...')

        # Extract data from the page using Playwright API.
        data = {
            'url': context.request.url,
            'title': await context.page.title(),
            'content_length': len(await context.page.content()),
        }

        # Push the extracted data to the default dataset.
        await context.push_data(data)

        # Extract all links on the page and enqueue them for further crawling.
        await context.enqueue_links()

    # Run the crawler with an initial list of URLs.
    await crawler.run(['https://crawlee.dev'])

    # Optionally, export the entire dataset to a file (e.g., CSV, JSON).
    await crawler.export_data('crawlee_results.json')
    crawler.log.info('Crawling finished. Data exported to crawlee_results.json')

if __name__ == '__main__':
    asyncio.run(main())

view raw JSON →