Crawlee
Crawlee is a Python library for web scraping and browser automation, providing a robust and flexible framework for building web scraping tasks. It offers features like automatic parallel crawling, proxy rotation, session management, and persistent storage, supporting various techniques from static HTML parsing with BeautifulSoup and Parsel to dynamic JavaScript-rendered content with Playwright. Currently at version 1.6.2, Crawlee for Python has been stable since its v1.0 release in September 2025 and follows semantic versioning, meaning breaking changes are reserved for major releases.
Warnings
- breaking Crawlee v1.0 (September 2025) introduced significant breaking changes, including a new unified storage client system, `HttpResponse.read` becoming an async method, removal of `Request.id`, and changes to internal HTTP client defaults (HttpxHttpClient replaced by ImpitHttpClient).
- breaking Breaking changes in v0.6 (March 2025) refactored class names by removing the 'Base' prefix from abstract classes and merged `HttpCrawlerOptions` into `BasicCrawlerOptions`. The `Session.cookies` attribute was changed from a simple dictionary to a `SessionCookies` class for multi-domain support, and `PlaywrightCrawler` now uses a persistent browser context by default.
- gotcha When using `PlaywrightCrawler`, you must separately install the necessary browser binaries using `playwright install`. Installing `crawlee[playwright]` only installs the Python Playwright library, not the browsers themselves.
- gotcha Crawlee is built on `asyncio`. All crawler operations (e.g., `crawler.run()`, `context.push_data()`, `context.enqueue_links()`, Playwright API calls like `context.page.title()`) are asynchronous and *must* be `awaited` within `async` functions. Failing to `await` will lead to unexecuted operations or incorrect behavior without explicit errors.
- gotcha To use crawlers like `BeautifulSoupCrawler`, `ParselCrawler`, or `PlaywrightCrawler`, you need to install Crawlee with the corresponding optional extras (e.g., `crawlee[beautifulsoup]`, `crawlee[parsel]`, or `crawlee[playwright]`). A basic `pip install crawlee` only provides the core framework.
Install
-
pip install crawlee -
pip install 'crawlee[all]' -
playwright install
Imports
- BeautifulSoupCrawler
from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
- ParselCrawler
from crawlee.crawlers import ParselCrawler, ParselCrawlingContext
- PlaywrightCrawler
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
- AdaptivePlaywrightCrawler
from crawlee.crawlers import AdaptivePlaywrightCrawler, AdaptivePlaywrightCrawlingContext
- SqlStorageClient
from crawlee.storage_clients import SqlStorageClient
- ImpitHttpClient
from crawlee.http_clients import ImpitHttpClient
Quickstart
import asyncio
from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext
import os
async def main() -> None:
# Initialize the PlaywrightCrawler.
# For a real project, consider using persistent storage via storage_client argument.
crawler = PlaywrightCrawler(
max_requests_per_crawl=10, # Limit for demo purposes
headless=True, # Run in headless mode (set to False to see the browser)
browser_type='chromium', # Use Chromium browser by default
)
# Define the default request handler, which will be called for every request.
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
context.log.info(f'Processing {context.request.url} ...')
# Extract data from the page using Playwright API.
data = {
'url': context.request.url,
'title': await context.page.title(),
'content_length': len(await context.page.content()),
}
# Push the extracted data to the default dataset.
await context.push_data(data)
# Extract all links on the page and enqueue them for further crawling.
await context.enqueue_links()
# Run the crawler with an initial list of URLs.
await crawler.run(['https://crawlee.dev'])
# Optionally, export the entire dataset to a file (e.g., CSV, JSON).
await crawler.export_data('crawlee_results.json')
crawler.log.info('Crawling finished. Data exported to crawlee_results.json')
if __name__ == '__main__':
asyncio.run(main())