{"id":5889,"library":"crawlee","title":"Crawlee","description":"Crawlee is a Python library for web scraping and browser automation, providing a robust and flexible framework for building web scraping tasks. It offers features like automatic parallel crawling, proxy rotation, session management, and persistent storage, supporting various techniques from static HTML parsing with BeautifulSoup and Parsel to dynamic JavaScript-rendered content with Playwright. Currently at version 1.6.2, Crawlee for Python has been stable since its v1.0 release in September 2025 and follows semantic versioning, meaning breaking changes are reserved for major releases.","status":"active","version":"1.6.2","language":"en","source_language":"en","source_url":"https://github.com/apify/crawlee-python","tags":["web scraping","browser automation","asyncio","playwright","beautifulsoup","parsel","crawler","apify"],"install":[{"cmd":"pip install crawlee","lang":"bash","label":"Core functionality"},{"cmd":"pip install 'crawlee[all]'","lang":"bash","label":"With all optional features (BeautifulSoup, Parsel, Playwright, CLI)"},{"cmd":"playwright install","lang":"bash","label":"Install browser binaries for PlaywrightCrawler (if using Playwright)"}],"dependencies":[{"reason":"Required runtime environment.","package":"python","optional":false},{"reason":"Optional for BeautifulSoupCrawler for HTML parsing.","package":"beautifulsoup4","optional":true},{"reason":"Optional for ParselCrawler for HTML/XML parsing with CSS/XPath selectors.","package":"parsel","optional":true},{"reason":"Optional for PlaywrightCrawler for browser automation and JavaScript-rendered content.","package":"playwright","optional":true},{"reason":"Optional, for using the Crawlee CLI to create new projects.","package":"uv","optional":true},{"reason":"Alternative to uv, for using the Crawlee CLI to create new projects.","package":"pipx","optional":true}],"imports":[{"symbol":"BeautifulSoupCrawler","correct":"from crawlee.crawlers import BeautifulSoupCrawler, BeautifulSoupCrawlingContext"},{"symbol":"ParselCrawler","correct":"from crawlee.crawlers import ParselCrawler, ParselCrawlingContext"},{"symbol":"PlaywrightCrawler","correct":"from crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext"},{"symbol":"AdaptivePlaywrightCrawler","correct":"from crawlee.crawlers import AdaptivePlaywrightCrawler, AdaptivePlaywrightCrawlingContext"},{"symbol":"SqlStorageClient","correct":"from crawlee.storage_clients import SqlStorageClient"},{"symbol":"ImpitHttpClient","correct":"from crawlee.http_clients import ImpitHttpClient"}],"quickstart":{"code":"import asyncio\nfrom crawlee.crawlers import PlaywrightCrawler, PlaywrightCrawlingContext\nimport os\n\nasync def main() -> None:\n    # Initialize the PlaywrightCrawler.\n    # For a real project, consider using persistent storage via storage_client argument.\n    crawler = PlaywrightCrawler(\n        max_requests_per_crawl=10, # Limit for demo purposes\n        headless=True,  # Run in headless mode (set to False to see the browser)\n        browser_type='chromium', # Use Chromium browser by default\n    )\n\n    # Define the default request handler, which will be called for every request.\n    @crawler.router.default_handler\n    async def request_handler(context: PlaywrightCrawlingContext) -> None:\n        context.log.info(f'Processing {context.request.url} ...')\n\n        # Extract data from the page using Playwright API.\n        data = {\n            'url': context.request.url,\n            'title': await context.page.title(),\n            'content_length': len(await context.page.content()),\n        }\n\n        # Push the extracted data to the default dataset.\n        await context.push_data(data)\n\n        # Extract all links on the page and enqueue them for further crawling.\n        await context.enqueue_links()\n\n    # Run the crawler with an initial list of URLs.\n    await crawler.run(['https://crawlee.dev'])\n\n    # Optionally, export the entire dataset to a file (e.g., CSV, JSON).\n    await crawler.export_data('crawlee_results.json')\n    crawler.log.info('Crawling finished. Data exported to crawlee_results.json')\n\nif __name__ == '__main__':\n    asyncio.run(main())\n","lang":"python","description":"This quickstart demonstrates how to use `PlaywrightCrawler` to recursively crawl a website, extract the page title and content length, and store the results. It highlights the asynchronous nature of Crawlee, defining a request handler, enqueuing new links, and exporting collected data. Remember to install browser binaries separately with `playwright install` if using PlaywrightCrawler."},"warnings":[{"fix":"Review the 'Upgrading to v1' guide in the official documentation. Update imports for storage clients and HTTP clients. Ensure all I/O operations (like `HttpResponse.read()`) are awaited. Adjust code relying on old `Request` object properties.","message":"Crawlee v1.0 (September 2025) introduced significant breaking changes, including a new unified storage client system, `HttpResponse.read` becoming an async method, removal of `Request.id`, and changes to internal HTTP client defaults (HttpxHttpClient replaced by ImpitHttpClient).","severity":"breaking","affected_versions":"<1.0"},{"fix":"Update class names (e.g., `BaseStorageClient` to `StorageClient`). Adjust crawler option instantiation. Adapt cookie handling to the new `SessionCookies` object. If custom browser context behavior is needed for Playwright, configure `user_data_dir` or explicitly manage contexts.","message":"Breaking changes in v0.6 (March 2025) refactored class names by removing the 'Base' prefix from abstract classes and merged `HttpCrawlerOptions` into `BasicCrawlerOptions`. The `Session.cookies` attribute was changed from a simple dictionary to a `SessionCookies` class for multi-domain support, and `PlaywrightCrawler` now uses a persistent browser context by default.","severity":"breaking","affected_versions":"<0.6"},{"fix":"After installing `crawlee[playwright]`, run `playwright install` in your environment. If using specific browser types (e.g., Firefox, Webkit), you might need `playwright install firefox` or `playwright install webkit`.","message":"When using `PlaywrightCrawler`, you must separately install the necessary browser binaries using `playwright install`. Installing `crawlee[playwright]` only installs the Python Playwright library, not the browsers themselves.","severity":"gotcha","affected_versions":"All"},{"fix":"Ensure all calls to asynchronous Crawlee methods are prefixed with `await` and executed within an `async def` function, which is then run using `asyncio.run()`.","message":"Crawlee is built on `asyncio`. All crawler operations (e.g., `crawler.run()`, `context.push_data()`, `context.enqueue_links()`, Playwright API calls like `context.page.title()`) are asynchronous and *must* be `awaited` within `async` functions. Failing to `await` will lead to unexecuted operations or incorrect behavior without explicit errors.","severity":"gotcha","affected_versions":"All"},{"fix":"Install with specific extras (e.g., `pip install 'crawlee[beautifulsoup]'`) or all extras (`pip install 'crawlee[all]'`) depending on your needs.","message":"To use crawlers like `BeautifulSoupCrawler`, `ParselCrawler`, or `PlaywrightCrawler`, you need to install Crawlee with the corresponding optional extras (e.g., `crawlee[beautifulsoup]`, `crawlee[parsel]`, or `crawlee[playwright]`). A basic `pip install crawlee` only provides the core framework.","severity":"gotcha","affected_versions":"All"}],"env_vars":null,"last_verified":"2026-04-14T00:00:00.000Z","next_check":"2026-07-13T00:00:00.000Z","problems":[]}