Crawl4AI: LLM Friendly Web Crawler & Scraper
Crawl4AI is an open-source, LLM-friendly web crawler and scraper designed for AI agents, RAG, and data pipelines. It provides fast, controllable, and customizable web content extraction, often converting pages into clean Markdown. The library supports dynamic content handling, caching, custom hooks, real-time monitoring via Docker, and offers flexible deployment options. It is actively maintained with frequent minor releases focusing on performance, anti-bot detection, and security, with the current version being 0.8.6. [8, 9]
Warnings
- breaking Critical Security Hotfix (v0.8.6): The `litellm` dependency was replaced with `unclecode-litellm` due to a PyPI supply chain compromise. Users on `v0.8.5` or earlier are strongly advised to upgrade immediately to `v0.8.6` or later to mitigate this risk. [13]
- gotcha Mandatory Playwright Installation: After installing `crawl4ai` via pip, you must run `playwright install` to download and set up the required browser binaries. Failing to do so will result in runtime errors when attempting to crawl. [6, 10, 13]
- breaking Docker API Hooks Disabled by Default (v0.8.0): For security reasons (Remote Code Execution vulnerability fix), hooks in the Docker API are now disabled by default. If you rely on hooks with the Docker API, you will need to re-enable them with caution. [2]
- breaking Legacy Browser Modules Removed (v0.6.0): Modules under `crawl4ai/browser/*` were removed. Also, the `AsyncPlaywrightCrawlerStrategy.get_page` function signature changed. Update imports and method calls accordingly. [1]
Install
-
pip install -U crawl4ai -
playwright install -
pip install crawl4ai[torch] -
pip install crawl4ai[transformer]
Imports
- AsyncWebCrawler
from crawl4ai import AsyncWebCrawler
Quickstart
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
# Initialize the crawler. Ensure 'playwright install' has been run.
async with AsyncWebCrawler() as crawler:
# Perform a basic crawl and extract content as Markdown
result = await crawler.arun(
url="https://www.nbcnews.com/business"
)
print("--- Extracted Markdown ---")
print(result.markdown[:500]) # Print first 500 chars of Markdown
# Example of getting raw HTML
# result_html = await crawler.arun(
# url="https://www.nbcnews.com/business",
# include_raw_html=True
# )
# print("--- Raw HTML ---")
# print(result_html.html[:500])
if __name__ == "__main__":
asyncio.run(main())