Scrapling
Scrapling is an adaptive, high-performance Python library designed for robust web scraping. It emphasizes undetectability and efficiency, automatically bypassing common anti-bot systems like Cloudflare and adapting to minor website structure changes. It offers various fetcher types for HTTP, headless browser, and stealthy browser interactions, alongside a Scrapy-like asynchronous spider framework for large-scale, concurrent crawling. The library is actively maintained with frequent updates, focusing on stealth, performance, and developer experience.
Warnings
- breaking Version 0.4 introduced a new asynchronous Spider framework and significant API changes. Existing scraping logic written for previous versions, especially those not using the new Spider API, may require substantial refactoring. Users are advised to review the v0.4 release notes for specific breaking changes.
- breaking In version 0.3.13, Scrapling stopped using `Camoufox` entirely due to various reasons. If your existing scrapers relied on `Camoufox` integration, they will break or behave differently.
- gotcha To use browser-based fetchers (like `StealthyFetcher` or `DynamicFetcher`), `pip install scrapling` is not sufficient. You must also run `scrapling install` (or `playwright install` directly if Playwright is installed separately) to download the necessary browser binaries.
- gotcha The adaptive scraping feature, which allows selectors to auto-relocate elements after website changes, needs to be explicitly enabled using `auto_save=True` during initial scraping and `adaptive=True` for subsequent scraping runs.
- gotcha For `TextHandler` and `Selector` classes, the method to retrieve all matched text or elements is `getall()` (e.g., `page.css('selector').getall()`), not `get_all()`.
- gotcha When running spiders, the `robots_txt_obey` option (introduced in v0.4.4) is disabled by default. If enabled, the spider will pre-fetch and respect `robots.txt` rules, including `Disallow`, `Crawl-delay`, and `Request-rate` directives, which can affect crawling speed and scope.
- gotcha Supplying proxy credentials, CDP URLs, or user_data_dir paths can expose sensitive data or connect to untrusted remote browsers. Always ensure these sources are secure and trustworthy.
Install
-
pip install "scrapling[fetchers]" -
scrapling install
Imports
- Fetcher
from scrapling.defaults import Fetcher
from scrapling.fetchers import Fetcher
- StealthyFetcher
from scrapling.fetchers import StealthyFetcher
- Spider
from scrapling.spiders import Spider
- Response
from scrapling.spiders import Response
- FetcherSession
from scrapling.fetchers import FetcherSession
Quickstart
from scrapling.fetchers import Fetcher
from scrapling.spiders import Spider, Response
import asyncio
# --- Basic HTTP Fetching ---
print("\n--- Basic HTTP Fetching ---")
page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text').getall()
authors = page.css('.quote .author::text').getall()
print(f"First quote: {quotes[0]}\nAuthor: {authors[0]}")
# --- Basic Spider Framework ---
print("\n--- Basic Spider Framework ---")
class QuotesSpider(Spider):
name = "quotes_spider"
start_urls = ["https://quotes.toscrape.com"]
async def parse(self, response: Response):
for quote in response.css("div.quote"):
yield {
"text": quote.css("span.text::text").get(""),
"author": quote.css("small.author::text").get(""),
}
async def run_spider():
# Note: For production, consider using `MySpider().start()` which handles event loops.
# For direct asyncio integration as below, ensure no other event loop is running.
result = await QuotesSpider().start_async()
print(f"Scraped {len(result.items)} items with the spider.")
if result.items:
print(f"First item from spider: {result.items[0]}")
if __name__ == "__main__":
# Run the basic HTTP fetch synchronously
# The spider requires an async context if run outside `Spider().start()`
# For this example, we wrap it in asyncio.run
asyncio.run(run_spider())