Scrapling

0.4.6 · active · verified Wed Apr 15

Scrapling is an adaptive, high-performance Python library designed for robust web scraping. It emphasizes undetectability and efficiency, automatically bypassing common anti-bot systems like Cloudflare and adapting to minor website structure changes. It offers various fetcher types for HTTP, headless browser, and stealthy browser interactions, alongside a Scrapy-like asynchronous spider framework for large-scale, concurrent crawling. The library is actively maintained with frequent updates, focusing on stealth, performance, and developer experience.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates basic HTTP fetching with `Fetcher` to extract data using CSS selectors. It also includes a minimal example of Scrapling's `Spider` framework for structured, asynchronous crawling, similar to Scrapy.

from scrapling.fetchers import Fetcher
from scrapling.spiders import Spider, Response
import asyncio

# --- Basic HTTP Fetching ---
print("\n--- Basic HTTP Fetching ---")
page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text').getall()
authors = page.css('.quote .author::text').getall()
print(f"First quote: {quotes[0]}\nAuthor: {authors[0]}")

# --- Basic Spider Framework ---
print("\n--- Basic Spider Framework ---")
class QuotesSpider(Spider):
    name = "quotes_spider"
    start_urls = ["https://quotes.toscrape.com"]

    async def parse(self, response: Response):
        for quote in response.css("div.quote"):
            yield {
                "text": quote.css("span.text::text").get(""),
                "author": quote.css("small.author::text").get(""),
            }

async def run_spider():
    # Note: For production, consider using `MySpider().start()` which handles event loops.
    # For direct asyncio integration as below, ensure no other event loop is running.
    result = await QuotesSpider().start_async()
    print(f"Scraped {len(result.items)} items with the spider.")
    if result.items:
        print(f"First item from spider: {result.items[0]}")

if __name__ == "__main__":
    # Run the basic HTTP fetch synchronously
    # The spider requires an async context if run outside `Spider().start()`
    # For this example, we wrap it in asyncio.run
    asyncio.run(run_spider())

view raw JSON →