Scrapy

2.15.0 · active · verified Fri Apr 10

Scrapy is a high-level Python web crawling and web scraping framework, designed for fast extraction of structured data from websites. It's actively maintained with frequent releases, supporting applications from data mining to information processing and automated testing. The current version is 2.15.0.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates a basic Scrapy spider that crawls the 'quotes.toscrape.com' website, specifically the 'humor' tag. It extracts the author and text of each quote, then follows the 'Next Page' link to continue crawling. The `start_urls` attribute defines the initial URLs, and the `parse` method handles the response, extracting data and scheduling new requests using `response.follow` for pagination.

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape.com/tag/humor/",
    ]

    def parse(self, response):
        for quote in response.css("div.quote"):
            yield {
                "author": quote.xpath("span/small/text()").get(),
                "text": quote.css("span.text::text").get(),
            }

        next_page = response.css('li.next a::attr("href")').get()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

# To run this spider, save it as a .py file (e.g., quotes_spider.py) and execute:
# scrapy runspider quotes_spider.py -o quotes.jsonl

view raw JSON →