Scrapy
Scrapy is a high-level Python web crawling and web scraping framework, designed for fast extraction of structured data from websites. It's actively maintained with frequent releases, supporting applications from data mining to information processing and automated testing. The current version is 2.15.0.
Warnings
- breaking Scrapy has progressively dropped support for older Python versions. Scrapy 2.12.0 dropped Python 3.8 support, and Scrapy 2.14.0 dropped Python 3.9 and PyPy 3.10. Ensure your environment meets the `requires_python >=3.10` requirement.
- breaking The `start_requests()` (synchronous) method for yielding initial requests has been replaced by `start()` (asynchronous) in Scrapy 2.13.0, which is now the preferred way. While `start_urls` remains a shortcut, direct asynchronous operations for initial requests should use `async def start(self)`.
- breaking The asyncio reactor is now enabled by default starting from Scrapy 2.13.0. This might affect applications with existing Twisted-specific reactor setups or require updates for custom spider middlewares that do not explicitly support asynchronous output, which may now log warnings.
- breaking Scrapy 2.14.2 includes a security fix where values from the `Referrer-Policy` header of HTTP responses are no longer executed as Python callables. Additionally, 301 redirects of POST requests are now converted into GET requests, aligning with the HTTP standard.
- deprecated `scrapy.utils.defer.maybeDeferred_coro()` and other related `scrapy.utils.defer` functions (e.g., `mustbe_deferred`, `defer_succeed`, `defer_fail`) are deprecated in Scrapy 2.14.1. Users are encouraged to use direct Twisted functions or coroutines.
- gotcha Request and Response objects now define `__slots__`, meaning you cannot assign arbitrary attributes directly (e.g., `response.foo = 1`). Attempting to do so will raise an `AttributeError`.
- gotcha In Scrapy 2.13.3, the default project template changed the values for `DOWNLOAD_DELAY` (from 0 to 1 second) and `CONCURRENT_REQUESTS_PER_DOMAIN` (from 8 to 1) to promote more polite crawling. New projects will inherit these slower defaults.
Install
-
pip install scrapy
Imports
- scrapy
import scrapy
- Spider
from scrapy import Spider
- Request
from scrapy import Request
- Item
from scrapy import Item
- Field
from scrapy.item import Field
- CrawlSpider
from scrapy.spiders import CrawlSpider
- LinkExtractor
from scrapy.linkextractors import LinkExtractor
- AsyncCrawlerProcess
from scrapy.crawler import AsyncCrawlerProcess
Quickstart
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
"https://quotes.toscrape.com/tag/humor/",
]
def parse(self, response):
for quote in response.css("div.quote"):
yield {
"author": quote.xpath("span/small/text()").get(),
"text": quote.css("span.text::text").get(),
}
next_page = response.css('li.next a::attr("href")').get()
if next_page is not None:
yield response.follow(next_page, self.parse)
# To run this spider, save it as a .py file (e.g., quotes_spider.py) and execute:
# scrapy runspider quotes_spider.py -o quotes.jsonl