{"id":2764,"library":"scrapy","title":"Scrapy","description":"Scrapy is a high-level Python web crawling and web scraping framework, designed for fast extraction of structured data from websites. It's actively maintained with frequent releases, supporting applications from data mining to information processing and automated testing. The current version is 2.15.0.","status":"active","version":"2.15.0","language":"en","source_language":"en","source_url":"https://github.com/scrapy/scrapy","tags":["web scraping","crawling","spider","async","framework"],"install":[{"cmd":"pip install scrapy","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Asynchronous networking framework, core dependency. Scrapy 2.15.0 adds experimental support for running without a Twisted reactor.","package":"Twisted"},{"reason":"Efficient XML and HTML parser.","package":"lxml"},{"reason":"HTML/XML data extraction library built on lxml.","package":"parsel"},{"reason":"Multi-purpose helper for URLs and web page encodings.","package":"w3lib"},{"reason":"Deals with network-level security needs.","package":"cryptography"},{"reason":"Deals with network-level security needs.","package":"pyOpenSSL"},{"reason":"Required (>=1.2.0) for improved protection against decompression bombs in HttpCompressionMiddleware since Scrapy 2.13.4.","package":"brotli"},{"reason":"Experimental HTTPX-based download handler in Scrapy 2.15.0.","package":"httpx","optional":true}],"imports":[{"symbol":"scrapy","correct":"import scrapy"},{"note":"Importing Spider directly from scrapy is generally preferred and cleaner than from scrapy.spider.","wrong":"from scrapy.spider import Spider","symbol":"Spider","correct":"from scrapy import Spider"},{"symbol":"Request","correct":"from scrapy import Request"},{"symbol":"Item","correct":"from scrapy import Item"},{"symbol":"Field","correct":"from scrapy.item import Field"},{"symbol":"CrawlSpider","correct":"from scrapy.spiders import CrawlSpider"},{"symbol":"LinkExtractor","correct":"from scrapy.linkextractors import LinkExtractor"},{"note":"Used for running Scrapy from scripts. AsyncCrawlerProcess returns coroutines, while CrawlerProcess returns Deferred objects.","symbol":"AsyncCrawlerProcess","correct":"from scrapy.crawler import AsyncCrawlerProcess"}],"quickstart":{"code":"import scrapy\n\nclass QuotesSpider(scrapy.Spider):\n    name = \"quotes\"\n    start_urls = [\n        \"https://quotes.toscrape.com/tag/humor/\",\n    ]\n\n    def parse(self, response):\n        for quote in response.css(\"div.quote\"):\n            yield {\n                \"author\": quote.xpath(\"span/small/text()\").get(),\n                \"text\": quote.css(\"span.text::text\").get(),\n            }\n\n        next_page = response.css('li.next a::attr(\"href\")').get()\n        if next_page is not None:\n            yield response.follow(next_page, self.parse)\n\n# To run this spider, save it as a .py file (e.g., quotes_spider.py) and execute:\n# scrapy runspider quotes_spider.py -o quotes.jsonl","lang":"python","description":"This quickstart demonstrates a basic Scrapy spider that crawls the 'quotes.toscrape.com' website, specifically the 'humor' tag. It extracts the author and text of each quote, then follows the 'Next Page' link to continue crawling. The `start_urls` attribute defines the initial URLs, and the `parse` method handles the response, extracting data and scheduling new requests using `response.follow` for pagination."},"warnings":[{"fix":"Upgrade to Python 3.10 or newer. Use a virtual environment for isolated Scrapy installations.","message":"Scrapy has progressively dropped support for older Python versions. Scrapy 2.12.0 dropped Python 3.8 support, and Scrapy 2.14.0 dropped Python 3.9 and PyPy 3.10. Ensure your environment meets the `requires_python >=3.10` requirement.","severity":"breaking","affected_versions":">=2.12.0"},{"fix":"Migrate `def start_requests(self)` to `async def start(self)` if you are performing asynchronous operations (e.g., database calls) to generate your initial requests.","message":"The `start_requests()` (synchronous) method for yielding initial requests has been replaced by `start()` (asynchronous) in Scrapy 2.13.0, which is now the preferred way. While `start_urls` remains a shortcut, direct asynchronous operations for initial requests should use `async def start(self)`.","severity":"breaking","affected_versions":">=2.13.0"},{"fix":"Review and update custom middlewares to support asynchronous spider output by defining `process_spider_output` as an asynchronous generator or implementing `process_spider_output_async`.","message":"The asyncio reactor is now enabled by default starting from Scrapy 2.13.0. This might affect applications with existing Twisted-specific reactor setups or require updates for custom spider middlewares that do not explicitly support asynchronous output, which may now log warnings.","severity":"breaking","affected_versions":">=2.13.0"},{"fix":"Do not rely on `Referrer-Policy` header values being executed as code. Be aware that POST requests resulting in 301 redirects will now be re-sent as GET requests.","message":"Scrapy 2.14.2 includes a security fix where values from the `Referrer-Policy` header of HTTP responses are no longer executed as Python callables. Additionally, 301 redirects of POST requests are now converted into GET requests, aligning with the HTTP standard.","severity":"breaking","affected_versions":">=2.14.2"},{"fix":"Replace calls to `scrapy.utils.defer` functions with their `twisted.internet.defer` equivalents or appropriate coroutine patterns. For `maybeDeferred_coro()`, consider `twisted.internet.defer.maybeDeferred` if staying with Deferreds.","message":"`scrapy.utils.defer.maybeDeferred_coro()` and other related `scrapy.utils.defer` functions (e.g., `mustbe_deferred`, `defer_succeed`, `defer_fail`) are deprecated in Scrapy 2.14.1. Users are encouraged to use direct Twisted functions or coroutines.","severity":"deprecated","affected_versions":">=2.14.1"},{"fix":"Store per-request/response data in the `request.meta` or `request.cb_kwargs` mappings instead of attaching new attributes to the objects.","message":"Request and Response objects now define `__slots__`, meaning you cannot assign arbitrary attributes directly (e.g., `response.foo = 1`). Attempting to do so will raise an `AttributeError`.","severity":"gotcha","affected_versions":">=2.15.0"},{"fix":"Adjust `DOWNLOAD_DELAY` and `CONCURRENT_REQUESTS_PER_DOMAIN` in your `settings.py` if you require higher concurrency or a faster crawl rate for your specific use case.","message":"In Scrapy 2.13.3, the default project template changed the values for `DOWNLOAD_DELAY` (from 0 to 1 second) and `CONCURRENT_REQUESTS_PER_DOMAIN` (from 8 to 1) to promote more polite crawling. New projects will inherit these slower defaults.","severity":"gotcha","affected_versions":">=2.13.3"}],"env_vars":null,"last_verified":"2026-04-10T00:00:00.000Z","next_check":"2026-07-09T00:00:00.000Z"}