Scrapy
High-level web crawling and scraping framework. Current version is 2.14.1 (Jan 2026). Requires Python >=3.10. Two major breaking changes in 2.13: start_requests() (sync) replaced by start() (async), and TWISTED_REACTOR now defaults to asyncio — both can silently break existing spiders.
Warnings
- breaking TWISTED_REACTOR default changed to asyncio (AsyncioSelectorReactor) in 2.13. Existing projects that relied on the default reactor being None may behave differently. Projects with incompatible Twisted code that assumed the default reactor could silently break.
- breaking start_requests() (sync) replaced by start() (async) in 2.13. The iteration behavior changed: start requests now run continuously rather than stopping when the scheduler has pending requests. This can cause different crawl ordering and memory behavior on large crawls.
- breaking Python 3.9 dropped in Scrapy 2.13. Minimum is now Python 3.10.
- gotcha response.css() and response.xpath() return SelectorList, not strings. Forgetting .get() or .getall() returns a SelectorList object, not the text. A common source of silent data bugs.
- gotcha return in a parse callback instead of yield causes items/requests to be silently dropped. parse() must be a generator (use yield) not return a list.
- gotcha Running scrapy crawl outside a Scrapy project directory raises ConfigError. The scrapy CLI requires a scrapy.cfg file in the current or parent directory.
Install
-
pip install Scrapy -
scrapy startproject myproject
Imports
- Spider.start
import scrapy class MySpider(scrapy.Spider): name = 'myspider' start_urls = ['https://example.com'] # New async start() method (2.13+) — preferred over start_requests() async def start(self): for url in self.start_urls: yield scrapy.Request(url, callback=self.parse) def parse(self, response): yield {'title': response.css('title::text').get()}
Quickstart
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = ['https://quotes.toscrape.com']
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
# Follow pagination
next_page = response.css('li.next a::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)