Scrapy

2.14.1 verified Tue May 12 auth: no python install: verified quickstart: verified

High-level web crawling and scraping framework. Current version is 2.14.1 (Jan 2026). Requires Python >=3.10. Two major breaking changes in 2.13: start_requests() (sync) replaced by start() (async), and TWISTED_REACTOR now defaults to asyncio — both can silently break existing spiders.

pip install Scrapy

Common errors

error 'scrapy' is not recognized as an internal or external command, operable program or batch file. ↓

cause Scrapy is not installed, or its executable script is not in the system's PATH environment variable. This is common if pip installs packages to a user-specific directory not automatically added to PATH, or if a virtual environment is not activated.

fix

Install Scrapy using pip install scrapy (or pip3 install scrapy). Ensure the directory where Scrapy's executable is installed (e.g., Python's Scripts directory on Windows or bin in a virtual environment) is included in your system's PATH. Alternatively, run Scrapy commands using python -m scrapy.

error ModuleNotFoundError: No module named 'scrapy' ↓

cause Scrapy is not installed for the Python interpreter currently being used, or there is a local file or directory named `scrapy.py` or `scrapy` that is shadowing the installed library.

fix

Verify Scrapy is installed for your active Python environment using pip show scrapy. If not, install it with pip install scrapy (or python3.x -m pip install scrapy for a specific Python version). Check your project directory and Python path for any conflicting files or folders named scrapy.

error AttributeError: 'Spider' object has no attribute 'start_requests' ↓

cause In Scrapy versions 2.13 and newer, for asynchronous spiders, the `start_requests()` method has been replaced by `async def start()`. If you define `async def start_requests()`, it will be ignored or lead to this error when the engine tries to call the non-existent synchronous `start_requests()`.

fix

If your spider uses async operations for initial requests, rename async def start_requests(self) to async def start(self). The start() method should be an async generator yielding Request objects. If you intend to use synchronous start_requests(), ensure it's not defined as async.

error twisted.internet.error.ReactorAlreadyRunning ↓

cause Scrapy 2.13+ defaults the `TWISTED_REACTOR` to `asyncio` (`twisted.internet.asyncioreactor.AsyncioSelectorReactor`), but another part of your code or a third-party library might be implicitly or explicitly installing a different Twisted reactor before Scrapy can configure its own. Importing `twisted.internet.reactor` too early is a common cause.

fix

Explicitly set TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor' in your settings.py. Review your project for any early imports of twisted.internet.reactor or other Twisted components and move them to local scopes or after Scrapy's reactor initialization if possible. If running Scrapy from a script, consider using scrapy.utils.reactor.install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor') at the very beginning.

error ValueError: Missing scheme in request url: ... ↓

cause This error occurs when a `scrapy.Request` object is created with a URL that lacks a proper scheme (e.g., `http://` or `https://`). The URL provided is incomplete or malformed.

fix

Ensure that all URLs passed to scrapy.Request include a valid scheme, such as http:// or https://. For example, instead of yield scrapy.Request('example.com'), use yield scrapy.Request('https://example.com/').

Warnings

breaking TWISTED_REACTOR default changed to asyncio (AsyncioSelectorReactor) in 2.13. Existing projects that relied on the default reactor being None may behave differently. Projects with incompatible Twisted code that assumed the default reactor could silently break. ↓

fix Explicitly set TWISTED_REACTOR in settings.py if you need a specific reactor. To restore old behavior: TWISTED_REACTOR = None. New projects use asyncio by default which is correct.

breaking start_requests() (sync) replaced by start() (async) in 2.13. The iteration behavior changed: start requests now run continuously rather than stopping when the scheduler has pending requests. This can cause different crawl ordering and memory behavior on large crawls. ↓

fix Override start() instead of start_requests() in new spiders. For existing spiders: start_requests() still works but its iteration behavior changed. See 'Delaying start request iteration' in docs to restore previous behavior.

breaking Python 3.9 dropped in Scrapy 2.13. Minimum is now Python 3.10. ↓

fix Pin Scrapy<2.13 for Python 3.9 environments.

gotcha response.css() and response.xpath() return SelectorList, not strings. Forgetting .get() or .getall() returns a SelectorList object, not the text. A common source of silent data bugs. ↓

fix Use .get() for the first match (returns str or None), .getall() for all matches (returns list of str). Example: response.css('h1::text').get() not response.css('h1::text').

gotcha return in a parse callback instead of yield causes items/requests to be silently dropped. parse() must be a generator (use yield) not return a list. ↓

fix Replace return [item1, item2] with yield item1; yield item2. Or return a generator expression. Scrapy 2.13 added a warning for this (WARN_ON_GENERATOR_RETURN_VALUE setting).

gotcha Running scrapy crawl outside a Scrapy project directory raises ConfigError. The scrapy CLI requires a scrapy.cfg file in the current or parent directory. ↓

fix Always run scrapy commands from inside a project directory (where scrapy.cfg is). Create a project first: scrapy startproject myproject.

Install

scrapy startproject myproject

Install compatibility verified last tested: 2026-05-12

python os / libc variant status wheel install import disk

3.10 alpine (musl) Scrapy - - 1.20s 90.2M

3.10 alpine (musl) default - - - -

3.10 slim (glibc) Scrapy - - 0.94s 91M

3.10 slim (glibc) default - - - -

3.11 alpine (musl) Scrapy - - 1.63s 102.0M

3.11 alpine (musl) default - - - -

3.11 slim (glibc) Scrapy - - 1.41s 103M

3.11 slim (glibc) default - - - -

3.12 alpine (musl) Scrapy - - 1.74s 91.7M

3.12 alpine (musl) default - - - -

3.12 slim (glibc) Scrapy - - 1.75s 92M

3.12 slim (glibc) default - - - -

3.13 alpine (musl) Scrapy - - 1.71s 91.0M

3.13 alpine (musl) default - - - -

3.13 slim (glibc) Scrapy - - 1.70s 92M

3.13 slim (glibc) default - - - -

3.9 alpine (musl) Scrapy - - 1.08s 90.2M

3.9 alpine (musl) default - - - -

3.9 slim (glibc) Scrapy - - 1.03s 91M

3.9 slim (glibc) default - - - -

Imports

Spider.start

wrong

# Old sync start_requests() — still works but deprecated pattern
def start_requests(self):
    for url in self.start_urls:
        yield scrapy.Request(url, callback=self.parse)

correct

import scrapy

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['https://example.com']

    # New async start() method (2.13+) — preferred over start_requests()
    async def start(self):
        for url in self.start_urls:
            yield scrapy.Request(url, callback=self.parse)

    def parse(self, response):
        yield {'title': response.css('title::text').get()}

start_requests() (sync) was replaced by start() (async) in Scrapy 2.13. start_requests() still works but start() is the new preferred interface. Custom start_requests() overrides still function but cannot yield items directly.

Quickstart verified last tested: 2026-04-23

Basic spider. Run with: scrapy crawl quotes -o output.json

import scrapy

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    start_urls = ['https://quotes.toscrape.com']

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        # Follow pagination
        next_page = response.css('li.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, self.parse)