Scrapy
raw JSON → 2.14.1 verified Tue May 12 auth: no python install: verified quickstart: verified
High-level web crawling and scraping framework. Current version is 2.14.1 (Jan 2026). Requires Python >=3.10. Two major breaking changes in 2.13: start_requests() (sync) replaced by start() (async), and TWISTED_REACTOR now defaults to asyncio — both can silently break existing spiders.
pip install Scrapy Common errors
error 'scrapy' is not recognized as an internal or external command, operable program or batch file. ↓
cause Scrapy is not installed, or its executable script is not in the system's PATH environment variable. This is common if pip installs packages to a user-specific directory not automatically added to PATH, or if a virtual environment is not activated.
fix
Install Scrapy using
pip install scrapy (or pip3 install scrapy). Ensure the directory where Scrapy's executable is installed (e.g., Python's Scripts directory on Windows or bin in a virtual environment) is included in your system's PATH. Alternatively, run Scrapy commands using python -m scrapy. error ModuleNotFoundError: No module named 'scrapy' ↓
cause Scrapy is not installed for the Python interpreter currently being used, or there is a local file or directory named `scrapy.py` or `scrapy` that is shadowing the installed library.
fix
Verify Scrapy is installed for your active Python environment using
pip show scrapy. If not, install it with pip install scrapy (or python3.x -m pip install scrapy for a specific Python version). Check your project directory and Python path for any conflicting files or folders named scrapy. error AttributeError: 'Spider' object has no attribute 'start_requests' ↓
cause In Scrapy versions 2.13 and newer, for asynchronous spiders, the `start_requests()` method has been replaced by `async def start()`. If you define `async def start_requests()`, it will be ignored or lead to this error when the engine tries to call the non-existent synchronous `start_requests()`.
fix
If your spider uses
async operations for initial requests, rename async def start_requests(self) to async def start(self). The start() method should be an async generator yielding Request objects. If you intend to use synchronous start_requests(), ensure it's not defined as async. error twisted.internet.error.ReactorAlreadyRunning ↓
cause Scrapy 2.13+ defaults the `TWISTED_REACTOR` to `asyncio` (`twisted.internet.asyncioreactor.AsyncioSelectorReactor`), but another part of your code or a third-party library might be implicitly or explicitly installing a different Twisted reactor before Scrapy can configure its own. Importing `twisted.internet.reactor` too early is a common cause.
fix
Explicitly set
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor' in your settings.py. Review your project for any early imports of twisted.internet.reactor or other Twisted components and move them to local scopes or after Scrapy's reactor initialization if possible. If running Scrapy from a script, consider using scrapy.utils.reactor.install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor') at the very beginning. error ValueError: Missing scheme in request url: ... ↓
cause This error occurs when a `scrapy.Request` object is created with a URL that lacks a proper scheme (e.g., `http://` or `https://`). The URL provided is incomplete or malformed.
fix
Ensure that all URLs passed to
scrapy.Request include a valid scheme, such as http:// or https://. For example, instead of yield scrapy.Request('example.com'), use yield scrapy.Request('https://example.com/'). Warnings
breaking TWISTED_REACTOR default changed to asyncio (AsyncioSelectorReactor) in 2.13. Existing projects that relied on the default reactor being None may behave differently. Projects with incompatible Twisted code that assumed the default reactor could silently break. ↓
fix Explicitly set TWISTED_REACTOR in settings.py if you need a specific reactor. To restore old behavior: TWISTED_REACTOR = None. New projects use asyncio by default which is correct.
breaking start_requests() (sync) replaced by start() (async) in 2.13. The iteration behavior changed: start requests now run continuously rather than stopping when the scheduler has pending requests. This can cause different crawl ordering and memory behavior on large crawls. ↓
fix Override start() instead of start_requests() in new spiders. For existing spiders: start_requests() still works but its iteration behavior changed. See 'Delaying start request iteration' in docs to restore previous behavior.
breaking Python 3.9 dropped in Scrapy 2.13. Minimum is now Python 3.10. ↓
fix Pin Scrapy<2.13 for Python 3.9 environments.
gotcha response.css() and response.xpath() return SelectorList, not strings. Forgetting .get() or .getall() returns a SelectorList object, not the text. A common source of silent data bugs. ↓
fix Use .get() for the first match (returns str or None), .getall() for all matches (returns list of str). Example: response.css('h1::text').get() not response.css('h1::text').
gotcha return in a parse callback instead of yield causes items/requests to be silently dropped. parse() must be a generator (use yield) not return a list. ↓
fix Replace return [item1, item2] with yield item1; yield item2. Or return a generator expression. Scrapy 2.13 added a warning for this (WARN_ON_GENERATOR_RETURN_VALUE setting).
gotcha Running scrapy crawl outside a Scrapy project directory raises ConfigError. The scrapy CLI requires a scrapy.cfg file in the current or parent directory. ↓
fix Always run scrapy commands from inside a project directory (where scrapy.cfg is). Create a project first: scrapy startproject myproject.
Install
scrapy startproject myproject Install compatibility verified last tested: 2026-05-12
python os / libc variant status wheel install import disk
3.10 alpine (musl) Scrapy - - 1.20s 90.2M
3.10 alpine (musl) default - - - -
3.10 slim (glibc) Scrapy - - 0.94s 91M
3.10 slim (glibc) default - - - -
3.11 alpine (musl) Scrapy - - 1.63s 102.0M
3.11 alpine (musl) default - - - -
3.11 slim (glibc) Scrapy - - 1.41s 103M
3.11 slim (glibc) default - - - -
3.12 alpine (musl) Scrapy - - 1.74s 91.7M
3.12 alpine (musl) default - - - -
3.12 slim (glibc) Scrapy - - 1.75s 92M
3.12 slim (glibc) default - - - -
3.13 alpine (musl) Scrapy - - 1.71s 91.0M
3.13 alpine (musl) default - - - -
3.13 slim (glibc) Scrapy - - 1.70s 92M
3.13 slim (glibc) default - - - -
3.9 alpine (musl) Scrapy - - 1.08s 90.2M
3.9 alpine (musl) default - - - -
3.9 slim (glibc) Scrapy - - 1.03s 91M
3.9 slim (glibc) default - - - -
Imports
- Spider.start wrong
# Old sync start_requests() — still works but deprecated pattern def start_requests(self): for url in self.start_urls: yield scrapy.Request(url, callback=self.parse)correctimport scrapy class MySpider(scrapy.Spider): name = 'myspider' start_urls = ['https://example.com'] # New async start() method (2.13+) — preferred over start_requests() async def start(self): for url in self.start_urls: yield scrapy.Request(url, callback=self.parse) def parse(self, response): yield {'title': response.css('title::text').get()}
Quickstart verified last tested: 2026-04-23
import scrapy
class QuotesSpider(scrapy.Spider):
name = 'quotes'
start_urls = ['https://quotes.toscrape.com']
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
# Follow pagination
next_page = response.css('li.next a::attr(href)').get()
if next_page:
yield response.follow(next_page, self.parse)