scrapy-splash

raw JSON →
0.11.1 verified Fri May 01 auth: no python

scrapy-splash provides JavaScript support for Scrapy using Splash, a headless browser. Version 0.11.1 is the latest release. The library allows rendering JavaScript-heavy pages by delegating requests to a Splash instance. Releases are sporadic; the last major update (0.10.0) added support for Python 3.12/3.13 and Scrapy 2.12+, and deprecated old dupefilter/cache storage components.

pip install scrapy-splash
error ModuleNotFoundError: No module named 'scrapy_splash'
cause scrapy-splash is not installed or installed in a different environment.
fix
Run 'pip install scrapy-splash' in the correct virtual environment.
error ConnectionError: Splash HTTP status 502: upstream connect error or disconnect/reset before headers. retried
cause Splash service is not running or unreachable. Default host/port is localhost:8050.
fix
Start Splash: 'sudo docker run -p 8050:8050 scrapinghub/splash'. Or set SPLASH_URL to the correct endpoint.
error KeyError: 'SPLASH_URL'
cause You haven't set the SPLASH_URL setting in Scrapy settings.
fix
Add 'SPLASH_URL = "http://localhost:8050"' to your settings.py.
breaking In scrapy-splash 0.10.0, SplashAwareDupeFilter and SplashAwareFSCacheStorage are deprecated. You must remove them from your settings and use the default Scrapy components (DUPEFILTER_CLASS and FILES_STORE_S3_ACL etc.) instead. A new SplashRequestFingerprinter component is provided to maintain request fingerprinting for Splash requests.
fix Remove DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' from settings. Instead, set SPLASH_REQUEST_FINGERPRINTER_CLASS = 'scrapy_splash.SplashRequestFingerprinter' and let Scrapy use its default dupefilter.
gotcha HttpAuthMiddleware credentials leak: If you use http_user and http_pass spider attributes for Splash authentication, those credentials are sent to every non-Splash request (including robots.txt). Use SPLASH_USER and SPLASH_PASS settings instead.
fix Set SPLASH_USER and SPLASH_PASS in settings.py and remove http_user/http_pass from spider attributes. Upgrade to >=0.8.0.
deprecated SplashJsonResponse.body_as_unicode() is deprecated since 0.9.0; use .text instead.
fix Replace calls to response.body_as_unicode() with response.text.

Basic spider using SplashRequest to render JavaScript.

import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
    name = 'myspider'
    start_urls = ['http://example.com']

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.parse, args={'wait': 0.5})

    def parse(self, response):
        # response is a SplashJsonResponse
        yield {'title': response.css('title::text').get()}