scrapy-splash
raw JSON → 0.11.1 verified Fri May 01 auth: no python
scrapy-splash provides JavaScript support for Scrapy using Splash, a headless browser. Version 0.11.1 is the latest release. The library allows rendering JavaScript-heavy pages by delegating requests to a Splash instance. Releases are sporadic; the last major update (0.10.0) added support for Python 3.12/3.13 and Scrapy 2.12+, and deprecated old dupefilter/cache storage components.
pip install scrapy-splash Common errors
error ModuleNotFoundError: No module named 'scrapy_splash' ↓
cause scrapy-splash is not installed or installed in a different environment.
fix
Run 'pip install scrapy-splash' in the correct virtual environment.
error ConnectionError: Splash HTTP status 502: upstream connect error or disconnect/reset before headers. retried ↓
cause Splash service is not running or unreachable. Default host/port is localhost:8050.
fix
Start Splash: 'sudo docker run -p 8050:8050 scrapinghub/splash'. Or set SPLASH_URL to the correct endpoint.
error KeyError: 'SPLASH_URL' ↓
cause You haven't set the SPLASH_URL setting in Scrapy settings.
fix
Add 'SPLASH_URL = "http://localhost:8050"' to your settings.py.
Warnings
breaking In scrapy-splash 0.10.0, SplashAwareDupeFilter and SplashAwareFSCacheStorage are deprecated. You must remove them from your settings and use the default Scrapy components (DUPEFILTER_CLASS and FILES_STORE_S3_ACL etc.) instead. A new SplashRequestFingerprinter component is provided to maintain request fingerprinting for Splash requests. ↓
fix Remove DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' from settings. Instead, set SPLASH_REQUEST_FINGERPRINTER_CLASS = 'scrapy_splash.SplashRequestFingerprinter' and let Scrapy use its default dupefilter.
gotcha HttpAuthMiddleware credentials leak: If you use http_user and http_pass spider attributes for Splash authentication, those credentials are sent to every non-Splash request (including robots.txt). Use SPLASH_USER and SPLASH_PASS settings instead. ↓
fix Set SPLASH_USER and SPLASH_PASS in settings.py and remove http_user/http_pass from spider attributes. Upgrade to >=0.8.0.
deprecated SplashJsonResponse.body_as_unicode() is deprecated since 0.9.0; use .text instead. ↓
fix Replace calls to response.body_as_unicode() with response.text.
Imports
- SplashRequest wrong
from scrapy_splash.request import SplashRequestcorrectfrom scrapy_splash import SplashRequest - SplashAwareDupeFilter
from scrapy_splash import SplashAwareDupeFilter - SplashAwareFSCacheStorage
from scrapy_splash import SplashAwareFSCacheStorage
Quickstart
import scrapy
from scrapy_splash import SplashRequest
class MySpider(scrapy.Spider):
name = 'myspider'
start_urls = ['http://example.com']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, args={'wait': 0.5})
def parse(self, response):
# response is a SplashJsonResponse
yield {'title': response.css('title::text').get()}