Scrapy Playwright
Scrapy Playwright is a Scrapy Download Handler that integrates Playwright for Python, enabling Scrapy spiders to effectively scrape dynamic web pages that rely heavily on JavaScript rendering. It allows for browser automation within the Scrapy framework, facilitating interaction with complex web elements, while maintaining Scrapy's efficient crawling and scheduling model. The library is actively maintained with frequent releases, currently at version 0.0.46.
Warnings
- breaking Python 3.8 support was dropped in `scrapy-playwright==0.0.44`. Users on Python 3.8 or older must upgrade their Python version to >=3.10 to use recent versions.
- breaking The import path for `HTTP11DownloadHandler` was updated in `scrapy-playwright==0.0.45`. This might affect projects with highly customized download handler setups or direct imports.
- deprecated Positional argument handling for the function passed to the `PLAYWRIGHT_PROCESS_REQUEST_HEADERS` setting was deprecated in version 0.0.41. Arguments should now be handled by keyword.
- gotcha Playwright requires browser binaries to be installed separately after the Python package. The `pip install playwright` command does not install these binaries by default. Without them, Playwright will not function.
- gotcha Scrapy must be configured to use an asyncio-compatible Twisted reactor (e.g., `twisted.internet.asyncioreactor.AsyncioSelectorReactor`) for `scrapy-playwright` to work correctly. Failing to do so will lead to asynchronous request failures.
- gotcha Playwright instances consume significant memory and CPU. Running many concurrent Playwright pages can exhaust system resources, leading to crashes or slow performance. This is especially critical when running non-headless browsers.
- gotcha Playwright's `Page.route` and `Page.unroute` methods are used internally by `scrapy-playwright`. Directly using these methods in user code can interfere with the library's functionality and lead to unexpected behavior.
Install
-
pip install scrapy-playwright playwright -
playwright install
Imports
- ScrapyPlaywrightDownloadHandler
from scrapy_playwright.handler import ScrapyPlaywrightDownloadHandler
- PageMethod
from scrapy_playwright.page import PageMethod
- PlaywrightRequest
from scrapy_playwright.request import PlaywrightRequest
Quickstart
import scrapy
from scrapy_playwright.page import PageMethod
# settings.py configuration (add these to your project's settings.py)
# DOWNLOAD_HANDLERS = {
# "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
# "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
# }
# TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
# PLAYWRIGHT_BROWSER_TYPE = "chromium" # or 'firefox', 'webkit'
# PLAYWRIGHT_LAUNCH_OPTIONS = {
# "headless": True, # Set to False for visual debugging
# "timeout": 60000 # 60 seconds
# }
class MySpider(scrapy.Spider):
name = "my_spider"
start_urls = ["https://quotes.toscrape.com/js/"]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
meta={
"playwright": True,
"playwright_page_methods": [
PageMethod("wait_for_selector", "div.quote"),
# PageMethod("screenshot", path="screenshot.png", full_page=True),
],
},
callback=self.parse_quotes,
)
async def parse_quotes(self, response):
# The response is now a PlaywrightResponse object with rendered content
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield scrapy.Request(response.urljoin(next_page), meta={'playwright': True, 'playwright_page_methods': [PageMethod("wait_for_selector", "div.quote")]}, callback=self.parse_quotes)