Scrapy Playwright

0.0.46 · active · verified Tue Apr 14

Scrapy Playwright is a Scrapy Download Handler that integrates Playwright for Python, enabling Scrapy spiders to effectively scrape dynamic web pages that rely heavily on JavaScript rendering. It allows for browser automation within the Scrapy framework, facilitating interaction with complex web elements, while maintaining Scrapy's efficient crawling and scheduling model. The library is actively maintained with frequent releases, currently at version 0.0.46.

Warnings

Install

Imports

Quickstart

To use Scrapy Playwright, you must configure your `settings.py` file to include the `ScrapyPlaywrightDownloadHandler` and set the `TWISTED_REACTOR` for asyncio compatibility. Then, within your spider, create `scrapy.Request` objects with `meta={'playwright': True}`. You can also define page interactions using `playwright_page_methods` with `PageMethod` objects to wait for elements or perform actions before parsing.

import scrapy
from scrapy_playwright.page import PageMethod

# settings.py configuration (add these to your project's settings.py)
# DOWNLOAD_HANDLERS = {
#     "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
#     "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
# }
# TWISTED_REACTOR = "twisted.internet.asyncioreactor.AsyncioSelectorReactor"
# PLAYWRIGHT_BROWSER_TYPE = "chromium" # or 'firefox', 'webkit'
# PLAYWRIGHT_LAUNCH_OPTIONS = {
#     "headless": True, # Set to False for visual debugging
#     "timeout": 60000 # 60 seconds
# }

class MySpider(scrapy.Spider):
    name = "my_spider"
    start_urls = ["https://quotes.toscrape.com/js/"]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url,
                meta={
                    "playwright": True,
                    "playwright_page_methods": [
                        PageMethod("wait_for_selector", "div.quote"),
                        # PageMethod("screenshot", path="screenshot.png", full_page=True),
                    ],
                },
                callback=self.parse_quotes,
            )

    async def parse_quotes(self, response):
        # The response is now a PlaywrightResponse object with rendered content
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').get(),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
            }

        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            yield scrapy.Request(response.urljoin(next_page), meta={'playwright': True, 'playwright_page_methods': [PageMethod("wait_for_selector", "div.quote")]}, callback=self.parse_quotes)

view raw JSON →