{"id":6237,"library":"scrapy-playwright","title":"Scrapy Playwright","description":"Scrapy Playwright is a Scrapy Download Handler that integrates Playwright for Python, enabling Scrapy spiders to effectively scrape dynamic web pages that rely heavily on JavaScript rendering. It allows for browser automation within the Scrapy framework, facilitating interaction with complex web elements, while maintaining Scrapy's efficient crawling and scheduling model. The library is actively maintained with frequent releases, currently at version 0.0.46.","status":"active","version":"0.0.46","language":"en","source_language":"en","source_url":"https://github.com/scrapy-plugins/scrapy-playwright","tags":["Scrapy","Playwright","web scraping","automation","dynamic content","JavaScript","headless browser"],"install":[{"cmd":"pip install scrapy-playwright playwright","lang":"bash","label":"Install package"},{"cmd":"playwright install","lang":"bash","label":"Install browser binaries"}],"dependencies":[{"reason":"Core web scraping framework; requires >=2.7","package":"Scrapy","optional":false},{"reason":"Headless browser automation library; requires >=1.40 (Python version)","package":"playwright","optional":false}],"imports":[{"note":"Used in settings.py to enable Playwright for HTTP/HTTPS requests.","symbol":"ScrapyPlaywrightDownloadHandler","correct":"from scrapy_playwright.handler import ScrapyPlaywrightDownloadHandler"},{"note":"Used to define actions (e.g., click, wait_for_selector) within a Playwright request's meta field.","symbol":"PageMethod","correct":"from scrapy_playwright.page import PageMethod"},{"note":"While Request objects are usually from 'scrapy', some examples might mistakenly use 'page' or an old path for Playwright-specific request objects. The standard approach is to use `scrapy.Request` with `meta={'playwright': True}`.","wrong":"from scrapy_playwright.page import PlaywrightRequest","symbol":"PlaywrightRequest","correct":"from scrapy_playwright.request import PlaywrightRequest"}],"quickstart":{"code":"import scrapy\nfrom scrapy_playwright.page import PageMethod\n\n# settings.py configuration (add these to your project's settings.py)\n# DOWNLOAD_HANDLERS = {\n#     \"http\": \"scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler\",\n#     \"https\": \"scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler\",\n# }\n# TWISTED_REACTOR = \"twisted.internet.asyncioreactor.AsyncioSelectorReactor\"\n# PLAYWRIGHT_BROWSER_TYPE = \"chromium\" # or 'firefox', 'webkit'\n# PLAYWRIGHT_LAUNCH_OPTIONS = {\n#     \"headless\": True, # Set to False for visual debugging\n#     \"timeout\": 60000 # 60 seconds\n# }\n\nclass MySpider(scrapy.Spider):\n    name = \"my_spider\"\n    start_urls = [\"https://quotes.toscrape.com/js/\"]\n\n    def start_requests(self):\n        for url in self.start_urls:\n            yield scrapy.Request(\n                url,\n                meta={\n                    \"playwright\": True,\n                    \"playwright_page_methods\": [\n                        PageMethod(\"wait_for_selector\", \"div.quote\"),\n                        # PageMethod(\"screenshot\", path=\"screenshot.png\", full_page=True),\n                    ],\n                },\n                callback=self.parse_quotes,\n            )\n\n    async def parse_quotes(self, response):\n        # The response is now a PlaywrightResponse object with rendered content\n        for quote in response.css('div.quote'):\n            yield {\n                'text': quote.css('span.text::text').get(),\n                'author': quote.css('small.author::text').get(),\n                'tags': quote.css('div.tags a.tag::text').getall(),\n            }\n\n        next_page = response.css('li.next a::attr(href)').get()\n        if next_page is not None:\n            yield scrapy.Request(response.urljoin(next_page), meta={'playwright': True, 'playwright_page_methods': [PageMethod(\"wait_for_selector\", \"div.quote\")]}, callback=self.parse_quotes)\n","lang":"python","description":"To use Scrapy Playwright, you must configure your `settings.py` file to include the `ScrapyPlaywrightDownloadHandler` and set the `TWISTED_REACTOR` for asyncio compatibility. Then, within your spider, create `scrapy.Request` objects with `meta={'playwright': True}`. You can also define page interactions using `playwright_page_methods` with `PageMethod` objects to wait for elements or perform actions before parsing."},"warnings":[{"fix":"Upgrade Python to 3.10 or newer. Ensure your `requires_python` is set correctly.","message":"Python 3.8 support was dropped in `scrapy-playwright==0.0.44`. Users on Python 3.8 or older must upgrade their Python version to >=3.10 to use recent versions.","severity":"breaking","affected_versions":">=0.0.44"},{"fix":"Review and update any custom configurations or direct imports related to Scrapy's HTTP download handlers if they are dependent on `scrapy-playwright`'s internal path for this component.","message":"The import path for `HTTP11DownloadHandler` was updated in `scrapy-playwright==0.0.45`. This might affect projects with highly customized download handler setups or direct imports.","severity":"breaking","affected_versions":">=0.0.45"},{"fix":"Update custom `PLAYWRIGHT_PROCESS_REQUEST_HEADERS` functions to accept keyword arguments instead of positional ones.","message":"Positional argument handling for the function passed to the `PLAYWRIGHT_PROCESS_REQUEST_HEADERS` setting was deprecated in version 0.0.41. Arguments should now be handled by keyword.","severity":"deprecated","affected_versions":">=0.0.41"},{"fix":"Run `playwright install` in your terminal after installing the Python package to download the necessary browser binaries (Chromium, Firefox, WebKit).","message":"Playwright requires browser binaries to be installed separately after the Python package. The `pip install playwright` command does not install these binaries by default. Without them, Playwright will not function.","severity":"gotcha","affected_versions":"All"},{"fix":"Add `TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'` to your `settings.py` file.","message":"Scrapy must be configured to use an asyncio-compatible Twisted reactor (e.g., `twisted.internet.asyncioreactor.AsyncioSelectorReactor`) for `scrapy-playwright` to work correctly. Failing to do so will lead to asynchronous request failures.","severity":"gotcha","affected_versions":"All"},{"fix":"Manage concurrency using Scrapy's `CONCURRENT_REQUESTS` and `CONCURRENT_REQUESTS_PER_DOMAIN` settings. Additionally, `PLAYWRIGHT_MAX_PAGES_PER_CONTEXT` can limit Playwright's parallel page usage. Always use `headless=True` in `PLAYWRIGHT_LAUNCH_OPTIONS` unless debugging visually.","message":"Playwright instances consume significant memory and CPU. Running many concurrent Playwright pages can exhaust system resources, leading to crashes or slow performance. This is especially critical when running non-headless browsers.","severity":"gotcha","affected_versions":"All"},{"fix":"Avoid directly calling `Page.route` or `Page.unroute` on Playwright page objects obtained via `response.meta['playwright_page']` unless you have a deep understanding of the internal workings and are prepared for potential conflicts.","message":"Playwright's `Page.route` and `Page.unroute` methods are used internally by `scrapy-playwright`. Directly using these methods in user code can interfere with the library's functionality and lead to unexpected behavior.","severity":"gotcha","affected_versions":"All"}],"env_vars":null,"last_verified":"2026-04-14T00:00:00.000Z","next_check":"2026-07-13T00:00:00.000Z","problems":[]}