Apify SDK for Python

raw JSON →
3.3.2 verified Thu Apr 16 auth: no python

The Apify SDK for Python is the official library for creating Apify Actors. Actors are serverless cloud programs that can perform various web scraping and automation tasks. This SDK provides tools for Actor lifecycle management, local storage emulation, and event handling, allowing developers to build scalable data extraction solutions. It is actively maintained, with the current stable version being 3.3.2.

pip install apify
error AttributeError: 'Actor' object has no attribute 'main'
cause Attempting to call the `main()` method on the `Actor` class, which was removed in Apify SDK v2.0 and replaced by the `async with Actor:` context manager pattern.
fix
Refactor your Actor's entry point to use the recommended asynchronous context manager: async def main(): async with Actor: # your logic and run with asyncio.run(main()).
error TypeError: RequestQueue.add_request() got an unexpected keyword argument 'url'
cause In Apify SDK v2.0+, `RequestQueue.add_request()` primarily expects an `apify.Request` object as its argument. Passing a dictionary directly with a `url` key, or simply a plain URL string, might be misinterpreted or require explicit wrapping.
fix
Pass an apify.Request object to add_request(). For simple URLs, you can often pass a string, but for more complex requests, create a Request object: from apify import Request; await request_queue.add_request(Request(url='http://example.com')).
error ApifyApiError: Actor input schema validation failed
cause The input provided to the Actor (either via Apify Console, API, or local `INPUT.json`) does not conform to the `INPUT_SCHEMA.json` defined for the Actor. This validation happens before the Actor's code even starts.
fix
Review your Actor's INPUT_SCHEMA.json and ensure that the input data strictly matches the defined schema, including data types, required fields, and patterns. Use the Apify Console's visual input schema editor or the apify validate-schema CLI command to check validity.
breaking The Apify SDK v3.0 introduced significant breaking changes from v2.x, including a complete overhaul of storage APIs (Dataset, KeyValueStore, RequestQueue). Older methods like `from_storage_object`, `get_info`, and `storage_object` have been removed or replaced. Default storage IDs in configuration changed from 'default' to `None`.
fix Consult the official 'Upgrading to v3' documentation. Replace removed methods with their v3 counterparts (e.g., `open` for storages, `get_metadata` for info). Adapt to `crawlee` v1.0 storage API changes.
breaking The `Actor.main()` method was removed in SDK v2.0 and is no longer supported in v3.x. Its functionality is replaced by the `async with Actor:` context manager, which handles initialization and graceful shutdown automatically.
fix Refactor your Actor's main logic to use `async with Actor:`: `async def main(): async with Actor: # Your Actor logic here`. This ensures proper lifecycle management and resource handling.
breaking Apify SDK v3.x requires Python 3.10 or higher. Previous versions (v2.x) supported Python 3.9+ (v1.x dropped 3.8 support).
fix Ensure your Python environment is running version 3.10 or newer. Upgrade Python if necessary.
gotcha In Apify SDK v3.0+, local storage is automatically purged (cleared) at the start of an Actor run (during `Actor.init()` or `async with Actor:`). This differs from v2.x, where the `--purge` CLI argument was required.
fix If you need to preserve local storage between runs for testing or specific workflows, you can disable automatic purging by passing `purge=False` to the Actor initialization, e.g., `async with Actor(purge=False):`.
gotcha Python's mutable default arguments can lead to unexpected behavior if not handled correctly. If a function's default argument is a mutable object (like a list or dictionary) and it's modified within the function, the change persists across subsequent calls, leading to state leakage.
fix Avoid using mutable objects as default arguments. Instead, use `None` as the default and initialize the mutable object inside the function if `None` is detected: `def func(arg=None): arg = arg if arg is not None else []`.
pip install apify[scrapy]

This quickstart demonstrates how to create a simple Apify Actor that fetches URLs from an input, scrapes their titles using HTTPX and BeautifulSoup, and pushes the extracted data to the default dataset. It utilizes the `async with Actor:` context manager for proper lifecycle management and `RequestQueue` for managing URLs.

import asyncio
import httpx
from bs4 import BeautifulSoup
from apify import Actor

async def main() -> None:
    async with Actor:
        # Retrieve the Actor input, or use a default if not provided
        actor_input = await Actor.get_input() or {}
        start_urls = actor_input.get('start_urls', [{'url': 'https://apify.com'}])

        # Open the default request queue
        request_queue = await Actor.open_request_queue()

        # Enqueue the start URLs
        for start_url_obj in start_urls:
            url = start_url_obj.get('url')
            if url:
                await request_queue.add_request(url)

        # Process the URLs from the request queue
        while True:
            request = await request_queue.fetch_next_request()

            if not request:
                break

            Actor.log.info(f'Processing {request.url}')
            try:
                async with httpx.AsyncClient() as client:
                    response = await client.get(request.url)
                    response.raise_for_status()

                soup = BeautifulSoup(response.content, 'html.parser')
                data = {
                    'url': request.url,
                    'title': soup.title.string if soup.title else None,
                    'status_code': response.status_code
                }
                await Actor.push_data(data)
            except httpx.HTTPStatusError as e:
                Actor.log.error(f'Failed to fetch {request.url}: {e}')
            finally:
                await request_queue.mark_request_handled(request)

if __name__ == '__main__':
    asyncio.run(main())