Apify SDK for Python
The Apify SDK for Python is the official library for creating Apify Actors. Actors are serverless cloud programs that can perform various web scraping and automation tasks. This SDK provides tools for Actor lifecycle management, local storage emulation, and event handling, allowing developers to build scalable data extraction solutions. It is actively maintained, with the current stable version being 3.3.2.
Common errors
-
AttributeError: 'Actor' object has no attribute 'main'
cause Attempting to call the `main()` method on the `Actor` class, which was removed in Apify SDK v2.0 and replaced by the `async with Actor:` context manager pattern.fixRefactor your Actor's entry point to use the recommended asynchronous context manager: `async def main(): async with Actor: # your logic` and run with `asyncio.run(main())`. -
TypeError: RequestQueue.add_request() got an unexpected keyword argument 'url'
cause In Apify SDK v2.0+, `RequestQueue.add_request()` primarily expects an `apify.Request` object as its argument. Passing a dictionary directly with a `url` key, or simply a plain URL string, might be misinterpreted or require explicit wrapping.fixPass an `apify.Request` object to `add_request()`. For simple URLs, you can often pass a string, but for more complex requests, create a `Request` object: `from apify import Request; await request_queue.add_request(Request(url='http://example.com'))`. -
ApifyApiError: Actor input schema validation failed
cause The input provided to the Actor (either via Apify Console, API, or local `INPUT.json`) does not conform to the `INPUT_SCHEMA.json` defined for the Actor. This validation happens before the Actor's code even starts.fixReview your Actor's `INPUT_SCHEMA.json` and ensure that the input data strictly matches the defined schema, including data types, required fields, and patterns. Use the Apify Console's visual input schema editor or the `apify validate-schema` CLI command to check validity.
Warnings
- breaking The Apify SDK v3.0 introduced significant breaking changes from v2.x, including a complete overhaul of storage APIs (Dataset, KeyValueStore, RequestQueue). Older methods like `from_storage_object`, `get_info`, and `storage_object` have been removed or replaced. Default storage IDs in configuration changed from 'default' to `None`.
- breaking The `Actor.main()` method was removed in SDK v2.0 and is no longer supported in v3.x. Its functionality is replaced by the `async with Actor:` context manager, which handles initialization and graceful shutdown automatically.
- breaking Apify SDK v3.x requires Python 3.10 or higher. Previous versions (v2.x) supported Python 3.9+ (v1.x dropped 3.8 support).
- gotcha In Apify SDK v3.0+, local storage is automatically purged (cleared) at the start of an Actor run (during `Actor.init()` or `async with Actor:`). This differs from v2.x, where the `--purge` CLI argument was required.
- gotcha Python's mutable default arguments can lead to unexpected behavior if not handled correctly. If a function's default argument is a mutable object (like a list or dictionary) and it's modified within the function, the change persists across subsequent calls, leading to state leakage.
Install
-
pip install apify -
pip install apify[scrapy]
Imports
- Actor
from apify import Actor
- Request
from apify.storages import Request
from apify import Request
Quickstart
import asyncio
import httpx
from bs4 import BeautifulSoup
from apify import Actor
async def main() -> None:
async with Actor:
# Retrieve the Actor input, or use a default if not provided
actor_input = await Actor.get_input() or {}
start_urls = actor_input.get('start_urls', [{'url': 'https://apify.com'}])
# Open the default request queue
request_queue = await Actor.open_request_queue()
# Enqueue the start URLs
for start_url_obj in start_urls:
url = start_url_obj.get('url')
if url:
await request_queue.add_request(url)
# Process the URLs from the request queue
while True:
request = await request_queue.fetch_next_request()
if not request:
break
Actor.log.info(f'Processing {request.url}')
try:
async with httpx.AsyncClient() as client:
response = await client.get(request.url)
response.raise_for_status()
soup = BeautifulSoup(response.content, 'html.parser')
data = {
'url': request.url,
'title': soup.title.string if soup.title else None,
'status_code': response.status_code
}
await Actor.push_data(data)
except httpx.HTTPStatusError as e:
Actor.log.error(f'Failed to fetch {request.url}: {e}')
finally:
await request_queue.mark_request_handled(request)
if __name__ == '__main__':
asyncio.run(main())