Apify SDK for Python

3.3.2 · active · verified Thu Apr 16

The Apify SDK for Python is the official library for creating Apify Actors. Actors are serverless cloud programs that can perform various web scraping and automation tasks. This SDK provides tools for Actor lifecycle management, local storage emulation, and event handling, allowing developers to build scalable data extraction solutions. It is actively maintained, with the current stable version being 3.3.2.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to create a simple Apify Actor that fetches URLs from an input, scrapes their titles using HTTPX and BeautifulSoup, and pushes the extracted data to the default dataset. It utilizes the `async with Actor:` context manager for proper lifecycle management and `RequestQueue` for managing URLs.

import asyncio
import httpx
from bs4 import BeautifulSoup
from apify import Actor

async def main() -> None:
    async with Actor:
        # Retrieve the Actor input, or use a default if not provided
        actor_input = await Actor.get_input() or {}
        start_urls = actor_input.get('start_urls', [{'url': 'https://apify.com'}])

        # Open the default request queue
        request_queue = await Actor.open_request_queue()

        # Enqueue the start URLs
        for start_url_obj in start_urls:
            url = start_url_obj.get('url')
            if url:
                await request_queue.add_request(url)

        # Process the URLs from the request queue
        while True:
            request = await request_queue.fetch_next_request()

            if not request:
                break

            Actor.log.info(f'Processing {request.url}')
            try:
                async with httpx.AsyncClient() as client:
                    response = await client.get(request.url)
                    response.raise_for_status()

                soup = BeautifulSoup(response.content, 'html.parser')
                data = {
                    'url': request.url,
                    'title': soup.title.string if soup.title else None,
                    'status_code': response.status_code
                }
                await Actor.push_data(data)
            except httpx.HTTPStatusError as e:
                Actor.log.error(f'Failed to fetch {request.url}: {e}')
            finally:
                await request_queue.mark_request_handled(request)

if __name__ == '__main__':
    asyncio.run(main())

view raw JSON →