ScrapeGraph Python SDK

1.46.0 · active · verified Thu Apr 16

ScrapeGraph Python SDK (version 1.46.0) is the official client for the ScrapeGraphAI API. It enables AI-powered web scraping, search, crawling, and structured data extraction using natural language prompts. The library focuses on abstracting away complexities like proxy management and JavaScript rendering, offering both synchronous and asynchronous clients. It maintains an active and iterative development cadence with frequent updates.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize the ScrapeGraph client using an API key (preferably from an environment variable) and then use the `smartscraper` service to extract structured data from a webpage. It utilizes a Pydantic `BaseModel` to define the desired output schema for robust data validation and type safety. The client should always be closed after use.

import os
from scrapegraph_py import Client
from pydantic import BaseModel, Field

# Set your ScrapeGraph AI API key
# It's recommended to set this as an environment variable: SGAI_API_KEY
# For quick testing, you can pass it directly or use os.environ.get
api_key = os.environ.get('SGAI_API_KEY', 'your_scrapegraph_api_key_here')

if not api_key or api_key == 'your_scrapegraph_api_key_here':
    print("Warning: Please set your SGAI_API_KEY environment variable or replace 'your_scrapegraph_api_key_here' with your actual API key.")
    exit()

client = Client(api_key=api_key)

class ArticleData(BaseModel):
    title: str = Field(description="The article title")
    author: str = Field(description="The author's name")
    publish_date: str = Field(description="Article publication date")
    content: str = Field(description="Main article content")

try:
    # Use SmartScraper to extract structured data from a webpage
    response = client.smartscraper(
        website_url="https://example.com/blog/article-example",
        user_prompt="Extract the article information",
        output_schema=ArticleData
    )

    print(f"Title: {response.title}")
    print(f"Author: {response.author}")
    print(f"Published: {response.publish_date}")
    print(f"Content snippet: {response.content[:100]}...")

finally:
    # Always close the client connection
    client.close()

view raw JSON →