ScrapeGraph Python SDK
ScrapeGraph Python SDK (version 1.46.0) is the official client for the ScrapeGraphAI API. It enables AI-powered web scraping, search, crawling, and structured data extraction using natural language prompts. The library focuses on abstracting away complexities like proxy management and JavaScript rendering, offering both synchronous and asynchronous clients. It maintains an active and iterative development cadence with frequent updates.
Common errors
-
ModuleNotFoundError: No module named 'scrapegraph_py'
cause The Python file containing your code has the same name as the package (`scrapegraph_py.py` or `scrapegraphai.py`), causing Python to import your local file instead of the installed library.fixRename your Python script file to something different, e.g., `my_scraper.py`. -
ImportError: cannot import name 'OpenAI' from 'scrapegraphai.models'
cause Attempting to import `OpenAI` directly from `scrapegraphai.models` when it's not exposed or has been moved/renamed in the current SDK version. The SDK typically handles LLM integration internally via the API key.fixAvoid direct imports of specific LLM classes like `OpenAI` from `scrapegraphai.models`. Instead, configure the desired LLM (e.g., OpenAI, Gemini) through the ScrapeGraphAI platform's API key. The `scrapegraph-py` SDK interacts with the ScrapeGraphAI API, which then manages the underlying LLMs. -
scrapegraph_py.exceptions.APIError: Invalid API key
cause The provided API key is incorrect, expired, or missing. The `Client` could not authenticate with the ScrapeGraphAI service.fixDouble-check your API key for typos. Ensure the `SGAI_API_KEY` environment variable is correctly set or that `api_key` is passed with a valid key to `Client(api_key='...')`. Obtain a new key from the ScrapeGraphAI Dashboard if necessary. -
scrapegraph_py.exceptions.APIError: Invalid URL format
cause The `website_url` parameter passed to a scraping method (e.g., `smartscraper`) is malformed or not a valid URL.fixVerify that the `website_url` string is a complete and correctly formatted URL, including the scheme (e.g., 'https://example.com').
Warnings
- breaking The ScrapeGraph ecosystem has seen API changes, particularly with the transition to 'v2 API surface' in related projects like Scrapegraph-ai. While `scrapegraph-py` (the SDK) strives for stability, older code using `ScrapeGraphClient` functions (e.g., `smart_scraper(client, url, prompt)`) may need to be updated to the `Client` class methods (e.g., `client.smartscraper(website_url, user_prompt)`).
- gotcha Failing to set the `SGAI_API_KEY` environment variable or providing an invalid API key will result in authentication errors (HTTP 401 Unauthorized) or 'Insufficient credits' errors. The client will not be able to perform API calls.
- gotcha When using local LLMs with certain functionalities, a 'Model not found, using default token size (8192)' error or an `ImportError: Could not import transformers python package` might occur. This indicates issues with LLM configuration or missing dependencies.
- gotcha Running web scraping operations too frequently or without adhering to website policies (e.g., `robots.txt`, terms of service) can lead to IP blocking (HTTP 429 Too Many Requests) or other service unavailability errors (HTTP 500, 503).
Install
-
pip install scrapegraph-py -
pip install scrapegraph-py[html] -
pip install scrapegraph-py[langchain]
Imports
- Client
from scrapegraph_py import ScrapeGraphClient
from scrapegraph_py import Client
- TimeRange
from scrapegraph_py.models import TimeRange
Quickstart
import os
from scrapegraph_py import Client
from pydantic import BaseModel, Field
# Set your ScrapeGraph AI API key
# It's recommended to set this as an environment variable: SGAI_API_KEY
# For quick testing, you can pass it directly or use os.environ.get
api_key = os.environ.get('SGAI_API_KEY', 'your_scrapegraph_api_key_here')
if not api_key or api_key == 'your_scrapegraph_api_key_here':
print("Warning: Please set your SGAI_API_KEY environment variable or replace 'your_scrapegraph_api_key_here' with your actual API key.")
exit()
client = Client(api_key=api_key)
class ArticleData(BaseModel):
title: str = Field(description="The article title")
author: str = Field(description="The author's name")
publish_date: str = Field(description="Article publication date")
content: str = Field(description="Main article content")
try:
# Use SmartScraper to extract structured data from a webpage
response = client.smartscraper(
website_url="https://example.com/blog/article-example",
user_prompt="Extract the article information",
output_schema=ArticleData
)
print(f"Title: {response.title}")
print(f"Author: {response.author}")
print(f"Published: {response.publish_date}")
print(f"Content snippet: {response.content[:100]}...")
finally:
# Always close the client connection
client.close()