ScrapingBee Python SDK
ScrapingBee is a web scraping API that handles headless browsers and rotates proxies for you. The Python SDK simplifies interaction with this API, offering features like JavaScript rendering, proxy rotation, AI-powered data extraction, and screenshot capabilities. It is currently at version 2.0.2 and receives regular updates, focusing on reliability and new API features.
Warnings
- breaking Version 2.0.0 introduced a fix for URL encoding of parameters. While intended as a correction, this might alter the behavior for existing users who might have implicitly relied on or worked around previous (potentially incorrect) encoding, leading to different request URLs or parameter interpretation. [cite: GitHub release v2.0.0]
- breaking Python 3.6 support was officially dropped in version 1.2.0. Users running on Python 3.6 will encounter issues or be unable to upgrade past v1.1.8. [cite: GitHub release v1.2.0]
- gotcha Hardcoding your ScrapingBee API key directly into your scripts is a security risk. It should be stored securely, ideally in an environment variable.
- gotcha By default, `render_js` is set to `True` for `client.get()` requests, which means JavaScript is executed and consumes 5 credits per request. For simple static HTML pages, this can unnecessarily increase credit usage.
- gotcha ScrapingBee plans have limits on concurrent requests. Exceeding this limit can lead to requests being queued or failing.
- gotcha While `extract_rules` are powerful, like any CSS/XPath selectors, they can break if the target website's HTML structure changes. The `v2.0.2` release specifically fixed handling of AI extract rules, indicating this is an area where issues can arise. [cite: GitHub release v2.0.2, 20]
Install
-
pip install scrapingbee
Imports
- ScrapingBeeClient
from scrapingbee import ScrapingBeeClient
Quickstart
import os
from scrapingbee import ScrapingBeeClient
# It's highly recommended to store your API key in an environment variable
api_key = os.environ.get('SCRAPINGBEE_API_KEY', 'YOUR_API_KEY')
if api_key == 'YOUR_API_KEY':
print("Warning: Replace 'YOUR_API_KEY' or set the SCRAPINGBEE_API_KEY environment variable.")
client = ScrapingBeeClient(api_key=api_key)
url_to_scrape = 'https://www.scrapingbee.com/blog/'
try:
response = client.get(
url_to_scrape,
params={
'render_js': True, # Set to False to save credits if JavaScript rendering is not needed
'extract_rules': {
'title': 'h1',
'subtitle': '#subtitle',
'articles': {'selector': 'article h2 a', 'type': 'list', 'output': 'text'}
}
}
)
if response.ok:
# If extract_rules are used, the content is usually JSON
if response.headers.get('content-type') == 'application/json':
import json
data = json.loads(response.content)
print(json.dumps(data, indent=2))
else:
# Otherwise, it's the raw HTML
print(response.text[:500]) # Print first 500 characters of HTML
else:
print(f"Failed to scrape {url_to_scrape}: Status {response.status_code}, Content: {response.text[:200]}")
except Exception as e:
print(f"An error occurred: {e}")