Robots Exclusion Protocol File Parser
robotspy is a Python library for parsing `robots.txt` files, implementing the Robots Exclusion Protocol (REP) as defined by RFC 9309. It allows applications to determine whether a web crawler is permitted to access a given URL path on a server. The library is actively maintained, with the current version being 0.13.0, and has a steady release cadence addressing bug fixes and adherence to Google's parsing behavior.
Common errors
-
AttributeError: 'NoneType' object has no attribute 'can_fetch'
cause The `read()` method of `RobotFileParser` was not called or failed to fetch/parse the `robots.txt` file, leaving the parser in an uninitialized state.fixAlways call `parser.read()` and handle potential network or parsing errors before attempting to call `can_fetch()` or other methods that rely on the parsed content. Example: `try: parser.read() ... except Exception as e: print(f'Failed to read: {e}')`. -
urllib.error.HTTPError: HTTP Error 403: Forbidden
cause The web server denied the request to fetch `robots.txt`, often because the request did not include a valid or recognized `User-Agent` header, or was identified as suspicious.fixWhen initializing `RobotFileParser`, provide a specific `user_agent` string: `parser = RobotFileParser(url='...', user_agent='MyCustomCrawler/1.0 AppleWebKit/537.36')`. Use a user agent that identifies your crawler and respects server policies. -
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position ...
cause The `robots.txt` file is encoded in something other than UTF-8, and `robotspy` encountered it prior to v0.8.0 or cannot automatically determine the correct encoding for an unusual file.fixUpgrade to `robotspy >= 0.8.0` which includes improvements for handling different encodings. If the issue persists with extremely unusual encodings, you might need to fetch the file manually and decode it before feeding it to `RobotFileParser` (though `robotspy` does not expose an API for this beyond the URL).
Warnings
- gotcha The interpretation of the '?' character in URL path patterns changed in v0.10.0. Previously, it was treated as a wildcard matching any single character; now it is treated as a literal '?' character, aligning with common `robots.txt` parsing behavior.
- gotcha The parser's behavior for handling user-agent product tokens was updated in v0.9.0 to align more closely with Google's robots parser. Older versions might have been more aggressive in discarding parts of a user-agent string if it contained malformed product tokens.
- gotcha Prior to v0.8.0, `robotspy` might have struggled or failed to correctly parse `robots.txt` files that were not UTF-8 encoded. The library was improved to handle non-UTF-8 encodings more robustly.
- gotcha When fetching `robots.txt` files, some websites (e.g., Cloudflare-protected sites) may return a 403 Forbidden error if no user agent is specified in the HTTP request. While `robotspy >= 0.8.0` adds a default user agent, custom or more specific user agents might still be needed.
Install
-
pip install robotspy
Imports
- RobotFileParser
from robotspy import RobotFileParser
Quickstart
from robotspy import RobotFileParser
import urllib.request # robotspy uses urllib.request internally
# Initialize the parser with the URL of the robots.txt file
# Using a well-known site's robots.txt for demonstration
parser = RobotFileParser(url="https://www.google.com/robots.txt")
try:
# Read the robots.txt file from the specified URL
# This performs an HTTP GET request to fetch the file.
parser.read()
print(f"User-agent '*' can fetch /: {parser.can_fetch('*', '/')}")
print(f"User-agent 'Googlebot' can fetch /search: {parser.can_fetch('Googlebot', '/search')}")
print(f"User-agent 'Googlebot' can fetch /images/search: {parser.can_fetch('Googlebot', '/images/search')}")
print(f"User-agent 'AdsBot-Google' can fetch /ads: {parser.can_fetch('AdsBot-Google', '/ads')}")
except Exception as e:
print(f"Error reading robots.txt: {e}")
print("Ensure you have network connectivity and the URL is correct.")
# Example of initializing with a custom user agent for fetching the robots.txt file itself
# This can help avoid 403 errors from some servers.
parser_custom_ua = RobotFileParser(url="https://www.google.com/robots.txt", user_agent="MyCustomCrawler/1.0")
try:
parser_custom_ua.read()
print(f"\nUsing 'MyCustomCrawler/1.0' to fetch, then checking for 'Googlebot': {parser_custom_ua.can_fetch('Googlebot', '/')}")
except Exception as e:
print(f"Error reading robots.txt with custom user agent: {e}")