Robots Exclusion Protocol File Parser

0.13.0 · active · verified Thu Apr 16

robotspy is a Python library for parsing `robots.txt` files, implementing the Robots Exclusion Protocol (REP) as defined by RFC 9309. It allows applications to determine whether a web crawler is permitted to access a given URL path on a server. The library is actively maintained, with the current version being 0.13.0, and has a steady release cadence addressing bug fixes and adherence to Google's parsing behavior.

Common errors

Warnings

Install

Imports

Quickstart

This example demonstrates how to initialize `RobotFileParser` with a URL, fetch the `robots.txt` file using `read()`, and then check crawling permissions for various user agents and paths using `can_fetch()`. It also shows how to set a custom `user_agent` when initializing the parser, which is used for fetching the `robots.txt` file itself.

from robotspy import RobotFileParser
import urllib.request # robotspy uses urllib.request internally

# Initialize the parser with the URL of the robots.txt file
# Using a well-known site's robots.txt for demonstration
parser = RobotFileParser(url="https://www.google.com/robots.txt")

try:
    # Read the robots.txt file from the specified URL
    # This performs an HTTP GET request to fetch the file.
    parser.read()

    print(f"User-agent '*' can fetch /: {parser.can_fetch('*', '/')}")
    print(f"User-agent 'Googlebot' can fetch /search: {parser.can_fetch('Googlebot', '/search')}")
    print(f"User-agent 'Googlebot' can fetch /images/search: {parser.can_fetch('Googlebot', '/images/search')}")
    print(f"User-agent 'AdsBot-Google' can fetch /ads: {parser.can_fetch('AdsBot-Google', '/ads')}")

except Exception as e:
    print(f"Error reading robots.txt: {e}")
    print("Ensure you have network connectivity and the URL is correct.")

# Example of initializing with a custom user agent for fetching the robots.txt file itself
# This can help avoid 403 errors from some servers.
parser_custom_ua = RobotFileParser(url="https://www.google.com/robots.txt", user_agent="MyCustomCrawler/1.0")
try:
    parser_custom_ua.read()
    print(f"\nUsing 'MyCustomCrawler/1.0' to fetch, then checking for 'Googlebot': {parser_custom_ua.can_fetch('Googlebot', '/')}")
except Exception as e:
    print(f"Error reading robots.txt with custom user agent: {e}")

view raw JSON →