{"library":"robotspy","title":"Robots Exclusion Protocol File Parser","description":"robotspy is a Python library for parsing `robots.txt` files, implementing the Robots Exclusion Protocol (REP) as defined by RFC 9309. It allows applications to determine whether a web crawler is permitted to access a given URL path on a server. The library is actively maintained, with the current version being 0.13.0, and has a steady release cadence addressing bug fixes and adherence to Google's parsing behavior.","language":"python","status":"active","last_verified":"Thu Apr 16","install":{"commands":["pip install robotspy"],"cli":null},"imports":["from robotspy import RobotFileParser"],"auth":{"required":false,"env_vars":[]},"quickstart":{"code":"from robotspy import RobotFileParser\nimport urllib.request # robotspy uses urllib.request internally\n\n# Initialize the parser with the URL of the robots.txt file\n# Using a well-known site's robots.txt for demonstration\nparser = RobotFileParser(url=\"https://www.google.com/robots.txt\")\n\ntry:\n    # Read the robots.txt file from the specified URL\n    # This performs an HTTP GET request to fetch the file.\n    parser.read()\n\n    print(f\"User-agent '*' can fetch /: {parser.can_fetch('*', '/')}\")\n    print(f\"User-agent 'Googlebot' can fetch /search: {parser.can_fetch('Googlebot', '/search')}\")\n    print(f\"User-agent 'Googlebot' can fetch /images/search: {parser.can_fetch('Googlebot', '/images/search')}\")\n    print(f\"User-agent 'AdsBot-Google' can fetch /ads: {parser.can_fetch('AdsBot-Google', '/ads')}\")\n\nexcept Exception as e:\n    print(f\"Error reading robots.txt: {e}\")\n    print(\"Ensure you have network connectivity and the URL is correct.\")\n\n# Example of initializing with a custom user agent for fetching the robots.txt file itself\n# This can help avoid 403 errors from some servers.\nparser_custom_ua = RobotFileParser(url=\"https://www.google.com/robots.txt\", user_agent=\"MyCustomCrawler/1.0\")\ntry:\n    parser_custom_ua.read()\n    print(f\"\\nUsing 'MyCustomCrawler/1.0' to fetch, then checking for 'Googlebot': {parser_custom_ua.can_fetch('Googlebot', '/')}\")\nexcept Exception as e:\n    print(f\"Error reading robots.txt with custom user agent: {e}\")","lang":"python","description":"This example demonstrates how to initialize `RobotFileParser` with a URL, fetch the `robots.txt` file using `read()`, and then check crawling permissions for various user agents and paths using `can_fetch()`. It also shows how to set a custom `user_agent` when initializing the parser, which is used for fetching the `robots.txt` file itself.","tag":null,"tag_description":null,"last_tested":null,"results":[]},"compatibility":null}