{"id":9280,"library":"robotspy","title":"Robots Exclusion Protocol File Parser","description":"robotspy is a Python library for parsing `robots.txt` files, implementing the Robots Exclusion Protocol (REP) as defined by RFC 9309. It allows applications to determine whether a web crawler is permitted to access a given URL path on a server. The library is actively maintained, with the current version being 0.13.0, and has a steady release cadence addressing bug fixes and adherence to Google's parsing behavior.","status":"active","version":"0.13.0","language":"en","source_language":"en","source_url":"https://github.com/andreburgaud/robotspy","tags":["web scraping","robots.txt","parser","SEO","web crawling"],"install":[{"cmd":"pip install robotspy","lang":"bash","label":"Install stable version"}],"dependencies":[],"imports":[{"symbol":"RobotFileParser","correct":"from robotspy import RobotFileParser"}],"quickstart":{"code":"from robotspy import RobotFileParser\nimport urllib.request # robotspy uses urllib.request internally\n\n# Initialize the parser with the URL of the robots.txt file\n# Using a well-known site's robots.txt for demonstration\nparser = RobotFileParser(url=\"https://www.google.com/robots.txt\")\n\ntry:\n    # Read the robots.txt file from the specified URL\n    # This performs an HTTP GET request to fetch the file.\n    parser.read()\n\n    print(f\"User-agent '*' can fetch /: {parser.can_fetch('*', '/')}\")\n    print(f\"User-agent 'Googlebot' can fetch /search: {parser.can_fetch('Googlebot', '/search')}\")\n    print(f\"User-agent 'Googlebot' can fetch /images/search: {parser.can_fetch('Googlebot', '/images/search')}\")\n    print(f\"User-agent 'AdsBot-Google' can fetch /ads: {parser.can_fetch('AdsBot-Google', '/ads')}\")\n\nexcept Exception as e:\n    print(f\"Error reading robots.txt: {e}\")\n    print(\"Ensure you have network connectivity and the URL is correct.\")\n\n# Example of initializing with a custom user agent for fetching the robots.txt file itself\n# This can help avoid 403 errors from some servers.\nparser_custom_ua = RobotFileParser(url=\"https://www.google.com/robots.txt\", user_agent=\"MyCustomCrawler/1.0\")\ntry:\n    parser_custom_ua.read()\n    print(f\"\\nUsing 'MyCustomCrawler/1.0' to fetch, then checking for 'Googlebot': {parser_custom_ua.can_fetch('Googlebot', '/')}\")\nexcept Exception as e:\n    print(f\"Error reading robots.txt with custom user agent: {e}\")","lang":"python","description":"This example demonstrates how to initialize `RobotFileParser` with a URL, fetch the `robots.txt` file using `read()`, and then check crawling permissions for various user agents and paths using `can_fetch()`. It also shows how to set a custom `user_agent` when initializing the parser, which is used for fetching the `robots.txt` file itself."},"warnings":[{"fix":"Upgrade to `robotspy >= 0.10.0` to ensure correct and standard handling of the '?' character in `robots.txt` disallow/allow rules.","message":"The interpretation of the '?' character in URL path patterns changed in v0.10.0. Previously, it was treated as a wildcard matching any single character; now it is treated as a literal '?' character, aligning with common `robots.txt` parsing behavior.","severity":"gotcha","affected_versions":"<0.10.0"},{"fix":"Upgrade to `robotspy >= 0.9.0` for more accurate parsing of `User-agent` lines, especially those with non-standard or partially malformed tokens.","message":"The parser's behavior for handling user-agent product tokens was updated in v0.9.0 to align more closely with Google's robots parser. Older versions might have been more aggressive in discarding parts of a user-agent string if it contained malformed product tokens.","severity":"gotcha","affected_versions":"<0.9.0"},{"fix":"Upgrade to `robotspy >= 0.8.0` to ensure better compatibility with `robots.txt` files using various character encodings.","message":"Prior to v0.8.0, `robotspy` might have struggled or failed to correctly parse `robots.txt` files that were not UTF-8 encoded. The library was improved to handle non-UTF-8 encodings more robustly.","severity":"gotcha","affected_versions":"<0.8.0"},{"fix":"If you encounter 403 errors, ensure you are using `robotspy >= 0.8.0`. For persistent issues, explicitly set a descriptive user agent when initializing the parser, e.g., `RobotFileParser(url=..., user_agent='MyCoolCrawler/1.0')`.","message":"When fetching `robots.txt` files, some websites (e.g., Cloudflare-protected sites) may return a 403 Forbidden error if no user agent is specified in the HTTP request. While `robotspy >= 0.8.0` adds a default user agent, custom or more specific user agents might still be needed.","severity":"gotcha","affected_versions":"<0.8.0"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Always call `parser.read()` and handle potential network or parsing errors before attempting to call `can_fetch()` or other methods that rely on the parsed content. Example: `try: parser.read() ... except Exception as e: print(f'Failed to read: {e}')`.","cause":"The `read()` method of `RobotFileParser` was not called or failed to fetch/parse the `robots.txt` file, leaving the parser in an uninitialized state.","error":"AttributeError: 'NoneType' object has no attribute 'can_fetch'"},{"fix":"When initializing `RobotFileParser`, provide a specific `user_agent` string: `parser = RobotFileParser(url='...', user_agent='MyCustomCrawler/1.0 AppleWebKit/537.36')`. Use a user agent that identifies your crawler and respects server policies.","cause":"The web server denied the request to fetch `robots.txt`, often because the request did not include a valid or recognized `User-Agent` header, or was identified as suspicious.","error":"urllib.error.HTTPError: HTTP Error 403: Forbidden"},{"fix":"Upgrade to `robotspy >= 0.8.0` which includes improvements for handling different encodings. If the issue persists with extremely unusual encodings, you might need to fetch the file manually and decode it before feeding it to `RobotFileParser` (though `robotspy` does not expose an API for this beyond the URL).","cause":"The `robots.txt` file is encoded in something other than UTF-8, and `robotspy` encountered it prior to v0.8.0 or cannot automatically determine the correct encoding for an unusual file.","error":"UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position ..."}]}