Protego
Protego is a pure-Python robots.txt parser with support for modern conventions like those defined by Google. As of version 0.6.0, it actively supports Python 3.10 and newer, with regular updates aligning with new Python releases. It is widely used for web scraping and compliance checking.
Warnings
- breaking Version 0.6.0 dropped official support for Python 3.9 and PyPy 3.10. Users on these Python versions should use Protego 0.5.x or upgrade their Python environment.
- breaking Version 0.4.0 dropped official support for Python 3.8.
- breaking Version 0.3.0 dropped support for Python 2.7, 3.5, 3.6, and 3.7. The `six` dependency was also removed in this version, making it Python 3 only.
- gotcha In Protego 0.3.0 and later, `Protego.parse()` will raise a `ValueError` if the `robotstxt_body` argument is not a string.
- gotcha Version 0.1.16 fixed an issue where absolute URLs in `Allow` and `Disallow` directives were incorrectly parsed, ignoring their protocol and netloc. Older versions might misinterpret these directives, leading to incorrect access decisions.
- gotcha Version 0.5.0 restructured the internal code from a single `protego.py` file into multiple modules. While the public API `from protego import Protego` remains stable, direct imports of internal modules (if any were used) would have broken.
Install
-
pip install protego
Imports
- Protego
from protego import Protego
Quickstart
from protego import Protego
robotstxt_content = """
User-agent: *
Disallow: /admin/
Allow: /admin/login
Crawl-delay: 5
Sitemap: http://example.com/sitemap.xml
"""
rp = Protego.parse(robotstxt_content)
# Check if a URL can be fetched by a user agent
can_fetch_admin = rp.can_fetch("http://example.com/admin/settings", "mybot")
can_fetch_login = rp.can_fetch("http://example.com/admin/login", "mybot")
print(f"Can 'mybot' fetch /admin/settings? {can_fetch_admin}")
print(f"Can 'mybot' fetch /admin/login? {can_fetch_login}")
print(f"Crawl delay for 'mybot': {rp.crawl_delay('mybot')} seconds")
print(f"Sitemaps: {list(rp.sitemaps)}")