Protego

0.6.0 · active · verified Wed Apr 01

Protego is a pure-Python robots.txt parser with support for modern conventions like those defined by Google. As of version 0.6.0, it actively supports Python 3.10 and newer, with regular updates aligning with new Python releases. It is widely used for web scraping and compliance checking.

Warnings

Install

Imports

Quickstart

Initialize the parser with robots.txt content and check URL access permissions for a specific user agent, retrieve crawl delay, and sitemaps.

from protego import Protego

robotstxt_content = """
User-agent: *
Disallow: /admin/
Allow: /admin/login
Crawl-delay: 5
Sitemap: http://example.com/sitemap.xml
"""

rp = Protego.parse(robotstxt_content)

# Check if a URL can be fetched by a user agent
can_fetch_admin = rp.can_fetch("http://example.com/admin/settings", "mybot")
can_fetch_login = rp.can_fetch("http://example.com/admin/login", "mybot")

print(f"Can 'mybot' fetch /admin/settings? {can_fetch_admin}")
print(f"Can 'mybot' fetch /admin/login? {can_fetch_login}")
print(f"Crawl delay for 'mybot': {rp.crawl_delay('mybot')} seconds")
print(f"Sitemaps: {list(rp.sitemaps)}")

view raw JSON →