{"library":"robots-txt-parser","title":"Robots.txt Parser","description":"The `robots-txt-parser` library provides a lightweight, promise-based solution for parsing `robots.txt` files efficiently in Node.js environments. It is currently at version 2.0.3, offering features such as comprehensive wildcard support in rules, configurable caching of `robots.txt` content, and flexible asynchronous operations via both promises and traditional callbacks. This package is specifically designed for developers building web crawlers, scrapers, and other automated bots that must adhere to website crawling policies. Key differentiators include its focus on Node.js, a clear API for determining URL crawlability, retrieving sitemaps, and managing crawl delays. The project maintains a stable release cadence, with the 2.x major version being actively supported since late 2018. Users can configure critical parameters such as the default user agent string and how the parser evaluates scenarios where allow/disallow rules are balanced.","language":"javascript","status":"active","last_verified":"Sun Apr 19","install":{"commands":["npm install robots-txt-parser"],"cli":null},"imports":["const robotsParser = require('robots-txt-parser');","import robotsParser from 'robots-txt-parser';","const robots = robotsParser({ userAgent: 'MyBot' });"],"auth":{"required":false,"env_vars":[]},"quickstart":{"code":"const robotsParser = require('robots-txt-parser');\n\nconst robots = robotsParser({\n  userAgent: 'Googlebot',\n  allowOnNeutral: false,\n});\n\nasync function checkCrawlability() {\n  try {\n    const domainUrl = 'http://example.com';\n    await robots.useRobotsFor(domainUrl); // Fetch and parse robots.txt for the domain\n\n    console.log(`Checking crawlability for ${domainUrl}/news...`);\n\n    const canCrawlSyncResult = robots.canCrawlSync(`${domainUrl}/news`);\n    console.log(`Crawlable (sync): ${canCrawlSyncResult}`);\n\n    // Promise-based check\n    const canCrawlPromiseResult = await robots.canCrawl(`${domainUrl}/news`);\n    console.log(`Crawlable (promise): ${canCrawlPromiseResult}`);\n\n    // Callback-based check\n    robots.canCrawl(`${domainUrl}/articles`, (value) => {\n      console.log(`Crawlable (callback for ${domainUrl}/articles): ${value}`);\n    });\n\n    const sitemaps = await robots.getSitemaps();\n    console.log('Sitemaps found:', sitemaps);\n\n  } catch (error) {\n    console.error('Error during robots.txt parsing or crawl check:', error);\n  }\n}\n\ncheckCrawlability();","lang":"javascript","description":"This quickstart demonstrates how to initialize the robots.txt parser, fetch rules for a domain, and check URL crawlability using synchronous, promise-based, and callback methods. It also shows how to retrieve sitemaps.","tag":null,"tag_description":null,"last_tested":null,"results":[]},"compatibility":null}