Robot Directives Parser
The `robot-directives` package (current stable version 0.4.0) provides a focused utility for parsing and interpreting robot directives found within HTML `<meta name="robots">` tags and `X-Robots-Tag` HTTP headers. It allows developers to programmatically determine a crawler's allowed or disallowed actions based on these instructions, such as `noindex`, `nofollow`, `noarchive`, and `unavailable_after`. The library handles the cascading logic of multiple directives, user-agent specific rules, and resolves conflicts based on a `restrictive` default (mimicking Googlebot's behavior). It explicitly differentiates itself by not handling the underlying HTML parsing, requiring users to extract meta tag content themselves. While not on a rapid release cycle, the package offers a stable API for its specialized parsing tasks, including a comprehensive set of constants for all standard robot directives and static methods for general utility.
Common errors
-
TypeError: RobotDirectives is not a constructor
cause Attempting to use `import { RobotDirectives } from 'robot-directives';` or other incorrect destructuring for a CommonJS default export.fixFor CommonJS, use `const RobotDirectives = require('robot-directives');`. For ESM, use `import RobotDirectives from 'robot-directives';`. -
ReferenceError: require is not defined
cause Trying to use `require()` in an ES Module context without proper setup or bundler configuration.fixIf in an ESM file (e.g., `type: module` in `package.json`), use `import RobotDirectives from 'robot-directives';` instead of `const RobotDirectives = require('robot-directives');`. -
Directive 'X' not working as expected (e.g., 'all' isn't overriding other directives)
cause Misunderstanding the default behavior of `allIsReadonly` or `restrictive` options.fixReview the constructor options for `allIsReadonly` (default `true`) and `restrictive` (default `true`). Adjust them in the `new RobotDirectives(options)` call if the default behavior doesn't match your expectations.
Warnings
- gotcha The `allIsReadonly` option defaults to `true`. This means declaring an `'all'` directive will not overwrite other directives, which might be counter-intuitive if expecting 'all' to be absolute. Most search crawlers behave this way, but it's important to be aware of.
- gotcha The `restrictive` option defaults to `true`, resolving directive conflicts (e.g., `noindex,index`) by selecting the most restrictive value (`noindex`). While this mimics Googlebot, other crawlers might resolve conflicts differently.
- gotcha The `unavailable_after` directive's evaluation depends on the `currentTime` option. If `currentTime` is not correctly configured (e.g., time zone issues, static date), the `unavailable_after` directive might not be interpreted as expired when it should be, or vice-versa.
- gotcha This library explicitly states it is NOT responsible for parsing HTML. You must manually extract the `content` attribute from `<meta name="robots">` tags and pass it to the `meta()` method.
Install
-
npm install robot-directives -
yarn add robot-directives -
pnpm add robot-directives
Imports
- RobotDirectives
import { RobotDirectives } from 'robot-directives';const RobotDirectives = require('robot-directives'); - RobotDirectives.NOINDEX
import { NOINDEX } from 'robot-directives';const { NOINDEX } = RobotDirectives; - RobotDirectives.isBot
import { isBot } from 'robot-directives';const { isBot } = RobotDirectives;
Quickstart
const RobotDirectives = require('robot-directives');
// Instantiate with default options
const robots = new RobotDirectives({
// Example: Override default userAgent if needed
// userAgent: 'Googlebot/2.1 (web crawler) (+http://www.google.com/bot.html)',
// Example: Override current time for 'unavailable_after' testing
// currentTime: () => new Date('jan 1 2025').getTime()
});
// Add directives from an X-Robots-Tag HTTP header
robots.header('googlebot: noindex, nosnippet');
// Add directives from HTML meta tags
robots.meta('robots', 'noarchive,nofollow');
robots.meta('bingbot', 'unavailable_after: 1-Jan-3000 00:00:00 EST');
// Check specific directives
console.log('Is nofollow?', robots.is(RobotDirectives.NOFOLLOW));
// Expected: true
console.log('Is noindex for Googlebot?', robots.is(RobotDirectives.NOINDEX, { userAgent: 'Googlebot' }));
// Expected: true
console.log('Is noarchive?', robots.is(RobotDirectives.NOARCHIVE));
// Expected: true
// Check for a directive that is not present
console.log('Is index?', robots.is(RobotDirectives.INDEX));
// Expected: false
// Check if 'unavailable_after' has passed (example: assuming current time is after 3000)
console.log('Is noindex for Bingbot after 3000?', robots.is(RobotDirectives.NOINDEX, {
currentTime: () => new Date('Jan 2 3000').getTime(), // Set current time past the unavailable_after date
userAgent: 'Bingbot/2.0'
}));
// Expected: true
// Use static helper function
console.log('Is "googlebot" a recognized bot name?', RobotDirectives.isBot('googlebot'));
// Expected: true
console.log('Is "mycustombot" a recognized bot name?', RobotDirectives.isBot('mycustombot'));
// Expected: false