Policies
The policy components that apply cross-cutting crawl rules.
Policies express cross-cutting rules that govern crawl behavior beyond per-URL filtering. A job lists policies under config.policies. This page catalogs the available kinds and their parameters.
How policies are configured
Each entry in config.policies is an object with a kind and an optional params object. Policies decide when a URL is actually scheduled out of the frontier to be crawled. When the policies say yes, the URL is scheduled now; when they say no, it is returned to the frontier to be reconsidered later. This is what distinguishes a policy from a filter: a filter makes a one-time admit or reject decision when a URL is discovered, while a policy is re-evaluated each time the orchestrator considers scheduling a URL, so a URL held back now can still be scheduled on a later pass once its condition is met.
config:
policies:
- kind: robots_txt
params:
default_delay: 10srecrawl
Holds back URLs that were visited too recently to be crawled again, based on a minimum delay. A URL whose last visit is older than the delay is allowed through; one visited more recently is deferred.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
scope | local or global | no | local | Whether the last visit is looked up within this run (local) or across all runs (global). |
minimum_delay | duration | no | 72h | The minimum time since the last visit before a URL may be crawled again. |
policies:
- kind: recrawl
params:
scope: global
minimum_delay: 24hrobots_txt
Enforces a crawl delay between requests to the same host, for politeness. It uses the delay declared in the host's robots rules when one is present, and the configured default_delay otherwise.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
default_delay | duration or range | no | 10s | The delay to apply when robots rules do not declare one. A single duration is fixed; a two-element [min, max] list picks a random delay in that range per request. |
scope | local or global | no | global | Whether the delay is tracked per run (local) or across all runs (global). |
policies:
- kind: robots_txt
params:
default_delay: [5s, 15s]
scope: global