inndx/
GitHub

Policies

The policy components that apply cross-cutting crawl rules.

Policies express cross-cutting rules that govern crawl behavior beyond per-URL filtering. A job lists policies under config.policies. This page catalogs the available kinds and their parameters.

How policies are configured

Each entry in config.policies is an object with a kind and an optional params object. Policies decide when a URL is actually scheduled out of the frontier to be crawled. When the policies say yes, the URL is scheduled now; when they say no, it is returned to the frontier to be reconsidered later. This is what distinguishes a policy from a filter: a filter makes a one-time admit or reject decision when a URL is discovered, while a policy is re-evaluated each time the orchestrator considers scheduling a URL, so a URL held back now can still be scheduled on a later pass once its condition is met.

config:
  policies:
    - kind: robots_txt
      params:
        default_delay: 10s

recrawl

Holds back URLs that were visited too recently to be crawled again, based on a minimum delay. A URL whose last visit is older than the delay is allowed through; one visited more recently is deferred.

FieldTypeRequiredDefaultDescription
scopelocal or globalnolocalWhether the last visit is looked up within this run (local) or across all runs (global).
minimum_delaydurationno72hThe minimum time since the last visit before a URL may be crawled again.
policies:
  - kind: recrawl
    params:
      scope: global
      minimum_delay: 24h

robots_txt

Enforces a crawl delay between requests to the same host, for politeness. It uses the delay declared in the host's robots rules when one is present, and the configured default_delay otherwise.

FieldTypeRequiredDefaultDescription
default_delayduration or rangeno10sThe delay to apply when robots rules do not declare one. A single duration is fixed; a two-element [min, max] list picks a random delay in that range per request.
scopelocal or globalnoglobalWhether the delay is tracked per run (local) or across all runs (global).
policies:
  - kind: robots_txt
    params:
      default_delay: [5s, 15s]
      scope: global

Search docs

Search the Self-host documentation