inndx/
GitHub

Filters

The filter components that admit or reject discovered URLs.

Filters decide which discovered URLs are allowed into the frontier. A job lists filters under config.filters, each selected by kind. This page catalogs the available kinds and their parameters.

How filters are configured

Each entry in config.filters is an object with a kind and an optional params object. Filters are applied in order, and a URL must pass every filter to be admitted to the queue. Filters that take allow and deny lists share a common rule: a URL is admitted when it matches an allow entry and no deny entry, and default_allow decides the outcome for a URL that matches neither list.

config:
  filters:
    - kind: regex_patterns
      params:
        allow:
          - https://example\.com/.+
    - kind: budget
      params:
        by: host
        limit: 1000

regex_patterns and url_patterns

Both kinds admit or reject by matching the URL. regex_patterns matches the full URL string against regular expressions. url_patterns matches individual parts of the URL, each given as its own pattern, which is more readable when you only care about a host and path prefix.

Both share the same parameters. For regex_patterns, each entry in allow and deny is a regular expression string. For url_patterns, each entry is a URL pattern object.

FieldTypeRequiredDefaultDescription
allowlist of regex strings or URL patternsnononePatterns a URL may match to be admitted.
denylist of regex strings or URL patternsnononePatterns that reject a URL.
default_allowbooleannofalseWhether a URL matching neither list is admitted.
filters:
  - kind: regex_patterns
    params:
      allow:
        - https://example\.com/docs/.+
      deny:
        - https://example\.com/docs/changelog/.+
  - kind: url_patterns
    params:
      allow:
        - hostname: example.com
          pathname: /docs/*

URL pattern

A url_patterns entry matches against individual parts of a URL. Only the components you set are matched; the rest are ignored.

FieldTypeRequiredDefaultDescription
protocolstringnononeMatch the URL scheme, such as https.
usernamestringnononeMatch the userinfo username.
passwordstringnononeMatch the userinfo password.
hostnamestringnononeMatch the host.
portstringnononeMatch the port.
pathnamestringnononeMatch the path.
searchstringnononeMatch the query string.
hashstringnononeMatch the fragment.
base_urlstringnononeA base URL that relative pattern components are resolved against.

max_depth

Admits a URL only if it is within a given link distance of the seeds.

FieldTypeRequiredDefaultDescription
max_depthintegeryesnoneThe maximum depth, counted in links from a seed, that a URL may be admitted at.
filters:
  - kind: max_depth
    params:
      max_depth: 3

recrawl

Controls whether a URL that has already been seen may be admitted again.

FieldTypeRequiredDefaultDescription
allowbooleannononeWhether already-seen URLs may be re-admitted.
scopelocal or globalnononeWhether "already seen" is judged within this run (local) or across all runs (global).
minimum_delaydurationnononeThe minimum time that must pass before a seen URL is re-admitted.
exclude_rootsbooleannononeWhether root URLs are excluded from recrawl handling.
exclude_path_depthintegernononeA path depth below which URLs are excluded from recrawl handling.
filters:
  - kind: recrawl
    params:
      allow: true
      scope: global
      minimum_delay: 24h

robots_txt

Rejects URLs that a site's robots rules disallow.

FieldTypeRequiredDefaultDescription
cache_ttldurationnononeHow long a fetched robots ruleset is cached before being refetched. Durations are written like 30s, 1h.
filters:
  - kind: robots_txt
    params:
      cache_ttl: 1h

budget

Caps how many URLs are admitted per a chosen dimension.

FieldTypeRequiredDefaultDescription
byhostyesnoneThe dimension the budget is counted against.
limitintegeryesnoneThe maximum number of URLs admitted per dimension value.
filters:
  - kind: budget
    params:
      by: host
      limit: 1000

url_labels and host_labels

Admit or reject based on labels attached to the URL itself (url_labels) or to its host (host_labels). This lets labels written during one crawl steer a later one.

Both share the same parameters:

FieldTypeRequiredDefaultDescription
allowmap of string to stringnononeLabels a URL or host must carry to be admitted.
denymap of string to stringnononeLabels that reject a URL or host.
filters:
  - kind: url_labels
    params:
      deny:
        skip: "true"

interleave

Mixes URLs from different hosts together in the queue rather than draining one host before the next, which spreads load across hosts.

FieldTypeRequiredDefaultDescription
shufflebooleannononeWhether to shuffle the interleaved order.
filters:
  - kind: interleave
    params:
      shuffle: true

Search docs

Search the Self-host documentation