Filters
The filter components that admit or reject discovered URLs.
Filters decide which discovered URLs are allowed into the frontier. A job lists filters under config.filters, each selected by kind. This page catalogs the available kinds and their parameters.
How filters are configured
Each entry in config.filters is an object with a kind and an optional params object. Filters are applied in order, and a URL must pass every filter to be admitted to the queue. Filters that take allow and deny lists share a common rule: a URL is admitted when it matches an allow entry and no deny entry, and default_allow decides the outcome for a URL that matches neither list.
config:
filters:
- kind: regex_patterns
params:
allow:
- https://example\.com/.+
- kind: budget
params:
by: host
limit: 1000regex_patterns and url_patterns
Both kinds admit or reject by matching the URL. regex_patterns matches the full URL string against regular expressions. url_patterns matches individual parts of the URL, each given as its own pattern, which is more readable when you only care about a host and path prefix.
Both share the same parameters. For regex_patterns, each entry in allow and deny is a regular expression string. For url_patterns, each entry is a URL pattern object.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
allow | list of regex strings or URL patterns | no | none | Patterns a URL may match to be admitted. |
deny | list of regex strings or URL patterns | no | none | Patterns that reject a URL. |
default_allow | boolean | no | false | Whether a URL matching neither list is admitted. |
filters:
- kind: regex_patterns
params:
allow:
- https://example\.com/docs/.+
deny:
- https://example\.com/docs/changelog/.+
- kind: url_patterns
params:
allow:
- hostname: example.com
pathname: /docs/*URL pattern
A url_patterns entry matches against individual parts of a URL. Only the components you set are matched; the rest are ignored.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
protocol | string | no | none | Match the URL scheme, such as https. |
username | string | no | none | Match the userinfo username. |
password | string | no | none | Match the userinfo password. |
hostname | string | no | none | Match the host. |
port | string | no | none | Match the port. |
pathname | string | no | none | Match the path. |
search | string | no | none | Match the query string. |
hash | string | no | none | Match the fragment. |
base_url | string | no | none | A base URL that relative pattern components are resolved against. |
max_depth
Admits a URL only if it is within a given link distance of the seeds.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
max_depth | integer | yes | none | The maximum depth, counted in links from a seed, that a URL may be admitted at. |
filters:
- kind: max_depth
params:
max_depth: 3recrawl
Controls whether a URL that has already been seen may be admitted again.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
allow | boolean | no | none | Whether already-seen URLs may be re-admitted. |
scope | local or global | no | none | Whether "already seen" is judged within this run (local) or across all runs (global). |
minimum_delay | duration | no | none | The minimum time that must pass before a seen URL is re-admitted. |
exclude_roots | boolean | no | none | Whether root URLs are excluded from recrawl handling. |
exclude_path_depth | integer | no | none | A path depth below which URLs are excluded from recrawl handling. |
filters:
- kind: recrawl
params:
allow: true
scope: global
minimum_delay: 24hrobots_txt
Rejects URLs that a site's robots rules disallow.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
cache_ttl | duration | no | none | How long a fetched robots ruleset is cached before being refetched. Durations are written like 30s, 1h. |
filters:
- kind: robots_txt
params:
cache_ttl: 1hbudget
Caps how many URLs are admitted per a chosen dimension.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
by | host | yes | none | The dimension the budget is counted against. |
limit | integer | yes | none | The maximum number of URLs admitted per dimension value. |
filters:
- kind: budget
params:
by: host
limit: 1000url_labels and host_labels
Admit or reject based on labels attached to the URL itself (url_labels) or to its host (host_labels). This lets labels written during one crawl steer a later one.
Both share the same parameters:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
allow | map of string to string | no | none | Labels a URL or host must carry to be admitted. |
deny | map of string to string | no | none | Labels that reject a URL or host. |
filters:
- kind: url_labels
params:
deny:
skip: "true"interleave
Mixes URLs from different hosts together in the queue rather than draining one host before the next, which spreads load across hosts.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
shuffle | boolean | no | none | Whether to shuffle the interleaved order. |
filters:
- kind: interleave
params:
shuffle: true