Crawl manifest
The complete schema of a crawl-job manifest, field by field.
A crawl manifest is the definition of a single crawl job. You pass it to the run command as a file, or send its config object to the orchestrator's HTTP API to create a job. This page documents every field. For the catalog of kind values each component field accepts, follow the links into Reference, Components.
Manifests can be written in YAML or JSON; the two are interchangeable. YAML is used throughout this page for readability.
Top-level structure
A manifest has three top-level fields:
name: my-crawl
labels:
team: research
config:
# ... the crawl configuration, documented below| Field | Type | Required | Description |
|---|---|---|---|
name | string (1 to 255 chars) | yes | Human-readable name for the job. |
labels | map of string to string | no | Arbitrary key-value labels attached to the job, useful for filtering and organization. Defaults to empty. |
config | object | yes | The crawl configuration. All crawl behavior lives here. |
The config object
config is where the crawl is defined. Every field has a default, so a minimal manifest only needs to override what it cares about (in practice, at least seeds, a stopping rule, and a pipeline). The fields are:
| Field | Type | Default | Description |
|---|---|---|---|
user_agent | string or random | a built-in default | The User-Agent sent with requests. random rotates a realistic agent per request. |
max_queue_size | integer | 1000 | Maximum number of URLs held in the in-flight frontier queue. Acts as a backpressure ceiling. |
speed | object | Speed | Rate-limiting and batching controls. |
starting_criteria | list | empty | Rules deciding whether a new run may start. Starting criteria. |
stopping_criteria | list | empty | Rules deciding when a run is finished. Stopping criteria. |
seeds | list | empty | Where the crawl begins. Seeds. |
ranker | object | breadth-first | How queued URLs are ordered. Rankers. |
fetcher | object | Fetcher | How pages are retrieved. |
policies | list | empty | Cross-cutting crawl policies. Policies. |
filters | list | empty | Rules that admit or reject discovered URLs. Filters. |
mutators | list | empty | Transforms applied to URLs before they enter the frontier. Mutators. |
pipelines | list | empty | The parsing pipelines applied to fetched content. Pipelines. |
Each component field follows the same shape: an object with a kind naming the implementation and an optional params object configuring it.
seeds:
- kind: static_list
params:
urls:
- https://example.com/Speed
config.speed controls how aggressively the crawl issues work, modeled as a token bucket:
speed:
bucket_capacity: 10
batch_size_factor: 1.0
refill_strategy: fixed
refill_rate: 1.0| Field | Type | Default | Description |
|---|---|---|---|
bucket_capacity | integer | 10 | Maximum tokens available, which bounds how many URLs can be scheduled in one evaluation. |
batch_size_factor | float | 1.0 | Multiplier applied to bucket_capacity to compute the effective scheduling batch size. |
refill_strategy | enum | fixed | How tokens are replenished over time. |
refill_rate | float | 1.0 | Tokens added per refill interval. |
Fetcher
config.fetcher configures retrieval. Only client is commonly set; the rest have sensible defaults:
fetcher:
client:
kind: standard
timeout: 30s
max_redirects: 10
max_retries: 3
retry_on_codes: [429, 451, 500, 502, 503, 504, 526]| Field | Type | Default | Description |
|---|---|---|---|
client | object | standard HTTP | The fetch client. Clients. |
middleware | list | empty | Per-request transforms. Middleware. |
timeout | duration | 30s | Per-request timeout. Durations are written like 30s, 2m, 500ms. |
max_redirects | integer | 10 | Maximum redirects followed per request. |
max_retries | integer | 3 | Maximum retry attempts for a failed fetch. |
retry_on_codes | list of integers | [429, 451, 500, 502, 503, 504, 526] | HTTP status codes that trigger a retry. |
proxies | list of strings | none | Optional proxy URLs to route requests through. |
Pipelines
config.pipelines is the heart of parsing. Each pipeline processes fetched content through an ordered set of steps, discovers links through a navigator, and delivers results through actions. A job may declare multiple pipelines.
pipelines:
- identifier: main
navigator:
kind: anchor
steps:
- kind: extractor
params:
kind: markdown
actions:
- kind: to_blob
params:
directory: output| Field | Type | Default | Description |
|---|---|---|---|
identifier | string | empty | A name for the pipeline, used in logs and to distinguish multiple pipelines. |
guards | list | none | Early-exit checks evaluated before the pipeline runs. Guards. |
navigator | object | none | How child links are discovered. Navigators. |
steps | list | empty | The ordered processing steps (conditions, extractors, resolvers). the parser pipeline. |
actions | list | empty | What to do with the result. Sink actions. |
priority | integer | none | Optional ordering when multiple pipelines apply. |
behavior | enum | default | How this pipeline interacts with others that match the same content. |
A step is itself a kind plus params. The common kinds are extractor (pull structured data, see Extractors), conditions (include or exclude the page), and asset_resolver (resolve and download referenced assets, see Resolvers).
A complete example
The following manifest crawls a single site breadth-first, keeps the crawl on-domain with a regex filter, normalizes URLs, resolves a few images per page, converts pages to markdown, and writes everything to blob storage. It mirrors the example shipped at manifests/test-crawl.yml in the repository.
name: docs-crawl
labels:
job: docs-crawl
config:
user_agent: random
max_queue_size: 15
speed:
bucket_capacity: 4
batch_size_factor: 2.0
refill_strategy: fixed
refill_rate: 1.0
stopping_criteria:
- kind: max_urls
params:
max_urls: 50
seeds:
- kind: static_list
params:
urls:
- https://example.com/docs
ranker:
kind: breadth
fetcher:
client:
kind: standard
timeout: 30s
filters:
- kind: regex_patterns
params:
allow:
- https://example\.com/docs/.+
mutators:
- kind: sanitize
params:
strip_fragment: true
strip_query: true
pipelines:
- navigator:
kind: anchor
steps:
- kind: asset_resolver
params:
concurrency: 4
timeout: 10s
max_retries: 3
resolvers:
- kind: xpath
params:
xpaths:
- //body/img/@src
max_items: 3
- kind: extractor
params:
kind: markdown
actions:
- kind: to_blob
params:
directory: output
key_strategy: 5min
include_assets: trueRelated
- Author a crawl job walks through building a manifest from scratch.
- Components catalogs every
kindand its parameters. - Configuration covers server and deployment settings, which are separate from per-job manifest settings.