Crawl manifest

The complete schema of a crawl-job manifest, field by field.

A crawl manifest is the definition of a single crawl job. You pass it to the run command as a file, or send its config object to the orchestrator's HTTP API to create a job. This page documents every field. For the catalog of kind values each component field accepts, follow the links into the Components Reference.

Manifests can be written in YAML or JSON; the two are interchangeable. YAML is used throughout this page for readability.

Top-level structure

A manifest has three top-level fields:

name: my-crawl
labels:
  team: research
config:
  # ... the crawl configuration, documented below

Field	Type	Required	Description
`name`	string (1 to 255 chars)	yes	Human-readable name for the job.
`labels`	map of string to string	no	Arbitrary key-value labels attached to the job, useful for filtering and organization. Defaults to empty.
`config`	object	yes	The crawl configuration. All crawl behavior lives here.

The config object

config is where the crawl is defined. Every field has a default, so a minimal manifest only needs to override what it cares about (in practice, at least seeds, a stopping rule, and a pipeline). The fields are:

Field	Type	Default	Description
`user_agent`	string or `random`	a built-in default	The User-Agent sent with requests. `random` rotates a realistic agent per request.
`max_queue_size`	integer	`1000`	Maximum number of URLs held in the in-flight frontier queue. Acts as a backpressure ceiling.
`speed`	object	Speed	Rate-limiting and batching controls.
`starting_criteria`	list	empty	Rules deciding whether a new run may start. Starting criteria.
`stopping_criteria`	list	empty	Rules deciding when a run is finished. Stopping criteria.
`seeds`	list	empty	Where the crawl begins. Seeds.
`ranker`	object	breadth-first	How queued URLs are ordered. Rankers.
`fetcher`	object	Fetcher	How pages are retrieved.
`policies`	list	empty	Cross-cutting crawl policies. Policies.
`filters`	list	empty	Rules that admit or reject discovered URLs. Filters.
`mutators`	list	empty	Transforms applied to URLs before they enter the frontier. Mutators.
`pipelines`	list	empty	The parsing pipelines applied to fetched content. Pipelines.

Each component field follows the same shape: an object with a kind naming the implementation and an optional params object configuring it.

seeds:
  - kind: static_list
    params:
      urls:
        - https://example.com/

Speed

config.speed controls how aggressively the crawl issues work, modeled as a token bucket:

speed:
  bucket_capacity: 10
  batch_size_factor: 1.0
  refill_strategy: fixed
  refill_rate: 1.0

Field	Type	Default	Description
`bucket_capacity`	integer	`10`	Maximum tokens available, which bounds how many URLs can be scheduled in one evaluation.
`batch_size_factor`	float	`1.0`	Multiplier applied to `bucket_capacity` to compute the effective scheduling batch size.
`refill_strategy`	enum	`fixed`	How tokens are replenished over time.
`refill_rate`	float	`1.0`	Tokens added per refill interval.

Fetcher

config.fetcher configures retrieval. Only client is commonly set; the rest have sensible defaults:

fetcher:
  client:
    kind: standard
  timeout: 30s
  max_redirects: 10
  max_retries: 3
  retry_on_codes: [429, 451, 500, 502, 503, 504, 526]

Field	Type	Default	Description
`client`	object	standard HTTP	The fetch client. Clients.
`middleware`	list	empty	Per-request transforms. Middleware.
`timeout`	duration	`30s`	Per-request timeout. Durations are written like `30s`, `2m`, `500ms`.
`max_redirects`	integer	`10`	Maximum redirects followed per request.
`max_retries`	integer	`3`	Maximum retry attempts for a failed fetch.
`retry_on_codes`	list of integers	`[429, 451, 500, 502, 503, 504, 526]`	HTTP status codes that trigger a retry.
`proxies`	list of strings	none	Optional proxy URLs to route requests through.

Pipelines

config.pipelines is the heart of parsing. Each pipeline processes fetched content through an ordered set of steps, discovers links through a navigator, and delivers results through actions. A job may declare multiple pipelines.

pipelines:
  - identifier: main
    navigator:
      kind: anchor
    steps:
      - kind: extractor
        params:
          kind: markdown
    actions:
      - kind: to_blob
        params:
          directory: output

Field	Type	Default	Description
`identifier`	string	empty	A name for the pipeline, used in logs and to distinguish multiple pipelines.
`guards`	list	none	Early-exit checks evaluated before the pipeline runs. Guards.
`navigator`	object	none	How child links are discovered. Navigators.
`steps`	list	empty	The ordered processing steps (conditions, extractors, resolvers). the parser pipeline.
`actions`	list	empty	What to do with the result. Sink actions.
`priority`	integer	none	Optional ordering when multiple pipelines apply.
`behavior`	enum	default	How this pipeline interacts with others that match the same content.

A step is itself a kind plus params. The common kinds are extractor (pull structured data, see Extractors), conditions (include or exclude the page), and asset_resolver (resolve and download referenced assets, see Resolvers).

A complete example

The following manifest crawls a single site breadth-first, keeps the crawl on-domain with a regex filter, normalizes URLs, resolves a few images per page, converts pages to markdown, and writes everything to blob storage. It mirrors the example shipped at manifests/test-crawl.yml in the repository.

name: docs-crawl
labels:
  job: docs-crawl
config:
  user_agent: random
  max_queue_size: 15
  speed:
    bucket_capacity: 4
    batch_size_factor: 2.0
    refill_strategy: fixed
    refill_rate: 1.0
  stopping_criteria:
    - kind: max_urls
      params:
        max_urls: 50
  seeds:
    - kind: static_list
      params:
        urls:
          - https://example.com/docs
  ranker:
    kind: breadth
  fetcher:
    client:
      kind: standard
    timeout: 30s
  filters:
    - kind: regex_patterns
      params:
        allow:
          - https://example\.com/docs/.+
  mutators:
    - kind: sanitize
      params:
        strip_fragment: true
        strip_query: true
  pipelines:
    - navigator:
        kind: anchor
      steps:
        - kind: asset_resolver
          params:
            concurrency: 4
            timeout: 10s
            max_retries: 3
            resolvers:
              - kind: xpath
                params:
                  xpaths:
                    - //body/img/@src
                  max_items: 3
        - kind: extractor
          params:
            kind: markdown
      actions:
        - kind: to_blob
          params:
            directory: output
            key_strategy: 5min
            include_assets: true

Author a crawl job walks through building a manifest from scratch.
Components catalogs every kind and its parameters.
Configuration covers server and deployment settings, which are separate from per-job manifest settings.