inndx/
GitHub

Crawl manifest

The complete schema of a crawl-job manifest, field by field.

A crawl manifest is the definition of a single crawl job. You pass it to the run command as a file, or send its config object to the orchestrator's HTTP API to create a job. This page documents every field. For the catalog of kind values each component field accepts, follow the links into Reference, Components.

Manifests can be written in YAML or JSON; the two are interchangeable. YAML is used throughout this page for readability.

Top-level structure

A manifest has three top-level fields:

name: my-crawl
labels:
  team: research
config:
  # ... the crawl configuration, documented below
FieldTypeRequiredDescription
namestring (1 to 255 chars)yesHuman-readable name for the job.
labelsmap of string to stringnoArbitrary key-value labels attached to the job, useful for filtering and organization. Defaults to empty.
configobjectyesThe crawl configuration. All crawl behavior lives here.

The config object

config is where the crawl is defined. Every field has a default, so a minimal manifest only needs to override what it cares about (in practice, at least seeds, a stopping rule, and a pipeline). The fields are:

FieldTypeDefaultDescription
user_agentstring or randoma built-in defaultThe User-Agent sent with requests. random rotates a realistic agent per request.
max_queue_sizeinteger1000Maximum number of URLs held in the in-flight frontier queue. Acts as a backpressure ceiling.
speedobjectSpeedRate-limiting and batching controls.
starting_criterialistemptyRules deciding whether a new run may start. Starting criteria.
stopping_criterialistemptyRules deciding when a run is finished. Stopping criteria.
seedslistemptyWhere the crawl begins. Seeds.
rankerobjectbreadth-firstHow queued URLs are ordered. Rankers.
fetcherobjectFetcherHow pages are retrieved.
policieslistemptyCross-cutting crawl policies. Policies.
filterslistemptyRules that admit or reject discovered URLs. Filters.
mutatorslistemptyTransforms applied to URLs before they enter the frontier. Mutators.
pipelineslistemptyThe parsing pipelines applied to fetched content. Pipelines.

Each component field follows the same shape: an object with a kind naming the implementation and an optional params object configuring it.

seeds:
  - kind: static_list
    params:
      urls:
        - https://example.com/

Speed

config.speed controls how aggressively the crawl issues work, modeled as a token bucket:

speed:
  bucket_capacity: 10
  batch_size_factor: 1.0
  refill_strategy: fixed
  refill_rate: 1.0
FieldTypeDefaultDescription
bucket_capacityinteger10Maximum tokens available, which bounds how many URLs can be scheduled in one evaluation.
batch_size_factorfloat1.0Multiplier applied to bucket_capacity to compute the effective scheduling batch size.
refill_strategyenumfixedHow tokens are replenished over time.
refill_ratefloat1.0Tokens added per refill interval.

Fetcher

config.fetcher configures retrieval. Only client is commonly set; the rest have sensible defaults:

fetcher:
  client:
    kind: standard
  timeout: 30s
  max_redirects: 10
  max_retries: 3
  retry_on_codes: [429, 451, 500, 502, 503, 504, 526]
FieldTypeDefaultDescription
clientobjectstandard HTTPThe fetch client. Clients.
middlewarelistemptyPer-request transforms. Middleware.
timeoutduration30sPer-request timeout. Durations are written like 30s, 2m, 500ms.
max_redirectsinteger10Maximum redirects followed per request.
max_retriesinteger3Maximum retry attempts for a failed fetch.
retry_on_codeslist of integers[429, 451, 500, 502, 503, 504, 526]HTTP status codes that trigger a retry.
proxieslist of stringsnoneOptional proxy URLs to route requests through.

Pipelines

config.pipelines is the heart of parsing. Each pipeline processes fetched content through an ordered set of steps, discovers links through a navigator, and delivers results through actions. A job may declare multiple pipelines.

pipelines:
  - identifier: main
    navigator:
      kind: anchor
    steps:
      - kind: extractor
        params:
          kind: markdown
    actions:
      - kind: to_blob
        params:
          directory: output
FieldTypeDefaultDescription
identifierstringemptyA name for the pipeline, used in logs and to distinguish multiple pipelines.
guardslistnoneEarly-exit checks evaluated before the pipeline runs. Guards.
navigatorobjectnoneHow child links are discovered. Navigators.
stepslistemptyThe ordered processing steps (conditions, extractors, resolvers). the parser pipeline.
actionslistemptyWhat to do with the result. Sink actions.
priorityintegernoneOptional ordering when multiple pipelines apply.
behaviorenumdefaultHow this pipeline interacts with others that match the same content.

A step is itself a kind plus params. The common kinds are extractor (pull structured data, see Extractors), conditions (include or exclude the page), and asset_resolver (resolve and download referenced assets, see Resolvers).

A complete example

The following manifest crawls a single site breadth-first, keeps the crawl on-domain with a regex filter, normalizes URLs, resolves a few images per page, converts pages to markdown, and writes everything to blob storage. It mirrors the example shipped at manifests/test-crawl.yml in the repository.

name: docs-crawl
labels:
  job: docs-crawl
config:
  user_agent: random
  max_queue_size: 15
  speed:
    bucket_capacity: 4
    batch_size_factor: 2.0
    refill_strategy: fixed
    refill_rate: 1.0
  stopping_criteria:
    - kind: max_urls
      params:
        max_urls: 50
  seeds:
    - kind: static_list
      params:
        urls:
          - https://example.com/docs
  ranker:
    kind: breadth
  fetcher:
    client:
      kind: standard
    timeout: 30s
  filters:
    - kind: regex_patterns
      params:
        allow:
          - https://example\.com/docs/.+
  mutators:
    - kind: sanitize
      params:
        strip_fragment: true
        strip_query: true
  pipelines:
    - navigator:
        kind: anchor
      steps:
        - kind: asset_resolver
          params:
            concurrency: 4
            timeout: 10s
            max_retries: 3
            resolvers:
              - kind: xpath
                params:
                  xpaths:
                    - //body/img/@src
                  max_items: 3
        - kind: extractor
          params:
            kind: markdown
      actions:
        - kind: to_blob
          params:
            directory: output
            key_strategy: 5min
            include_assets: true
  • Author a crawl job walks through building a manifest from scratch.
  • Components catalogs every kind and its parameters.
  • Configuration covers server and deployment settings, which are separate from per-job manifest settings.

Search docs

Search the Self-host documentation