Author a crawl job

Build a crawl-job manifest from the ground up, choosing each component deliberately.

This guide walks through writing a crawl-job manifest piece by piece, explaining the choices at each step. By the end you will be able to assemble a manifest for a real crawl rather than copying an example. For the exhaustive field reference, see Crawl manifest.

A manifest has three top-level fields: a name, an optional set of labels, and a config object that holds all the crawl behavior. Every field inside config has a default, so you only specify what you want to change. The sections below build a config up from the smallest useful job to a complete one.

Start with seeds and a stopping rule

The two things every crawl needs are a place to start and a reason to stop. Seeds provide the starting URLs. A stopping criterion ends the run so it does not crawl forever.

The simplest seed is a fixed list of URLs, and the simplest stopping rule is a cap on the number of URLs visited:

name: docs-crawl
config:
  seeds:
    - kind: static_list
      params:
        urls:
          - https://example.com/docs
  stopping_criteria:
    - kind: max_urls
      params:
        max_urls: 50

This is already a runnable job: it starts at one URL and stops after fifty have been processed. Other stopping criteria include max_depth (stop past a link distance from the seeds), max_age (stop after a wall-clock duration), and max_empty_evaluations (stop once the crawl stops finding new work). You can list more than one; the run stops when any of them is met.

Keep the crawl on target with filters

Left unfiltered, a crawl follows every link it discovers and can wander off the site you care about. Filters decide which discovered URLs are admitted to the queue and which are rejected.

The most common filter matches URLs against patterns. Use regex_patterns to allow and deny by regular expression:

  filters:
    - kind: regex_patterns
      params:
        allow:
          - https://example\.com/docs/.+
        deny:
          - https://example\.com/docs/changelog/.+

A URL must match an allow pattern and must not match a deny pattern to be admitted. Other useful filters include max_depth (admit only URLs within a link distance of the seeds), budget (cap how many URLs are admitted per host), and robots_txt (respect a site's robots rules). Filters work together, so you can combine a pattern filter with a per-host budget to stay both on-topic and polite.

Choose a traversal order

The ranker decides the order in which queued URLs are visited. The choice affects how coverage builds up over a partial crawl.

  ranker:
    kind: breadth

A breadth ranker visits URLs closer to the seeds first, which spreads coverage evenly and is a good default for site-wide crawls. A depth ranker follows links deeper before widening, which reaches far pages sooner but covers the site unevenly if the run is cut short. There is also a page_rank ranker that orders URLs by link importance.

Normalize URLs

Many sites link to the same page through several URLs that differ only by a trailing fragment or query string. Each variant would otherwise be queued as a separate URL. The sanitize mutator strips those parts before a URL enters the queue, so duplicates collapse into one:

  mutators:
    - kind: sanitize
      params:
        strip_fragment: true
        strip_query: true

For the full set of seeds, filters, rankers, and mutators, see Seed and filter URLs.

Configure fetching

The fetcher retrieves each URL. Its most important setting is the client, which decides how the request is made. The default standard client fetches raw HTML over plain HTTP and is fast:

  fetcher:
    client:
      kind: standard
    timeout: 30s
    max_retries: 3

The standard client cannot run JavaScript, so sites that render their content in the browser come back empty or incomplete. For those, inndx offers browser-based clients. See Fetch with a browser for when and how to use them.

Define the parsing pipeline

A pipeline turns fetched content into results. It discovers links to follow through a navigator, processes content through an ordered list of steps, and delivers output through actions. A job can declare more than one pipeline.

  pipelines:
    - navigator:
        kind: anchor
      steps:
        - kind: extractor
          params:
            kind: markdown
      actions:
        - kind: to_blob
          params:
            directory: output

The anchor navigator discovers child links from the page's anchor tags, which is what lets the crawl widen beyond its seeds. The extractor step of kind markdown converts each page to clean markdown. The to_blob action writes the result to blob storage under the output directory.

Steps run in order. Besides extractor, a step can be a condition (decide whether to keep the page before extracting) or an asset_resolver (resolve and download referenced assets such as images). For the full set of extractors see Extract structured data; for delivery options see Deliver results.

Validate and run

Put the sections together into one manifest file, docs-crawl.yml:

name: docs-crawl
config:
  seeds:
    - kind: static_list
      params:
        urls:
          - https://example.com/docs
  stopping_criteria:
    - kind: max_urls
      params:
        max_urls: 50
  filters:
    - kind: regex_patterns
      params:
        allow:
          - https://example\.com/docs/.+
  ranker:
    kind: breadth
  mutators:
    - kind: sanitize
      params:
        strip_fragment: true
        strip_query: true
  fetcher:
    client:
      kind: standard
    timeout: 30s
  pipelines:
    - navigator:
        kind: anchor
      steps:
        - kind: extractor
          params:
            kind: markdown
      actions:
        - kind: to_blob
          params:
            directory: output

The quickest way to try a manifest is the run command, which executes a single crawl from a file without standing up a server:

inndx run docs-crawl.yml

To create a persistent job that you can run repeatedly, trigger on a schedule, and inspect over time, submit the manifest instead of using run. Start a run immediately on creation:

crawlctl jobs create -f docs-crawl.yml --start-run

The response is the created job, including its generated id. Use that id (or the job's name) to start more runs, attach triggers, and query progress. Once the job exists, keep it up to date declaratively with crawlctl jobs apply -f docs-crawl.yml, which creates it if it is missing or updates it if it already exists. To confirm a run is underway and watch it advance, see Inspect runs and logs.