inndx/
GitHub

Your first crawl

Write a small manifest and run a complete crawl in a single process with the run command.

The fastest way to see inndx work is the run command. It takes one crawl-job manifest, executes the entire pipeline in a single process, and exits when the crawl finishes. No server, no database to provision, no message broker. This page takes you from an empty file to markdown results on disk.

You need image access

The inndx container image is distributed through a private registry you receive access to during enterprise onboarding. If you have not set that up yet, see your onboarding materials for registry access. The commands below assume you can pull from the required registry.

Write a manifest

A manifest is a YAML file describing one crawl job. Create first-crawl.yml with the following content. Every block is selected by a kind; read the comments to see what each one does.

name: first-crawl
config:
  # Start from a single page.
  seeds:
    - kind: static_list
      params:
        urls:
          - https://example.com/
  # Stop after fetching at most 10 URLs so the demo finishes quickly.
  stopping_criteria:
    - kind: max_urls
      params:
        max_urls: 10
  # Crawl breadth-first.
  ranker:
    kind: breadth
  # Use the plain HTTP client (no browser).
  fetcher:
    client:
      kind: standard
    timeout: 30s
  # Follow links, then convert each page to markdown.
  pipelines:
    - navigator:
        kind: anchor
      steps:
        - kind: extractor
          params:
            kind: markdown
      # Write each result to a local directory.
      actions:
        - kind: to_blob
          params:
            directory: output

This crawls example.com, follows anchor links breadth-first up to 10 pages, converts each page to markdown, and writes the results into an output directory.

Run it

Run the manifest inside the container, mounting the current directory so inndx can read your manifest and write results back out:

docker run --rm \
  -v "$PWD:/work" -w /work \
  registry.pogue.dev/inndx/inndx:0.3.0 \
  run first-crawl.yml

The run subcommand takes the manifest path as its only required argument. inndx applies an in-memory database, cache, and broker automatically for single-process runs, so there is nothing else to configure.

What you should see

inndx logs each stage as URLs move through the pipeline. You will see lines indicating URLs being scheduled, fetched, and parsed, and the run stopping once the max_urls limit is reached:

INFO inndx: starting crawl job run name=first-crawl
INFO scheduled batch urls=1
INFO fetched url=https://example.com/ status=200
INFO parsed url=https://example.com/ extracted=markdown
...
INFO crawl job run finished reason=max_urls

When it exits, look in ./output. You will find the markdown extracted from each fetched page.

Tune the crawl

Try raising max_urls, swapping ranker to kind: depth for depth-first traversal, or adding a filters block to keep the crawl on one site. The full set of options is in Reference, Crawl manifest.

Next

run is great for one-off crawls and trying things out, but it has no API and exits as soon as the crawl finishes. To create and manage jobs over HTTP, move on to Run the server.

Search docs

Search the Self-host documentation