Your first crawl
Write a small manifest and run a complete crawl in a single process with the run command.
The fastest way to see inndx work is the run command. It takes one crawl-job manifest, executes the entire pipeline in a single process, and exits when the crawl finishes. No server, no database to provision, no message broker. This page takes you from an empty file to markdown results on disk.
The inndx container image is distributed through a private registry you receive access to during enterprise onboarding. If you have not set that up yet, see your onboarding materials for registry access. The commands below assume you can pull from the required registry.
Write a manifest
A manifest is a YAML file describing one crawl job. Create first-crawl.yml with the following content. Every block is selected by a kind; read the comments to see what each one does.
name: first-crawl
config:
# Start from a single page.
seeds:
- kind: static_list
params:
urls:
- https://example.com/
# Stop after fetching at most 10 URLs so the demo finishes quickly.
stopping_criteria:
- kind: max_urls
params:
max_urls: 10
# Crawl breadth-first.
ranker:
kind: breadth
# Use the plain HTTP client (no browser).
fetcher:
client:
kind: standard
timeout: 30s
# Follow links, then convert each page to markdown.
pipelines:
- navigator:
kind: anchor
steps:
- kind: extractor
params:
kind: markdown
# Write each result to a local directory.
actions:
- kind: to_blob
params:
directory: outputThis crawls example.com, follows anchor links breadth-first up to 10 pages, converts each page to markdown, and writes the results into an output directory.
Run it
Run the manifest inside the container, mounting the current directory so inndx can read your manifest and write results back out:
docker run --rm \
-v "$PWD:/work" -w /work \
registry.pogue.dev/inndx/inndx:0.3.0 \
run first-crawl.ymlThe run subcommand takes the manifest path as its only required argument. inndx applies an in-memory database, cache, and broker automatically for single-process runs, so there is nothing else to configure.
What you should see
inndx logs each stage as URLs move through the pipeline. You will see lines indicating URLs being scheduled, fetched, and parsed, and the run stopping once the max_urls limit is reached:
INFO inndx: starting crawl job run name=first-crawl
INFO scheduled batch urls=1
INFO fetched url=https://example.com/ status=200
INFO parsed url=https://example.com/ extracted=markdown
...
INFO crawl job run finished reason=max_urlsWhen it exits, look in ./output. You will find the markdown extracted from each fetched page.
Try raising max_urls, swapping ranker to kind: depth for depth-first traversal, or adding a filters block to keep the crawl on one site. The full set of options is in Reference, Crawl manifest.
Next
run is great for one-off crawls and trying things out, but it has no API and exits as soon as the crawl finishes. To create and manage jobs over HTTP, move on to Run the server.