inndx/
GitHub

Concepts in 5 minutes

The smallest mental model of inndx you need before running your first crawl.

Before you run anything, it helps to hold a small map of how inndx works. This page gives you just enough vocabulary to follow the rest of the Get started guides. You do not need to understand the internals yet. The Concepts section covers those in depth.

A crawl is a pipeline

inndx turns a crawl job (a single configuration describing what to crawl and how) into structured results. When a job runs, it produces a crawl job run, or just "run": one execution with its own lifecycle and logs.

Work flows through four stages:

1Orchestrator

The Orchestrator is the control plane. It decides which URLs to fetch next, holds the frontier (the set of URLs queued for fetching), and manages the job and run lifecycle, scheduling, and triggers.

2Fetcher

The Fetcher retrieves each URL's content, either with a plain HTTP client or a real browser, and stores the raw response.

3Parser

The Parser turns raw content into structured data such as markdown and extracted fields, and discovers new links, which it hands back to the orchestrator to widen the crawl.

4Sink

The Sink delivers the parsed results to one or more outputs, such as blob storage or a webhook.

A fifth service, the analytics service, watches the whole pipeline and records audit and per-run logs you can query over an API. You will meet it in Inspect runs and logs.

Everything is configurable by kind

Almost every behavior in the pipeline is pluggable. A crawl job picks each behavior by naming a kind. For example, a seed of kind: static_list starts the crawl from a fixed list of URLs; a filter of kind: pattern keeps or drops URLs by regular expression; an extractor of kind: markdown converts a page to markdown.

This is why the manifest you will write is mostly a list of kind plus params blocks. The full catalog of available kinds lives in Reference, Components.

Two ways to run it

You will use both in the next pages:

  • run executes a single crawl job from a manifest file, in one process, and exits when the crawl finishes. It is the fastest way to see a result. Start here in Your first crawl.
  • dev starts the services as a long-running server with an HTTP API, so you can create and manage jobs over REST. You will use it in Run the server.
One process now, many later

In these tutorials everything runs in a single process for simplicity. In production each stage is deployed separately and the stages communicate through a message broker. That is a deployment concern, not a code change. See Deployment modes.

What you will do next

In Your first crawl you will write a small manifest, run it with run, and watch inndx fetch pages, convert them to markdown, and write the results out.

Search docs

Search the Self-host documentation