inndx/
GitHub

Reliability

How inndx ensures work is not lost when a service restarts mid-crawl.

When inndx is running distributed, four separate services are in flight at once. Any one of them could restart due to a deployment, a crash, or a node replacement. This page explains what inndx guarantees in those situations and what that means for your results.

Services are decoupled through a message broker

The stages do not call each other directly. When the orchestrator has URLs ready for the fetcher, it does not make a request to the fetcher's API. Instead it publishes to a shared message broker, and the fetcher picks work up from there on its own terms. The same pattern holds for every hand-off in the pipeline.

This decoupling means a restart in one stage does not cascade into the others. If the parser restarts, the fetcher keeps fetching and its output waits in the broker. When the parser comes back up, it resumes from where it left off. Each stage fails and recovers independently.

Work is not lost on restart

inndx guarantees that no work is silently dropped and that each unit of work is processed exactly once. To achieve this, inndx uses an outbox pattern: before a stage publishes work to the broker, it first writes that work to durable storage in the same operation as the state change it represents. A background process then delivers it to the broker. If a service crashes between the write and the delivery, the work is still there when the service comes back. Once a consumer processes a piece of work, it is marked complete and will not be delivered again.

The practical effect is that a service restart mid-crawl does not silently drop or duplicate URLs. The crawl picks back up from a safe checkpoint.

What this means for your results

Each unit of work is processed exactly once. Once a stage has consumed and handled a piece of work, it will not be handed that same work again, even after a restart. A crawl that is interrupted mid-run and resumed will pick up from a safe checkpoint without re-processing what was already completed.

Throughput through batching

Internally, work moves between stages in batches rather than one URL at a time. Batch sizes and timeouts are configurable per service. This reduces overhead on the broker and database and keeps throughput high even at large crawl volumes.

Search docs

Search the Self-host documentation