inndx/
GitHub

Architecture overview

How the inndx services and shared infrastructure fit together.

inndx crawls the web by passing each URL through a short sequence of stages, each responsible for one part of the job. This page is the map: what each part does, what infrastructure they all share, and how they connect. Read it first, then follow the links at the end into the more detailed concept pages.

The four stages

A crawl moves through four stages. Each is a separate concern, and in production each can run and scale on its own.

Orchestrator. The orchestrator decides what to crawl. It owns your crawl jobs and their runs, holds the set of URLs waiting to be visited, applies your seeding and filtering rules to choose what goes next, and tracks a run from start to finish. It is also where scheduled and reactive triggers fire to start new runs.

Fetcher. The fetcher retrieves the content of each URL the orchestrator schedules. It performs the request (over plain HTTP, or with a real browser for sites that need JavaScript), follows redirects, and hands the retrieved content onward.

Parser. The parser turns retrieved pages into results. It extracts the structured data you asked for, decides whether a page should be kept, and discovers new links to follow so the crawl can widen.

Sink. The sink delivers results. It runs the output actions you configured against each parsed result, for example writing it to storage or labelling the URL it came from.

The analytics service

Alongside the four stages, inndx runs an analytics service. It is the source of the operational history you can query: an audit log of configuration changes, a per-run crawl log of what happened to each URL, and aggregated metrics and latency for a run. When you want to see how a run progressed or what changed and when, that data comes from here.

The infrastructure it relies on

inndx does not store everything itself. It relies on a few pieces of supporting infrastructure that you provide and point it at:

  • A database holds your crawl state: jobs, runs, the URLs seen, and the crawl graph.
  • Blob storage holds large content: the raw pages that were fetched, other assets, and the results you deliver to storage.
  • A cache speeds up repeated work during a crawl.
  • A message broker carries work between the stages so they stay decoupled.

In a quick local setup these can be lightweight and run on the same machine. In production they are real, separately operated services. What each one is and how it is provisioned for a supported deployment is covered in your onboarding materials.

How it fits together

The stages do not call each other directly. Each one publishes its output to the message broker, and the next stage consumes it. This is what lets inndx run as a single process for evaluation and as independently scaled services in production without changing anything but configuration.

Crawl pipeline discovered links Message broker Analytics Database Blob storage Cache Orchestrator Fetcher Parser Sink

Where to go next

Search docs

Search the Self-host documentation