The crawl pipeline

Follow a URL end to end as it moves from seed to delivered result.

When you start a crawl, inndx takes your starting URLs and passes each one through four stages in sequence. This page traces that journey so you can build an accurate mental model of what happens to a URL and why your results look the way they do.

Starting from seeds

A crawl begins with seeds: the URLs you supply as entry points. The orchestrator loads those seeds, applies any filters and ranking rules you have configured, and selects an initial batch to work through. Filters decide which URLs are eligible at all; rankers determine the order in which eligible URLs are visited. If you have configured URL mutators they are applied here too, for example to normalize query strings before fetching.

Once the orchestrator has a batch ready, it passes them to the fetcher.

Fetching content

The fetcher receives each scheduled URL and retrieves its content. For most sites this is a plain HTTP request. For sites that rely on JavaScript to render their content, the fetcher can use a real browser instead. Once the response arrives, the raw content is held so the parser can work on it, and the fetcher signals that the URL has been successfully retrieved.

If a URL fails to fetch, the failure is recorded and reported back to the orchestrator so it can be accounted for in the run's progress.

Parsing and extracting data

The parser picks up where the fetcher left off. For each fetched URL it runs a configurable sequence of steps.

First, a guard may short-circuit the page entirely if an early condition is not met, avoiding unnecessary work. Then conditions decide whether the page should be included or skipped based on rules like URL patterns or content heuristics. For pages that pass, extractors pull out the structured data you asked for: raw content, rendered markdown, or fields mapped by XPath or CSS selectors. Navigators scan the page for links to follow. Resolvers handle any referenced assets you want to capture.

The parser produces two outputs from each page: the structured extraction result, which goes to the sink, and the new links it discovered, which go back to the orchestrator to widen the crawl.

Delivering results

The sink receives the structured result for each parsed page and runs your configured output actions against it. Actions include writing results to blob storage, logging them, or attaching labels to the URL record. Multiple actions can run against the same result.

The sink is the end of the line for a URL: once its actions complete, that URL's journey through the pipeline is finished.

The feedback loop and how a crawl ends

The links the parser discovers re-enter the orchestrator's URL queue. The orchestrator applies your filters and rankers to them just as it did the original seeds, and the ones that pass get scheduled for the fetcher. This is how a crawl that starts from a handful of seed URLs can grow to cover an entire site.

The crawl continues this loop until your stopping criteria are met. You can limit it by maximum URL count, maximum crawl depth, elapsed time, or how many consecutive evaluation cycles have found no new URLs to visit. When a stopping criterion is satisfied, the orchestrator closes the run. If you have configured a trigger on the job, it may automatically start the next run on a schedule or in reaction to the finished one.

Where to go next

The individual stages have configurable components you select in your crawl manifest. The full catalog of what you can plug in at each stage is in the Components reference. For the manifest fields that control seeds, filters, stopping criteria, and the rest, see the Crawl manifest reference.