What inndx records

The data inndx keeps across jobs, runs, the crawl graph, and analytics.

As inndx crawls, it builds up a record of what it has done: the jobs you have configured, every run they have produced, the URLs and links it has encountered, and a detailed log of each URL's journey through the pipeline. This page describes what is recorded and where it lives, which is the information you need for querying results, planning backups, and estimating capacity.

Jobs, runs, and triggers

A crawl job is the configuration you create: the seeds, the filters, the parser setup, the output actions, and the stopping criteria. It is a persistent record that you can update, pause, or delete. The job itself does not crawl anything; it defines what a crawl should do.

A run is a single execution of a job. Every time a job is started, manually or by a trigger, a new run is created. The run tracks progress from start to finish: how many URLs have been scheduled, fetched, parsed, and delivered, and what state the run is currently in. Completed runs are kept so you can review what happened and compare across executions.

A trigger is a rule attached to a job that starts new runs automatically. Triggers can fire on a cron schedule or in reaction to a run finishing. Triggers are stored alongside the job and can be enabled, disabled, or deleted independently.

Hosts, URLs, and the crawl graph

As inndx discovers and visits URLs, it builds a graph of what it has seen. Every unique hostname gets a record, and every unique URL gets a record. These are global across all jobs and runs, so if two different jobs visit the same URL, they share the same URL record.

During a run, inndx tracks which URLs have been visited as part of that run, including the depth at which they were reached. This per-run frontier record is what allows the run to resume from a checkpoint and what the stopping criteria measure against.

When the parser discovers that one page links to another, that parent-to-child relationship is recorded as an edge in the crawl graph. This gives you a navigable map of the link structure inndx found. You can also attach arbitrary key-value labels to hosts and URLs, which is useful for tagging, filtering, and downstream processing.

Logs and metrics

The analytics service maintains two queryable histories.

The audit log records every change to your configuration: when a job was created or updated, when a trigger was added, when a data schema changed. Each entry captures what changed and when. This is the record to consult when you need to understand how your setup evolved over time or audit who changed what.

The crawl log records each URL's passage through the pipeline for a given run: when it was scheduled, when it was fetched, when it was parsed, and whether any step failed. You can query the crawl log for a specific run to see the full history of every URL that was processed, including timings and error details for anything that did not complete successfully.

The analytics service also exposes aggregated metrics and latency series for a run, useful for understanding throughput and identifying bottlenecks.

Where data lives

inndx keeps data in three places:

The database holds your crawl state: jobs, runs, triggers, hosts, URLs, the crawl graph, and the per-run frontier. This is the primary operational store. It is what you back up to preserve your job configurations and crawl history, and it is what grows with the number of URLs inndx has ever seen.

Blob storage holds large content: the raw pages retrieved by the fetcher and any results written to storage by the sink. Blob storage typically holds the bulk of the data by volume. It is separate from the database so it can be sized and retained independently.

A columnar store holds audit and crawl logs and run metrics, separate from the relational database. A columnar store is optimized for the kind of queries logs serve: aggregations, time-range scans, and filtering across large numbers of entries. This store grows with run activity; a high-throughput crawl produces a large number of log entries per run.

For backup purposes, the relational database and the columnar store both need point-in-time recovery coverage. Blob storage retention depends on how long you need raw content and results to remain accessible.

What inndx records

Jobs, runs, and triggers

Hosts, URLs, and the crawl graph

Logs and metrics

Where data lives

Search docs