inndx/
GitHub

Inspect runs and logs

Query audit logs, crawl logs, metrics, and latency from the analytics service.

The analytics service records what happened during every run and every configuration change, and exposes it over an API. This guide shows how to query that data to monitor and debug crawls. For the endpoint reference, see HTTP API.

All of these endpoints are read-only GET requests that return paginated results and accept query parameters to narrow what comes back. A parameter that takes a list of values uses empty-bracket syntax, repeating the key as key[]=value. For example, two statuses are written statuses[]=error&statuses[]=excluded. Single-value parameters such as window and occurred_after are written plainly with no brackets.

Audit logs

The audit log records changes to your configuration: jobs, triggers, schemas, and other resources being created, updated, or deleted. Query it to see how a setup evolved and who changed what.

curl 'http://localhost:8022/v1/audit_logs'

Each entry has a resource_type and resource_id naming what changed, an action describing the change, an actor identifying who made it, an occurred_at timestamp, and optional metadata. Narrow the results with query parameters such as resource_types, actions, actors, and a time range:

curl 'http://localhost:8022/v1/audit_logs?resource_types[]=crawl_job&occurred_after=2026-06-01T00:00:00Z'

Crawl logs

The crawl log records each URL's passage through the pipeline for a single run. You query it by run id:

curl 'http://localhost:8022/v1/runs/<run-id>/logs'

Each entry covers one URL at one step of the pipeline. The step is schedule, fetch, or parse, and the status is ok, error, or excluded. Alongside those, an entry carries the url and its host, the pipeline_id, the http_status and latency_ms for the step, the depth the URL was reached at, and, when something went wrong, an error_reason and error_detail.

Narrow the log with query parameters. To see only failures, filter by status; to focus on one stage, filter by step:

curl 'http://localhost:8022/v1/runs/<run-id>/logs?statuses[]=error'
curl 'http://localhost:8022/v1/runs/<run-id>/logs?steps[]=fetch&hosts[]=example.com'

You can also filter by error_reasons[], pipeline_ids[], and a time range with occurred_after and occurred_before.

Metrics and latency

For a high-level view of a run rather than individual URLs, the analytics service aggregates the crawl log into metric and latency series.

The metrics endpoint counts log entries over time, bucketed by a window (minute, five_minute, hour, or day). Group the counts by one or more dimensions to break them down, for example by status to watch successes against failures:

curl 'http://localhost:8022/v1/runs/<run-id>/metrics?window=hour&group_by[]=status'

The response is a series of buckets, each with its dimensions and a value count. Available dimensions to group or filter by include step, status, error_reason, host, pipeline_id, job_id, and run_id. To restrict to specific dimension values rather than break them out, use the filters map, written with the dimension as a key, for example filters[status]=error:

curl 'http://localhost:8022/v1/runs/<run-id>/metrics?window=hour&group_by[]=host&filters[status]=error'

The latency endpoint summarizes step timings over the same windows, reporting percentile and average latencies per bucket:

curl 'http://localhost:8022/v1/runs/<run-id>/latency?window=hour&steps[]=fetch'

Each bucket reports p50_ms, p95_ms, p99_ms, avg_ms, and a sample_count. Querying latency for the fetch step is the quickest way to see whether retrieval is your bottleneck.

Using logs to debug a crawl

When a crawl returns fewer results than expected, the crawl log tells you where URLs were lost. Work the pipeline in order.

Start by counting outcomes for the run to see the overall shape:

curl 'http://localhost:8022/v1/runs/<run-id>/metrics?window=day&group_by[]=step&group_by[]=status'

A large excluded count at the parse step means a condition step is dropping pages; a large error count at the fetch step means retrieval is failing. Then pull the matching entries to see why. For fetch failures, filter to errors at the fetch step and read the http_status, error_reason, and error_detail:

curl 'http://localhost:8022/v1/runs/<run-id>/logs?steps[]=fetch&statuses[]=error'

A run of 403 or 429 http statuses points at access or rate-limiting issues to address with fetch settings or politeness controls. Entries with status of excluded at the parse step name the pipeline_id that dropped the page, so you can tell which condition was responsible. URLs that never appear in the log at all were never scheduled, which usually means a filter rejected them before they reached the queue; revisit your filters in Seed and filter URLs.

Search docs

Search the Self-host documentation