Inspect runs and logs

Query audit logs, crawl logs, metrics, and latency from the analytics service.

The analytics service records what happened during every run and every configuration change, and exposes it for querying. This guide shows how to query that data to monitor and debug crawls. The examples below default to crawlctl, with the equivalent dashboard and raw HTTP API calls alongside it. For the endpoint reference, see HTTP API.

The API endpoints behind these commands are read-only GET requests that return paginated results and accept query parameters to narrow what comes back. A parameter that takes a list of values uses empty-bracket syntax, repeating the key as key[]=value. For example, two statuses are written statuses[]=error&statuses[]=excluded. Single-value parameters such as window and occurred_after are written plainly with no brackets.

Audit logs

The audit log records changes to your configuration: jobs, triggers, schemas, and other resources being created, updated, or deleted. Query it to see how a setup evolved and who changed what.

crawlctl audit-logs list

Each entry has a resource_type and resource_id naming what changed, an action describing the change, an actor identifying who made it, an occurred_at timestamp, and optional metadata. Narrow the results by resource type, action, actor, or a time range:

crawlctl audit-logs list --resource-type crawl_job --after 7d

Crawl logs

The crawl log records each URL's passage through the pipeline for a single run. You query it by run id:

crawlctl runs logs <run-id>

Each entry covers one URL at one step of the pipeline. The step is schedule, fetch, or parse, and the status is ok, error, or excluded. Alongside those, an entry carries the url and its host, the pipeline_id, the http_status and latency_ms for the step, the depth the URL was reached at, and, when something went wrong, an error_reason and error_detail.

Narrow the log by status, step, host, error reason, pipeline, or a time range. To see only failures, filter by status; to focus on one stage, filter by step:

crawlctl runs logs <run-id> --status error
crawlctl runs logs <run-id> --step fetch --host example.com

You can also filter by error reason and pipeline id, and narrow by a time range.

Metrics and latency

For a high-level view of a run rather than individual URLs, the analytics service aggregates the crawl log into metric and latency series.

The metrics query counts log entries over time, bucketed by a window (minute, five_minute, hour, or day). Group the counts by one or more dimensions to break them down, for example by status to watch successes against failures:

crawlctl runs metrics <run-id> --window hour --group-by status

The response is a series of buckets, each with its dimensions and a count. Available dimensions to group or filter by include step, status, error_reason, host, pipeline_id, job_id, and run_id. To restrict to specific dimension values rather than break them out, filter on a dimension:

crawlctl runs metrics <run-id> --window hour --group-by host --filter status=error

The latency query summarizes step timings over the same windows, reporting percentile and average latencies per bucket:

crawlctl runs latency <run-id> --window hour --step fetch

Each bucket reports p50_ms, p95_ms, p99_ms, avg_ms, and a sample_count. Querying latency for the fetch step is the quickest way to see whether retrieval is your bottleneck.

Using logs to debug a crawl

When a crawl returns fewer results than expected, the crawl log tells you where URLs were lost. Work the pipeline in order.

Start by counting outcomes for the run to see the overall shape:

crawlctl runs metrics <run-id> --window day --group-by step --group-by status

A large excluded count at the parse step means a condition step is dropping pages; a large error count at the fetch step means retrieval is failing. Then pull the matching entries to see why. For fetch failures, filter to errors at the fetch step and read the http_status, error_reason, and error_detail:

crawlctl runs logs <run-id> --step fetch --status error

A run of 403 or 429 http statuses points at access or rate-limiting issues to address with fetch settings or politeness controls. Entries with status of excluded at the parse step name the pipeline_id that dropped the page, so you can tell which condition was responsible. URLs that never appear in the log at all were never scheduled, which usually means a filter rejected them before they reached the queue; revisit your filters in Seed and filter URLs.

Inspect runs and logs

Audit logs

Crawl logs

Metrics and latency

Using logs to debug a crawl

Search docs