Schedule recurring crawls

Run crawls on a cron schedule or in reaction to other runs finishing.

Crawls often need to repeat: nightly refreshes, or a follow-up crawl triggered when another finishes. inndx models both with triggers. This guide shows how to configure them. For the full reference, see Trigger conditions.

A trigger is attached to a crawl job and starts new runs of that job automatically. You create one with a name and a condition. The condition's kind decides what makes the trigger fire. This guide assumes you already have a job and its id; see Author a crawl job for creating one.

Scheduled triggers

A schedule trigger fires on a cron schedule. Its expression is a seven-field cron expression, evaluated in UTC:

second  minute  hour  day-of-month  month  day-of-week  year

For example, to run a job every day at 02:30 UTC, attach a trigger to the job (<job-id>):

crawlctl triggers create --job <job-id> \
  --set name=nightly-refresh \
  --set condition.kind=schedule \
  --set condition.params.expression="0 30 2 * * * *"

The 0 30 2 sets seconds, minutes, and hours; the remaining * * * * allow any day-of-month, month, day-of-week, and year, so the schedule recurs daily. Because the expression is evaluated in UTC, convert your intended local time to UTC when writing it.

The response includes the trigger's id and its enabled state. A trigger starts enabled.

Reactive triggers

An on_finish trigger fires when a run of the same job it is attached to finishes. Use it to chain a follow-up run after each run completes, for example to re-crawl as soon as the previous pass ends:

crawlctl triggers create --job <job-id> \
  --set name=loop-on-finish \
  --set condition.kind=on_finish

You can hold the next run back with an optional delay, which waits that long after the run finishes before starting the next:

crawlctl triggers create --job <job-id> \
  --set name=loop-with-delay \
  --set condition.kind=on_finish \
  --set condition.params.delay=1h

An on_finish trigger reacts only to the job it belongs to, not to other jobs finishing.

To keep a trigger's definition in a file and reconcile it the same way as a job, use crawlctl triggers apply --job <job-id> -f trigger.yaml instead of create; it creates the trigger if it does not exist and updates it if it does.

Preventing overlapping runs

A schedule that fires faster than a run completes, or a misfired manual start, can leave two runs of the same job active at once. To prevent that, add an exclusive starting criterion to the job's config. A starting criterion is checked before a new run begins and can refuse to start it:

config:
  starting_criteria:
    - kind: exclusive
      params:
        by: job_id
        limit: 1

With by: job_id and limit: 1, the orchestrator allows only one active run per job; a trigger that fires while a run is still going does not start a second one. Set this on jobs whose triggers may fire while a previous run is unfinished.

Inspecting triggered runs

Runs created by a trigger are ordinary runs of the job, so you list them the same way you list any of a job's runs:

crawlctl runs list --job <job-id>

To see when a specific trigger fired, query its executions. Each execution records when it was scheduled_at and when it was executed_at:

crawlctl triggers executions <trigger-id>

A trigger can be paused without deleting it by disabling it, and re-enabled later:

crawlctl triggers disable <trigger-id>
crawlctl triggers enable <trigger-id>

To follow what a triggered run actually did, see Inspect runs and logs.

Schedule recurring crawls

Scheduled triggers

Reactive triggers

Preventing overlapping runs

Inspecting triggered runs

Search docs