inndx/
GitHub

Deliver results

Send parsed crawl results to blob storage, the local filesystem, and URL labels.

The sink stage decides what happens to each parsed result. This guide covers the available delivery actions and how to combine them. For the full action catalog, see Sink actions.

Actions are configured per pipeline, under config.pipelines[].actions. Each entry is an object with a kind and optional params. Every result a pipeline produces is handed to its actions.

Writing to blob storage

The to_blob action writes each result to blob storage. It is the main way to collect crawl output:

actions:
  - kind: to_blob
    params:
      directory: output
      key_strategy: hash
      include_assets: true

The directory prefixes every key written, so all of a job's output groups under one path. The result is stored at <directory>/<key>/data.

The key_strategy decides the per-result key:

  • hash keys each result by a hash of its URL. A given URL always maps to the same location, so a later crawl of the same URL overwrites the earlier copy. Use it when you want one current copy per URL.
  • 5min keys each result under a timestamp bucket, YYYYMMDD/HH_MM/<url-hash>, rounded to five-minute windows. The same URL crawled at different times lands in different buckets, so history is preserved. Use it when you want snapshots over time rather than a single current copy.

Set include_assets: true to also write assets resolved during parsing (such as images pulled in by an asset_resolver step). Assets are stored alongside the result under <directory>/<key>/assets/<asset-type>/<asset-hash>.

By default the action writes to the deployment's default blob storage. To target a specific configured backend, name it with storage_identifier:

actions:
  - kind: to_blob
    params:
      storage_identifier: archive
      directory: output

Writing a result emits a BlobSaved log event (and AssetBlobSaved per asset), which you can follow to confirm delivery.

Writing to the local filesystem

The to_file action writes results to a directory on the machine running the sink, mirroring the same data and assets layout to_blob uses. It takes only include_assets; the destination directory is set in the sink's configuration, not in the manifest:

actions:
  - kind: to_file
    params:
      include_assets: true

Use to_file for local evaluation and quick inspection, where reaching into a directory is more convenient than a blob backend.

Labeling URLs

The label_url action writes key-value labels back onto the URL record for each result. Those labels persist on the URL and can be read later by label-based filters and conditions, which lets one pass of a crawl steer a later one:

actions:
  - kind: label_url
    params:
      labels:
        category: product
        extracted: "true"

For how labels are consumed, see the url_labels filter in Seed and filter URLs.

Logging for debugging

The log action emits a log line for each result instead of delivering it anywhere. It is a development aid for confirming a pipeline produces results and seeing them flow:

actions:
  - kind: log
    params:
      level: info
      event: ResultProduced

level sets the log level (trace, debug, info, warn, or error) and event sets the event name on the emitted line, so you can grep for it. Remove or lower the level once a crawl is working.

Running multiple actions

A pipeline can list several actions. Each result is passed to every action in turn, in the order listed, so you can deliver to more than one target at once. A common combination writes results to blob storage, tags the URL, and logs during development:

actions:
  - kind: to_blob
    params:
      directory: output
      key_strategy: hash
  - kind: label_url
    params:
      labels:
        delivered: "true"
  - kind: log
    params:
      level: debug
      event: Delivered

The result content is made available to each action independently, so writing to blob storage and labeling the URL do not interfere with one another.

Where results land

The to_blob action writes to whichever blob storage the deployment is configured with. In a local setup that may be a directory on disk; in production it is typically an object store such as S3. Which backends exist and how they are addressed (including the names you pass to storage_identifier) is set in the server configuration, not in the manifest. See Configuration for how blob storage backends are defined.

Search docs

Search the Self-host documentation