Deliver results

Send parsed crawl results to blob storage, the local filesystem, and URL labels.

The sink stage decides what happens to each parsed result. This guide covers the available delivery actions and how to combine them. For the full action catalog, see Sink actions.

Actions are configured per pipeline, under config.pipelines[].actions. Each entry is an object with a kind and optional params. Every result a pipeline produces is handed to its actions.

Writing to blob storage

The to_blob action writes each result to blob storage. It is the main way to collect crawl output:

actions:
  - kind: to_blob
    params:
      directory: output
      key_strategy: hash
      include_assets: true

The directory prefixes every key written, so all of a job's output groups under one path. The result is stored at <directory>/<key>/data.

The key_strategy decides the per-result key:

hash keys each result by a hash of its URL. A given URL always maps to the same location, so a later crawl of the same URL overwrites the earlier copy. Use it when you want one current copy per URL.
5min keys each result under a timestamp bucket, YYYYMMDD/HH_MM/<url-hash>, rounded to five-minute windows. The same URL crawled at different times lands in different buckets, so history is preserved. Use it when you want snapshots over time rather than a single current copy.

Set include_assets: true to also write assets resolved during parsing (such as images pulled in by an asset_resolver step). Assets are stored alongside the result under <directory>/<key>/assets/<asset-type>/<asset-hash>.

By default the action writes to the deployment's default blob storage. To target a specific configured backend instead, name it under storage:

actions:
  - kind: to_blob
    params:
      storage:
        type: identifier
        target: archive
      directory: output

A to_blob action can also carry its own object storage connection directly, instead of naming one this deployment already configured, by giving storage a type of connection instead of identifier. This is useful when different crawl jobs need to write to different, customer-owned buckets that were never set up on this deployment ahead of time:

actions:
  - kind: to_blob
    params:
      directory: output
      storage:
        type: connection
        endpoint_url: https://s3.us-east-1.amazonaws.com
        bucket: customer-bucket
        access_key: { ref: s3-access-key }
        secret_key: { ref: s3-secret-key, backend: vault_primary }

access_key and secret_key each accept either a literal string or a reference like the ones shown above, resolved from a configured secret backend (such as a HashiCorp Vault server, or this deployment's own environment variables) at the exact moment the blob is written, never any sooner, and never written back into this deployment's own database. See Secrets for how secret backends are configured, and Storage source for the full storage field reference. A storage field is either an identifier or a connection, never both at once; omitting it entirely keeps today's default behavior of writing to the deployment's default backend.

Writing a result emits a BlobSaved log event (and AssetBlobSaved per asset), which you can follow to confirm delivery.

Writing to the local filesystem

The to_file action writes results to a directory on the machine running the sink, mirroring the same data and assets layout to_blob uses. It takes only include_assets; the destination directory is set in the sink's configuration, not in the manifest:

actions:
  - kind: to_file
    params:
      include_assets: true

Use to_file for local evaluation and quick inspection, where reaching into a directory is more convenient than a blob backend.

Labeling URLs

The label_url action writes key-value labels back onto the URL record for each result. Those labels persist on the URL and can be read later by label-based filters and conditions, which lets one pass of a crawl steer a later one:

actions:
  - kind: label_url
    params:
      labels:
        category: product
        extracted: "true"

For how labels are consumed, see the url_labels filter in Seed and filter URLs.

Logging for debugging

The log action emits a log line for each result instead of delivering it anywhere. It is a development aid for confirming a pipeline produces results and seeing them flow:

actions:
  - kind: log
    params:
      level: info
      event: ResultProduced

level sets the log level (trace, debug, info, warn, or error) and event sets the event name on the emitted line, so you can grep for it. Remove or lower the level once a crawl is working.

Running multiple actions

A pipeline can list several actions. Each result is passed to every action in turn, in the order listed, so you can deliver to more than one target at once. A common combination writes results to blob storage, tags the URL, and logs during development:

actions:
  - kind: to_blob
    params:
      directory: output
      key_strategy: hash
  - kind: label_url
    params:
      labels:
        delivered: "true"
  - kind: log
    params:
      level: debug
      event: Delivered

The result content is made available to each action independently, so writing to blob storage and labeling the URL do not interfere with one another.

Where results land

The to_blob action writes to whichever blob storage the deployment is configured with, unless its storage is a connection, in which case it writes to the object store it names directly instead. In a local setup the default backend may be a directory on disk; in production it is typically an object store such as S3. Which backends exist and how they are addressed (including the names you pass to storage.target) is set in the server configuration, not in the manifest. See Configuration for how blob storage backends and secret backends are defined.