Configuration

Every configuration section and its environment-variable mapping.

inndx is configured hierarchically: built-in defaults, then a config file, then environment variables (highest priority). This page documents what each setting does. Production wiring of these settings with secrets is part of enterprise onboarding; here we describe the knobs themselves.

The settings are grouped into the shared sections every service reads (General, Data stores, Secrets, Messaging and coordination, Runtime, Retention, and Observability) and the per-service settings that each service adds on top.

How configuration is loaded

Configuration is resolved in three layers, each overriding the one before:

Built-in defaults.
A configuration file, if one is given with the --config-file flag or the INNDX_CONFIG_FILE environment variable. The file may be YAML, TOML, or JSON; the format is detected from the file extension.
Environment variables prefixed with INNDX, which override the file and defaults.

Run inndx config to print the fully resolved configuration.

Several sections are tagged unions that select a backend with a kind field (for example the cache, broker, blob storage, and columnar store). For those, set kind and then the fields for that kind.

Environment variables

Environment variables map onto the nested configuration with a double-underscore separator: INNDX__SECTION__FIELD. For example, INNDX__SERVER__PORT=8022 sets the server.port value.

INNDX__SERVER__PORT=8022
INNDX__DATABASE__PRIMARY=postgresql://user:pass@db:5432/inndx
INNDX__COLUMNAR_STORE__KIND=duckdb
INNDX__CACHE__KIND=redis
INNDX__CACHE__URL=redis://localhost:6379

Nested values follow the same rule, with each level joined by double underscores:

Lists index their entries by position: append __0__, __1__, and so on. A field within an entry follows as another segment. For example, the first configured proxy's identifier is INNDX__FETCHER__PROXIES__0__IDENTIFIER, and the first address of that proxy is INNDX__FETCHER__PROXIES__0__ADDRESSES__0.
Maps use each entry's key as a path segment. A secondary blob store named archive, for example, is configured under INNDX__BLOB_STORAGE__SECONDARY__ARCHIVE__*:
```
INNDX__BLOB_STORAGE__SECONDARY__ARCHIVE__KIND=file
INNDX__BLOB_STORAGE__SECONDARY__ARCHIVE__DATA_DIR=./archive
```

A list-valued field may alternatively be set from a single variable holding a JSON array.

Duration values

Fields typed as a duration accept a human-readable string such as 30s, 5m, 1h, or a combination like 1h 30m. The same format is used everywhere a duration appears in this page.

General

General identity settings, under the meta section.

Field	Type	Default	Description
`default_tenant_id`	string	`default`	The tenant used when a request or job does not specify one.
`env`	string	`development`	A label for the environment, used in telemetry and logging.

Data stores

Where inndx persists crawl state, analytics, and content.

Database

The relational database holding crawl state. The driver is selected by the URL scheme (sqlite://... or postgresql://...).

Field	Type	Default	Description
`primary`	string	`sqlite://database.db?mode=rwc`	The primary database connection URL.
`replicas`	list of strings	empty	Read-replica connection URLs.
`options`	database options	defaults	Connection-pool and logging options.

Database options

Field	Type	Default	Description
`max_connections`	integer	none	Maximum connections in the pool.
`min_connections`	integer	none	Minimum idle connections kept in the pool.
`connect_timeout`	duration	none	Maximum time to wait establishing a connection.
`idle_timeout`	duration	none	How long an idle connection is kept before being closed.
`acquire_timeout`	duration	none	Maximum time to wait acquiring a connection from the pool.
`max_lifetime`	duration	none	Maximum lifetime of a connection before it is recycled.
`enable_logging`	boolean	`false`	Whether to log database statements.

Columnar store

The columnar store holding analytics data (audit logs, crawl logs, metrics). Selected by kind.

Kind	Fields	Description
`duckdb`	`path` (string)	An embedded DuckDB store. The default, with `path` defaulting to `analytics.db`. Use `:memory:` for an in-memory store.
`clickhouse`	`url`, `username`, `password`, `database`	An external ClickHouse store.

columnar_store:
  kind: duckdb
  path: analytics.db

Blob storage

Blob storage holds raw content and results. A deployment has one primary store and an optional map of named secondary stores; the storage.target on a to_blob action's identifier source selects a secondary store by name. A to_blob action can also carry its own object storage connection inline, instead of naming one configured here; see Secrets for how that connection's credentials can be either typed in literally or resolved from a configured secret backend.

Field	Type	Default	Description
`primary`	blob store	file `./data`	The default blob store.
`secondary`	map of name to blob store	empty	Additional named blob stores.

Blob store

A blob store is selected by kind.

Kind	Fields	Description
`memory`	none	An in-process store.
`file`	`data_dir` (string)	A local-filesystem store. The default, with `data_dir` defaulting to `./data`.
`s3`	`endpoint_url`, `bucket`, `access_key`, `secret_key`, `region`, `force_path_style`, `multipart_threshold`, `delete_batch_size`	An S3-compatible store. Only `endpoint_url` and `bucket` are required.

blob_storage:
  primary:
    kind: s3
    endpoint_url: https://s3.example.com
    bucket: inndx
    access_key: <your-access-key>
    secret_key: <your-secret-key>

Secrets

Some component parameters, rather than naming something this deployment already configured ahead of time, can carry connection details directly: an access key, a token, a password. Any such field accepts either a literal value, or a reference to a secret resolved from a configured secret backend at the moment that field's value is actually needed, never any sooner and never written back to this deployment's own database. A reference is an object with a ref field naming the secret, and an optional backend field naming which configured backend to resolve it from:

access_key: { ref: s3-access-key }
secret_key: { ref: s3-secret-key, backend: vault_primary }

When backend is omitted, resolution uses the backend named env, which always exists, with no configuration required: it reads the secret's name as an environment variable on the machine running the service. Additional backends are configured under secrets.backends, a map of operator-chosen names to a backend definition selected by kind; a reference's backend field names one of these map keys, not a kind, so a deployment can run more than one backend of the same kind (for example, two separate HashiCorp Vault servers) simply by giving each one a different key.

Field	Type	Default	Description
`backends`	map of name to secret backend	empty	Additional, named secret backends. The backend named `env` is always implicitly available and cannot be redefined here.
`cache_ttl_seconds`	integer	`30`	How long a resolved secret value is kept in this service's own process memory before a repeated reference is resolved again. This cache is never shared between processes and is never written to disk or to any other service.

Secret backend

A secret backend is selected by kind.

Kind	Fields	Description
`vault`	`address`, `token`, `mount`	A HashiCorp Vault server's KV version 2 secrets engine.

secrets:
  backends:
    vault_primary:
      kind: vault
      address: https://vault.example.com
      token: <your-vault-token>
      mount: secret
  cache_ttl_seconds: 30

A secret named under the vault backend is a path into that mount, optionally followed by #field_name when the secret stored at that path has more than one field: database/creds#password reads the password field of the secret at database/creds, while a path with no #field_name reads that secret's only field.

Messaging and coordination

How services share state and discover one another. In single-process dev mode the in-process defaults suffice; a distributed deployment points these at shared backends.

Cache

The cache shared across services. Selected by kind.

Kind	Fields	Description
`memory`	`max_capacity` (integer)	An in-process cache. The default, with `max_capacity` defaulting to 1000000.
`redis`	`url` (string)	A Redis cache.
`redis_cluster`	`urls` (list of strings)	A Redis Cluster cache.

cache:
  kind: redis
  url: redis://localhost:6379

Broker

The message broker carrying work between services. Selected by kind.

Kind	Fields	Description
`memory`	none	An in-process broker. The default; suitable only for single-process mode.
`kafka`	`bootstrap_servers` (string)	A Kafka broker. `bootstrap_servers` is a comma-separated list.

broker:
  kind: kafka
  bootstrap_servers: localhost:9092

Service registry

The registry services use to discover one another. Selected by kind, with the same options as the cache: memory (default), redis (url), and redis_cluster (urls).

service_registry:
  kind: redis
  url: redis://localhost:6379

Runtime

How each service process binds and schedules its own work.

Server

The HTTP server each service binds.

Field	Type	Default	Description
`host`	string	`0.0.0.0`	The address the HTTP server binds to.
`port`	integer	`8022`	The port the HTTP server binds to.

Task queue

The in-process worker queue each service uses to run its work.

Field	Type	Default	Description
`worker_count`	integer	`4`	Number of worker tasks.
`max_queue_capacity`	integer	`256`	Maximum number of queued tasks before backpressure applies.
`batch_size`	integer	`16`	Number of tasks processed per batch.
`batch_timeout_ms`	integer	`10`	Maximum time, in milliseconds, to wait filling a batch.

Retention

Background, age based deletion of data that has grown stale, so storage doesn't grow without bound.

Field	Type	Default	Description
`enabled`	boolean	`true`	Disables all retention deletion when set to `false`, keeping every resource forever, exactly as if retention were never configured.
`sweep_interval_seconds`	integer	`300`	How often retention is checked.
`resources`	resources	see below	Per-resource retention settings.

Retention resources

Each entry below is independently configured under retention.resources.

Field	Type	Default	Description
`ttl_seconds`	integer or none	resource-dependent, see below	How old data must be before it is deleted. Omitted (the default for every resource except `outbox_events`) means this resource is kept forever.
`batch_size`	integer	`500`	Maximum number of rows deleted at a time for this resource.

The configurable resources are:

Resource	Default `ttl_seconds`	Description
`outbox_events`	`86400` (1 day)	Internal event delivery records. The only resource kept for a limited time by default; every other resource below is kept forever until configured otherwise.
`crawl_job_runs`	none	Completed crawl job runs. Deleting a run also deletes everything that belongs only to that run (its visited links, link graph, URL queue, and artifacts).
`trigger_executions`	none	Crawl job trigger execution history. Execution records for a trigger that has been deleted are removed regardless of age; this setting controls how long execution history is kept for triggers that still exist.
`fingerprints`	none	Content fingerprints used for duplicate detection. Content that is still being seen is never deleted, regardless of how long ago it was first seen.
`audit_log`	none	Audit log entries.
`crawl_log`	none	Crawl log entries. Deleting a crawl job run also immediately deletes that run's crawl log entries; this setting controls how long crawl log entries are kept otherwise.

A small number of global, deduplicated resources (URL hosts, URL links, and their labels) are not configurable here and are always kept, since they may be shared across many crawl job runs.

retention:
  enabled: true
  sweep_interval_seconds: 300
  resources:
    crawl_job_runs:
      ttl_seconds: 2592000 # 30 days
      batch_size: 100
    trigger_executions:
      ttl_seconds: 7776000 # 90 days

Observability

Tracing, metrics, and error reporting.

Telemetry

OpenTelemetry export of traces and metrics.

Field	Type	Default	Description
`enabled`	boolean	`false`	Whether telemetry export is enabled.
`endpoint`	string	none	The OTLP endpoint to export to.

Sentry

Error reporting to Sentry.

Field	Type	Default	Description
`enabled`	boolean	`false`	Whether Sentry reporting is enabled.
`dsn`	string	empty	The Sentry DSN.

Per-service settings

Each service has its own configuration section (orchestrator, fetcher, parser, sink, analytics) holding its component groups and a few operational fields alongside them. Every component is an object with an enabled flag in addition to its own settings, so any component can be turned on or off. What each component does and its per-job parameters live in the component reference; this page covers which components each service has and the operational fields next to them.

orchestrator

orchestrator:
  filters:
    robots_txt:
      enabled: false
  stopping_criteria:
    max_urls:
      enabled: true
  start_stop_interval: 10s
  run_evaluate_concurrency: 8

The component groups under orchestrator. Each holds the components shown, every one an object with an enabled flag and enabled by default. See the linked reference page for what each does and its per-job parameters.

Group	Components
`filters` (reference)	`max_depth`, `regex_patterns`, `url_patterns`, `budget`, `recrawl`, `robots_txt`, `url_labels`, `host_labels`, `interleave`
`mutators` (reference)	`sanitize`
`policies` (reference)	`recrawl`, `robots_txt`
`rankers` (reference)	`depth`, `breadth`
`seeds` (reference)	`sitemap`, `host_labels`, `host_labels_sitemap`
`starting_criteria` (reference)	`exclusive`
`stopping_criteria` (reference)	`max_age`, `max_depth`, `max_empty_evaluations`, `max_urls`

The scheduling fields directly under orchestrator:

Field	Type	Default	Description
`start_stop_interval`	duration	`10s`	How often the orchestrator checks for runs to start or stop.
`run_evaluate_interval`	duration	`5s`	How often a run is evaluated to schedule more work.
`run_evaluate_concurrency`	integer	`8`	How many runs are evaluated concurrently.
`trigger_evaluate_interval`	duration	`15s`	How often triggers are checked.
`trigger_execution_interval`	duration	`10s`	How often due triggers are executed.

fetcher

The fetcher section configures the browser clients and any proxies, in addition to enabling clients and middleware. The browser-client settings are operational counterparts to the per-job client params on Clients.

fetcher:
  clients:
    cdp:
      enabled: true
      headless: true
  middleware:
    set_language:
      enabled: false
  timeout: 30s
  concurrency: 10

Field	Type	Default	Description
`clients`	see Clients		Client-specific settings.
`middleware`	see Middleware		Which middleware are enabled.
`timeout`	duration	`30s`	Default fetch timeout.
`concurrency`	integer	`10`	Maximum number of inflight sessions across all clients.
`proxies`	list of proxies	empty	Proxy server configurations.

Proxies

The proxies list configures the connection to the proxy servers:

Field	Type	Default	Description
`identifier`	string	none	The proxy identifier, used in client parameters to select this proxy.
`addresses`	proxy addresses	none	The proxy server addresses.
`scheme`	string	none	The proxy scheme, `http`, `https`, `socks5`, `socks5h`.
`username`	string	none	The proxy username, if required.
`password`	string	none	The proxy password, if required.

The proxy addresses can be a list of explicit hosts and ports:

fetcher:
  proxies:
    - identifier: myproxy
      addresses:
        - proxy1.example.com:8080
        - proxy2.example.com:8080
      scheme: http
      username: user
      password: pass

Or it can be an expanded template:

fetcher:
  proxies:
    - identifier: myproxy
      addresses:
        host:
          - proxy1.example.com
          - proxy2.example.com
        port:
          start: 8080
          end: 8085
      scheme: http
      username: user
      password: pass

Where the host under addresses is a single host, or a list of hosts, and the port is either a single port, a list of ports, or a range defined with start and end.

Clients

The clients section configures the available fetcher clients. Each client has its own settings, but all have the enabled flag like other components to turn them on or off. All clients besides standard are disabled by default.

The standard (HTTP) client has no startup settings of its own. The browser clients are below.

`playwright` client

Field	Type	Default	Description
`browsers`	list of strings	`[chrome]`	The browser engines to launch. Each is one of `inndx`, `chrome`, `firefox`, `safari`, or `random`.

`cdp` client

Field	Type	Default	Description
`max_concurrency`	integer	`10`	Maximum concurrent browser pages.
`headless`	boolean	`true`	Whether the browser runs headless.
`chrome_executable`	string	none	Path to a Chrome executable, if not the bundled one.
`ignore_certificate_errors`	boolean	`false`	Whether to ignore TLS certificate errors.
`disable_gpu`	boolean	`true`	Whether to disable GPU acceleration.
`disable_quic`	boolean	`false`	Whether to disable the QUIC protocol.
`args`	list of strings	empty	Extra command-line arguments passed to the browser.

`stealth_cdp` client

The stealth_cdp client has the same startup settings as the cdp client.

Field	Type	Default	Description
`max_concurrency`	integer	`10`	Maximum concurrent browser pages.
`headless`	boolean	`true`	Whether the browser runs headless.
`chrome_executable`	string	none	Path to a Chrome executable, if not the bundled one.
`ignore_certificate_errors`	boolean	`false`	Whether to ignore TLS certificate errors.
`disable_gpu`	boolean	`true`	Whether to disable GPU acceleration.
`disable_quic`	boolean	`false`	Whether to disable the QUIC protocol.
`args`	list of strings	empty	Extra command-line arguments passed to the browser.

`remote_cdp` client

Field	Type	Default	Description
`max_concurrency`	integer	`10`	Maximum concurrent pages across all remote browsers.
`browsers`	list of remote browsers	empty	The remote browsers to connect to.

`remote_stealth_cdp` client

The remote_stealth_cdp client has the same startup settings as the remote_cdp client.

Field	Type	Default	Description
`max_concurrency`	integer	`10`	Maximum concurrent pages across all remote browsers.
`browsers`	list of remote browsers	empty	The remote browsers to connect to.

Remote browsers

The browsers list on the remote_cdp and remote_stealth_cdp clients configures the connection to each remote browser:

Field	Type	Default	Description
`identifier`	string	none	The browser identifier, used in client parameters to select this browser.
`endpoints`	list of strings	none	The browser's debugging protocol endpoints.
`scope`	string	`session`	The connection scope, either `session` or `request` for a single connection during runtime or a new one per request.
`ignore_https_errors`	boolean	`false`	Whether to ignore TLS certificate errors.
`cache_enabled`	boolean	`true`	Whether to enable the browser cache.
`resolve_ip`	boolean	`true`	Whether to resolve the endpoint hosts before connecting; if false, the client connects to the endpoints as given.

Middleware

The middleware section enables or disables request and response middleware. Each is an object with an enabled flag, and all are enabled by default. See Middleware for what each does and its per-job parameters.

Component	Description
`mutate_path`	Rewrites the request path.
`mutate_method`	Rewrites the request method.
`mutate_headers`	Adds or rewrites request or response headers.
`set_language`	Sets the request's language headers.

parser

parser:
  conditions:
    fingerprint:
      enabled: false
  concurrency: 10

The component groups under parser. Each holds the components shown, every one an object with an enabled flag and enabled by default. See the linked reference pages for what each does and its per-job parameters.

Group	Components
`conditions`	`expression`, `regex_patterns`, `url_patterns`, `heuristics`, `fingerprint`, `host_labels`, `url_labels`
`heuristics` (reference)	`regex_patterns`, `url_patterns`, `host_pattern`, `url_path_segment`, `url_path_keyword`, `open_graph`, `json_ld`, `text_link_density`, `content_density`
`extractors` (reference)	`markdown`, `data_map`, `host_data_map`
`resolvers` (reference)	`xpath`
`navigators` (reference)	`anchor`
`guards` (reference)	`expression`
`feature_sets`	`selectors`, `data_map`, `host_data_map`

The heuristics group lists the heuristic kinds available to the heuristics condition, and feature_sets defines the feature sets used by fingerprint conditions and the data-map extractors.

The field directly under parser:

Field	Type	Default	Description
`concurrency`	integer	`10`	How many tasks the parser processes concurrently.

sink

sink:
  actions:
    to_file:
      enabled: true
      directory: ./output
  concurrency: 10

The actions group under sink enables or disables the delivery actions. Each is an object with an enabled flag, and all are enabled by default. See Actions for what each does and its per-job parameters.

Component	Description
`to_blob`	Writes each result to a blob store.
`to_file`	Writes each result to a local directory. Also takes a `directory` field setting where results are written.
`label_url`	Applies labels to the result's URL.

The field directly under sink:

Field	Type	Default	Description
`concurrency`	integer	`10`	How many results the sink delivers concurrently.

analytics

The analytics service currently has no service-specific components or fields beyond the shared sections.