inndx/
GitHub

Configuration

Every configuration section and its environment-variable mapping.

inndx is configured hierarchically: built-in defaults, then a config file, then environment variables (highest priority). This page documents what each setting does. Production wiring of these settings with secrets is part of enterprise onboarding; here we describe the knobs themselves.

How configuration is loaded

Configuration is resolved in three layers, each overriding the one before:

  1. Built-in defaults.
  2. A configuration file, if one is given with the --config-file flag or the INNDX_CONFIG_FILE environment variable. The file may be YAML, TOML, or JSON; the format is detected from the file extension.
  3. Environment variables prefixed with INNDX, which override the file and defaults.

Environment variables map onto the nested configuration with a double-underscore separator: INNDX__SECTION__FIELD. For example, INNDX__SERVER__PORT=8022 sets the server.port value. Run inndx config to print the fully resolved configuration.

Several sections are tagged unions that select a backend with a kind field (for example the cache, broker, blob storage, and columnar store). For those, set kind and then the fields for that kind.

Meta

General identity settings.

FieldTypeDefaultDescription
default_tenant_idstringdefaultThe tenant used when a request or job does not specify one.
envstringdevelopmentA label for the environment, used in telemetry and logging.

Server

The HTTP server each service binds.

FieldTypeDefaultDescription
hoststring0.0.0.0The address the HTTP server binds to.
portinteger8022The port the HTTP server binds to.

Database

The relational database holding crawl state. The driver is selected by the URL scheme (sqlite://... or postgresql://...).

FieldTypeDefaultDescription
primarystringsqlite://database.db?mode=rwcThe primary database connection URL.
replicaslist of stringsemptyRead-replica connection URLs.
optionsdatabase optionsdefaultsConnection-pool and logging options.

Database options

FieldTypeDefaultDescription
max_connectionsintegernoneMaximum connections in the pool.
min_connectionsintegernoneMinimum idle connections kept in the pool.
connect_timeoutdurationnoneMaximum time to wait establishing a connection.
idle_timeoutdurationnoneHow long an idle connection is kept before being closed.
acquire_timeoutdurationnoneMaximum time to wait acquiring a connection from the pool.
max_lifetimedurationnoneMaximum lifetime of a connection before it is recycled.
enable_loggingbooleanfalseWhether to log database statements.

Columnar store

The columnar store holding analytics data (audit logs, crawl logs, metrics). Selected by kind.

KindFieldsDescription
duckdbpath (string)An embedded DuckDB store. The default, with path defaulting to analytics.db. Use :memory: for an in-memory store.
clickhouseurl, username, password, databaseAn external ClickHouse store.
columnar_store:
  kind: duckdb
  path: analytics.db

Cache

The cache shared across services. Selected by kind.

KindFieldsDescription
memorymax_capacity (integer)An in-process cache. The default, with max_capacity defaulting to 1000000.
redisurl (string)A Redis cache.
redis_clusterurls (list of strings)A Redis Cluster cache.
cache:
  kind: redis
  url: redis://localhost:6379

Broker

The message broker carrying work between services. Selected by kind.

KindFieldsDescription
memorynoneAn in-process broker. The default; suitable only for single-process mode.
kafkabootstrap_servers (string)A Kafka broker. bootstrap_servers is a comma-separated list.
broker:
  kind: kafka
  bootstrap_servers: localhost:9092

Service registry

The registry services use to discover one another. Selected by kind, with the same options as the cache: memory (default), redis (url), and redis_cluster (urls).

service_registry:
  kind: redis
  url: redis://localhost:6379

Blob storage

Blob storage holds raw content and results. A deployment has one primary store and an optional map of named secondary stores; the storage_identifier on a to_blob action selects a secondary store by name.

FieldTypeDefaultDescription
primaryblob storefile ./dataThe default blob store.
secondarymap of name to blob storeemptyAdditional named blob stores.

Blob store

A blob store is selected by kind.

KindFieldsDescription
memorynoneAn in-process store.
filedata_dir (string)A local-filesystem store. The default, with data_dir defaulting to ./data.
s3endpoint_url, bucket, access_key, secret_key, region, force_path_style, multipart_threshold, delete_batch_sizeAn S3-compatible store. Only endpoint_url and bucket are required.
blob_storage:
  primary:
    kind: s3
    endpoint_url: https://s3.example.com
    bucket: inndx
    access_key: <your-access-key>
    secret_key: <your-secret-key>

Task queue

The in-process worker queue each service uses to run its work.

FieldTypeDefaultDescription
worker_countinteger4Number of worker tasks.
max_queue_capacityinteger256Maximum number of queued tasks before backpressure applies.
batch_sizeinteger16Number of tasks processed per batch.
batch_timeout_msinteger10Maximum time, in milliseconds, to wait filling a batch.

Telemetry

OpenTelemetry export of traces and metrics.

FieldTypeDefaultDescription
enabledbooleanfalseWhether telemetry export is enabled.
endpointstringnoneThe OTLP endpoint to export to.

Sentry

Error reporting to Sentry.

FieldTypeDefaultDescription
enabledbooleanfalseWhether Sentry reporting is enabled.
dsnstringemptyThe Sentry DSN.

Per-service settings

Each service has its own configuration section (orchestrator, fetcher, parser, sink, analytics) holding two things: which components it has enabled, and operational tuning.

Every component a service offers can be turned on or off. Each component group (such as orchestrator.filters or parser.conditions) lists its components, and each is an object with an enabled flag:

orchestrator:
  filters:
    robots_txt:
      enabled: false

The operational tuning per service is below.

orchestrator

FieldTypeDefaultDescription
start_stop_intervalduration10sHow often the orchestrator checks for runs to start or stop.
run_evaluate_intervalduration5sHow often a run is evaluated to schedule more work.
run_evaluate_concurrencyinteger8How many runs are evaluated concurrently.
trigger_evaluate_intervalduration15sHow often triggers are checked.
trigger_execution_intervalduration10sHow often due triggers are executed.

fetcher

The fetcher section configures the browser clients and any proxies, in addition to enabling clients and middleware. The browser-client settings are operational counterparts to the per-job client params on Clients.

The cdp client settings:

FieldTypeDefaultDescription
max_concurrencyinteger10Maximum concurrent browser pages.
headlessbooleantrueWhether the browser runs headless.
chrome_executablestringnonePath to a Chrome executable, if not the bundled one.
ignore_certificate_errorsbooleanfalseWhether to ignore TLS certificate errors.
disable_gpubooleantrueWhether to disable GPU acceleration.
disable_quicbooleanfalseWhether to disable the QUIC protocol.
argslist of stringsemptyExtra command-line arguments passed to the browser.

The remote_cdp client connects to external browsers. Its settings carry a max_concurrency (default 10) and a list of browsers, each with an identifier, a list of endpoints, a connection scope, and flags such as ignore_https_errors, cache_enabled, and resolve_ip.

Proxies are configured as a list under the fetcher section, each with an identifier, addresses, a scheme, and optional username and password.

parser

FieldTypeDefaultDescription
concurrencyinteger10How many pages the parser processes concurrently.

sink

FieldTypeDefaultDescription
concurrencyinteger10How many results the sink delivers concurrently.

The to_file action also takes a directory here, setting where to_file writes results:

sink:
  actions:
    to_file:
      enabled: true
      directory: ./output

analytics

The analytics service currently has no service-specific tuning beyond the shared sections.

Environment variable mapping

Every field maps to an environment variable named INNDX__ followed by the section and field, joined by double underscores. Some examples:

INNDX__SERVER__PORT=8022
INNDX__DATABASE__PRIMARY=postgresql://user:pass@db:5432/inndx
INNDX__COLUMNAR_STORE__KIND=duckdb
INNDX__CACHE__KIND=redis
INNDX__CACHE__URL=redis://localhost:6379
INNDX__BROKER__KIND=kafka
INNDX__BROKER__BOOTSTRAP_SERVERS=localhost:9092
INNDX__SERVICE_REGISTRY__KIND=redis
INNDX__SERVICE_REGISTRY__URL=redis://localhost:6379
INNDX__BLOB_STORAGE__PRIMARY__KIND=file
INNDX__BLOB_STORAGE__PRIMARY__DATA_DIR=./data

Named entries in a map use their key as another path segment. A secondary blob store named archive, for example, is configured as:

INNDX__BLOB_STORAGE__SECONDARY__ARCHIVE__KIND=file
INNDX__BLOB_STORAGE__SECONDARY__ARCHIVE__DATA_DIR=./archive

Search docs

Search the Self-host documentation