Configuration
Every configuration section and its environment-variable mapping.
inndx is configured hierarchically: built-in defaults, then a config file, then environment variables (highest priority). This page documents what each setting does. Production wiring of these settings with secrets is part of enterprise onboarding; here we describe the knobs themselves.
How configuration is loaded
Configuration is resolved in three layers, each overriding the one before:
- Built-in defaults.
- A configuration file, if one is given with the
--config-fileflag or theINNDX_CONFIG_FILEenvironment variable. The file may be YAML, TOML, or JSON; the format is detected from the file extension. - Environment variables prefixed with
INNDX, which override the file and defaults.
Environment variables map onto the nested configuration with a double-underscore separator: INNDX__SECTION__FIELD. For example, INNDX__SERVER__PORT=8022 sets the server.port value. Run inndx config to print the fully resolved configuration.
Several sections are tagged unions that select a backend with a kind field (for example the cache, broker, blob storage, and columnar store). For those, set kind and then the fields for that kind.
Meta
General identity settings.
| Field | Type | Default | Description |
|---|---|---|---|
default_tenant_id | string | default | The tenant used when a request or job does not specify one. |
env | string | development | A label for the environment, used in telemetry and logging. |
Server
The HTTP server each service binds.
| Field | Type | Default | Description |
|---|---|---|---|
host | string | 0.0.0.0 | The address the HTTP server binds to. |
port | integer | 8022 | The port the HTTP server binds to. |
Database
The relational database holding crawl state. The driver is selected by the URL scheme (sqlite://... or postgresql://...).
| Field | Type | Default | Description |
|---|---|---|---|
primary | string | sqlite://database.db?mode=rwc | The primary database connection URL. |
replicas | list of strings | empty | Read-replica connection URLs. |
options | database options | defaults | Connection-pool and logging options. |
Database options
| Field | Type | Default | Description |
|---|---|---|---|
max_connections | integer | none | Maximum connections in the pool. |
min_connections | integer | none | Minimum idle connections kept in the pool. |
connect_timeout | duration | none | Maximum time to wait establishing a connection. |
idle_timeout | duration | none | How long an idle connection is kept before being closed. |
acquire_timeout | duration | none | Maximum time to wait acquiring a connection from the pool. |
max_lifetime | duration | none | Maximum lifetime of a connection before it is recycled. |
enable_logging | boolean | false | Whether to log database statements. |
Columnar store
The columnar store holding analytics data (audit logs, crawl logs, metrics). Selected by kind.
| Kind | Fields | Description |
|---|---|---|
duckdb | path (string) | An embedded DuckDB store. The default, with path defaulting to analytics.db. Use :memory: for an in-memory store. |
clickhouse | url, username, password, database | An external ClickHouse store. |
columnar_store:
kind: duckdb
path: analytics.dbCache
The cache shared across services. Selected by kind.
| Kind | Fields | Description |
|---|---|---|
memory | max_capacity (integer) | An in-process cache. The default, with max_capacity defaulting to 1000000. |
redis | url (string) | A Redis cache. |
redis_cluster | urls (list of strings) | A Redis Cluster cache. |
cache:
kind: redis
url: redis://localhost:6379Broker
The message broker carrying work between services. Selected by kind.
| Kind | Fields | Description |
|---|---|---|
memory | none | An in-process broker. The default; suitable only for single-process mode. |
kafka | bootstrap_servers (string) | A Kafka broker. bootstrap_servers is a comma-separated list. |
broker:
kind: kafka
bootstrap_servers: localhost:9092Service registry
The registry services use to discover one another. Selected by kind, with the same options as the cache: memory (default), redis (url), and redis_cluster (urls).
service_registry:
kind: redis
url: redis://localhost:6379Blob storage
Blob storage holds raw content and results. A deployment has one primary store and an optional map of named secondary stores; the storage_identifier on a to_blob action selects a secondary store by name.
| Field | Type | Default | Description |
|---|---|---|---|
primary | blob store | file ./data | The default blob store. |
secondary | map of name to blob store | empty | Additional named blob stores. |
Blob store
A blob store is selected by kind.
| Kind | Fields | Description |
|---|---|---|
memory | none | An in-process store. |
file | data_dir (string) | A local-filesystem store. The default, with data_dir defaulting to ./data. |
s3 | endpoint_url, bucket, access_key, secret_key, region, force_path_style, multipart_threshold, delete_batch_size | An S3-compatible store. Only endpoint_url and bucket are required. |
blob_storage:
primary:
kind: s3
endpoint_url: https://s3.example.com
bucket: inndx
access_key: <your-access-key>
secret_key: <your-secret-key>Task queue
The in-process worker queue each service uses to run its work.
| Field | Type | Default | Description |
|---|---|---|---|
worker_count | integer | 4 | Number of worker tasks. |
max_queue_capacity | integer | 256 | Maximum number of queued tasks before backpressure applies. |
batch_size | integer | 16 | Number of tasks processed per batch. |
batch_timeout_ms | integer | 10 | Maximum time, in milliseconds, to wait filling a batch. |
Telemetry
OpenTelemetry export of traces and metrics.
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Whether telemetry export is enabled. |
endpoint | string | none | The OTLP endpoint to export to. |
Sentry
Error reporting to Sentry.
| Field | Type | Default | Description |
|---|---|---|---|
enabled | boolean | false | Whether Sentry reporting is enabled. |
dsn | string | empty | The Sentry DSN. |
Per-service settings
Each service has its own configuration section (orchestrator, fetcher, parser, sink, analytics) holding two things: which components it has enabled, and operational tuning.
Every component a service offers can be turned on or off. Each component group (such as orchestrator.filters or parser.conditions) lists its components, and each is an object with an enabled flag:
orchestrator:
filters:
robots_txt:
enabled: falseThe operational tuning per service is below.
orchestrator
| Field | Type | Default | Description |
|---|---|---|---|
start_stop_interval | duration | 10s | How often the orchestrator checks for runs to start or stop. |
run_evaluate_interval | duration | 5s | How often a run is evaluated to schedule more work. |
run_evaluate_concurrency | integer | 8 | How many runs are evaluated concurrently. |
trigger_evaluate_interval | duration | 15s | How often triggers are checked. |
trigger_execution_interval | duration | 10s | How often due triggers are executed. |
fetcher
The fetcher section configures the browser clients and any proxies, in addition to enabling clients and middleware. The browser-client settings are operational counterparts to the per-job client params on Clients.
The cdp client settings:
| Field | Type | Default | Description |
|---|---|---|---|
max_concurrency | integer | 10 | Maximum concurrent browser pages. |
headless | boolean | true | Whether the browser runs headless. |
chrome_executable | string | none | Path to a Chrome executable, if not the bundled one. |
ignore_certificate_errors | boolean | false | Whether to ignore TLS certificate errors. |
disable_gpu | boolean | true | Whether to disable GPU acceleration. |
disable_quic | boolean | false | Whether to disable the QUIC protocol. |
args | list of strings | empty | Extra command-line arguments passed to the browser. |
The remote_cdp client connects to external browsers. Its settings carry a max_concurrency (default 10) and a list of browsers, each with an identifier, a list of endpoints, a connection scope, and flags such as ignore_https_errors, cache_enabled, and resolve_ip.
Proxies are configured as a list under the fetcher section, each with an identifier, addresses, a scheme, and optional username and password.
parser
| Field | Type | Default | Description |
|---|---|---|---|
concurrency | integer | 10 | How many pages the parser processes concurrently. |
sink
| Field | Type | Default | Description |
|---|---|---|---|
concurrency | integer | 10 | How many results the sink delivers concurrently. |
The to_file action also takes a directory here, setting where to_file writes results:
sink:
actions:
to_file:
enabled: true
directory: ./outputanalytics
The analytics service currently has no service-specific tuning beyond the shared sections.
Environment variable mapping
Every field maps to an environment variable named INNDX__ followed by the section and field, joined by double underscores. Some examples:
INNDX__SERVER__PORT=8022
INNDX__DATABASE__PRIMARY=postgresql://user:pass@db:5432/inndx
INNDX__COLUMNAR_STORE__KIND=duckdb
INNDX__CACHE__KIND=redis
INNDX__CACHE__URL=redis://localhost:6379
INNDX__BROKER__KIND=kafka
INNDX__BROKER__BOOTSTRAP_SERVERS=localhost:9092
INNDX__SERVICE_REGISTRY__KIND=redis
INNDX__SERVICE_REGISTRY__URL=redis://localhost:6379
INNDX__BLOB_STORAGE__PRIMARY__KIND=file
INNDX__BLOB_STORAGE__PRIMARY__DATA_DIR=./dataNamed entries in a map use their key as another path segment. A secondary blob store named archive, for example, is configured as:
INNDX__BLOB_STORAGE__SECONDARY__ARCHIVE__KIND=file
INNDX__BLOB_STORAGE__SECONDARY__ARCHIVE__DATA_DIR=./archive