inndx/
GitHub

Fetch with a browser

Crawl JavaScript-heavy sites using the browser-based fetch clients.

The default fetch client retrieves raw HTML over plain HTTP, which is fast but cannot run JavaScript. For sites that render content client-side, inndx offers browser-based clients. This guide explains when you need one and how to configure it. For the full client catalog, see Clients.

The fetch client is set under config.fetcher.client in a crawl manifest, as an object with a kind and optional params.

When you need a browser

The standard client returns the HTML exactly as the server sends it, before any JavaScript runs. Many modern sites send a near-empty HTML shell and build the real page in the browser. Against those sites the standard client produces thin or empty results.

The signs that you need a browser client:

  • The fetched markdown or extracted fields come back empty or contain only a loading placeholder, even though the page looks full in your own browser.
  • The page's real content does not appear in "view source" in a browser, only in the rendered view.
  • Links you expect the crawl to discover are never found, because they are injected by a script.

A quick way to check is to fetch a single URL with each client and compare. The fetcher exposes a single-URL endpoint for exactly this:

curl -X POST 'http://localhost:8022/v1/fetch' \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://example.com/app",
    "client": { "kind": "standard" }
  }'

If the standard response is missing content that a browser client returns, switch that crawl to a browser client.

Choosing a client

inndx offers three browser clients in addition to standard. They differ in how the browser is driven and where it runs.

  • standard: plain HTTP, no JavaScript. The fastest and cheapest option. Use it whenever the site works without scripting.
  • playwright: a full browser driven through Playwright. Renders JavaScript and supports screenshots. Use it for general client-side-rendered sites.
  • cdp: a Chrome instance driven directly over the Chrome DevTools Protocol. Renders JavaScript and adds finer control, including blocking specific resource types and fingerprint patching. Use it when you need that extra control or lighter-weight rendering than Playwright.
  • remote_cdp: the same Chrome DevTools Protocol control, but pointed at a browser running elsewhere rather than inside the fetcher. The fetcher connects to an external Chrome DevTools Protocol endpoint over WebSockets. Use it when you run a separate pool of browsers and want the fetcher to drive them.

All three browser clients are far heavier than standard: each fetch drives a real browser, which costs significantly more CPU, memory, and time per URL. Reach for a browser client only for the sites that require one. You can run multiple crawls with different clients, so a standard crawl and a browser crawl can coexist.

The browser image variant

The playwright and cdp clients run a browser inside the fetcher process, so they require the browser variant of the inndx container image. The browser image is named with a -browsers suffix on the tag, for example:

inndx-image:0.3.0-browsers

It bundles Playwright and the Chrome DevTools Protocol runtime, with Chrome and Chrome for Testing included. Deploy the fetcher from this image whenever a crawl uses playwright or cdp.

The remote_cdp client is the exception: because the browser runs externally, the fetcher itself does not need the bundled browser, so the base image is sufficient for it. The external browser pool it connects to is provisioned separately.

For the list of image variants, see Images.

Configuring a browser fetch

A browser fetch is configured by setting the client kind and its params. A minimal Playwright client:

config:
  fetcher:
    client:
      kind: playwright
      params:
        javascript_enabled: true
    timeout: 30s

The browser clients accept a screenshot option to capture the rendered page:

config:
  fetcher:
    client:
      kind: playwright
      params:
        javascript_enabled: true
        screenshot:
          enabled: true
          full_page: true

The cdp and remote_cdp clients add two options worth knowing. intercept_resources blocks named resource types from loading, which speeds up rendering and lowers cost when you do not need them; for example, skip images and fonts when you only want text. patch_fingerprint adjusts the browser's fingerprint. A cdp client tuned for text extraction:

config:
  fetcher:
    client:
      kind: cdp
      params:
        javascript_enabled: true
        intercept_resources:
          - image
          - media
          - font
        patch_fingerprint: true

For remote_cdp, the same options apply, and groups selects which external browser pool the fetcher connects to.

Cost and scaling notes

Browser fetches are the most expensive part of a crawl. A single browser fetch can take many times longer than a standard fetch and holds far more memory while it runs, so a browser crawl needs more fetcher capacity for the same throughput. Two levers reduce the cost: restrict the browser client to only the crawls that need it, and use intercept_resources to avoid loading assets you will not extract.

Sizing a browser-fetching deployment (how many fetcher instances and how much memory per instance) depends on your target throughput and the sites you crawl. That guidance is part of your enterprise onboarding materials.

Search docs

Search the Self-host documentation