inndx
GitHub

Parse

Run extraction pipelines against supplied content.

The parse endpoints run one or more pipelines over content you supply, so you can test extraction without crawling.

These endpoints share the pagination and error conventions.

Parse a document

POST/v1/parse

Runs the supplied pipelines over a single document and returns the results.

Request bodyin: body
url
stringrequired

The URL the content came from.

format: uri
body
stringrequired

The base64-encoded content to parse.

content_type
stringrequired

The MIME type of the content.

redirects
Redirect[]default: []

The redirects that were followed to reach the content. Each entry has url, location, side ("server" or "client"), and type ("permanent" or "temporary").

pipelines
object[]default: []

The pipelines to evaluate against the content. Each pipeline has identifier (string, default "default"), optional guards, an optional navigator, steps (defaults to a single extractor), an optional priority, and behavior (string, default "continue"). See the parser components and the crawl manifest for the full shape of guards, navigators, and steps.

Responses

curl -X POST 'http://localhost:8022/v1/parse' \
  -H 'X-Tenant-Id: acme' \
  -H 'Content-Type: application/json' \
  -d '{
    "url": "https://example.com/article",
    "body": "PGh0bWw+Li4uPC9odG1sPg==",
    "content_type": "text/html"
  }'

Parse a batch

POST/v1/parse/batch

Runs pipelines over multiple documents in a single request. Per-item failures are returned as error items rather than failing the whole request.

Request bodyin: body
items
object[]required

The documents to parse. Each item has the same shape as the single parse request body.

min: 1max: 64
Responses

curl -X POST 'http://localhost:8022/v1/parse/batch' \
  -H 'X-Tenant-Id: acme' \
  -H 'Content-Type: application/json' \
  -d '{
    "items": [
      {
        "url": "https://example.com/article",
        "body": "PGh0bWw+Li4uPC9odG1sPg==",
        "content_type": "text/html"
      }
    ]
  }'

Search docs

Search the Self-host documentation