inndx/
GitHub

Extract structured data

Turn fetched pages into clean markdown or structured records.

Fetching gives you raw HTML; the value comes from what you extract. This guide covers the extractor options, from whole-page markdown to field-level extraction with data maps and schemas. For the full extractor catalog, see Extractors.

Extraction happens inside a pipeline. An extractor step under config.pipelines[].steps takes a kind naming the extractor and its params. The sections below cover each kind.

Markdown and raw extraction

The simplest extractors take the whole page. The markdown extractor converts the page to clean markdown, which is ideal when you want readable content rather than specific fields:

steps:
  - kind: extractor
    params:
      kind: markdown

You can narrow it to a region of the page with content_xpath, and drop unwanted elements with skip_tags:

steps:
  - kind: extractor
    params:
      kind: markdown
      params:
        content_xpath: //main
        skip_tags:
          - nav
          - footer

The raw extractor keeps the original content as-is, without converting it. Use it when you want to store the unmodified page.

Field extraction with data maps

When you want specific fields rather than the whole page, use a data map. A data map is a list of named fields, each with a source that says where the value comes from. Sources can be an XPath expression, a CSS selector, a value carried in the crawl context, or a static constant.

The data_map extractor carries its map inline in the manifest:

steps:
  - kind: extractor
    params:
      kind: data_map
      params:
        map:
          fields:
            - name: title
              source:
                type: extractor
                kind: xpath
                params:
                  expressions:
                    - //h1/text()
            - name: price
              source:
                type: extractor
                kind: selector
                params:
                  selectors:
                    - expression: .price
                      accessor:
                        type: text
                        recursive: true

Each field has a name and a source. An extractor source of kind xpath takes one or more expressions; an extractor source of kind selector takes one or more CSS selectors, each with an optional accessor that says whether to read the element's text, an attribute, or its HTML.

To extract a repeating structure, such as a list of products, give a field a root that selects the repeated elements and an item that describes each one:

fields:
  - name: products
    root:
      type: selector
      kind: selector
      params:
        expressions:
          - .product-card
    item:
      name: product
      fields:
        - name: title
          source:
            type: extractor
            kind: xpath
            params:
              expressions:
                - .//h2/text()
        - name: price
          source:
            type: extractor
            kind: selector
            params:
              selectors:
                - expression: .price

When the same map applies to every page on a host, you can store it once instead of repeating it in every job. A stored map is created against a host and the host_data_map extractor looks it up automatically by host at extraction time:

steps:
  - kind: extractor
    params:
      kind: host_data_map
      params:
        validate: true

Create a stored map by posting it to the host's data-maps endpoint. A stored map references a data schema by id (see the next section):

curl -X POST 'http://localhost:8022/v1/hosts/<host-hash>/data_maps' \
  -H 'Content-Type: application/json' \
  -d '{
    "schema_id": "<schema-id>",
    "map": {
      "fields": [
        { "name": "title", "source": { "type": "extractor", "kind": "xpath", "params": { "expressions": ["//h1/text()"] } } }
      ]
    }
  }'

Validating against a schema

A data schema describes the shape the extracted record should have, expressed as a JSON Schema document. Validating extracted data against a schema catches pages that changed structure or fields that came back empty, so bad records do not flow downstream silently.

Create a schema by name with its JSON Schema body:

curl -X POST 'http://localhost:8022/v1/schemas' \
  -H 'Content-Type: application/json' \
  -d '{
    "name": "product",
    "schema": {
      "type": "object",
      "required": ["title", "price"],
      "properties": {
        "title": { "type": "string" },
        "price": { "type": "string" }
      }
    }
  }'

The response includes the schema's id. Reference that id from an inline data_map extractor with schema_id to validate as you extract, or from a stored host_data_map (which always carries a schema_id) by setting validate: true.

steps:
  - kind: extractor
    params:
      kind: data_map
      params:
        schema_id: <schema-id>
        map:
          fields:
            - name: title
              source:
                type: extractor
                kind: xpath
                params:
                  expressions:
                    - //h1/text()

You can also check a record against a schema directly, which is useful while designing one:

curl -X POST 'http://localhost:8022/v1/schemas/<schema-id>/validate' \
  -H 'Content-Type: application/json' \
  -d '{ "data": { "title": "Example", "price": "9.99" } }'

Conditions: deciding what to keep

Not every fetched page is worth extracting. A condition step runs before extraction and decides whether the pipeline should continue for this page. Conditions are how you skip index pages, off-topic content, or near-duplicates.

steps:
  - kind: condition
    params:
      - kind: url_patterns
        params:
          allow:
            - pathname: /products/*
  - kind: extractor
    params:
      kind: data_map
      params:
        map:
          fields:
            - name: title
              source:
                type: extractor
                kind: xpath
                params:
                  expressions:
                    - //h1/text()

A condition step holds a list of conditions. Besides url_patterns and regex_patterns, conditions include expression (a custom expression), host_labels and url_labels (match on labels), heuristics (score a page on signals such as content density or structured-data markup), and fingerprint (detect near-duplicate pages). Because the condition comes before the extractor in the step list, a page that fails it is dropped before any extraction work is done.

Testing extraction

You do not have to run a whole crawl to see what an extractor produces. The parser exposes endpoints that apply extraction to content you supply.

To test a stored data map against a page, send the page content to the map's apply endpoint. The content is sent base64-encoded, so encode the page into a variable first and feed the body in through a heredoc, which keeps the JSON quotes clean:

CONTENT=$(base64 -w0 page.html)

curl -X POST 'http://localhost:8022/v1/data_maps/<data-map-id>/apply' \
  -H 'Content-Type: application/json' \
  -d @- <<EOF
{ "content": "$CONTENT" }
EOF

The response contains the extracted data.

To test a full pipeline, including conditions and the extractor, against a URL's content, use the parse endpoint. It takes the URL, the page body (base64-encoded), its content_type, and the pipelines to run:

BODY=$(base64 -w0 page.html)

curl -X POST 'http://localhost:8022/v1/parse' \
  -H 'Content-Type: application/json' \
  -d @- <<EOF
{
  "url": "https://example.com/products/1",
  "content_type": "text/html",
  "body": "$BODY",
  "pipelines": [
    {
      "steps": [
        { "kind": "extractor", "params": { "kind": "markdown" } }
      ]
    }
  ]
}
EOF

The response lists a result per pipeline, each with the discovered links and the extracted data. Iterate on your map or conditions against a saved page until the output is right, then put the same extractor into your crawl job.

Search docs

Search the Self-host documentation