inndx/
GitHub

Extractors

Pipeline steps that pull structured data out of fetched content.

Extractors produce the output of a crawl: markdown, raw content, or structured records. They appear as extractor steps in a pipeline. This page catalogs the available kinds and their parameters.

How extractors are configured

An extractor is a pipeline step whose params carry a kind selecting the extractor and its options:

steps:
  - kind: extractor
    params:
      kind: markdown

raw

Emits the page content unchanged. It takes no parameters.

steps:
  - kind: extractor
    params:
      kind: raw

markdown

Converts the page to clean markdown.

FieldTypeRequiredDefaultDescription
content_xpathstringno//bodyThe XPath of the region to convert.
skip_tagslist of stringsnobuilt-in listTag names to drop before converting. Defaults to a built-in list of non-content tags: script, style, noscript, iframe, template, head, svg, canvas, object, embed, form, button, input, textarea, select, label, option, and dialog.
steps:
  - kind: extractor
    params:
      kind: markdown
      params:
        content_xpath: //main
        skip_tags:
          - nav
          - footer

data_map

Extracts named fields using a data map supplied inline.

FieldTypeRequiredDefaultDescription
mapdata mapyesnoneThe data map describing the fields to extract.
schema_idUUIDnononeA data schema to validate the extracted record against.
steps:
  - kind: extractor
    params:
      kind: data_map
      params:
        map:
          fields:
            - name: title
              source:
                type: extractor
                kind: xpath
                params:
                  expressions:
                    - //h1/text()

host_data_map

Extracts fields using a data map stored for the page's host, looked up automatically at extraction time.

FieldTypeRequiredDefaultDescription
validatebooleannotrueWhether to validate the extracted record against the stored map's data schema.
steps:
  - kind: extractor
    params:
      kind: host_data_map
      params:
        validate: true

Data map

A data map is a list of fields to extract. It is used by the data_map extractor and by data-map feature sets.

FieldTypeRequiredDefaultDescription
fieldslist of fieldsyesnoneThe fields to extract. At least one is required.

Field

A field always has a name. Its value is then defined in one of three mutually exclusive ways: a source (a single value), nested fields (a group of sub-fields), or a root and item (a repeated structure, such as a list of records).

FieldTypeRequiredDefaultDescription
namestringyesnoneThe output name of the field.
sourcefield sourceconditionalnoneFor a single-value field, where the value comes from.
fieldslist of fieldsconditionalnoneFor a group field, the nested sub-fields.
rootroot selectorconditionalnoneFor a repeated field, the selector for the repeated elements.
itemfieldconditionalnoneFor a repeated field, the field describing each element.

Field source

A field source is an object whose type selects where a value comes from.

TypeFieldsDescription
extractor (xpath)kind: xpath, params.expressions (list of strings)Extract the value with XPath expressions.
extractor (selector)kind: selector, params.selectors (list of selector expressions)Extract the value with CSS selectors.
contextkey (string)Take the value from the crawl context by key.
staticvalue (any)Use a constant value.

Selector expression

FieldTypeRequiredDefaultDescription
expressionstringyesnoneThe CSS selector.
accessoraccessornononeWhat part of the matched element to read. Defaults to its text.

Accessor

An accessor is an object with a type field selecting one of three forms.

TypeFieldDescription
attributename (string)Read the named attribute.
textrecursive (boolean)Read the text, optionally including descendant text.
htmlouter (boolean)Read the element's HTML, inner or outer.

Root selector

A root selector locates the repeated elements of a repeated field.

FieldTypeRequiredDefaultDescription
typeselectoryesnoneAlways selector.
kindxpath or selectoryesnoneWhether the expressions are XPath or CSS selectors.
params.expressionslist of stringsyesnoneThe expressions selecting the repeated elements.

Search docs

Search the Self-host documentation