Extractors
Pipeline steps that pull structured data out of fetched content.
Extractors produce the output of a crawl: markdown, raw content, or structured records. They appear as extractor steps in a pipeline. This page catalogs the available kinds and their parameters.
How extractors are configured
An extractor is a pipeline step whose params carry a kind selecting the extractor and its options:
steps:
- kind: extractor
params:
kind: markdownraw
Emits the page content unchanged. It takes no parameters.
steps:
- kind: extractor
params:
kind: rawmarkdown
Converts the page to clean markdown.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
content_xpath | string | no | //body | The XPath of the region to convert. |
skip_tags | list of strings | no | built-in list | Tag names to drop before converting. Defaults to a built-in list of non-content tags: script, style, noscript, iframe, template, head, svg, canvas, object, embed, form, button, input, textarea, select, label, option, and dialog. |
steps:
- kind: extractor
params:
kind: markdown
params:
content_xpath: //main
skip_tags:
- nav
- footerdata_map
Extracts named fields using a data map supplied inline.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
map | data map | yes | none | The data map describing the fields to extract. |
schema_id | UUID | no | none | A data schema to validate the extracted record against. |
steps:
- kind: extractor
params:
kind: data_map
params:
map:
fields:
- name: title
source:
type: extractor
kind: xpath
params:
expressions:
- //h1/text()host_data_map
Extracts fields using a data map stored for the page's host, looked up automatically at extraction time.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
validate | boolean | no | true | Whether to validate the extracted record against the stored map's data schema. |
steps:
- kind: extractor
params:
kind: host_data_map
params:
validate: trueData map
A data map is a list of fields to extract. It is used by the data_map extractor and by data-map feature sets.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
fields | list of fields | yes | none | The fields to extract. At least one is required. |
Field
A field always has a name. Its value is then defined in one of three mutually exclusive ways: a source (a single value), nested fields (a group of sub-fields), or a root and item (a repeated structure, such as a list of records).
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
name | string | yes | none | The output name of the field. |
source | field source | conditional | none | For a single-value field, where the value comes from. |
fields | list of fields | conditional | none | For a group field, the nested sub-fields. |
root | root selector | conditional | none | For a repeated field, the selector for the repeated elements. |
item | field | conditional | none | For a repeated field, the field describing each element. |
Field source
A field source is an object whose type selects where a value comes from.
| Type | Fields | Description |
|---|---|---|
extractor (xpath) | kind: xpath, params.expressions (list of strings) | Extract the value with XPath expressions. |
extractor (selector) | kind: selector, params.selectors (list of selector expressions) | Extract the value with CSS selectors. |
context | key (string) | Take the value from the crawl context by key. |
static | value (any) | Use a constant value. |
Selector expression
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
expression | string | yes | none | The CSS selector. |
accessor | accessor | no | none | What part of the matched element to read. Defaults to its text. |
Accessor
An accessor is an object with a type field selecting one of three forms.
| Type | Field | Description |
|---|---|---|
attribute | name (string) | Read the named attribute. |
text | recursive (boolean) | Read the text, optionally including descendant text. |
html | outer (boolean) | Read the element's HTML, inner or outer. |
Root selector
A root selector locates the repeated elements of a repeated field.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
type | selector | yes | none | Always selector. |
kind | xpath or selector | yes | none | Whether the expressions are XPath or CSS selectors. |
params.expressions | list of strings | yes | none | The expressions selecting the repeated elements. |