Extract structured data
Turn fetched pages into clean markdown or structured records.
Fetching gives you raw HTML; the value comes from what you extract. This guide covers the extractor options, from whole-page markdown to field-level extraction with data maps and schemas. For the full extractor catalog, see Extractors.
Extraction happens inside a pipeline. An extractor step under config.pipelines[].steps takes a kind naming the extractor and its params. The sections below cover each kind.
Markdown and raw extraction
The simplest extractors take the whole page. The markdown extractor converts the page to clean markdown, which is ideal when you want readable content rather than specific fields:
steps:
- kind: extractor
params:
kind: markdownYou can narrow it to a region of the page with content_xpath, and drop unwanted elements with skip_tags:
steps:
- kind: extractor
params:
kind: markdown
params:
content_xpath: //main
skip_tags:
- nav
- footerThe raw extractor keeps the original content as-is, without converting it. Use it when you want to store the unmodified page.
Field extraction with data maps
When you want specific fields rather than the whole page, use a data map. A data map is a list of named fields, each with a source that says where the value comes from. Sources can be an XPath expression, a CSS selector, a value carried in the crawl context, or a static constant.
The data_map extractor carries its map inline in the manifest:
steps:
- kind: extractor
params:
kind: data_map
params:
map:
fields:
- name: title
source:
type: extractor
kind: xpath
params:
expressions:
- //h1/text()
- name: price
source:
type: extractor
kind: selector
params:
selectors:
- expression: .price
accessor:
type: text
recursive: trueEach field has a name and a source. An extractor source of kind xpath takes one or more expressions; an extractor source of kind selector takes one or more CSS selectors, each with an optional accessor that says whether to read the element's text, an attribute, or its HTML.
To extract a repeating structure, such as a list of products, give a field a root that selects the repeated elements and an item that describes each one:
fields:
- name: products
root:
type: selector
kind: selector
params:
expressions:
- .product-card
item:
name: product
fields:
- name: title
source:
type: extractor
kind: xpath
params:
expressions:
- .//h2/text()
- name: price
source:
type: extractor
kind: selector
params:
selectors:
- expression: .priceWhen the same map applies to every page on a host, you can store it once instead of repeating it in every job. A stored map is created against a host and the host_data_map extractor looks it up automatically by host at extraction time:
steps:
- kind: extractor
params:
kind: host_data_map
params:
validate: trueCreate a stored map by posting it to the host's data-maps endpoint. A stored map references a data schema by id (see the next section):
curl -X POST 'http://localhost:8022/v1/hosts/<host-hash>/data_maps' \
-H 'Content-Type: application/json' \
-d '{
"schema_id": "<schema-id>",
"map": {
"fields": [
{ "name": "title", "source": { "type": "extractor", "kind": "xpath", "params": { "expressions": ["//h1/text()"] } } }
]
}
}'Validating against a schema
A data schema describes the shape the extracted record should have, expressed as a JSON Schema document. Validating extracted data against a schema catches pages that changed structure or fields that came back empty, so bad records do not flow downstream silently.
Create a schema by name with its JSON Schema body:
curl -X POST 'http://localhost:8022/v1/schemas' \
-H 'Content-Type: application/json' \
-d '{
"name": "product",
"schema": {
"type": "object",
"required": ["title", "price"],
"properties": {
"title": { "type": "string" },
"price": { "type": "string" }
}
}
}'The response includes the schema's id. Reference that id from an inline data_map extractor with schema_id to validate as you extract, or from a stored host_data_map (which always carries a schema_id) by setting validate: true.
steps:
- kind: extractor
params:
kind: data_map
params:
schema_id: <schema-id>
map:
fields:
- name: title
source:
type: extractor
kind: xpath
params:
expressions:
- //h1/text()You can also check a record against a schema directly, which is useful while designing one:
curl -X POST 'http://localhost:8022/v1/schemas/<schema-id>/validate' \
-H 'Content-Type: application/json' \
-d '{ "data": { "title": "Example", "price": "9.99" } }'Conditions: deciding what to keep
Not every fetched page is worth extracting. A condition step runs before extraction and decides whether the pipeline should continue for this page. Conditions are how you skip index pages, off-topic content, or near-duplicates.
steps:
- kind: condition
params:
- kind: url_patterns
params:
allow:
- pathname: /products/*
- kind: extractor
params:
kind: data_map
params:
map:
fields:
- name: title
source:
type: extractor
kind: xpath
params:
expressions:
- //h1/text()A condition step holds a list of conditions. Besides url_patterns and regex_patterns, conditions include expression (a custom expression), host_labels and url_labels (match on labels), heuristics (score a page on signals such as content density or structured-data markup), and fingerprint (detect near-duplicate pages). Because the condition comes before the extractor in the step list, a page that fails it is dropped before any extraction work is done.
Testing extraction
You do not have to run a whole crawl to see what an extractor produces. The parser exposes endpoints that apply extraction to content you supply.
To test a stored data map against a page, send the page content to the map's apply endpoint. The content is sent base64-encoded, so encode the page into a variable first and feed the body in through a heredoc, which keeps the JSON quotes clean:
CONTENT=$(base64 -w0 page.html)
curl -X POST 'http://localhost:8022/v1/data_maps/<data-map-id>/apply' \
-H 'Content-Type: application/json' \
-d @- <<EOF
{ "content": "$CONTENT" }
EOFThe response contains the extracted data.
To test a full pipeline, including conditions and the extractor, against a URL's content, use the parse endpoint. It takes the URL, the page body (base64-encoded), its content_type, and the pipelines to run:
BODY=$(base64 -w0 page.html)
curl -X POST 'http://localhost:8022/v1/parse' \
-H 'Content-Type: application/json' \
-d @- <<EOF
{
"url": "https://example.com/products/1",
"content_type": "text/html",
"body": "$BODY",
"pipelines": [
{
"steps": [
{ "kind": "extractor", "params": { "kind": "markdown" } }
]
}
]
}
EOFThe response lists a result per pipeline, each with the discovered links and the extracted data. Iterate on your map or conditions against a saved page until the output is right, then put the same extractor into your crawl job.