Extract structured data

Turn fetched pages into clean markdown or structured records.

Fetching gives you raw HTML; the value comes from what you extract. This guide covers the extractor options, from whole-page markdown to field-level extraction with data maps and schemas. For the full extractor catalog, see Extractors.

Extraction happens inside a pipeline. An extractor step under config.pipelines[].steps takes a kind naming the extractor and its params. The sections below cover each kind.

Markdown and raw extraction

The simplest extractors take the whole page. The markdown extractor converts the page to clean markdown, which is ideal when you want readable content rather than specific fields:

steps:
  - kind: extractor
    params:
      kind: markdown

You can narrow it to a region of the page with content_xpath, and drop unwanted elements with skip_tags:

steps:
  - kind: extractor
    params:
      kind: markdown
      params:
        content_xpath: //main
        skip_tags:
          - nav
          - footer

The raw extractor keeps the original content as-is, without converting it. Use it when you want to store the unmodified page.

Field extraction with data maps

When you want specific fields rather than the whole page, use a data map. A data map is a list of named fields, each with a source that says where the value comes from. Sources can be an XPath expression, a CSS selector, JSON embedded in the page (such as JSON-LD) navigated with a CEL expression, a value carried in the crawl context, or a static constant. See Data maps for the full field shape.

The data_map extractor carries its map inline in the manifest:

steps:
  - kind: extractor
    params:
      kind: data_map
      params:
        map:
          fields:
            - name: title
              source:
                type: extractor
                kind: xpath
                params:
                  expressions:
                    - //h1/text()
            - name: price
              source:
                type: extractor
                kind: selector
                params:
                  selectors:
                    - expression: .price
                      accessor:
                        type: text
                        recursive: true

Each field has a name and a source. An extractor source of kind xpath takes one or more expressions; an extractor source of kind selector takes one or more CSS selectors, each with an optional accessor that says whether to read the element's text, an attribute, or its HTML.

To extract a repeating structure, such as a list of products, give a field a root that selects the repeated elements and an item that describes each one:

fields:
  - name: products
    root:
      type: selector
      kind: selector
      params:
        expressions:
          - .product-card
    item:
      name: product
      fields:
        - name: title
          source:
            type: extractor
            kind: xpath
            params:
              expressions:
                - .//h2/text()
        - name: price
          source:
            type: extractor
            kind: selector
            params:
              selectors:
                - expression: .price

Many product and content pages embed pre-structured data directly in the page, most commonly as JSON-LD: a <script type="application/ld+json"> tag holding a JSON description of what's on the page. An extractor source of kind json reads that, by locating the element with a selector (the same shape used for selector sources), parsing its text content as JSON, and evaluating a CEL (Common Expression Language) expression against it to pick out the value, with the parsed JSON bound to a variable named data:

fields:
  - name: sku
    source:
      type: extractor
      kind: json
      params:
        selectors:
          - expression: script[type="application/ld+json"]
        expression: data.sku

The expression's result becomes the field's value unchanged, so it can be a whole nested object or array rather than only a single scalar. Given an offers object nested under the JSON-LD payload, data.offers returns that object as-is. Pages that wrap their JSON-LD in a @graph array of mixed types can use CEL's filter to pick out the entry you want: data["@graph"].filter(g, g["@type"] == "Product")[0].offers.price. See JSON expression for the full reference.

When the same map applies to every page on a host, you can store it once instead of repeating it in every job. A stored map is created against a host and the host_data_map extractor looks it up automatically by host at extraction time:

steps:
  - kind: extractor
    params:
      kind: host_data_map
      params:
        validate: true

Create a stored map against the host. A stored map references a data schema by id (see the next section):

crawlctl data-maps create --host <host-hash> -f data-map.yaml

# data-map.yaml
schema_id: <schema-id>
map:
  fields:
    - name: title
      source:
        type: extractor
        kind: xpath
        params:
          expressions:
            - //h1/text()

Validating against a schema

A data schema describes the shape the extracted record should have, expressed as a JSON Schema document. Validating extracted data against a schema catches pages that changed structure or fields that came back empty, so bad records do not flow downstream silently. Schemas also support extension keywords for default values, computed fields, context-sourced fields, and value transforms; see Data schemas for the full reference.

Create a schema by name with its JSON Schema body:

crawlctl schemas create -f schema.yaml

# schema.yaml
name: product
schema:
  type: object
  required: [title, price]
  properties:
    title:
      type: string
    price:
      type: string

The response includes the schema's id. Reference that id from an inline data_map extractor with schema_id to validate as you extract, or from a stored host_data_map (which always carries a schema_id) by setting validate: true.

steps:
  - kind: extractor
    params:
      kind: data_map
      params:
        schema_id: <schema-id>
        map:
          fields:
            - name: title
              source:
                type: extractor
                kind: xpath
                params:
                  expressions:
                    - //h1/text()

You can also check a record against a schema directly, which is useful while designing one:

echo '{ "data": { "title": "Example", "price": "9.99" } }' | crawlctl schemas validate <schema-id> -f -

Conditions: deciding what to keep

Not every fetched page is worth extracting. A condition step runs before extraction and decides whether the pipeline should continue for this page. Conditions are how you skip index pages, off-topic content, or near-duplicates.

steps:
  - kind: condition
    params:
      - kind: url_patterns
        params:
          allow:
            - pathname: /products/*
  - kind: extractor
    params:
      kind: data_map
      params:
        map:
          fields:
            - name: title
              source:
                type: extractor
                kind: xpath
                params:
                  expressions:
                    - //h1/text()

A condition step holds a list of conditions. Besides url_patterns and regex_patterns, conditions include expression (a custom expression), host_labels and url_labels (match on labels), heuristics (score a page on signals such as content density or structured-data markup), and fingerprint (detect near-duplicate pages). Because the condition comes before the extractor in the step list, a page that fails it is dropped before any extraction work is done.

Testing extraction

You do not have to run a whole crawl to see what an extractor produces. The parser exposes endpoints that apply extraction to content you supply.

To test a stored data map against a page, apply it to the page's content directly:

crawlctl data-maps extract <data-map-id> -f page.html

The response contains the extracted data.

To test a full pipeline, including conditions and the extractor, against a URL's content, parse it directly. Give the URL it was fetched from, its content type, and the pipelines to run:

crawlctl parse https://example.com/products/1 \
  --content-type text/html \
  --content page.html \
  -f pipelines.yaml

# pipelines.yaml
pipelines:
  - steps:
      - kind: extractor
        params:
          kind: markdown