Output formats

The available output formats and guidance on choosing among them.

The Scrape API returns a page in one or more formats per call. Specify what you want in the formats array of your request. If you omit formats, the API defaults to markdown.

Markdown

The page converted to clean markdown. Navigation, ads, cookie banners, and other page chrome are stripped. Headings, lists, code blocks, and links are preserved.

{ "kind": "markdown" }

Use markdown when feeding content into an LLM, building a retrieval index, or any case where the text structure matters more than the original markup.

Skipping tags

Pass a skip_tags array to strip specific HTML elements before the markdown conversion:

{ "kind": "markdown", "skip_tags": ["nav", "footer", "aside"] }

This is useful when a page has persistent elements like sidebars or related-article sections that add noise to the output.

HTML

The page's article content as cleaned HTML, without the surrounding page chrome.

{ "kind": "html" }

Use HTML when you need the structural markup, want to do further DOM processing, or are passing the content to a system that can handle HTML natively.

JSON

Extract a structured object from the page by describing the fields you want. Pass a fields array describing what to pull out and the response comes back shaped to match.

{ "kind": "json", "fields": [] }

Use JSON when you need specific values out of a page rather than the whole document.

Fields

Each entry in fields is one of three shapes:

Scalar extracts a single value:

{ "name": "title", "extractor": { "kind": "selector", "params": { "selectors": [{ "expression": "h1" }] } } }

Object groups nested fields under a key:

{
  "name": "author",
  "fields": [
    { "name": "name", "extractor": { "kind": "selector", "params": { "selectors": [{ "expression": ".byline .name" }] } } }
  ]
}

Array repeats item for each element matched by root, scoped relative to the page (or relative to the parent item, when nested inside another array):

{
  "name": "posts",
  "root": { "kind": "selector", "params": { "expressions": [".post"] } },
  "item": {
    "name": "title",
    "extractor": { "kind": "selector", "params": { "selectors": [{ "expression": "h2" }] } }
  }
}

If root is omitted, item is extracted once from the current context.

Extractors

A scalar field's extractor has a kind:

xpath evaluates a list of XPath expressions and returns the first match:

{ "kind": "xpath", "params": { "expressions": ["//h1"] } }

selector evaluates a list of CSS selectors and returns the first match. Each selector can include an accessor describing what to read off the matched element:

{ "kind": "selector", "params": { "selectors": [{ "expression": "img", "accessor": { "type": "attribute", "name": "src" } }] } }

The accessor types are:

Type	Fields	Description
`text`	`recursive`	The element's text content. Set `recursive` to include descendant text.
`html`	`outer`	The element's HTML. Set `outer` to include the element's own tag.
`attribute`	`name`	The named attribute's value.

If accessor is omitted, the element's text content is used.

json targets a <script> tag containing embedded JSON (for example a JSON-LD block or a framework's hydration payload). selectors are CSS selectors that locate the script tag, and expression is a CEL expression evaluated against that tag's parsed JSON content to pull out the value you want:

{
  "kind": "json",
  "params": {
    "selectors": [{ "expression": "script[type='application/ld+json']" }],
    "expression": "datePublished"
  }
}

Full example

Extracting a page title and a list of articles, each with a title and link:

{
  "kind": "json",
  "fields": [
    {
      "name": "title",
      "extractor": { "kind": "selector", "params": { "selectors": [{ "expression": "h1" }] } }
    },
    {
      "name": "articles",
      "root": { "kind": "selector", "params": { "expressions": [".article"] } },
      "item": {
        "name": "article",
        "fields": [
          {
            "name": "title",
            "extractor": { "kind": "selector", "params": { "selectors": [{ "expression": "h2" }] } }
          },
          {
            "name": "url",
            "extractor": {
              "kind": "selector",
              "params": {
                "selectors": [{ "expression": "a", "accessor": { "type": "attribute", "name": "href" } }]
              }
            }
          }
        ]
      }
    }
  ]
}

The response's data is shaped to match:

{
  "title": "Latest news",
  "articles": [
    { "title": "First post", "url": "/posts/1" },
    { "title": "Second post", "url": "/posts/2" }
  ]
}

Binary

The raw response bytes, base64-encoded in the HTTP response. The SDKs decode the value for you.

{ "kind": "binary" }

This format happens when the requested URL points directly to a non-HTML resource like an image or PDF. It is not possible to request binary format for an HTML page.

Requesting multiple formats

You can request more than one format in a single call. The response includes one result per format:

{
  "url": "https://example.com",
  "results": [
    { "kind": "markdown", "content": "..." },
    { "kind": "html", "content": "..." }
  ]
}

A single call that requests two formats is still billed as one unit at the base price, since the page is only fetched once.