Parse
Run extraction pipelines against supplied content.
The parse endpoints run one or more pipelines over content you supply, so you can test extraction without crawling.
These endpoints share the pagination and error conventions.
Parse a document
/v1/parseRuns the supplied pipelines over a single document and returns the results.
- url stringrequired
The URL the content came from.
format: uri- body stringrequired
The base64-encoded content to parse.
- content_type stringrequired
The MIME type of the content.
- redirects Redirect[]default: []
The redirects that were followed to reach the content. Each entry has
url,location,side("server"or"client"), andtype("permanent"or"temporary").- pipelines object[]default: []
The pipelines to evaluate against the content. Each pipeline has
identifier(string, default"default"), optionalguards, an optionalnavigator,steps(defaults to a single extractor), an optionalpriority, andbehavior(string, default"continue"). See the parser components and the crawl manifest for the full shape of guards, navigators, and steps.
curl -X POST 'http://localhost:8022/v1/parse' \
-H 'X-Tenant-Id: acme' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com/article",
"body": "PGh0bWw+Li4uPC9odG1sPg==",
"content_type": "text/html"
}'Parse a batch
/v1/parse/batchRuns pipelines over multiple documents in a single request. Per-item failures are returned as error items rather than failing the whole request.
- items object[]required
The documents to parse. Each item has the same shape as the single parse request body.
min: 1max: 64
curl -X POST 'http://localhost:8022/v1/parse/batch' \
-H 'X-Tenant-Id: acme' \
-H 'Content-Type: application/json' \
-d '{
"items": [
{
"url": "https://example.com/article",
"body": "PGh0bWw+Li4uPC9odG1sPg==",
"content_type": "text/html"
}
]
}'