Author a crawl job
Build a crawl-job manifest from the ground up, choosing each component deliberately.
This guide walks through writing a crawl-job manifest piece by piece, explaining the choices at each step. By the end you will be able to assemble a manifest for a real crawl rather than copying an example. For the exhaustive field reference, see Crawl manifest.
A manifest has three top-level fields: a name, an optional set of labels, and a config object that holds all the crawl behavior. Every field inside config has a default, so you only specify what you want to change. The sections below build a config up from the smallest useful job to a complete one.
Start with seeds and a stopping rule
The two things every crawl needs are a place to start and a reason to stop. Seeds provide the starting URLs. A stopping criterion ends the run so it does not crawl forever.
The simplest seed is a fixed list of URLs, and the simplest stopping rule is a cap on the number of URLs visited:
name: docs-crawl
config:
seeds:
- kind: static_list
params:
urls:
- https://example.com/docs
stopping_criteria:
- kind: max_urls
params:
max_urls: 50This is already a runnable job: it starts at one URL and stops after fifty have been processed. Other stopping criteria include max_depth (stop past a link distance from the seeds), max_age (stop after a wall-clock duration), and max_empty_evaluations (stop once the crawl stops finding new work). You can list more than one; the run stops when any of them is met.
Keep the crawl on target with filters
Left unfiltered, a crawl follows every link it discovers and can wander off the site you care about. Filters decide which discovered URLs are admitted to the queue and which are rejected.
The most common filter matches URLs against patterns. Use regex_patterns to allow and deny by regular expression:
filters:
- kind: regex_patterns
params:
allow:
- https://example\.com/docs/.+
deny:
- https://example\.com/docs/changelog/.+A URL must match an allow pattern and must not match a deny pattern to be admitted. Other useful filters include max_depth (admit only URLs within a link distance of the seeds), budget (cap how many URLs are admitted per host), and robots_txt (respect a site's robots rules). Filters work together, so you can combine a pattern filter with a per-host budget to stay both on-topic and polite.
Choose a traversal order
The ranker decides the order in which queued URLs are visited. The choice affects how coverage builds up over a partial crawl.
ranker:
kind: breadthA breadth ranker visits URLs closer to the seeds first, which spreads coverage evenly and is a good default for site-wide crawls. A depth ranker follows links deeper before widening, which reaches far pages sooner but covers the site unevenly if the run is cut short. There is also a page_rank ranker that orders URLs by link importance.
Normalize URLs
Many sites link to the same page through several URLs that differ only by a trailing fragment or query string. Each variant would otherwise be queued as a separate URL. The sanitize mutator strips those parts before a URL enters the queue, so duplicates collapse into one:
mutators:
- kind: sanitize
params:
strip_fragment: true
strip_query: trueFor the full set of seeds, filters, rankers, and mutators, see Seed and filter URLs.
Configure fetching
The fetcher retrieves each URL. Its most important setting is the client, which decides how the request is made. The default standard client fetches raw HTML over plain HTTP and is fast:
fetcher:
client:
kind: standard
timeout: 30s
max_retries: 3The standard client cannot run JavaScript, so sites that render their content in the browser come back empty or incomplete. For those, inndx offers browser-based clients. See Fetch with a browser for when and how to use them.
Define the parsing pipeline
A pipeline turns fetched content into results. It discovers links to follow through a navigator, processes content through an ordered list of steps, and delivers output through actions. A job can declare more than one pipeline.
pipelines:
- navigator:
kind: anchor
steps:
- kind: extractor
params:
kind: markdown
actions:
- kind: to_blob
params:
directory: outputThe anchor navigator discovers child links from the page's anchor tags, which is what lets the crawl widen beyond its seeds. The extractor step of kind markdown converts each page to clean markdown. The to_blob action writes the result to blob storage under the output directory.
Steps run in order. Besides extractor, a step can be a condition (decide whether to keep the page before extracting) or an asset_resolver (resolve and download referenced assets such as images). For the full set of extractors see Extract structured data; for delivery options see Deliver results.
Validate and run
Put the sections together into one manifest file, docs-crawl.yml:
name: docs-crawl
config:
seeds:
- kind: static_list
params:
urls:
- https://example.com/docs
stopping_criteria:
- kind: max_urls
params:
max_urls: 50
filters:
- kind: regex_patterns
params:
allow:
- https://example\.com/docs/.+
ranker:
kind: breadth
mutators:
- kind: sanitize
params:
strip_fragment: true
strip_query: true
fetcher:
client:
kind: standard
timeout: 30s
pipelines:
- navigator:
kind: anchor
steps:
- kind: extractor
params:
kind: markdown
actions:
- kind: to_blob
params:
directory: outputThe quickest way to try a manifest is the run command, which executes a single crawl from a file without standing up a server:
inndx run docs-crawl.ymlTo create a persistent job that you can run repeatedly, trigger on a schedule, and inspect over time, send it to the orchestrator API instead. The API takes JSON, so the same manifest is posted as a JSON body. Add the create_run=true query parameter to start a run immediately on creation:
curl -X POST 'http://localhost:8022/v1/jobs?create_run=true' \
-H 'Content-Type: application/json' \
-d '{
"name": "docs-crawl",
"config": {
"seeds": [
{ "kind": "static_list", "params": { "urls": ["https://example.com/docs"] } }
],
"stopping_criteria": [
{ "kind": "max_urls", "params": { "max_urls": 50 } }
],
"filters": [
{ "kind": "regex_patterns", "params": { "allow": ["https://example\\.com/docs/.+"] } }
],
"ranker": { "kind": "breadth" },
"mutators": [
{ "kind": "sanitize", "params": { "strip_fragment": true, "strip_query": true } }
],
"fetcher": { "client": { "kind": "standard" }, "timeout": "30s" },
"pipelines": [
{
"navigator": { "kind": "anchor" },
"steps": [
{ "kind": "extractor", "params": { "kind": "markdown" } }
],
"actions": [
{ "kind": "to_blob", "params": { "directory": "output" } }
]
}
]
}
}'The response is the created job, including its generated id. Use that id to start more runs, attach triggers, and query progress. To confirm a run is underway and watch it advance, see Inspect runs and logs.