Seeds
The seed components that determine where a crawl begins.
Seeds provide the initial set of URLs a crawl starts from. A job lists one or more seeds under config.seeds, each selected by kind. This page catalogs the available kinds and their parameters.
How seeds are configured
Each entry in config.seeds is an object with a kind and, for most kinds, a params object. A job may list several seeds, and their URLs combine into the starting set:
config:
seeds:
- kind: static_list
params:
urls:
- https://example.com/
- kind: sitemap
params:
urls:
- https://example.com/sitemap.xmlstatic_list
A fixed list of URLs supplied directly in the manifest.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
urls | list of strings | yes | none | The URLs to start the crawl from. |
seeds:
- kind: static_list
params:
urls:
- https://example.com/docs
- https://example.com/blogsitemap
Seeds from the URLs listed in one or more sitemaps.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
urls | list of strings | no | none | Sitemap URLs to read. |
limit | integer | no | none | Maximum number of URLs to take from the sitemaps. |
concurrency | integer | no | none | How many sitemaps to fetch in parallel. |
seeds:
- kind: sitemap
params:
urls:
- https://example.com/sitemap.xml
limit: 500host_labels
Seeds from hosts that carry the given labels. Labels are key-value tags previously attached to host records, so this kind crawls whichever known hosts match a tag.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
labels | map of string to string | yes | none | Labels a host must carry to be selected. |
limit | integer | no | none | Maximum number of hosts to seed from. |
seeds:
- kind: host_labels
params:
labels:
tier: priority
limit: 100host_labels_sitemap
Selects hosts by label, then seeds from each selected host's sitemap. It combines host_labels selection with sitemap seeding.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
labels | map of string to string | no | none | Labels a host must carry to be selected. |
host_limit | integer | no | none | Maximum number of hosts to select. |
link_limit | integer | no | none | Maximum number of URLs to take per host sitemap. |
concurrency | integer | no | none | How many host sitemaps to fetch in parallel. |
seeds:
- kind: host_labels_sitemap
params:
labels:
tier: priority
host_limit: 50
link_limit: 200