Seed and filter URLs

Control where a crawl starts and which URLs it is allowed to follow.

A crawl is shaped as much by what it excludes as by where it begins. This guide shows how to seed a crawl and how to use filters, rankers, and mutators together to keep it focused and well-behaved. For the full catalog of each component, see Components.

These four component lists all live under config in a crawl manifest: seeds, filters, ranker, and mutators. Each entry is an object with a kind and an optional params.

Seeding a crawl

Seeds are the starting URLs. A crawl can have more than one seed, and the kinds can be mixed.

The static_list seed is a fixed list of URLs you supply directly. Use it when you know exactly where the crawl should begin:

seeds:
  - kind: static_list
    params:
      urls:
        - https://example.com/docs
        - https://example.com/blog

The sitemap seed reads a site's sitemap and starts from the URLs it lists. Use it to cover a site broadly without enumerating pages yourself. You can cap how many URLs it pulls with limit:

seeds:
  - kind: sitemap
    params:
      urls:
        - https://example.com/sitemap.xml
      limit: 500

The host_labels seed starts from hosts that carry labels you have applied. Labels are key-value tags attached to host records (see Normalizing URLs with mutators and Deliver results for how labels get written). This seed is useful when you maintain a set of hosts and want to crawl whichever ones match a tag:

seeds:
  - kind: host_labels
    params:
      labels:
        tier: priority
      limit: 100

There is also host_labels_sitemap, which combines the two: it selects hosts by label and then seeds from each host's sitemap.

Admitting and rejecting URLs with filters

As the crawl discovers links, filters decide which URLs are admitted to the queue. A URL must pass every filter to be admitted. Filters are how you keep a crawl on-topic and polite.

The pattern filters are the workhorses. The regex_patterns filter matches the full URL against regular expressions:

filters:
  - kind: regex_patterns
    params:
      allow:
        - https://example\.com/docs/.+
      deny:
        - https://example\.com/docs/changelog/.+

The url_patterns filter matches against individual parts of the URL (hostname, pathname, and so on), each given as its own pattern. This is easier to read than a single full-URL regex when you only care about, for example, staying on one host and one path prefix:

filters:
  - kind: url_patterns
    params:
      allow:
        - hostname: example.com
          pathname: /docs/*

Both pattern filters accept allow, deny, and default_allow. A URL is admitted when it matches an allow entry and no deny entry; default_allow sets what happens to a URL that matches neither list.

Beyond patterns, several filters bound the crawl in other ways:

max_depth admits a URL only if it is within a given link distance of the seeds (max_depth).
budget caps how many URLs are admitted per some dimension (by: host, limit).
robots_txt rejects URLs a site's robots rules disallow, with a configurable cache lifetime (cache_ttl).
recrawl controls whether a URL already seen may be visited again, and how long must pass before it is (scope, minimum_delay).
host_labels and url_labels admit or reject based on labels attached to hosts or URLs (allow, deny label maps), which lets earlier results steer later crawling.

A typical on-domain crawl combines a pattern filter with a per-host budget:

filters:
  - kind: regex_patterns
    params:
      allow:
        - https://example\.com/.+
  - kind: budget
    params:
      by: host
      limit: 1000

Ordering with rankers

The ranker sets the order in which admitted URLs leave the queue. It does not change which URLs are crawled, only when each is reached, which matters most when a run is stopped before it finishes.

ranker:
  kind: breadth

A breadth ranker visits URLs closer to the seeds first, spreading coverage evenly across the site. A depth ranker follows a branch of links deep before widening, reaching distant pages sooner at the cost of even coverage. The page_rank ranker orders URLs by their link importance within the crawl, prioritizing well-connected pages; it accepts tuning parameters such as damping_factor, max_iterations, and tolerance.

Normalizing URLs with mutators

Mutators transform a URL before it enters the queue. Without normalization, a single page reached through several URLs that differ only by a trailing fragment or tracking query string is queued several times over.

The sanitize mutator strips those parts so the variants collapse into one queue entry:

mutators:
  - kind: sanitize
    params:
      strip_fragment: true
      strip_query: true

strip_fragment removes the #section portion of a URL; strip_query removes the ?key=value portion. Strip the query only when it does not change which page is served, since some sites use query parameters to select real content.

Putting it together

The following manifest section seeds from a sitemap, stays on one host and path prefix, caps per-host volume, visits breadth-first, and normalizes URLs:

config:
  seeds:
    - kind: sitemap
      params:
        urls:
          - https://example.com/sitemap.xml
        limit: 500
  filters:
    - kind: regex_patterns
      params:
        allow:
          - https://example\.com/docs/.+
    - kind: budget
      params:
        by: host
        limit: 1000
  ranker:
    kind: breadth
  mutators:
    - kind: sanitize
      params:
        strip_fragment: true
        strip_query: true