Heuristics
A condition that keeps a page by scoring it against weighted heuristics.
A condition decides whether a fetched page is processed and its result kept. Conditions are configured inside a condition step under a pipeline's steps; a single condition step holds a list of conditions. This page documents the heuristics condition and the heuristic kinds it uses.
The heuristics condition scores a page against a list of heuristics and keeps it when the combined score meets a threshold. Each heuristic looks at one signal (the URL shape, structured-data markup, text density, and so on), produces a score, and contributes to the total according to its weight.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
threshold | float (0 to 1) | no | 0.65 | The combined score at or above which the page is kept. |
exclude_roots | boolean | no | true | Whether root URLs are exempt from scoring and always kept. |
heuristics | list of heuristic kinds | no | empty | The heuristics to score with. At least one is required when the list is set. |
steps:
- kind: condition
params:
- kind: heuristics
params:
threshold: 0.65
heuristics:
- kind: content_density
params:
weight: 1.0
min_block_words: 15
- kind: url_path_segment
params:
min_segments: 3Heuristic kinds
Each entry in the heuristics list is an object with a kind and a params object. Most heuristics share a weight (how much the heuristic counts toward the combined score). Pattern-based heuristics also share a score (the score given on a match) and accumulate (whether repeated matches add up rather than scoring once).
url_path_segment
Scores pages whose URL path has at least a minimum number of segments, on the basis that deeper paths are more often content pages.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
weight | float (0 to 1) | no | 1.0 | The heuristic's weight in the combined score. |
min_segments | integer | no | 3 | The minimum number of path segments to score. |
url_path_keyword
Scores pages whose URL path contains any of the given keywords.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
weight | float (0 to 1) | no | 1.0 | The heuristic's weight in the combined score. |
keywords | list of strings | no | empty | Keywords to look for in the URL path. |
score | float (0.1 to 1) | no | 0.5 | The score given when a keyword is found. |
open_graph
Scores pages that carry Open Graph markup of given types or fields.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
weight | float (0 to 1) | no | 1.0 | The heuristic's weight in the combined score. |
types | Open Graph types | no | see table | Open Graph object types to score. |
fields | list of Open Graph fields | no | empty | Specific Open Graph fields to score. |
Open Graph types
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
types | list of strings | no | empty | The Open Graph object types to match (such as article). |
score | float (0.1 to 1) | no | 0.5 | The score given when a type matches. |
Open Graph field
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
selectors | list of strings | no | empty | Selectors identifying the Open Graph field. |
score | float (0.1 to 1) | no | 0.5 | The score given when the field is present. |
json_ld
Scores pages that carry JSON-LD structured data of given types.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
weight | float (0 to 1) | no | 1.0 | The heuristic's weight in the combined score. |
types | JSON-LD types | no | see table | JSON-LD types to score. |
JSON-LD types
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
types | list of strings | no | empty | The JSON-LD types to match (such as Article). |
score | float (0.1 to 1) | no | 0.5 | The score given when a type matches. |
text_link_density
Scores pages by the ratio of text to links, distinguishing content from navigation.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
weight | float (0 to 1) | no | 1.0 | The heuristic's weight in the combined score. |
min_density | float (0 to 1) | no | 0.2 | The minimum text-to-link density to score. |
content_density
Scores pages that contain dense blocks of text.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
weight | float (0 to 1) | no | 1.0 | The heuristic's weight in the combined score. |
min_block_words | integer | no | 15 | The minimum number of words in a block for it to count. |
regex_patterns
Scores pages whose URL matches regular expressions.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
weight | float (0 to 1) | no | 1.0 | The heuristic's weight in the combined score. |
patterns | list of regex strings | no | empty | Patterns to match the URL against. |
score | float (0.1 to 1) | no | 0.5 | The score given on a match. |
accumulate | boolean | no | false | Whether multiple matches add up rather than scoring once. |
url_patterns
Scores pages whose URL matches URL-component patterns.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
weight | float (0 to 1) | no | 1.0 | The heuristic's weight in the combined score. |
patterns | list of URL patterns | no | empty | URL-component patterns to match. |
score | float (0.1 to 1) | no | 0.5 | The score given on a match. |
accumulate | boolean | no | false | Whether multiple matches add up rather than scoring once. |
host_patterns
Scores pages using a built-in set of patterns for a content category.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
weight | float (0 to 1) | yes | none | The heuristic's weight in the combined score. |
score | float (0.1 to 1) | yes | none | The score given on a match. |
category | enum | yes | none | The content category to match: articles, blogs, products, forums, sections, or others. |
accumulate | boolean | no | false | Whether multiple matches add up rather than scoring once. |
format | regex or url_pattern | no | none | Which pattern format the built-in set uses. |