Fingerprint
A condition that detects duplicate or near-duplicate pages by fingerprint.
A condition decides whether a fetched page is processed and its result kept. Conditions are configured inside a condition step under a pipeline's steps; a single condition step holds a list of conditions. This page documents the fingerprint condition.
The fingerprint condition computes a fingerprint of a page's content and compares it against other fingerprints to detect duplicates and near-duplicates. A page whose similarity to an existing fingerprint reaches the threshold is treated as a duplicate.
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
target | comparison target | no | spatial, local scope | What the fingerprint is compared against. |
threshold | float (0 to 1) | no | 0.98 | The similarity at or above which a page is treated as a duplicate. |
batch_size | integer | no | 1000 | How many fingerprints are compared per batch. |
exclude_roots | boolean | no | true | Whether root URLs are exempt from fingerprinting. |
namespace | string | no | default | The namespace fingerprints are grouped and compared under. |
feature_set | feature set | no | selectors (body) | How the features that form the fingerprint are extracted from the page. |
steps:
- kind: condition
params:
- kind: fingerprint
params:
threshold: 0.98
target:
spatial:
scope: global
feature_set:
kind: selectors
params:
selectors:
- expression: //main
split: falseComparison target
The target selects what a page's fingerprint is compared against. It is either the string temporal or a spatial object.
temporal compares the page against a saved fingerprint for the same URL, which detects changes to one URL's content over time. spatial compares the page against fingerprints from other URLs, which detects duplicates and near-duplicates across different URLs.
The spatial form takes a scope:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
scope | local or global | yes | none | Whether the comparison is within this run (local) or across all runs (global). |
target:
spatial:
scope: globalFeature set
A feature set decides which parts of a page form its fingerprint. It is an object with a kind and a params object.
| Kind | Description |
|---|---|
selectors | Extract features from page elements selected by expression. |
data_map | Extract features using an inline data map. |
host_data_map | Extract features using a stored per-host data map. |
selectors
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
selectors | list of feature selectors | yes | body | The elements to extract features from. |
split | boolean | no | false | Whether each selected element forms a separate feature. |
Feature selector
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
expression | string | yes | none | A selector expression identifying the element. |
accessor | accessor | no | none | What part of the element to read. Defaults to the element text. |
Accessor
An accessor is an object with a type field selecting one of three forms.
| Type | Field | Description |
|---|---|---|
attribute | name (string) | Read the named attribute. |
text | recursive (boolean) | Read the text, optionally including descendant text. |
html | outer (boolean) | Read the element's HTML, inner or outer. |
data_map
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
map | data map | yes | none | The data map that extracts the features. |
fields | list of strings | yes | none | Which mapped fields form the fingerprint. |
split | boolean | no | false | Whether each field forms a separate feature. |
host_data_map
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
schema_id | UUID | yes | none | The schema whose stored host data map extracts the features. |
fields | list of strings | yes | none | Which mapped fields form the fingerprint. |
split | boolean | no | false | Whether each field forms a separate feature. |