inndx/
GitHub

Fingerprint

A condition that detects duplicate or near-duplicate pages by fingerprint.

A condition decides whether a fetched page is processed and its result kept. Conditions are configured inside a condition step under a pipeline's steps; a single condition step holds a list of conditions. This page documents the fingerprint condition.

The fingerprint condition computes a fingerprint of a page's content and compares it against other fingerprints to detect duplicates and near-duplicates. A page whose similarity to an existing fingerprint reaches the threshold is treated as a duplicate.

FieldTypeRequiredDefaultDescription
targetcomparison targetnospatial, local scopeWhat the fingerprint is compared against.
thresholdfloat (0 to 1)no0.98The similarity at or above which a page is treated as a duplicate.
batch_sizeintegerno1000How many fingerprints are compared per batch.
exclude_rootsbooleannotrueWhether root URLs are exempt from fingerprinting.
namespacestringnodefaultThe namespace fingerprints are grouped and compared under.
feature_setfeature setnoselectors (body)How the features that form the fingerprint are extracted from the page.
steps:
  - kind: condition
    params:
      - kind: fingerprint
        params:
          threshold: 0.98
          target:
            spatial:
              scope: global
          feature_set:
            kind: selectors
            params:
              selectors:
                - expression: //main
              split: false

Comparison target

The target selects what a page's fingerprint is compared against. It is either the string temporal or a spatial object.

temporal compares the page against a saved fingerprint for the same URL, which detects changes to one URL's content over time. spatial compares the page against fingerprints from other URLs, which detects duplicates and near-duplicates across different URLs.

The spatial form takes a scope:

FieldTypeRequiredDefaultDescription
scopelocal or globalyesnoneWhether the comparison is within this run (local) or across all runs (global).
target:
  spatial:
    scope: global

Feature set

A feature set decides which parts of a page form its fingerprint. It is an object with a kind and a params object.

KindDescription
selectorsExtract features from page elements selected by expression.
data_mapExtract features using an inline data map.
host_data_mapExtract features using a stored per-host data map.

selectors

FieldTypeRequiredDefaultDescription
selectorslist of feature selectorsyesbodyThe elements to extract features from.
splitbooleannofalseWhether each selected element forms a separate feature.

Feature selector

FieldTypeRequiredDefaultDescription
expressionstringyesnoneA selector expression identifying the element.
accessoraccessornononeWhat part of the element to read. Defaults to the element text.

Accessor

An accessor is an object with a type field selecting one of three forms.

TypeFieldDescription
attributename (string)Read the named attribute.
textrecursive (boolean)Read the text, optionally including descendant text.
htmlouter (boolean)Read the element's HTML, inner or outer.

data_map

FieldTypeRequiredDefaultDescription
mapdata mapyesnoneThe data map that extracts the features.
fieldslist of stringsyesnoneWhich mapped fields form the fingerprint.
splitbooleannofalseWhether each field forms a separate feature.

host_data_map

FieldTypeRequiredDefaultDescription
schema_idUUIDyesnoneThe schema whose stored host data map extracts the features.
fieldslist of stringsyesnoneWhich mapped fields form the fingerprint.
splitbooleannofalseWhether each field forms a separate feature.

Search docs

Search the Self-host documentation