inndx/
GitHub

Heuristics

A condition that keeps a page by scoring it against weighted heuristics.

A condition decides whether a fetched page is processed and its result kept. Conditions are configured inside a condition step under a pipeline's steps; a single condition step holds a list of conditions. This page documents the heuristics condition and the heuristic kinds it uses.

The heuristics condition scores a page against a list of heuristics and keeps it when the combined score meets a threshold. Each heuristic looks at one signal (the URL shape, structured-data markup, text density, and so on), produces a score, and contributes to the total according to its weight.

FieldTypeRequiredDefaultDescription
thresholdfloat (0 to 1)no0.65The combined score at or above which the page is kept.
exclude_rootsbooleannotrueWhether root URLs are exempt from scoring and always kept.
heuristicslist of heuristic kindsnoemptyThe heuristics to score with. At least one is required when the list is set.
steps:
  - kind: condition
    params:
      - kind: heuristics
        params:
          threshold: 0.65
          heuristics:
            - kind: content_density
              params:
                weight: 1.0
                min_block_words: 15
            - kind: url_path_segment
              params:
                min_segments: 3

Heuristic kinds

Each entry in the heuristics list is an object with a kind and a params object. Most heuristics share a weight (how much the heuristic counts toward the combined score). Pattern-based heuristics also share a score (the score given on a match) and accumulate (whether repeated matches add up rather than scoring once).

url_path_segment

Scores pages whose URL path has at least a minimum number of segments, on the basis that deeper paths are more often content pages.

FieldTypeRequiredDefaultDescription
weightfloat (0 to 1)no1.0The heuristic's weight in the combined score.
min_segmentsintegerno3The minimum number of path segments to score.

url_path_keyword

Scores pages whose URL path contains any of the given keywords.

FieldTypeRequiredDefaultDescription
weightfloat (0 to 1)no1.0The heuristic's weight in the combined score.
keywordslist of stringsnoemptyKeywords to look for in the URL path.
scorefloat (0.1 to 1)no0.5The score given when a keyword is found.

open_graph

Scores pages that carry Open Graph markup of given types or fields.

FieldTypeRequiredDefaultDescription
weightfloat (0 to 1)no1.0The heuristic's weight in the combined score.
typesOpen Graph typesnosee tableOpen Graph object types to score.
fieldslist of Open Graph fieldsnoemptySpecific Open Graph fields to score.

Open Graph types

FieldTypeRequiredDefaultDescription
typeslist of stringsnoemptyThe Open Graph object types to match (such as article).
scorefloat (0.1 to 1)no0.5The score given when a type matches.

Open Graph field

FieldTypeRequiredDefaultDescription
selectorslist of stringsnoemptySelectors identifying the Open Graph field.
scorefloat (0.1 to 1)no0.5The score given when the field is present.

json_ld

Scores pages that carry JSON-LD structured data of given types.

FieldTypeRequiredDefaultDescription
weightfloat (0 to 1)no1.0The heuristic's weight in the combined score.
typesJSON-LD typesnosee tableJSON-LD types to score.

JSON-LD types

FieldTypeRequiredDefaultDescription
typeslist of stringsnoemptyThe JSON-LD types to match (such as Article).
scorefloat (0.1 to 1)no0.5The score given when a type matches.

Scores pages by the ratio of text to links, distinguishing content from navigation.

FieldTypeRequiredDefaultDescription
weightfloat (0 to 1)no1.0The heuristic's weight in the combined score.
min_densityfloat (0 to 1)no0.2The minimum text-to-link density to score.

content_density

Scores pages that contain dense blocks of text.

FieldTypeRequiredDefaultDescription
weightfloat (0 to 1)no1.0The heuristic's weight in the combined score.
min_block_wordsintegerno15The minimum number of words in a block for it to count.

regex_patterns

Scores pages whose URL matches regular expressions.

FieldTypeRequiredDefaultDescription
weightfloat (0 to 1)no1.0The heuristic's weight in the combined score.
patternslist of regex stringsnoemptyPatterns to match the URL against.
scorefloat (0.1 to 1)no0.5The score given on a match.
accumulatebooleannofalseWhether multiple matches add up rather than scoring once.

url_patterns

Scores pages whose URL matches URL-component patterns.

FieldTypeRequiredDefaultDescription
weightfloat (0 to 1)no1.0The heuristic's weight in the combined score.
patternslist of URL patternsnoemptyURL-component patterns to match.
scorefloat (0.1 to 1)no0.5The score given on a match.
accumulatebooleannofalseWhether multiple matches add up rather than scoring once.

host_patterns

Scores pages using a built-in set of patterns for a content category.

FieldTypeRequiredDefaultDescription
weightfloat (0 to 1)yesnoneThe heuristic's weight in the combined score.
scorefloat (0.1 to 1)yesnoneThe score given on a match.
categoryenumyesnoneThe content category to match: articles, blogs, products, forums, sections, or others.
accumulatebooleannofalseWhether multiple matches add up rather than scoring once.
formatregex or url_patternnononeWhich pattern format the built-in set uses.

Search docs

Search the Self-host documentation