Lexical Metrics, Baselines, and Feature Weights

Matheel can blend lexical similarity with semantic and code-aware scores.

Lexical Metrics and Baselines

levenshtein Normalized edit-distance similarity.
jaro_winkler String similarity that rewards shared prefixes and near matches.
winnowing Token fingerprint overlap based on k-grams and sliding-window minima.
gst Greedy String Tiling over token sequences.

Winnowing and GST use Matheel's shared code-token regex after preprocessing by default. Set lexical_tokenizer="parser" or CLI --lexical-tokenizer parser to use tree-sitter leaf node types for supported languages. Parser-derived token streams are more structural and less sensitive to identifier renaming.

These are most useful when:

identifier changes are small
formatting is similar
you want a lightweight non-embedding metric in the final score
you want token-level overlap without loading an embedding model

Parameters

feature_weights
levenshtein_weights
jaro_winkler_prefix_weight
winnowing_kgram
winnowing_window
gst_min_match_length
lexical_tokenizer

Canonical Weight Format

Use feature_weights as a dictionary or repeated name=value entries:

feature_weights = {
    "semantic": 0.5,
    "levenshtein": 0.25,
    "jaro_winkler": 0.1,
    "winnowing": 0.1,
    "gst": 0.05,
}

CLI:

python examples/sample_data.py --output sample_pairs.zip --overwrite
matheel compare sample_pairs.zip \
  --feature-weight semantic=0.5 \
  --feature-weight levenshtein=0.25 \
  --feature-weight jaro_winkler=0.1 \
  --feature-weight winnowing=0.1 \
  --feature-weight gst=0.05

Normalization

Matheel normalizes the supplied weights automatically.

That means:

values only need to be non-negative
they do not need to sum to 1.0
the final blend stays stable after normalization

Default Blend

If you omit feature_weights, Matheel uses:

{
    "semantic": 0.7,
    "levenshtein": 0.3,
    "jaro_winkler": 0.0,
}

Baseline-Specific Options

CLI:

python examples/sample_data.py --output sample_pairs.zip --overwrite
matheel compare sample_pairs.zip \
  --feature-weight winnowing=1.0 \
  --winnowing-kgram 5 \
  --winnowing-window 4 \
  --lexical-tokenizer raw

python examples/sample_data.py --output sample_pairs.zip --overwrite
matheel compare sample_pairs.zip \
  --feature-weight gst=1.0 \
  --gst-min-match-length 5 \
  --code-language python \
  --lexical-tokenizer parser