Lexical Metrics, Baselines, and Feature Weights
Matheel can blend lexical similarity with semantic and code-aware scores.
Lexical Metrics and Baselines
levenshteinNormalized edit-distance similarity.jaro_winklerString similarity that rewards shared prefixes and near matches.winnowingToken fingerprint overlap based on k-grams and sliding-window minima.gstGreedy String Tiling over token sequences.
Winnowing and GST use Matheel's shared code-token regex after preprocessing by default. Set lexical_tokenizer="parser" or CLI --lexical-tokenizer parser to use tree-sitter leaf node types for supported languages. Parser-derived token streams are more structural and less sensitive to identifier renaming.
These are most useful when:
- identifier changes are small
- formatting is similar
- you want a lightweight non-embedding metric in the final score
- you want token-level overlap without loading an embedding model
Parameters
feature_weightslevenshtein_weightsjaro_winkler_prefix_weightwinnowing_kgramwinnowing_windowgst_min_match_lengthlexical_tokenizer
Canonical Weight Format
Use feature_weights as a dictionary or repeated name=value entries:
feature_weights = {
"semantic": 0.5,
"levenshtein": 0.25,
"jaro_winkler": 0.1,
"winnowing": 0.1,
"gst": 0.05,
}
CLI:
python examples/sample_data.py --output sample_pairs.zip --overwrite
matheel compare sample_pairs.zip \
--feature-weight semantic=0.5 \
--feature-weight levenshtein=0.25 \
--feature-weight jaro_winkler=0.1 \
--feature-weight winnowing=0.1 \
--feature-weight gst=0.05
Normalization
Matheel normalizes the supplied weights automatically.
That means:
- values only need to be non-negative
- they do not need to sum to
1.0 - the final blend stays stable after normalization
Default Blend
If you omit feature_weights, Matheel uses:
{
"semantic": 0.7,
"levenshtein": 0.3,
"jaro_winkler": 0.0,
}
Baseline-Specific Options
CLI:
python examples/sample_data.py --output sample_pairs.zip --overwrite
matheel compare sample_pairs.zip \
--feature-weight winnowing=1.0 \
--winnowing-kgram 5 \
--winnowing-window 4 \
--lexical-tokenizer raw
python examples/sample_data.py --output sample_pairs.zip --overwrite
matheel compare sample_pairs.zip \
--feature-weight gst=1.0 \
--gst-min-match-length 5 \
--code-language python \
--lexical-tokenizer parser