Scoring and Calibration
Matheel scores are useful for ranking pairs inside one run. They are not automatically calibrated probabilities.
Different algorithms expose different evidence:
- embedding backends compare vector geometry
- Levenshtein and Jaro-Winkler compare string similarity
- Winnowing and GST compare token overlap
- code metrics compare code-specific structure, tokens, or model alignments
Use raw scores for inspection and ranking. Use labeled data when you need thresholds that carry a specific meaning such as "likely same solution" or "needs review".
Score Ranges
Use feature_score_range(...) when you need the documented range in code.
| Feature or metric | Default range | Interpretation |
|---|---|---|
semantic with cosine |
-1.0 to 1.0 |
Larger means closer vector direction. |
semantic with dot |
unbounded | Larger means stronger dot product, but magnitude depends on vector scale. |
semantic with euclidean |
negative distance to 0.0 |
0.0 means identical vectors; farther vectors are more negative. |
semantic with manhattan |
negative distance to 0.0 |
0.0 means identical vectors; farther vectors are more negative. |
| normalized semantic scores | 0.0 to 1.0 |
Enable normalize_semantic_scores=True before blending raw dot, euclidean, or manhattan scores with other metrics. |
levenshtein |
0.0 to 1.0 |
Normalized edit-distance similarity. |
jaro_winkler |
0.0 to 1.0 |
String similarity with prefix weighting. |
winnowing |
0.0 to 1.0 |
Fingerprint overlap. |
gst |
0.0 to 1.0 |
Covered-token overlap from greedy string tiling. |
code_metric |
usually 0.0 to 1.0 |
Depends on the selected code metric. Built-in CodeBLEU, CrystalBLEU, RUBY, TSED, and CodeBERTScore paths are intended to be bounded. |
Weighted blends are arithmetic blends of selected feature scores after feature weights are normalized. A blended score is only as comparable as the component scales. Matheel guards mixed-feature blends for raw unbounded or distance-style semantic scores; enable semantic normalization when you need a 0-1 blend.
Thresholds
A threshold should be calibrated for the dataset and workflow that will use it.
Good threshold labels are operational:
- "same assignment solution"
- "manual review needed"
- "known clone pair"
- "not related"
Avoid treating a threshold from one model, language, assignment, or preprocessing setup as universal.
Calibrating on Labeled Pairs
Use a small labeled sample that matches the workflow:
- Collect representative code pairs.
- Label each pair with the decision you care about.
- Score the pairs with the same Matheel options you plan to use later.
- Choose the threshold that optimizes the metric you care about.
- Re-check the threshold when you change models, metrics, languages, preprocessing, chunking, or feature weights.
from matheel.calibration import calibrate_threshold
labeled_scores = [
(0.95, True),
(0.82, True),
(0.61, False),
(0.40, False),
]
report = calibrate_threshold(labeled_scores)
print(report["threshold"])
print(report["precision"], report["recall"], report["f1"])
calibrate_threshold(...) accepts either (score, label) pairs or mappings:
from matheel.calibration import calibrate_threshold
report = calibrate_threshold(
[
{"similarity_score": 0.92, "label": "same"},
{"similarity_score": 0.48, "label": "different"},
],
score_key="similarity_score",
label_key="label",
positive_label="same",
)
Use evaluate_threshold(...) when you already have a threshold and want confusion counts, precision, recall, F1, and accuracy.
Threshold Tuning Reports
Use tune-threshold for a reproducible threshold sweep over scored pair rows. Unlike the fuller calibration report, this workflow also works for all-positive or all-negative samples and records a warning when ROC or precision-recall summaries are not meaningful.
matheel tune-threshold scored_pairs.csv \
--score-column similarity_score \
--label-column label \
--optimize f1 \
--output-dir threshold_tuning
The output includes:
threshold_tuning_threshold_sweep.csv: threshold, confusion counts, precision, recall, F1, and accuracy.threshold_tuning_summary.json: selected threshold and reproducibility settings.threshold_tuning_report.json: machine-readable report payload.threshold_tuning_report.html: local HTML summary.
Python usage:
from matheel.calibration import write_threshold_tuning_report_artifacts
report, artifacts = write_threshold_tuning_report_artifacts(
scored_pairs,
"threshold_tuning",
score_key="similarity_score",
label_key="label",
optimize="f1",
)
print(report["summary"]["optimized_threshold"]["threshold"])
print(artifacts["threshold_sweep_csv"])
Calibration Reports
For scored pair CSVs, export a threshold sweep, ROC curve, precision-recall curve, and summary JSON:
matheel calibration-report scored_pairs.csv \
--score-column similarity_score \
--label-column label \
--output-dir calibration_artifacts
The output includes:
calibration_threshold_sweep.csv: threshold, confusion counts, precision, recall, F1, and accuracy.calibration_roc.csv: threshold rows with false-positive and true-positive rates.calibration_precision_recall.csv: threshold rows with precision and recall.calibration_summary.json: AUROC, average precision, and the optimized threshold.calibration_report.json: all curve rows plus the summary in one JSON artifact.
The ROC and precision-recall reports require at least one positive and one negative label. Single-class labeled samples fail clearly because AUROC and average precision are not meaningful there.
from matheel.calibration import calibration_report
report = calibration_report(
[
{"similarity_score": 0.95, "label": 1},
{"similarity_score": 0.82, "label": 1},
{"similarity_score": 0.30, "label": 0},
{"similarity_score": 0.10, "label": 0},
],
score_key="similarity_score",
label_key="label",
)
print(report["summary"]["auroc"])
print(report["summary"]["optimized_threshold"]["threshold"])
Comparing Runs
When comparing algorithms or configurations:
- keep the input set fixed
- keep labels separate from scoring
- compare ranking quality and calibrated threshold quality, not just raw averages
- report active
feature_weights,vector_backend,code_metric, preprocessing, and chunking settings - avoid comparing raw
dotor negative distance scores directly against 0-1 metrics
The comparison suite records feature_set, vector_backend, code_metric, chunking_method, summary score statistics, and elapsed time. Use those fields to keep comparisons auditable.