Reproducible Benchmark Demo
This demo creates a tiny synthetic pair-classification dataset, adapts it through the dataset manifest workflow, scores it with an offline lexical baseline, and writes auditable outputs.
It is a workflow example, not a benchmark claim. Real datasets are not bundled with Matheel, and users should provide or download datasets according to each source's terms.
Run the Demo
From the repository root:
python examples/evaluation/reproducible_benchmark_demo.py \
--output-dir benchmark_outputs/synthetic_pair_benchmark \
--overwrite
The command writes:
benchmark_outputs/synthetic_pair_benchmark/
raw/pairs.csv
dataset_manifest.json
benchmark_config.json
dataset/
results/
scored_pairs.csv
pair_metrics.json
resample_metrics.csv
resample_summary.csv
reproducibility.json
The generated files capture:
- synthetic raw inputs and the normalized Matheel dataset
- source, adapter, and destination choices in
dataset_manifest.json - threshold, feature weights, preprocessing, language, and resampling seed in
benchmark_config.json - scored rows and aggregate pair-classification metrics
- fold-level metrics and interval summaries
- package versions, Python metadata, platform metadata, source fingerprint, and run metadata
CLI Equivalent
After generating the synthetic inputs, you can rerun the scoring step through the CLI:
matheel evaluate-pairs \
--manifest benchmark_outputs/synthetic_pair_benchmark/dataset_manifest.json \
--feature-weight levenshtein=1.0 \
--preprocess-mode basic \
--code-language python \
--threshold 0.65 \
--scores-out benchmark_outputs/synthetic_pair_benchmark/results/scored_pairs_cli.csv \
--metrics-out benchmark_outputs/synthetic_pair_benchmark/results/pair_metrics_cli.json \
--reproducibility-out benchmark_outputs/synthetic_pair_benchmark/results/reproducibility_cli.json
The example uses levenshtein=1.0, so it runs offline and does not download embedding models.
Adapt the Workflow
For a custom pair dataset:
- Replace
raw/pairs.csvwith your own table. - Update
dataset_manifest.jsonsoadapter_optionspoint to your column names. - Keep
benchmark_config.jsonwith the run settings you want to audit. - Keep a fixed resampling seed while comparing configurations.
- Store generated outputs outside source control unless the files are deliberately tiny examples.
For retrieval datasets, use auto_retrieval_tabular and the ranking metrics described in Datasets and evaluation.
Reproducibility Checklist
- Record the Matheel version and optional backend package versions.
- Record the dataset source, adapter, and normalized source fingerprint.
- Record scorer settings, preprocessing, language, threshold, and feature weights.
- Record split method, number of folds or rounds, confidence level, and random seed.
- Keep scored rows alongside summary metrics so threshold decisions can be audited.