Datasets and Evaluation

Matheel supports local labeled pair and retrieval datasets, plus an extensible source/preset/adapter workflow for benchmark experiments.

No external datasets are bundled or downloaded by default. Users provide or download datasets according to each source's terms, license, provenance, and access requirements. Use task_type: plagiarism for plagiarism-oriented datasets so they remain easy to separate from future dataset families.

Dataset Loading Workflow

Dataset loading has three explicit steps:

Resolve a source, such as a local directory, GitHub repository, Zenodo record, Hugging Face dataset repository, or Kaggle dataset.
Adapt the raw layout into Matheel manifests with a registered adapter.
Load one or more normalized pair or retrieval datasets.

from matheel.datasets import load_pair_datasets, load_retrieval_datasets

pair_dataset = load_pair_datasets(
    [
        {
            "source": "local",
            "identifier": "./data/raw_pairs",
            "name": "custom_pairs",
            "adapter": "auto_pair_tabular",
            "adapter_options": {
                "pair_table": "pairs.csv",
                "left_text_column": "code_a",
                "right_text_column": "code_b",
                "label_column": "label",
            },
        }
    ]
)

retrieval_dataset = load_retrieval_datasets(
    [
        {
            "source": "local",
            "identifier": "./data/raw_retrieval",
            "name": "custom_retrieval",
            "adapter": "auto_retrieval_tabular",
            "adapter_options": {
                "retrieval_table": "qrels.csv",
                "query_text_column": "query_code",
                "document_text_column": "candidate_code",
                "relevance_column": "relevance",
            },
        }
    ]
)

Source resolvers write to user-provided destinations when supplied. For remote sources without an explicit destination, Matheel uses a temporary/cache location outside the repository.

Reproducible Loading Manifests

Use a dataset manifest when you want source, adapter, and output choices captured in a reviewable file. Manifest paths are resolved relative to the manifest file, so the workflow does not depend on the shell's current working directory.

{
  "version": 1,
  "task": "pair",
  "datasets": [
    {
      "name": "custom_pairs",
      "source": "local",
      "identifier": "./data/raw_pairs",
      "adapter": "auto_pair_tabular",
      "adapted_destination": "./data/normalized_pairs",
      "adapter_options": {
        "pair_table": "pairs.csv",
        "left_text_column": "code_a",
        "right_text_column": "code_b",
        "label_column": "label"
      }
    }
  ]
}

Adapter-backed manifest specs must set adapted_destination; remote source specs must set destination. Keep credentials out of manifests. Authenticate with external tools or environment-specific configuration instead.

Load a manifest from Python:

from matheel.datasets import load_pair_datasets_from_manifest

dataset = load_pair_datasets_from_manifest("datasets.pair.json")

Or evaluate directly from the CLI:

matheel evaluate-pairs --manifest datasets.pair.json \
  --scores-out scored_pairs.csv \
  --metrics-out pair_metrics.json

Retrieval manifests use task: "retrieval" and the retrieval adapter options:

{
  "version": 1,
  "task": "retrieval",
  "datasets": [
    {
      "name": "custom_retrieval",
      "source": "local",
      "identifier": "./data/raw_retrieval",
      "adapter": "auto_retrieval_tabular",
      "adapted_destination": "./data/normalized_retrieval",
      "adapter_options": {
        "retrieval_table": "retrieval.csv",
        "query_text_column": "query_code",
        "document_text_column": "candidate_code",
        "relevance_column": "relevance"
      }
    }
  ]
}

JSON manifests are supported by default. YAML manifests are supported when PyYAML is installed.

Sources, Presets, and Adapters

Use the registry APIs to inspect and extend dataset loading:

from matheel.datasets import (
    available_dataset_adapters,
    available_dataset_presets,
    available_dataset_presets_by_task,
    available_dataset_sources,
    register_dataset_adapter,
    register_dataset_preset,
    register_dataset_source,
)

print(available_dataset_sources())
print(available_dataset_adapters())
print(available_dataset_presets())
print(available_dataset_presets_by_task("pair"))
print(available_dataset_presets_by_task("retrieval"))

Built-in generic source resolvers:

Source	Purpose	Notes
`local`	Use a local directory containing raw or normalized data.	Offline and deterministic.
`github`	Download a repository archive by `owner/repo`.	Public repositories need no token.
`zenodo`	Download files for a Zenodo record id.	Archives are extracted with path traversal checks.
`huggingface`	Resolve a Hugging Face dataset repository.	Requires optional Hugging Face Hub support.
`kaggle`	Download a Kaggle dataset.	Requires optional Kaggle API or CLI credentials.

Source resolvers only locate or download data; users are responsible for dataset access, credentials, licenses, and source terms.

Built-in adapters:

Adapter	Output kind	Purpose
`auto_pair_tabular`	`pair_classification`	Convert CSV/TSV/JSON rows with left/right code text or file paths and a label.
`auto_retrieval_tabular`	`retrieval`	Convert CSV/TSV/JSON rows with query/document code text or file paths and relevance.
`soco14_retrieval`	`retrieval`	Convert SOCO14 qrel/source layouts.
`irplag_pair`	`pair_classification`	Convert IRPlag pair layouts.
`irplag_retrieval`	`retrieval`	Convert IRPlag pair layouts into retrieval manifests.
`conplag_pair`	`pair_classification`	Convert ConPlag pair layouts.
`ipca_pair`	`pair_classification`	Convert IPCA `.txt` submissions using filename P/NP labels.
`student_code_similarity_pair`	`pair_classification`	Convert Student Code Similarity labeled pair tables.
`criminal_minds_pair`	`pair_classification`	Convert Criminal Minds original/plagiarized submission layouts.

Approved built-in plagiarism presets:

Preset	Task families	Source	Identifier
`soco14`	retrieval	Zenodo	`7433031`
`irplag`	pair, retrieval	GitHub	`oscarkarnalim/sourcecodeplagiarismdataset`
`conplag`	pair	Zenodo	`7332790`
`ipca`	pair	GitHub	`humsha/IPCA`
`student_code_similarity`	pair	Kaggle	`ehsankhani/student-code-similarity-and-plagiarism-labels`
`criminal_minds`	pair	Zenodo	`19115559`

Custom projects can register their own source resolvers, presets, and adapters:

from matheel.datasets import register_dataset_adapter, register_dataset_preset, register_dataset_source


def resolve_internal_dataset(identifier, destination=None, revision="main", token=None, split=None):
    return f"./data/{identifier}"


register_dataset_source("internal", resolve_internal_dataset, overwrite=True)
register_dataset_preset(
    "internal_pairs",
    {
        "source": "internal",
        "identifier": "raw_pairs",
        "adapter": "auto_pair_tabular",
        "task_families": ("pair",),
    },
    overwrite=True,
)

For package-level dataset support, use the contributing dataset support guide. It covers source/adapters/presets, fixture expectations, and reproducibility requirements.

Dataset CLI Utilities

Use matheel datasets list to inspect registered sources, adapters, and presets:

matheel datasets list
matheel datasets list --task retrieval --format json

Use matheel datasets validate to check a normalized dataset and print stable counts:

matheel datasets validate tiny_pairs --format json
matheel datasets validate tiny_retrieval --kind retrieval

Add --output-dir when you want a structured validation report with JSON, CSV, and HTML artifacts:

matheel datasets validate tiny_pairs \
  --kind pair \
  --output-dir dataset_validation \
  --format json

The validation report separates blocking errors from warnings and checks manifests, duplicate ids, missing references, empty files, label coverage, qrel coverage, and basic metadata completeness.

Python usage:

from matheel.dataset_validation import write_dataset_validation_report

report, artifacts = write_dataset_validation_report(
    "tiny_pairs",
    "dataset_validation",
    kind="pair",
)

print(report["status"], report["error_count"], report["warning_count"])
print(artifacts["report_html"])

Use matheel datasets adapt to convert a raw local tabular dataset into normalized Matheel manifests. Always provide an explicit output directory so repeated runs write to the same location:

matheel datasets adapt ./data/raw_pairs \
  --kind pair \
  --output ./data/normalized_pairs \
  --dataset-name custom_pairs \
  --adapter-option pair_table=pairs.csv \
  --adapter-option left_text_column=code_a \
  --adapter-option right_text_column=code_b \
  --adapter-option label_column=label \
  --format json

For retrieval tables:

matheel datasets adapt ./data/raw_retrieval \
  --kind retrieval \
  --output ./data/normalized_retrieval \
  --adapter-option retrieval_table=retrieval.csv \
  --adapter-option query_text_column=query_code \
  --adapter-option document_text_column=candidate_code \
  --adapter-option relevance_column=relevance \
  --format json

Dataset Registry

Dataset tracking uses these fields:

Field	Meaning
`name`	Stable dataset identifier.
`task_type`	High-level task label. Use `plagiarism` for plagiarism-oriented datasets.
`dataset_kind`	Dataset layout, such as `pair_classification` or `retrieval`.
`languages`	Languages covered by the dataset.
`license`	Dataset license or `unknown`.
`source_url`	Public source page.
`access`	`bundled`, `download`, `manual`, or `external`.
`citation`	Citation text or DOI when available.
`notes`	Short caveats about labels, splits, or preprocessing.

This metadata registry is separate from source/preset loading and starts empty. Register datasets in code only after deciding they should be tracked by Matheel:

from matheel.datasets import register_dataset_entry

register_dataset_entry(
    "example_plagiarism_dataset",
    task_type="plagiarism",
    dataset_kind="pair_classification",
    languages=("python",),
    license="unknown",
    source_url="https://example.org/dataset",
    access="manual",
    notes="Example registry entry only.",
)

Pair Dataset Format

A pair-classification dataset directory contains:

metadata.json
files.csv
pairs.csv
files/

metadata.json should include:

{
  "task_type": "plagiarism",
  "dataset_kind": "pair_classification",
  "name": "tiny_plagiarism_fixture"
}

files.csv must include:

Column	Meaning
`file_id`	Stable file identifier with no path separators.
`file_path`	Relative path under the dataset directory.

pairs.csv must include:

Column	Meaning
`left_id`	File id for the first submission.
`right_id`	File id for the second submission.
`label`	Binary label, where `1` means positive/plagiarism match and `0` means negative.

You can write a small dataset programmatically:

import pandas as pd
from matheel.datasets import write_pair_dataset

write_pair_dataset(
    "tiny_pairs",
    files=pd.DataFrame(
        [
            {"file_id": "a", "text": "print(1)", "suffix": ".py"},
            {"file_id": "b", "text": "print(1)", "suffix": ".py"},
            {"file_id": "c", "text": "print(2)", "suffix": ".py"},
        ]
    ),
    pairs=pd.DataFrame(
        [
            {"left_id": "a", "right_id": "b", "label": 1},
            {"left_id": "a", "right_id": "c", "label": 0},
        ]
    ),
    metadata={"name": "tiny_plagiarism_fixture"},
)

Pair Evaluation

Use the CLI to score local pair datasets and write both scored rows and metrics:

matheel evaluate-pairs tiny_pairs \
  --feature-weight levenshtein=1.0 \
  --threshold 0.8 \
  --scores-out scored_pairs.csv \
  --metrics-out pair_metrics.json

The command defaults to levenshtein=1.0 so small local evaluations are offline-friendly. Add semantic features explicitly when you want model-backed scoring.

You can also adapt a raw custom pair table directly from the CLI:

matheel evaluate-pairs ./data/raw_pairs \
  --adapter auto_pair_tabular \
  --adapter-option pair_table=pairs.csv \
  --adapter-option left_text_column=code_a \
  --adapter-option right_text_column=code_b \
  --adapter-option label_column=label \
  --scores-out scored_pairs.csv \
  --metrics-out pair_metrics.json

Use --preset NAME for registered presets, or combine --source, --identifier, --destination, --revision, --split, and --path-in-archive when a resolver needs an explicit source spec. Matheel does not require or store credentials in these commands.

Retrieval Dataset Format

A retrieval dataset directory contains:

metadata.json
files.csv
queries.csv
corpus.csv
qrels.csv
files/

metadata.json should include:

{
  "task_type": "plagiarism",
  "dataset_kind": "retrieval",
  "name": "tiny_plagiarism_retrieval_fixture"
}

files.csv uses the same columns as pair datasets. queries.csv maps query ids to files:

Column	Meaning
`query_id`	Stable query identifier with no path separators.
`file_id`	File id from `files.csv`.

corpus.csv maps candidate document ids to files:

Column	Meaning
`document_id`	Stable document identifier with no path separators.
`file_id`	File id from `files.csv`.

qrels.csv stores relevance judgments:

Column	Meaning
`query_id`	Query id from `queries.csv`.
`document_id`	Candidate document id from `corpus.csv`.
`relevance`	Non-negative relevance score. Values greater than `0` are treated as relevant.

You can write a small retrieval dataset programmatically:

import pandas as pd
from matheel.datasets import write_retrieval_dataset

write_retrieval_dataset(
    "tiny_retrieval",
    files=pd.DataFrame(
        [
            {"file_id": "query_a", "text": "print(1)", "suffix": ".py"},
            {"file_id": "doc_a", "text": "print(1)", "suffix": ".py"},
            {"file_id": "doc_b", "text": "print(2)", "suffix": ".py"},
        ]
    ),
    queries=pd.DataFrame([{"query_id": "q1", "file_id": "query_a"}]),
    corpus=pd.DataFrame(
        [
            {"document_id": "d1", "file_id": "doc_a"},
            {"document_id": "d2", "file_id": "doc_b"},
        ]
    ),
    qrels=pd.DataFrame([{"query_id": "q1", "document_id": "d1", "relevance": 1}]),
    metadata={"name": "tiny_plagiarism_retrieval_fixture"},
)

Retrieval Evaluation

Use the CLI to score each query against every corpus document and write ranking metrics:

matheel evaluate-retrieval tiny_retrieval \
  --feature-weight levenshtein=1.0 \
  --k 10 \
  --scores-out scored_retrieval.csv \
  --metrics-out retrieval_metrics.json

The metrics include mean average precision, mean reciprocal rank, precision at k, recall at k, and nDCG at k.

Raw custom retrieval tables can be adapted the same way:

matheel evaluate-retrieval ./data/raw_retrieval \
  --adapter auto_retrieval_tabular \
  --adapter-option retrieval_table=retrieval.csv \
  --adapter-option query_text_column=query_code \
  --adapter-option document_text_column=candidate_code \
  --adapter-option relevance_column=relevance \
  --k 10 \
  --scores-out scored_retrieval.csv \
  --metrics-out retrieval_metrics.json

Resampling and Uncertainty

Use resampling when you want uncertainty summaries instead of only one point metric. Matheel provides split generators for single splits, k-fold, repeated k-fold, and bootstrap:

from matheel.resampling import bootstrap_resamples, kfold_splits, single_split

single = single_split(100, train_size=0.7, validation_size=0.1, seed=7)
folds = kfold_splits(100, n_splits=5, seed=7)
bootstraps = bootstrap_resamples(100, n_rounds=100, seed=7)

For stratified k-fold splits, n_splits must not exceed the smallest label count. For grouped k-fold splits, n_splits must not exceed the number of unique groups. This keeps every fold valid and prevents empty test folds.

For pair-classification results, splits are applied to scored pair rows:

from matheel.evaluation import evaluate_pair_resamples
from matheel.resampling import kfold_splits

splits = kfold_splits(len(scored_pairs), n_splits=5, seed=7)
fold_metrics, fold_summary = evaluate_pair_resamples(
    scored_pairs,
    splits,
    threshold=0.8,
)

For retrieval results, splits are applied to query ids, so every selected query keeps its candidate documents:

from matheel.evaluation import evaluate_retrieval_resamples
from matheel.resampling import kfold_splits

query_ids = sorted(scored_retrieval["query_id"].unique())
splits = kfold_splits(query_ids, n_splits=5, seed=7)
fold_metrics, fold_summary = evaluate_retrieval_resamples(
    scored_retrieval,
    splits,
    k=10,
)

fold_summary contains percentile intervals for each metric across the selected resamples. For paired comparisons between two configurations, use compare_metric_samples(...) on matching split-level metric values:

from matheel.resampling import compare_metric_samples

comparison = compare_metric_samples(
    baseline_fold_metrics["f1"],
    candidate_fold_metrics["f1"],
    metric_name="f1",
)

The comparison report includes mean difference, interval bounds, win/loss/tie counts, and a two-sided sign-test p-value.

For a complete tiny workflow that writes a manifest, scored rows, metrics, resampling summaries, and reproducibility metadata, see the reproducible benchmark demo.