Datasets and Evaluation

Matheel supports local labeled pair and retrieval datasets, plus an extensible source/preset/adapter workflow for benchmark experiments.

No external datasets are bundled or downloaded by default. Users provide or download datasets according to each source's terms, license, provenance, and access requirements. Use task_type: plagiarism for plagiarism-oriented datasets so they remain easy to separate from future dataset families.

Dataset Loading Workflow

Dataset loading has three explicit steps:

  1. Resolve a source, such as a local directory, GitHub repository, Zenodo record, Hugging Face dataset repository, or Kaggle dataset.
  2. Adapt the raw layout into Matheel manifests with a registered adapter.
  3. Load one or more normalized pair or retrieval datasets.
from matheel.datasets import load_pair_datasets, load_retrieval_datasets

pair_dataset = load_pair_datasets(
    [
        {
            "source": "local",
            "identifier": "./data/raw_pairs",
            "name": "custom_pairs",
            "adapter": "auto_pair_tabular",
            "adapter_options": {
                "pair_table": "pairs.csv",
                "left_text_column": "code_a",
                "right_text_column": "code_b",
                "label_column": "label",
            },
        }
    ]
)

retrieval_dataset = load_retrieval_datasets(
    [
        {
            "source": "local",
            "identifier": "./data/raw_retrieval",
            "name": "custom_retrieval",
            "adapter": "auto_retrieval_tabular",
            "adapter_options": {
                "retrieval_table": "qrels.csv",
                "query_text_column": "query_code",
                "document_text_column": "candidate_code",
                "relevance_column": "relevance",
            },
        }
    ]
)

Source resolvers write to user-provided destinations when supplied. For remote sources without an explicit destination, Matheel uses a temporary/cache location outside the repository.

Reproducible Loading Manifests

Use a dataset manifest when you want source, adapter, and output choices captured in a reviewable file. Manifest paths are resolved relative to the manifest file, so the workflow does not depend on the shell's current working directory.

{
  "version": 1,
  "task": "pair",
  "datasets": [
    {
      "name": "custom_pairs",
      "source": "local",
      "identifier": "./data/raw_pairs",
      "adapter": "auto_pair_tabular",
      "adapted_destination": "./data/normalized_pairs",
      "adapter_options": {
        "pair_table": "pairs.csv",
        "left_text_column": "code_a",
        "right_text_column": "code_b",
        "label_column": "label"
      }
    }
  ]
}

Adapter-backed manifest specs must set adapted_destination; remote source specs must set destination. Keep credentials out of manifests. Authenticate with external tools or environment-specific configuration instead.

Load a manifest from Python:

from matheel.datasets import load_pair_datasets_from_manifest

dataset = load_pair_datasets_from_manifest("datasets.pair.json")

Or evaluate directly from the CLI:

matheel evaluate-pairs --manifest datasets.pair.json \
  --scores-out scored_pairs.csv \
  --metrics-out pair_metrics.json

Retrieval manifests use task: "retrieval" and the retrieval adapter options:

{
  "version": 1,
  "task": "retrieval",
  "datasets": [
    {
      "name": "custom_retrieval",
      "source": "local",
      "identifier": "./data/raw_retrieval",
      "adapter": "auto_retrieval_tabular",
      "adapted_destination": "./data/normalized_retrieval",
      "adapter_options": {
        "retrieval_table": "retrieval.csv",
        "query_text_column": "query_code",
        "document_text_column": "candidate_code",
        "relevance_column": "relevance"
      }
    }
  ]
}

JSON manifests are supported by default. YAML manifests are supported when PyYAML is installed.

Sources, Presets, and Adapters

Use the registry APIs to inspect and extend dataset loading:

from matheel.datasets import (
    available_dataset_adapters,
    available_dataset_presets,
    available_dataset_presets_by_task,
    available_dataset_sources,
    register_dataset_adapter,
    register_dataset_preset,
    register_dataset_source,
)

print(available_dataset_sources())
print(available_dataset_adapters())
print(available_dataset_presets())
print(available_dataset_presets_by_task("pair"))
print(available_dataset_presets_by_task("retrieval"))

Built-in generic source resolvers:

Source Purpose Notes
local Use a local directory containing raw or normalized data. Offline and deterministic.
github Download a repository archive by owner/repo. Public repositories need no token.
zenodo Download files for a Zenodo record id. Archives are extracted with path traversal checks.
huggingface Resolve a Hugging Face dataset repository. Requires optional Hugging Face Hub support.
kaggle Download a Kaggle dataset. Requires optional Kaggle API or CLI credentials.

Source resolvers only locate or download data; users are responsible for dataset access, credentials, licenses, and source terms.

Built-in adapters:

Adapter Output kind Purpose
auto_pair_tabular pair_classification Convert CSV/TSV/JSON rows with left/right code text or file paths and a label.
auto_retrieval_tabular retrieval Convert CSV/TSV/JSON rows with query/document code text or file paths and relevance.
soco14_retrieval retrieval Convert SOCO14 qrel/source layouts.
irplag_pair pair_classification Convert IRPlag pair layouts.
irplag_retrieval retrieval Convert IRPlag pair layouts into retrieval manifests.
conplag_pair pair_classification Convert ConPlag pair layouts.
ipca_pair pair_classification Convert IPCA .txt submissions using filename P/NP labels.
student_code_similarity_pair pair_classification Convert Student Code Similarity labeled pair tables.
criminal_minds_pair pair_classification Convert Criminal Minds original/plagiarized submission layouts.

Approved built-in plagiarism presets:

Preset Task families Source Identifier
soco14 retrieval Zenodo 7433031
irplag pair, retrieval GitHub oscarkarnalim/sourcecodeplagiarismdataset
conplag pair Zenodo 7332790
ipca pair GitHub humsha/IPCA
student_code_similarity pair Kaggle ehsankhani/student-code-similarity-and-plagiarism-labels
criminal_minds pair Zenodo 19115559

Custom projects can register their own source resolvers, presets, and adapters:

from matheel.datasets import register_dataset_adapter, register_dataset_preset, register_dataset_source


def resolve_internal_dataset(identifier, destination=None, revision="main", token=None, split=None):
    return f"./data/{identifier}"


register_dataset_source("internal", resolve_internal_dataset, overwrite=True)
register_dataset_preset(
    "internal_pairs",
    {
        "source": "internal",
        "identifier": "raw_pairs",
        "adapter": "auto_pair_tabular",
        "task_families": ("pair",),
    },
    overwrite=True,
)

For package-level dataset support, use the contributing dataset support guide. It covers source/adapters/presets, fixture expectations, and reproducibility requirements.

Dataset CLI Utilities

Use matheel datasets list to inspect registered sources, adapters, and presets:

matheel datasets list
matheel datasets list --task retrieval --format json

Use matheel datasets validate to check a normalized dataset and print stable counts:

matheel datasets validate tiny_pairs --format json
matheel datasets validate tiny_retrieval --kind retrieval

Add --output-dir when you want a structured validation report with JSON, CSV, and HTML artifacts:

matheel datasets validate tiny_pairs \
  --kind pair \
  --output-dir dataset_validation \
  --format json

The validation report separates blocking errors from warnings and checks manifests, duplicate ids, missing references, empty files, label coverage, qrel coverage, and basic metadata completeness.

Python usage:

from matheel.dataset_validation import write_dataset_validation_report

report, artifacts = write_dataset_validation_report(
    "tiny_pairs",
    "dataset_validation",
    kind="pair",
)

print(report["status"], report["error_count"], report["warning_count"])
print(artifacts["report_html"])

Use matheel datasets adapt to convert a raw local tabular dataset into normalized Matheel manifests. Always provide an explicit output directory so repeated runs write to the same location:

matheel datasets adapt ./data/raw_pairs \
  --kind pair \
  --output ./data/normalized_pairs \
  --dataset-name custom_pairs \
  --adapter-option pair_table=pairs.csv \
  --adapter-option left_text_column=code_a \
  --adapter-option right_text_column=code_b \
  --adapter-option label_column=label \
  --format json

For retrieval tables:

matheel datasets adapt ./data/raw_retrieval \
  --kind retrieval \
  --output ./data/normalized_retrieval \
  --adapter-option retrieval_table=retrieval.csv \
  --adapter-option query_text_column=query_code \
  --adapter-option document_text_column=candidate_code \
  --adapter-option relevance_column=relevance \
  --format json

Dataset Registry

Dataset tracking uses these fields:

Field Meaning
name Stable dataset identifier.
task_type High-level task label. Use plagiarism for plagiarism-oriented datasets.
dataset_kind Dataset layout, such as pair_classification or retrieval.
languages Languages covered by the dataset.
license Dataset license or unknown.
source_url Public source page.
access bundled, download, manual, or external.
citation Citation text or DOI when available.
notes Short caveats about labels, splits, or preprocessing.

This metadata registry is separate from source/preset loading and starts empty. Register datasets in code only after deciding they should be tracked by Matheel:

from matheel.datasets import register_dataset_entry

register_dataset_entry(
    "example_plagiarism_dataset",
    task_type="plagiarism",
    dataset_kind="pair_classification",
    languages=("python",),
    license="unknown",
    source_url="https://example.org/dataset",
    access="manual",
    notes="Example registry entry only.",
)

Pair Dataset Format

A pair-classification dataset directory contains:

metadata.json
files.csv
pairs.csv
files/

metadata.json should include:

{
  "task_type": "plagiarism",
  "dataset_kind": "pair_classification",
  "name": "tiny_plagiarism_fixture"
}

files.csv must include:

Column Meaning
file_id Stable file identifier with no path separators.
file_path Relative path under the dataset directory.

pairs.csv must include:

Column Meaning
left_id File id for the first submission.
right_id File id for the second submission.
label Binary label, where 1 means positive/plagiarism match and 0 means negative.

You can write a small dataset programmatically:

import pandas as pd
from matheel.datasets import write_pair_dataset

write_pair_dataset(
    "tiny_pairs",
    files=pd.DataFrame(
        [
            {"file_id": "a", "text": "print(1)", "suffix": ".py"},
            {"file_id": "b", "text": "print(1)", "suffix": ".py"},
            {"file_id": "c", "text": "print(2)", "suffix": ".py"},
        ]
    ),
    pairs=pd.DataFrame(
        [
            {"left_id": "a", "right_id": "b", "label": 1},
            {"left_id": "a", "right_id": "c", "label": 0},
        ]
    ),
    metadata={"name": "tiny_plagiarism_fixture"},
)

Pair Evaluation

Use the CLI to score local pair datasets and write both scored rows and metrics:

matheel evaluate-pairs tiny_pairs \
  --feature-weight levenshtein=1.0 \
  --threshold 0.8 \
  --scores-out scored_pairs.csv \
  --metrics-out pair_metrics.json

The command defaults to levenshtein=1.0 so small local evaluations are offline-friendly. Add semantic features explicitly when you want model-backed scoring.

You can also adapt a raw custom pair table directly from the CLI:

matheel evaluate-pairs ./data/raw_pairs \
  --adapter auto_pair_tabular \
  --adapter-option pair_table=pairs.csv \
  --adapter-option left_text_column=code_a \
  --adapter-option right_text_column=code_b \
  --adapter-option label_column=label \
  --scores-out scored_pairs.csv \
  --metrics-out pair_metrics.json

Use --preset NAME for registered presets, or combine --source, --identifier, --destination, --revision, --split, and --path-in-archive when a resolver needs an explicit source spec. Matheel does not require or store credentials in these commands.

Retrieval Dataset Format

A retrieval dataset directory contains:

metadata.json
files.csv
queries.csv
corpus.csv
qrels.csv
files/

metadata.json should include:

{
  "task_type": "plagiarism",
  "dataset_kind": "retrieval",
  "name": "tiny_plagiarism_retrieval_fixture"
}

files.csv uses the same columns as pair datasets. queries.csv maps query ids to files:

Column Meaning
query_id Stable query identifier with no path separators.
file_id File id from files.csv.

corpus.csv maps candidate document ids to files:

Column Meaning
document_id Stable document identifier with no path separators.
file_id File id from files.csv.

qrels.csv stores relevance judgments:

Column Meaning
query_id Query id from queries.csv.
document_id Candidate document id from corpus.csv.
relevance Non-negative relevance score. Values greater than 0 are treated as relevant.

You can write a small retrieval dataset programmatically:

import pandas as pd
from matheel.datasets import write_retrieval_dataset

write_retrieval_dataset(
    "tiny_retrieval",
    files=pd.DataFrame(
        [
            {"file_id": "query_a", "text": "print(1)", "suffix": ".py"},
            {"file_id": "doc_a", "text": "print(1)", "suffix": ".py"},
            {"file_id": "doc_b", "text": "print(2)", "suffix": ".py"},
        ]
    ),
    queries=pd.DataFrame([{"query_id": "q1", "file_id": "query_a"}]),
    corpus=pd.DataFrame(
        [
            {"document_id": "d1", "file_id": "doc_a"},
            {"document_id": "d2", "file_id": "doc_b"},
        ]
    ),
    qrels=pd.DataFrame([{"query_id": "q1", "document_id": "d1", "relevance": 1}]),
    metadata={"name": "tiny_plagiarism_retrieval_fixture"},
)

Retrieval Evaluation

Use the CLI to score each query against every corpus document and write ranking metrics:

matheel evaluate-retrieval tiny_retrieval \
  --feature-weight levenshtein=1.0 \
  --k 10 \
  --scores-out scored_retrieval.csv \
  --metrics-out retrieval_metrics.json

The metrics include mean average precision, mean reciprocal rank, precision at k, recall at k, and nDCG at k.

Raw custom retrieval tables can be adapted the same way:

matheel evaluate-retrieval ./data/raw_retrieval \
  --adapter auto_retrieval_tabular \
  --adapter-option retrieval_table=retrieval.csv \
  --adapter-option query_text_column=query_code \
  --adapter-option document_text_column=candidate_code \
  --adapter-option relevance_column=relevance \
  --k 10 \
  --scores-out scored_retrieval.csv \
  --metrics-out retrieval_metrics.json

Resampling and Uncertainty

Use resampling when you want uncertainty summaries instead of only one point metric. Matheel provides split generators for single splits, k-fold, repeated k-fold, and bootstrap:

from matheel.resampling import bootstrap_resamples, kfold_splits, single_split

single = single_split(100, train_size=0.7, validation_size=0.1, seed=7)
folds = kfold_splits(100, n_splits=5, seed=7)
bootstraps = bootstrap_resamples(100, n_rounds=100, seed=7)

For stratified k-fold splits, n_splits must not exceed the smallest label count. For grouped k-fold splits, n_splits must not exceed the number of unique groups. This keeps every fold valid and prevents empty test folds.

For pair-classification results, splits are applied to scored pair rows:

from matheel.evaluation import evaluate_pair_resamples
from matheel.resampling import kfold_splits

splits = kfold_splits(len(scored_pairs), n_splits=5, seed=7)
fold_metrics, fold_summary = evaluate_pair_resamples(
    scored_pairs,
    splits,
    threshold=0.8,
)

For retrieval results, splits are applied to query ids, so every selected query keeps its candidate documents:

from matheel.evaluation import evaluate_retrieval_resamples
from matheel.resampling import kfold_splits

query_ids = sorted(scored_retrieval["query_id"].unique())
splits = kfold_splits(query_ids, n_splits=5, seed=7)
fold_metrics, fold_summary = evaluate_retrieval_resamples(
    scored_retrieval,
    splits,
    k=10,
)

fold_summary contains percentile intervals for each metric across the selected resamples. For paired comparisons between two configurations, use compare_metric_samples(...) on matching split-level metric values:

from matheel.resampling import compare_metric_samples

comparison = compare_metric_samples(
    baseline_fold_metrics["f1"],
    candidate_fold_metrics["f1"],
    metric_name="f1",
)

The comparison report includes mean difference, interval bounds, win/loss/tie counts, and a two-sided sign-test p-value.

For a complete tiny workflow that writes a manifest, scored rows, metrics, resampling summaries, and reproducibility metadata, see the reproducible benchmark demo.