Usage Guide
This page is the quick-start entry point for installing Matheel and running the main CLI and Python workflows. Detailed parameter guides live in the topic pages linked below.
Installation
Matheel supports Python 3.10 to 3.13.
Base install:
pip install matheel
The base package includes the CLI, preprocessing, lexical similarity, and the core comparison workflow. Install optional extras when you need larger semantic backends, chunkers, metric runtimes, or the Gradio app:
pip install "matheel[semantic]"
pip install "matheel[chunking]"
pip install "matheel[metrics]"
pip install "matheel[visualization]"
pip install "matheel[gradio]"
pip install "matheel[all]"
| Extra | Use it for |
|---|---|
matheel[semantic] |
Sentence Transformers, Model2Vec, and PyLate semantic scoring backends. |
matheel[chunking] |
Chonkie chunkers for splitting code before embedding. |
matheel[metrics] |
Optional code metric runtimes such as TSED and CodeBERTScore. |
matheel[visualization] |
UMAP projection for dataset visualization. |
matheel[gradio] |
Dependencies for running the Gradio web app. |
matheel[all] |
All supported optional backends in one install. |
Compatibility extras remain available for narrower installs: sentence_transformers, model2vec, pylate, and chunking_code.
Examples that use semantic weights assume matheel[semantic] or matheel[all] is installed. Optional installs can take some time because they may include model and ML runtime dependencies.
Quick Checks
Generate the tiny Java sample archive from source strings:
Base CLI check:
python examples/sample_data.py --output sample_pairs.zip --overwrite
matheel compare sample_pairs.zip \
--feature-weight levenshtein=1.0 \
--num 10
Base Python check:
from matheel.similarity import calculate_similarity
score = calculate_similarity(
"def add(a, b):\n return a + b\n",
"def add(x, y):\n return x + y\n",
feature_weights={"levenshtein": 1.0},
)
print(round(score, 4))
Semantic CLI check:
matheel compare sample_pairs.zip \
--model huggingface/CodeBERTa-small-v1 \
--feature-weight semantic=0.7 \
--feature-weight levenshtein=0.3
Semantic Python check:
from matheel.similarity import get_sim_list
results = get_sim_list(
"sample_pairs.zip",
model_name="huggingface/CodeBERTa-small-v1",
feature_weights={"semantic": 0.7, "levenshtein": 0.3},
)
print(results.head())
Common Workflows
- Use a ZIP archive or a directory path with
matheel compare. - Use
feature_weightsto combine semantic, lexical, and code-aware scores. - Use
--normalize-semantic-scoreswhen blendingdot,euclidean, ormanhattansemantic scores with other 0-1 metrics. - Add
--preprocess-modewhen code should be normalized before scoring. - Add
--chunking-methodwhen large files should be split before embedding. - Use
matheel compare-suitewith a JSON config for repeatable multi-run comparisons. - Use
--algorithm-pathwhen you need a customscore_pairimplementation. - Run the reproducible benchmark demo when you need a small auditable evaluation workflow.
- Run the Gradio app or notebooks when you want an interactive workflow, including normalized dataset evaluation, visualization artifacts, ready leaderboards, and leaderboard inspection.
Demos and Examples
- Hugging Face Space demo: buelfhood/matheel-framework
- Core workflows Colab notebook: Open in Colab
- Dataset workflows Colab notebook: Open in Colab
- Custom algorithms Colab notebook: Open in Colab
- Gradio Colab notebook: Open in Colab
- Visualization and leaderboard Colab notebook: Open in Colab
- Examples folder: github.com/FahadEbrahim/matheel/tree/main/examples
Documentation Map
- Preprocessing
- Tokenization and preprocessing limits
- Chunking
- Vectors and routing
- Lexical metrics and baselines
- Code metrics
- Scoring and calibration
- Visualization
- Leaderboard
- Reproducible benchmark demo
- Custom algorithms
- Comparison suite
- Contributing algorithms
- Contributing datasets
- Development
Interface Notes
- CLI and Python API accept either a directory or a ZIP archive.
- Gradio remains ZIP-first for dataset uploads, supports ready leaderboards from normalized dataset ZIPs, and supports JSON or ZIP leaderboard artifact inspection.
feature_weightsis the canonical scoring input.vector_backend=autouses Hugging Face metadata and tag heuristics when available.- CLI progress bars write to stderr and default to interactive terminals only. Use
--progressor--no-progressto override. - Python APIs accept
progress=Truefor tqdm bars andprogress_callback=...for structured progress events. - Collection results include run metadata in
DataFrame.attrs, includingelapsed_seconds,feature_set,vector_backend,code_metric,chunking_method, and custom algorithm metadata when applicable.