Preprocessing
Preprocessing modifies code text before lexical, semantic, or code-aware scoring runs.
Parameter
preprocess_modecode_languagein comparison APIs, orlanguagewhen callingpreprocess_code(...)directly
Supported Modes
noneKeeps the original code except for trailing-whitespace cleanup and final trimming.normalizeNormalizes newlines, trims trailing whitespace, and removes blank lines.basicRemoves block comments, removes line comments, drops blank lines, and collapses repeated whitespace.advancedAppliesbasic, strips import-like lines, normalizes string/number literals, canonicalizes identifiers, and collapses whitespace.
What It Changes
/* ... */block comments are removed for C-like languages.// ...line comments are removed for C-like languages.- Lua
--[[ ... ]]block comments and-- ...line comments are removed for Lua. # ...comments are removed for the generic path and non-Lua language hints, except common C/C++/C# preprocessor directives such as#include,#define,#region, and#nullable.- Comment markers inside quoted string literals are preserved.
- Import-like lines (
import,from ... import,package,#include,#import,using,use,require,include,library(...),source(...), and Node/Lua-stylerequire(...)assignments) are stripped inadvanced. - String and numeric literals are normalized to placeholders in
advanced. - Non-keyword identifiers are canonicalized (
id1,id2, ...) inadvanced. - Excess whitespace is collapsed in
basicandadvanced.
Language Coverage
Preprocessing is still text-first, but the current regression-tested language set is:
javapythonccppgojavascripttypescriptkotlinscalaswiftsoliditydartphprubyrustcsharpluajuliarobjc
The broader tree-sitter parser inventory makes future language additions realistic, but they should land with matching import/comment heuristics and tests before they are treated as officially supported preprocessing targets.
Order and Tokenization
Preprocessing runs before lexical baselines, semantic embeddings, and code metrics. advanced applies the same cleanup as basic, then strips import-like lines, normalizes literals, and canonicalizes identifiers.
See Tokenization and preprocessing limits for the token streams used by Winnowing, GST, CrystalBLEU, RUBY, parser-backed metrics, and embedding backends.
When To Use It
- Use
nonewhen formatting, comments, or whitespace are meaningful for your task. - Use
normalizewhen you want stable layout without removing comments. - Use
basicfor most code-similarity runs where comments and formatting should not dominate. - Use
advancedwhen you need stronger normalization across identifier renames and literal changes.
Python Example
from matheel.preprocessing import preprocess_code
cleaned = preprocess_code(
"""
def add(a, b): # helper
return a + b
""",
mode="basic",
language="python",
)
print(cleaned)
CLI Example
python examples/sample_data.py --output sample_pairs.zip --overwrite
matheel compare sample_pairs.zip \
--preprocess-mode basic \
--code-language java