Tokenization and Preprocessing Limits

Matheel combines text-first preprocessing, token-level baselines, embedding backends, and optional parser-backed code metrics. These layers are related, but they do not all use the same token stream.

Pipeline Order

For calculate_similarity(...) and get_sim_list(...), the main order is:

Read source text.
Apply preprocess_mode.
Build embeddings when the semantic feature is active.
Tokenize for lexical baselines and token-based code metrics.
Parse code for parser-derived lexical tokens or parser-backed code metrics when those modes are active and their runtimes are available.
Blend active feature scores with normalized feature weights.

Preprocessing happens before lexical, semantic, and code-aware scoring. If you enable advanced, identifiers and literals may already be canonicalized before token-based metrics run.

Preprocessing Modes

none Trims trailing whitespace and final surrounding whitespace only.
normalize Normalizes line endings, trims trailing whitespace, and drops blank lines.
basic Removes comments with language-aware string handling, drops blank lines, and collapses whitespace.
advanced Runs basic, removes import-like lines, replaces string and numeric literals with placeholders, canonicalizes non-keyword identifiers, and collapses whitespace.

Preprocessing is intentionally text-first. It uses language-specific comment and import heuristics, but it is not an AST rewrite.

Tokenization By Feature

Feature or metric	Tokenization or parsing strategy
`levenshtein`	Compares the prepared strings directly.
`jaro_winkler`	Compares the prepared strings directly.
`winnowing`	Uses Matheel's code-token regex by default, or tree-sitter leaf node types with `lexical_tokenizer="parser"`.
`gst`	Uses the same lexical tokenizer selection as Winnowing.
`crystalbleu`	Uses the same code-token regex, then builds n-grams and discounts frequent n-grams.
`ruby` with `ruby_mode=ngram`	Uses the same code-token regex.
`ruby` with `ruby_mode=string` and `ruby_tokenizer=regex`	Uses the same code-token regex.
`ruby` with `ruby_tokenizer=tranx`	Splits punctuation and camel-case boundaries for a more granular token edit sequence.
CodeBLEU syntax/dataflow	Uses tree-sitter-backed syntax and DFG extraction for supported languages.
TSED	Parses syntax trees and compares tree edit distance when optional parser dependencies are installed.
CodeBERTScore	Uses the selected transformer tokenizer.
Semantic embeddings	Use the selected embedding backend's tokenizer or vectorizer.

The shared code-token regex recognizes identifiers, numbers, and punctuation. It does not split snake_case or camelCase by default. For example, totalCount stays one token in regex mode; RUBY's tranx tokenizer splits it into total and Count.

Parser-Derived Lexical Tokens

Set lexical_tokenizer="parser" in Python or --lexical-tokenizer parser in the CLI to use parser-derived leaf node types for winnowing and gst. The default is raw, preserving the regex token stream used in earlier Matheel releases.

Parser-derived lexical tokens make the lexical baselines less sensitive to identifier renaming. They still compare ordered token streams; they are not full AST or data-flow comparisons.

Limitations

Important limitations:

lexical token baselines see surface token order and punctuation
parser-derived lexical tokens and parser-backed metrics depend on parser runtime availability
unsupported languages fall back only where a metric explicitly supports fallback behavior
preprocessing language hints improve comment/import handling, but they do not guarantee full parsing
aggressive advanced preprocessing can hide identifier and literal differences that some workflows may care about

For parser-heavy comparisons, report the active preprocessing mode, lexical_tokenizer, tokenization-sensitive parameters, selected code metric, language, and parser availability alongside the score.