TokenizerChanger package

Submodules

TokenizerChanger.tokenizer_changer module

class TokenizerChanger.tokenizer_changer.TokenizerChanger(tokenizer: TokenizersBackend | None = None, space_sign: str = 'Ġ')[source]

Bases: object

add_merges(merges: Sequence[str | Sequence[str]])[source]

Add merge rules to the tokenizer.

Parameters

merges:: Iterable of merges. Each merge may be either a whitespace-separated string (e.g. "a b") or a sequence of pieces (typically length 2).

Notes

If required tokens are missing, this may call add_token_suggestion(), which can prompt the user.

add_token_suggestion(merge: str | Sequence[str])[source]

Prompt the user to add tokens required by a merge.

Parameters

merge:: The merge that cannot be added because required pieces are missing.

Returns

bool: True if tokens were added (or permission already granted), otherwise False.

Warnings

This method is interactive: it calls input() and prints to stdout. Avoid using it in non-interactive environments.

add_tokens(tokens: list[str])[source]

Add new tokens to the tokenizer vocabulary.

Parameters

tokens:: Token strings to add. Any literal spaces will be replaced by self.space_sign.

Notes

This method mutates the underlying JSON vocab (self.state['model']['vocab']). It does not automatically update merges; call updated_tokenizer() to rebuild a usable PreTrainedTokenizerFast.

property baseline_tokenizer: TokenizersBackend | None: Alias for original_tokenizer.

delete_inappropriate_merges(vocab: list[str] = [], n_jobs: int = 1)[source]

Delete merges that reference tokens not present in a provided vocab. If vocab in empty, then self.state[“model”][“vocab”] will be used.

Parameters

vocab:: A list of allowed tokens. Any merge referencing tokens outside this list will be removed.
n_jobs:: Number of worker processes to use. 1 uses a single process.

Notes

Multiprocessing can be unreliable on some platforms/environments (especially when pickling bound methods). If it fails, the implementation falls back to single-process deletion.

delete_k_tail_tokens(k: int, exclude: list[str] = [], delete_merges: bool = True, consider_excluded_tokens: bool = False, n_jobs=1)[source]

Delete k tokens selected by find_tail_tokens().

Parameters

k:: Number of tokens to delete.
exclude:: Tokens to skip during selection.
delete_merges:: Whether to also delete merges that reference deleted tokens.
consider_excluded_tokens:: Whether to subtract the excluded count from k.
n_jobs:: Worker process count used by delete_merges().

Notes

This uses find_tail_tokens(), which is based on vocab ordering rather than true frequency unless your tokenizer’s vocab is frequency-sorted.

delete_merges(unwanted_tokens: list[str] | None = None, unwanted_merges_set: set = {}, n_jobs=1)[source]

Delete merge rules that contain unwanted tokens.

Parameters

unwanted_tokens:: If provided, merges are filtered using these tokens. Otherwise uses the current self.unwanted_tokens.
n_jobs:: Number of worker processes to use. 1 uses a single process.

Notes

Multiprocessing can be unreliable on some platforms/environments (especially when pickling bound methods). If it fails, the implementation falls back to single-process deletion.

delete_overlaps(vocab: dict, delete_merges: bool = True)[source]

Delete tokens that overlap with another vocabulary.

Parameters

vocab:: A vocabulary mapping (e.g., token->id). Any token present in both vocabularies will be deleted from this tokenizer.
delete_merges:: Whether to also remove merges referencing the deleted tokens.

delete_tokens(unwanted_tokens: list[str] = [], include_substrings: bool = True, delete_merges: bool = True, n_jobs: int = 1) → None[source]

Delete tokens from the vocabulary and optionally delete affected merges.

Parameters

unwanted_tokens:: Tokens to delete. If empty, deletes the current self.unwanted_tokens.
include_substrings:: If True, expands the deletion list by searching for tokens present in the current vocabulary. (Note: this does not search arbitrary substrings; it filters to existing vocab entries.)
delete_merges:: If True, also remove merges that reference deleted tokens.
n_jobs:: Worker process count used by delete_merges().

Raises

KeyError: If a requested token is not in the vocabulary.

find_tail_tokens(k_least: int, exclude: list[str] = [], consider_excluded_tokens: bool = False)[source]

Select k tokens from the tail of the vocabulary ordering.

Parameters

k_least:: Number of tokens to select.
exclude:: Tokens to ignore.
consider_excluded_tokens:: If True, k_least is decremented for each excluded token encountered during selection.

Notes

Despite the name, this method does not compute true token frequencies. It walks the vocabulary in reverse insertion/id order and picks the first k. This is only “least frequent” if your tokenizer’s vocab is ordered by frequency.

Side Effects

Updates self.unwanted_tokens.

find_token_id_gap()[source]

Find the most recent gap in token ids.

Returns

int: The id at the start of the last detected gap. add_tokens uses this to pick ids that do not collide with existing ones.

find_tokens(unwanted_tokens: list[str])[source]

Add existing tokens from unwanted_tokens to the internal deletion list.

Parameters

unwanted_tokens:: Candidate tokens to look up in the tokenizer vocabulary.

Side Effects

Appends found tokens to self.unwanted_tokens.

format_merges()[source]

Normalize merge entries to tuples of strings.

Some tokenizer JSON dumps store merges as whitespace-separated strings. This converts each merge into a tuple (typically length 2).

get_overlapping_merges(merges: list)[source]

Return merges that overlap with another merge list.

Parameters

merges:: A list of merges from another tokenizer.

Returns

list: Merges from this tokenizer that appear to overlap with the provided list.

get_overlapping_tokens(vocab: dict)[source]

Return tokens that exist in both vocabularies.

Parameters

vocab:: A mapping representing another vocabulary (e.g., token->id).

Returns

list[str]: Tokens that appear in both vocab and this tokenizer’s vocab.

load_tokenizer(tokenizer: TokenizersBackend)[source]

Load a tokenizer and initialize internal JSON state.

Parameters

tokenizer:: The tokenizer to load (must be a fast tokenizer).

property none_keys: list[str]: Alias for none_types.

replace_tokens(donor_TC: TokenizerChanger, k: int, ignore_overlaps: bool = False, add_merges: bool = True, n_jobs: int = 1, replaced_idx_file: str | None = 'replaced_idx.txt')[source]

Replace k tokens from this tokenizer with tokens from a donor tokenizer.

Parameters

donor_TC:: The TokenizerChanger instance to source replacement tokens from.
k:: Number of tokens to replace.
ignore_overlaps:: Whether to skip overlapping tokens during replacement.
add_merges:: Whether to also add merges referencing replaced tokens.
n_jobs:: Worker process count used by delete_merges().
replaced_idx_file:: Optional file path to save the indices of replaced tokens.

Notes

This method selects tokens to delete using find_tail_tokens() and then adds new tokens from the donor tokenizer. If the donor has fewer than k tokens, it will add as many as it can.

save_tokenizer(path: str = 'updated_tokenizer')[source]

Persist the updated tokenizer to disk.

Parameters

path:: Output directory passed to tokenizer.save_pretrained.

Notes

This calls updated_tokenizer() first to rebuild the tokenizer instance from the current JSON state.

set_space_marker(space_marker: str) → None[source]: Alias for set_space_sign().

set_space_sign(space_sign: str)[source]

Set the marker used to represent spaces when adding tokens.

Parameters

space_sign:: The character (or string) used to replace regular spaces in token strings.

property space_marker: str: Alias for space_sign.

property tokens_to_delete: list[str]: Alias for unwanted_tokens.

updated_tokenizer()[source]

Rebuild and return a tokenizer from the current internal JSON state.

Returns

transformers.PreTrainedTokenizerFast: A new tokenizer instance that reflects the current self.state.

Side Effects

Replaces self.tokenizer and refreshes self.state from the rebuilt tokenizer.

Module contents

TokenizerChanger: utilities for editing Hugging Face fast tokenizers.

The public API is exposed via TokenizerChanger.TokenizerChanger.

class TokenizerChanger.TokenizerChanger(tokenizer: TokenizersBackend | None = None, space_sign: str = 'Ġ')[source]

Bases: object

add_merges(merges: Sequence[str | Sequence[str]])[source]

Add merge rules to the tokenizer.

Parameters

merges:: Iterable of merges. Each merge may be either a whitespace-separated string (e.g. "a b") or a sequence of pieces (typically length 2).

Notes

If required tokens are missing, this may call add_token_suggestion(), which can prompt the user.

add_token_suggestion(merge: str | Sequence[str])[source]

Prompt the user to add tokens required by a merge.

Parameters

merge:: The merge that cannot be added because required pieces are missing.

Returns

bool: True if tokens were added (or permission already granted), otherwise False.

Warnings

This method is interactive: it calls input() and prints to stdout. Avoid using it in non-interactive environments.

add_tokens(tokens: list[str])[source]

Add new tokens to the tokenizer vocabulary.

Parameters

tokens:: Token strings to add. Any literal spaces will be replaced by self.space_sign.

Notes

This method mutates the underlying JSON vocab (self.state['model']['vocab']). It does not automatically update merges; call updated_tokenizer() to rebuild a usable PreTrainedTokenizerFast.

property baseline_tokenizer: TokenizersBackend | None: Alias for original_tokenizer.

delete_inappropriate_merges(vocab: list[str] = [], n_jobs: int = 1)[source]

Delete merges that reference tokens not present in a provided vocab. If vocab in empty, then self.state[“model”][“vocab”] will be used.

Parameters

vocab:: A list of allowed tokens. Any merge referencing tokens outside this list will be removed.
n_jobs:: Number of worker processes to use. 1 uses a single process.

Notes

Multiprocessing can be unreliable on some platforms/environments (especially when pickling bound methods). If it fails, the implementation falls back to single-process deletion.

delete_k_tail_tokens(k: int, exclude: list[str] = [], delete_merges: bool = True, consider_excluded_tokens: bool = False, n_jobs=1)[source]

Delete k tokens selected by find_tail_tokens().

Parameters

k:: Number of tokens to delete.
exclude:: Tokens to skip during selection.
delete_merges:: Whether to also delete merges that reference deleted tokens.
consider_excluded_tokens:: Whether to subtract the excluded count from k.
n_jobs:: Worker process count used by delete_merges().

Notes

This uses find_tail_tokens(), which is based on vocab ordering rather than true frequency unless your tokenizer’s vocab is frequency-sorted.

delete_merges(unwanted_tokens: list[str] | None = None, unwanted_merges_set: set = {}, n_jobs=1)[source]

Delete merge rules that contain unwanted tokens.

Parameters

unwanted_tokens:: If provided, merges are filtered using these tokens. Otherwise uses the current self.unwanted_tokens.
n_jobs:: Number of worker processes to use. 1 uses a single process.

Notes

Multiprocessing can be unreliable on some platforms/environments (especially when pickling bound methods). If it fails, the implementation falls back to single-process deletion.

delete_overlaps(vocab: dict, delete_merges: bool = True)[source]

Delete tokens that overlap with another vocabulary.

Parameters

vocab:: A vocabulary mapping (e.g., token->id). Any token present in both vocabularies will be deleted from this tokenizer.
delete_merges:: Whether to also remove merges referencing the deleted tokens.

delete_tokens(unwanted_tokens: list[str] = [], include_substrings: bool = True, delete_merges: bool = True, n_jobs: int = 1) → None[source]

Delete tokens from the vocabulary and optionally delete affected merges.

Parameters

unwanted_tokens:: Tokens to delete. If empty, deletes the current self.unwanted_tokens.
include_substrings:: If True, expands the deletion list by searching for tokens present in the current vocabulary. (Note: this does not search arbitrary substrings; it filters to existing vocab entries.)
delete_merges:: If True, also remove merges that reference deleted tokens.
n_jobs:: Worker process count used by delete_merges().

Raises

KeyError: If a requested token is not in the vocabulary.

find_tail_tokens(k_least: int, exclude: list[str] = [], consider_excluded_tokens: bool = False)[source]

Select k tokens from the tail of the vocabulary ordering.

Parameters

k_least:: Number of tokens to select.
exclude:: Tokens to ignore.
consider_excluded_tokens:: If True, k_least is decremented for each excluded token encountered during selection.

Notes

Despite the name, this method does not compute true token frequencies. It walks the vocabulary in reverse insertion/id order and picks the first k. This is only “least frequent” if your tokenizer’s vocab is ordered by frequency.

Side Effects

Updates self.unwanted_tokens.

find_token_id_gap()[source]

Find the most recent gap in token ids.

Returns

int: The id at the start of the last detected gap. add_tokens uses this to pick ids that do not collide with existing ones.

find_tokens(unwanted_tokens: list[str])[source]

Add existing tokens from unwanted_tokens to the internal deletion list.

Parameters

unwanted_tokens:: Candidate tokens to look up in the tokenizer vocabulary.

Side Effects

Appends found tokens to self.unwanted_tokens.

format_merges()[source]

Normalize merge entries to tuples of strings.

Some tokenizer JSON dumps store merges as whitespace-separated strings. This converts each merge into a tuple (typically length 2).

get_overlapping_merges(merges: list)[source]

Return merges that overlap with another merge list.

Parameters

merges:: A list of merges from another tokenizer.

Returns

list: Merges from this tokenizer that appear to overlap with the provided list.

get_overlapping_tokens(vocab: dict)[source]

Return tokens that exist in both vocabularies.

Parameters

vocab:: A mapping representing another vocabulary (e.g., token->id).

Returns

list[str]: Tokens that appear in both vocab and this tokenizer’s vocab.

load_tokenizer(tokenizer: TokenizersBackend)[source]

Load a tokenizer and initialize internal JSON state.

Parameters

tokenizer:: The tokenizer to load (must be a fast tokenizer).

property none_keys: list[str]: Alias for none_types.

replace_tokens(donor_TC: TokenizerChanger, k: int, ignore_overlaps: bool = False, add_merges: bool = True, n_jobs: int = 1, replaced_idx_file: str | None = 'replaced_idx.txt')[source]

Replace k tokens from this tokenizer with tokens from a donor tokenizer.

Parameters

donor_TC:: The TokenizerChanger instance to source replacement tokens from.
k:: Number of tokens to replace.
ignore_overlaps:: Whether to skip overlapping tokens during replacement.
add_merges:: Whether to also add merges referencing replaced tokens.
n_jobs:: Worker process count used by delete_merges().
replaced_idx_file:: Optional file path to save the indices of replaced tokens.

Notes

This method selects tokens to delete using find_tail_tokens() and then adds new tokens from the donor tokenizer. If the donor has fewer than k tokens, it will add as many as it can.

save_tokenizer(path: str = 'updated_tokenizer')[source]

Persist the updated tokenizer to disk.

Parameters

path:: Output directory passed to tokenizer.save_pretrained.

Notes

This calls updated_tokenizer() first to rebuild the tokenizer instance from the current JSON state.

set_space_marker(space_marker: str) → None[source]: Alias for set_space_sign().

set_space_sign(space_sign: str)[source]

Set the marker used to represent spaces when adding tokens.

Parameters

space_sign:: The character (or string) used to replace regular spaces in token strings.

property space_marker: str: Alias for space_sign.

property tokens_to_delete: list[str]: Alias for unwanted_tokens.

updated_tokenizer()[source]

Rebuild and return a tokenizer from the current internal JSON state.

Returns

transformers.PreTrainedTokenizerFast: A new tokenizer instance that reflects the current self.state.

Side Effects

Replaces self.tokenizer and refreshes self.state from the rebuilt tokenizer.