TokenizerChanger package
Submodules
TokenizerChanger.tokenizer_changer module
- class TokenizerChanger.tokenizer_changer.TokenizerChanger(tokenizer: TokenizersBackend | None = None, space_sign: str = 'Ġ')[source]
Bases:
object- add_merges(merges: Sequence[str | Sequence[str]])[source]
Add merge rules to the tokenizer.
Parameters
- merges:
Iterable of merges. Each merge may be either a whitespace-separated string (e.g.
"a b") or a sequence of pieces (typically length 2).
Notes
If required tokens are missing, this may call
add_token_suggestion(), which can prompt the user.
- add_token_suggestion(merge: str | Sequence[str])[source]
Prompt the user to add tokens required by a merge.
Parameters
- merge:
The merge that cannot be added because required pieces are missing.
Returns
- bool
True if tokens were added (or permission already granted), otherwise False.
Warnings
This method is interactive: it calls
input()and prints to stdout. Avoid using it in non-interactive environments.
- add_tokens(tokens: list[str])[source]
Add new tokens to the tokenizer vocabulary.
Parameters
- tokens:
Token strings to add. Any literal spaces will be replaced by
self.space_sign.
Notes
This method mutates the underlying JSON vocab (
self.state['model']['vocab']). It does not automatically update merges; callupdated_tokenizer()to rebuild a usablePreTrainedTokenizerFast.
- property baseline_tokenizer: TokenizersBackend | None
Alias for
original_tokenizer.
- delete_inappropriate_merges(vocab: list[str] = [], n_jobs: int = 1)[source]
Delete merges that reference tokens not present in a provided vocab. If vocab in empty, then self.state[“model”][“vocab”] will be used.
Parameters
- vocab:
A list of allowed tokens. Any merge referencing tokens outside this list will be removed.
- n_jobs:
Number of worker processes to use.
1uses a single process.
Notes
Multiprocessing can be unreliable on some platforms/environments (especially when pickling bound methods). If it fails, the implementation falls back to single-process deletion.
- delete_k_tail_tokens(k: int, exclude: list[str] = [], delete_merges: bool = True, consider_excluded_tokens: bool = False, n_jobs=1)[source]
Delete k tokens selected by
find_tail_tokens().Parameters
- k:
Number of tokens to delete.
- exclude:
Tokens to skip during selection.
- delete_merges:
Whether to also delete merges that reference deleted tokens.
- consider_excluded_tokens:
Whether to subtract the excluded count from
k.- n_jobs:
Worker process count used by
delete_merges().
Notes
This uses
find_tail_tokens(), which is based on vocab ordering rather than true frequency unless your tokenizer’s vocab is frequency-sorted.
- delete_merges(unwanted_tokens: list[str] | None = None, unwanted_merges_set: set = {}, n_jobs=1)[source]
Delete merge rules that contain unwanted tokens.
Parameters
- unwanted_tokens:
If provided, merges are filtered using these tokens. Otherwise uses the current
self.unwanted_tokens.- n_jobs:
Number of worker processes to use.
1uses a single process.
Notes
Multiprocessing can be unreliable on some platforms/environments (especially when pickling bound methods). If it fails, the implementation falls back to single-process deletion.
- delete_overlaps(vocab: dict, delete_merges: bool = True)[source]
Delete tokens that overlap with another vocabulary.
Parameters
- vocab:
A vocabulary mapping (e.g., token->id). Any token present in both vocabularies will be deleted from this tokenizer.
- delete_merges:
Whether to also remove merges referencing the deleted tokens.
- delete_tokens(unwanted_tokens: list[str] = [], include_substrings: bool = True, delete_merges: bool = True, n_jobs: int = 1) None[source]
Delete tokens from the vocabulary and optionally delete affected merges.
Parameters
- unwanted_tokens:
Tokens to delete. If empty, deletes the current
self.unwanted_tokens.- include_substrings:
If True, expands the deletion list by searching for tokens present in the current vocabulary. (Note: this does not search arbitrary substrings; it filters to existing vocab entries.)
- delete_merges:
If True, also remove merges that reference deleted tokens.
- n_jobs:
Worker process count used by
delete_merges().
Raises
- KeyError
If a requested token is not in the vocabulary.
- find_tail_tokens(k_least: int, exclude: list[str] = [], consider_excluded_tokens: bool = False)[source]
Select k tokens from the tail of the vocabulary ordering.
Parameters
- k_least:
Number of tokens to select.
- exclude:
Tokens to ignore.
- consider_excluded_tokens:
If True,
k_leastis decremented for each excluded token encountered during selection.
Notes
Despite the name, this method does not compute true token frequencies. It walks the vocabulary in reverse insertion/id order and picks the first k. This is only “least frequent” if your tokenizer’s vocab is ordered by frequency.
Side Effects
Updates
self.unwanted_tokens.
- find_token_id_gap()[source]
Find the most recent gap in token ids.
Returns
- int
The id at the start of the last detected gap.
add_tokensuses this to pick ids that do not collide with existing ones.
- find_tokens(unwanted_tokens: list[str])[source]
Add existing tokens from
unwanted_tokensto the internal deletion list.Parameters
- unwanted_tokens:
Candidate tokens to look up in the tokenizer vocabulary.
Side Effects
Appends found tokens to
self.unwanted_tokens.
- format_merges()[source]
Normalize merge entries to tuples of strings.
Some tokenizer JSON dumps store merges as whitespace-separated strings. This converts each merge into a tuple (typically length 2).
- get_overlapping_merges(merges: list)[source]
Return merges that overlap with another merge list.
Parameters
- merges:
A list of merges from another tokenizer.
Returns
- list
Merges from this tokenizer that appear to overlap with the provided list.
- get_overlapping_tokens(vocab: dict)[source]
Return tokens that exist in both vocabularies.
Parameters
- vocab:
A mapping representing another vocabulary (e.g., token->id).
Returns
- list[str]
Tokens that appear in both
vocaband this tokenizer’s vocab.
- load_tokenizer(tokenizer: TokenizersBackend)[source]
Load a tokenizer and initialize internal JSON state.
Parameters
- tokenizer:
The tokenizer to load (must be a fast tokenizer).
- property none_keys: list[str]
Alias for
none_types.
- replace_tokens(donor_TC: TokenizerChanger, k: int, ignore_overlaps: bool = False, add_merges: bool = True, n_jobs: int = 1, replaced_idx_file: str | None = 'replaced_idx.txt')[source]
Replace k tokens from this tokenizer with tokens from a donor tokenizer.
Parameters
- donor_TC:
The TokenizerChanger instance to source replacement tokens from.
- k:
Number of tokens to replace.
- ignore_overlaps:
Whether to skip overlapping tokens during replacement.
- add_merges:
Whether to also add merges referencing replaced tokens.
- n_jobs:
Worker process count used by
delete_merges().- replaced_idx_file:
Optional file path to save the indices of replaced tokens.
Notes
This method selects tokens to delete using
find_tail_tokens()and then adds new tokens from the donor tokenizer. If the donor has fewer than k tokens, it will add as many as it can.
- save_tokenizer(path: str = 'updated_tokenizer')[source]
Persist the updated tokenizer to disk.
Parameters
- path:
Output directory passed to
tokenizer.save_pretrained.
Notes
This calls
updated_tokenizer()first to rebuild the tokenizer instance from the current JSON state.
- set_space_marker(space_marker: str) None[source]
Alias for
set_space_sign().
- set_space_sign(space_sign: str)[source]
Set the marker used to represent spaces when adding tokens.
Parameters
- space_sign:
The character (or string) used to replace regular spaces in token strings.
- property space_marker: str
Alias for
space_sign.
- property tokens_to_delete: list[str]
Alias for
unwanted_tokens.
Module contents
TokenizerChanger: utilities for editing Hugging Face fast tokenizers.
The public API is exposed via TokenizerChanger.TokenizerChanger.
- class TokenizerChanger.TokenizerChanger(tokenizer: TokenizersBackend | None = None, space_sign: str = 'Ġ')[source]
Bases:
object- add_merges(merges: Sequence[str | Sequence[str]])[source]
Add merge rules to the tokenizer.
Parameters
- merges:
Iterable of merges. Each merge may be either a whitespace-separated string (e.g.
"a b") or a sequence of pieces (typically length 2).
Notes
If required tokens are missing, this may call
add_token_suggestion(), which can prompt the user.
- add_token_suggestion(merge: str | Sequence[str])[source]
Prompt the user to add tokens required by a merge.
Parameters
- merge:
The merge that cannot be added because required pieces are missing.
Returns
- bool
True if tokens were added (or permission already granted), otherwise False.
Warnings
This method is interactive: it calls
input()and prints to stdout. Avoid using it in non-interactive environments.
- add_tokens(tokens: list[str])[source]
Add new tokens to the tokenizer vocabulary.
Parameters
- tokens:
Token strings to add. Any literal spaces will be replaced by
self.space_sign.
Notes
This method mutates the underlying JSON vocab (
self.state['model']['vocab']). It does not automatically update merges; callupdated_tokenizer()to rebuild a usablePreTrainedTokenizerFast.
- property baseline_tokenizer: TokenizersBackend | None
Alias for
original_tokenizer.
- delete_inappropriate_merges(vocab: list[str] = [], n_jobs: int = 1)[source]
Delete merges that reference tokens not present in a provided vocab. If vocab in empty, then self.state[“model”][“vocab”] will be used.
Parameters
- vocab:
A list of allowed tokens. Any merge referencing tokens outside this list will be removed.
- n_jobs:
Number of worker processes to use.
1uses a single process.
Notes
Multiprocessing can be unreliable on some platforms/environments (especially when pickling bound methods). If it fails, the implementation falls back to single-process deletion.
- delete_k_tail_tokens(k: int, exclude: list[str] = [], delete_merges: bool = True, consider_excluded_tokens: bool = False, n_jobs=1)[source]
Delete k tokens selected by
find_tail_tokens().Parameters
- k:
Number of tokens to delete.
- exclude:
Tokens to skip during selection.
- delete_merges:
Whether to also delete merges that reference deleted tokens.
- consider_excluded_tokens:
Whether to subtract the excluded count from
k.- n_jobs:
Worker process count used by
delete_merges().
Notes
This uses
find_tail_tokens(), which is based on vocab ordering rather than true frequency unless your tokenizer’s vocab is frequency-sorted.
- delete_merges(unwanted_tokens: list[str] | None = None, unwanted_merges_set: set = {}, n_jobs=1)[source]
Delete merge rules that contain unwanted tokens.
Parameters
- unwanted_tokens:
If provided, merges are filtered using these tokens. Otherwise uses the current
self.unwanted_tokens.- n_jobs:
Number of worker processes to use.
1uses a single process.
Notes
Multiprocessing can be unreliable on some platforms/environments (especially when pickling bound methods). If it fails, the implementation falls back to single-process deletion.
- delete_overlaps(vocab: dict, delete_merges: bool = True)[source]
Delete tokens that overlap with another vocabulary.
Parameters
- vocab:
A vocabulary mapping (e.g., token->id). Any token present in both vocabularies will be deleted from this tokenizer.
- delete_merges:
Whether to also remove merges referencing the deleted tokens.
- delete_tokens(unwanted_tokens: list[str] = [], include_substrings: bool = True, delete_merges: bool = True, n_jobs: int = 1) None[source]
Delete tokens from the vocabulary and optionally delete affected merges.
Parameters
- unwanted_tokens:
Tokens to delete. If empty, deletes the current
self.unwanted_tokens.- include_substrings:
If True, expands the deletion list by searching for tokens present in the current vocabulary. (Note: this does not search arbitrary substrings; it filters to existing vocab entries.)
- delete_merges:
If True, also remove merges that reference deleted tokens.
- n_jobs:
Worker process count used by
delete_merges().
Raises
- KeyError
If a requested token is not in the vocabulary.
- find_tail_tokens(k_least: int, exclude: list[str] = [], consider_excluded_tokens: bool = False)[source]
Select k tokens from the tail of the vocabulary ordering.
Parameters
- k_least:
Number of tokens to select.
- exclude:
Tokens to ignore.
- consider_excluded_tokens:
If True,
k_leastis decremented for each excluded token encountered during selection.
Notes
Despite the name, this method does not compute true token frequencies. It walks the vocabulary in reverse insertion/id order and picks the first k. This is only “least frequent” if your tokenizer’s vocab is ordered by frequency.
Side Effects
Updates
self.unwanted_tokens.
- find_token_id_gap()[source]
Find the most recent gap in token ids.
Returns
- int
The id at the start of the last detected gap.
add_tokensuses this to pick ids that do not collide with existing ones.
- find_tokens(unwanted_tokens: list[str])[source]
Add existing tokens from
unwanted_tokensto the internal deletion list.Parameters
- unwanted_tokens:
Candidate tokens to look up in the tokenizer vocabulary.
Side Effects
Appends found tokens to
self.unwanted_tokens.
- format_merges()[source]
Normalize merge entries to tuples of strings.
Some tokenizer JSON dumps store merges as whitespace-separated strings. This converts each merge into a tuple (typically length 2).
- get_overlapping_merges(merges: list)[source]
Return merges that overlap with another merge list.
Parameters
- merges:
A list of merges from another tokenizer.
Returns
- list
Merges from this tokenizer that appear to overlap with the provided list.
- get_overlapping_tokens(vocab: dict)[source]
Return tokens that exist in both vocabularies.
Parameters
- vocab:
A mapping representing another vocabulary (e.g., token->id).
Returns
- list[str]
Tokens that appear in both
vocaband this tokenizer’s vocab.
- load_tokenizer(tokenizer: TokenizersBackend)[source]
Load a tokenizer and initialize internal JSON state.
Parameters
- tokenizer:
The tokenizer to load (must be a fast tokenizer).
- property none_keys: list[str]
Alias for
none_types.
- replace_tokens(donor_TC: TokenizerChanger, k: int, ignore_overlaps: bool = False, add_merges: bool = True, n_jobs: int = 1, replaced_idx_file: str | None = 'replaced_idx.txt')[source]
Replace k tokens from this tokenizer with tokens from a donor tokenizer.
Parameters
- donor_TC:
The TokenizerChanger instance to source replacement tokens from.
- k:
Number of tokens to replace.
- ignore_overlaps:
Whether to skip overlapping tokens during replacement.
- add_merges:
Whether to also add merges referencing replaced tokens.
- n_jobs:
Worker process count used by
delete_merges().- replaced_idx_file:
Optional file path to save the indices of replaced tokens.
Notes
This method selects tokens to delete using
find_tail_tokens()and then adds new tokens from the donor tokenizer. If the donor has fewer than k tokens, it will add as many as it can.
- save_tokenizer(path: str = 'updated_tokenizer')[source]
Persist the updated tokenizer to disk.
Parameters
- path:
Output directory passed to
tokenizer.save_pretrained.
Notes
This calls
updated_tokenizer()first to rebuild the tokenizer instance from the current JSON state.
- set_space_marker(space_marker: str) None[source]
Alias for
set_space_sign().
- set_space_sign(space_sign: str)[source]
Set the marker used to represent spaces when adding tokens.
Parameters
- space_sign:
The character (or string) used to replace regular spaces in token strings.
- property space_marker: str
Alias for
space_sign.
- property tokens_to_delete: list[str]
Alias for
unwanted_tokens.