euroeval.tokenization_utils
source module euroeval.tokenization_utils
Utility functions related to tokenization.
Functions
-
get_special_token_metadata — Get the special token metadata for a tokenizer.
-
should_prompts_be_stripped — Determine if we should strip the prompts for few-shot evaluation.
-
should_prefix_space_be_added_to_labels — Determine if we should add a prefix space to the labels.
-
get_bos_token — Get the beginning-of-sequence token from a tokenizer.
-
get_eos_token — Get the end-of-sequence token from a tokenizer.
-
get_end_of_chat_token_ids — Get the end token ID for chat models.
-
get_first_label_token_mapping — Check if the model should output scores.
source get_special_token_metadata(tokenizer: PreTrainedTokenizerBase) → dict
Get the special token metadata for a tokenizer.
Parameters
-
tokenizer : PreTrainedTokenizerBase — The tokenizer.
Returns
-
dict — The special token metadata.
source should_prompts_be_stripped(labels_to_be_generated: list[str], tokenizer: PreTrainedTokenizer) → bool
Determine if we should strip the prompts for few-shot evaluation.
This is the case if the tokenizer needs to include the space as part of the label token. The strategy is thus to tokenize a label with a preceeding colon (as in the prompts), i.e., ": positive", and check if the tokenization starts with the tokens of ": ". If this is the case, then we should not strip the prompts, since the tokenizer produces the whitespace token separately.
Parameters
-
labels_to_be_generated : list[str] — The labels that are to be generated.
-
tokenizer : PreTrainedTokenizer — The tokenizer used to tokenize the labels.
Returns
-
bool — Whether we should strip the prompts.
source should_prefix_space_be_added_to_labels(labels_to_be_generated: list[str], tokenizer: PreTrainedTokenizer) → bool
Determine if we should add a prefix space to the labels.
This is the case if the prompts are stripped and the tokenizer doesn't automatically add prefix whitespaces to the labels.
Parameters
-
labels_to_be_generated : list[str] — The labels that are to be generated.
-
tokenizer : PreTrainedTokenizer — The tokenizer used to tokenize the labels.
Returns
-
bool — Whether we should add a prefix space to the labels.
source get_bos_token(tokenizer: PreTrainedTokenizer) → tuple[str, int]
Get the beginning-of-sequence token from a tokenizer.
Parameters
-
tokenizer : PreTrainedTokenizer — The tokenizer.
Returns
-
tuple[str, int] — A pair (token, token_id) representing the beginning-of-sequence token and its token ID.
Raises
source get_eos_token(tokenizer: PreTrainedTokenizer) → tuple[str, int]
Get the end-of-sequence token from a tokenizer.
Parameters
-
tokenizer : PreTrainedTokenizer — The tokenizer.
Returns
-
tuple[str, int] — A pair (token, token_id) representing the end-of-sequence token and its token ID.
Raises
source get_end_of_chat_token_ids(tokenizer: PreTrainedTokenizer) → list[int] | None
Get the end token ID for chat models.
This is only relevant for tokenizers with a chat template.
Parameters
-
tokenizer : PreTrainedTokenizer — The tokenizer.
Returns
-
list[int] | None — The token IDs used to end chats, or None if the tokenizer does not have a chat template.
Raises
-
ValueError — If the end-of-chat token could not be located.
source get_first_label_token_mapping(dataset_config: DatasetConfig, model_config: ModelConfig, tokenizer: PreTrainedTokenizer | None, generative_type: GenerativeType | None) → dict[str, str] | bool
Check if the model should output scores.
Parameters
-
dataset_config : DatasetConfig — The dataset configuration.
-
model_config : ModelConfig — The model configuration.
-
tokenizer : PreTrainedTokenizer | None — The tokenizer, or None if not available.
-
generative_type : GenerativeType | None — The generative type, or None if not available.
Returns
-
dict[str, str] | bool — A mapping from labels to the first token in each label, or alternatively a Boolean value indicating whether the model should output scores (if the mapping is outputted then the model will always output scores).