euroeval.tokenisation_utils
source module euroeval.tokenisation_utils
Utility functions related to tokenisation.
Functions
-
get_special_token_metadata — Get the special token metadata for a tokeniser.
-
should_prompts_be_stripped — Determine if we should strip the prompts for few-shot evaluation.
-
should_prefix_space_be_added_to_labels — Determine if we should add a prefix space to the labels.
-
get_bos_token — Get the beginning-of-sequence token from a tokeniser.
-
get_eos_token — Get the end-of-sequence token from a tokeniser.
-
get_pad_token — Get the padding token from a tokeniser.
-
get_end_of_chat_token_ids — Get the end token ID for chat models.
-
get_first_label_token_mapping — Check if the model should output scores.
-
has_chat_template — Check if a tokeniser has a chat template.
-
apply_chat_template — Apply the chat template to a prompt.
source get_special_token_metadata(tokeniser: PreTrainedTokenizerBase) → dict
Get the special token metadata for a tokeniser.
Parameters
-
tokeniser : PreTrainedTokenizerBase — The tokeniser.
Returns
-
dict — The special token metadata.
source should_prompts_be_stripped(labels_to_be_generated: list[str], tokeniser: PreTrainedTokenizer) → bool
Determine if we should strip the prompts for few-shot evaluation.
This is the case if the tokeniser needs to include the space as part of the label token. The strategy is thus to tokenise a label with a preceeding colon (as in the prompts), i.e., ": positive", and check if the tokenisation starts with the tokens of ": ". If this is the case, then we should not strip the prompts, since the tokeniser produces the whitespace token separately.
Parameters
-
labels_to_be_generated : list[str] — The labels that are to be generated.
-
tokeniser : PreTrainedTokenizer — The tokeniser used to tokenise the labels.
Returns
-
bool — Whether we should strip the prompts.
source should_prefix_space_be_added_to_labels(labels_to_be_generated: list[str], tokeniser: PreTrainedTokenizer) → bool
Determine if we should add a prefix space to the labels.
This is the case if the prompts are stripped and the tokeniser doesn't automatically add prefix whitespaces to the labels.
Parameters
-
labels_to_be_generated : list[str] — The labels that are to be generated.
-
tokeniser : PreTrainedTokenizer — The tokeniser used to tokenise the labels.
Returns
-
bool — Whether we should add a prefix space to the labels.
source get_bos_token(tokeniser: PreTrainedTokenizer) → tuple[str, int] | tuple[None, None]
Get the beginning-of-sequence token from a tokeniser.
Parameters
-
tokeniser : PreTrainedTokenizer — The tokeniser.
Returns
-
tuple[str, int] | tuple[None, None] — A pair (token, token_id) representing the beginning-of-sequence token and its token ID, or (None, None) if no BOS token is found.
source get_eos_token(tokeniser: PreTrainedTokenizer) → tuple[str, int] | tuple[None, None]
Get the end-of-sequence token from a tokeniser.
Parameters
-
tokeniser : PreTrainedTokenizer — The tokeniser.
Returns
-
tuple[str, int] | tuple[None, None] — A pair (token, token_id) representing the end-of-sequence token and its token ID, or (None, None) if no EOS token is found.
source get_pad_token(tokeniser: PreTrainedTokenizer) → tuple[str, int] | tuple[None, None]
Get the padding token from a tokeniser.
Parameters
-
tokeniser : PreTrainedTokenizer — The tokeniser.
Returns
-
tuple[str, int] | tuple[None, None] — A pair (token, token_id) representing the padding token and its token ID, or (None, None) if no padding token is found.
source get_end_of_chat_token_ids(tokeniser: PreTrainedTokenizer, generative_type: GenerativeType | None) → list[int] | None
Get the end token ID for chat models.
This is only relevant for tokenisers with a chat template.
Parameters
-
tokeniser : PreTrainedTokenizer — The tokeniser.
-
generative_type : GenerativeType | None — The generative type, or None if not available.
Returns
-
list[int] | None — The token IDs used to end chats, or None if the tokeniser does not have a chat template or if no end-of-chat token could be found.
Raises
-
e
source get_first_label_token_mapping(dataset_config: DatasetConfig, model_config: ModelConfig, tokeniser: PreTrainedTokenizer | None, generative_type: GenerativeType | None, log_metadata: bool) → dict[str, str] | bool
Check if the model should output scores.
Parameters
-
dataset_config : DatasetConfig — The dataset configuration.
-
model_config : ModelConfig — The model configuration.
-
tokeniser : PreTrainedTokenizer | None — The tokeniser, or None if not available.
-
generative_type : GenerativeType | None — The generative type, or None if not available.
-
log_metadata : bool — Whether to log metadata.
Returns
-
dict[str, str] | bool — A mapping from labels to the first token in each label, or alternatively a Boolean value indicating whether the model should output scores (if the mapping is outputted then the model will always output scores).
source has_chat_template(tokeniser: PreTrainedTokenizer) → bool
Check if a tokeniser has a chat template.
Parameters
-
tokeniser : PreTrainedTokenizer — The tokeniser.
Returns
-
bool — Whether the tokeniser has a chat template.
source apply_chat_template(conversation: list[dict[str, str]], tokeniser: PreTrainedTokenizer, tokenise: bool, add_generation_prompt: bool, enable_thinking: bool, **extra_kwargs) → str | list[int]
Apply the chat template to a prompt.
Parameters
-
conversation : list[dict[str, str]] — The conversation to apply the chat template to.
-
tokeniser : PreTrainedTokenizer — The tokeniser.
-
tokenise : bool — Whether to tokenise the resulting prompt, returning a list of token IDs instead of a string.
-
add_generation_prompt : bool — Whether to add a generation prompt at the end of the conversation. This is only relevant for regular Hugging Face tokenisers, as Mistral tokenisers always add a generation prompt.
-
enable_thinking : bool — Whether to enable special handling for reasoning models, such as adding special tokens for thinking. This is only relevant for regular Hugging Face tokenisers, as Mistral tokenisers always handle reasoning models.
-
**extra_kwargs — Extra keyword arguments to pass to the tokeniser's
apply_chat_template
method. Only relevant for regular Hugging Face tokenisers.
Returns
-
str | list[int] — The prompt with the chat template applied, either as a string or a list of token IDs, depending on the value of
tokenise
.
Raises
-
InvalidModel — If the tokeniser does not have a chat template.