euroeval.tokenisation_utils
source module euroeval.tokenisation_utils
Utility functions related to tokenisation.
Functions
- 
get_special_token_metadata — Get the special token metadata for a tokeniser. 
- 
should_prompts_be_stripped — Determine if we should strip the prompts for few-shot evaluation. 
- 
should_prefix_space_be_added_to_labels — Determine if we should add a prefix space to the labels. 
- 
get_bos_token — Get the beginning-of-sequence token from a tokeniser. 
- 
get_eos_token — Get the end-of-sequence token from a tokeniser. 
- 
get_pad_token — Get the padding token from a tokeniser. 
- 
get_end_of_chat_token_ids — Get the end token ID for chat models. 
- 
get_first_label_token_mapping — Check if the model should output scores. 
- 
has_chat_template — Check if a tokeniser has a chat template. 
- 
apply_chat_template — Apply the chat template to a prompt. 
source get_special_token_metadata(tokeniser: PreTrainedTokenizerBase) → dict
Get the special token metadata for a tokeniser.
Parameters
- 
tokeniser : PreTrainedTokenizerBase — The tokeniser. 
Returns
- 
dict — The special token metadata. 
source should_prompts_be_stripped(labels_to_be_generated: c.Sequence[str], tokeniser: PreTrainedTokenizer) → bool
Determine if we should strip the prompts for few-shot evaluation.
This is the case if the tokeniser needs to include the space as part of the label token. The strategy is thus to tokenise a label with a preceeding colon (as in the prompts), i.e., ": positive", and check if the tokenisation starts with the tokens of ": ". If this is the case, then we should not strip the prompts, since the tokeniser produces the whitespace token separately.
Parameters
- 
labels_to_be_generated : c.Sequence[str] — The labels that are to be generated. 
- 
tokeniser : PreTrainedTokenizer — The tokeniser used to tokenise the labels. 
Returns
- 
bool — Whether we should strip the prompts. 
source should_prefix_space_be_added_to_labels(labels_to_be_generated: c.Sequence[str], tokeniser: PreTrainedTokenizer) → bool
Determine if we should add a prefix space to the labels.
This is the case if the prompts are stripped and the tokeniser doesn't automatically add prefix whitespaces to the labels.
Parameters
- 
labels_to_be_generated : c.Sequence[str] — The labels that are to be generated. 
- 
tokeniser : PreTrainedTokenizer — The tokeniser used to tokenise the labels. 
Returns
- 
bool — Whether we should add a prefix space to the labels. 
source get_bos_token(tokeniser: PreTrainedTokenizer) → tuple[str, int] | tuple[None, None]
Get the beginning-of-sequence token from a tokeniser.
Parameters
- 
tokeniser : PreTrainedTokenizer — The tokeniser. 
Returns
- 
tuple[str, int] | tuple[None, None] — A pair (token, token_id) representing the beginning-of-sequence token and its token ID, or (None, None) if no BOS token is found. 
source get_eos_token(tokeniser: PreTrainedTokenizer) → tuple[str, int] | tuple[None, None]
Get the end-of-sequence token from a tokeniser.
Parameters
- 
tokeniser : PreTrainedTokenizer — The tokeniser. 
Returns
- 
tuple[str, int] | tuple[None, None] — A pair (token, token_id) representing the end-of-sequence token and its token ID, or (None, None) if no EOS token is found. 
source get_pad_token(tokeniser: PreTrainedTokenizer) → tuple[str, int] | tuple[None, None]
Get the padding token from a tokeniser.
Parameters
- 
tokeniser : PreTrainedTokenizer — The tokeniser. 
Returns
- 
tuple[str, int] | tuple[None, None] — A pair (token, token_id) representing the padding token and its token ID, or (None, None) if no padding token is found. 
source get_end_of_chat_token_ids(tokeniser: PreTrainedTokenizer, generative_type: GenerativeType | None) → c.Sequence[int] | None
Get the end token ID for chat models.
This is only relevant for tokenisers with a chat template.
Parameters
- 
tokeniser : PreTrainedTokenizer — The tokeniser. 
- 
generative_type : GenerativeType | None — The generative type, or None if not available. 
Returns
- 
c.Sequence[int] | None — The token IDs used to end chats, or None if the tokeniser does not have a chat template or if no end-of-chat token could be found. 
Raises
- 
e 
source get_first_label_token_mapping(dataset_config: DatasetConfig, model_config: ModelConfig, tokeniser: PreTrainedTokenizer | None, generative_type: GenerativeType | None, log_metadata: bool) → dict[str, str] | bool
Check if the model should output scores.
Parameters
- 
dataset_config : DatasetConfig — The dataset configuration. 
- 
model_config : ModelConfig — The model configuration. 
- 
tokeniser : PreTrainedTokenizer | None — The tokeniser, or None if not available. 
- 
generative_type : GenerativeType | None — The generative type, or None if not available. 
- 
log_metadata : bool — Whether to log metadata. 
Returns
- 
dict[str, str] | bool — A mapping from labels to the first token in each label, or alternatively a Boolean value indicating whether the model should output scores (if the mapping is outputted then the model will always output scores). 
source has_chat_template(tokeniser: PreTrainedTokenizer) → bool
Check if a tokeniser has a chat template.
Parameters
- 
tokeniser : PreTrainedTokenizer — The tokeniser. 
Returns
- 
bool — Whether the tokeniser has a chat template. 
source apply_chat_template(conversation: c.Sequence[dict[str, str]], tokeniser: PreTrainedTokenizer, tokenise: bool, add_generation_prompt: bool, **extra_kwargs) → str | c.Sequence[int]
Apply the chat template to a prompt.
Parameters
- 
conversation : c.Sequence[dict[str, str]] — The conversation to apply the chat template to. 
- 
tokeniser : PreTrainedTokenizer — The tokeniser. 
- 
tokenise : bool — Whether to tokenise the resulting prompt, returning a list of token IDs instead of a string. 
- 
add_generation_prompt : bool — Whether to add a generation prompt at the end of the conversation. This is only relevant for regular Hugging Face tokenisers, as Mistral tokenisers always add a generation prompt. 
- 
**extra_kwargs — Extra keyword arguments to pass to the tokeniser's apply_chat_templatemethod. Only relevant for regular Hugging Face tokenisers.
Returns
- 
str | c.Sequence[int] — The prompt with the chat template applied, either as a string or a list of token IDs, depending on the value of tokenise.
Raises
- 
InvalidModel — If the tokeniser does not have a chat template.