Skip to content

euroeval.tokenisation_utils

source module euroeval.tokenisation_utils

Utility functions related to tokenisation.

Functions

source get_special_token_metadata(tokeniser: PreTrainedTokenizerBase)dict

Get the special token metadata for a tokeniser.

Parameters

  • tokeniser : PreTrainedTokenizerBase The tokeniser.

Returns

  • dict The special token metadata.

source should_prompts_be_stripped(labels_to_be_generated: list[str], tokeniser: PreTrainedTokenizer)bool

Determine if we should strip the prompts for few-shot evaluation.

This is the case if the tokeniser needs to include the space as part of the label token. The strategy is thus to tokenise a label with a preceeding colon (as in the prompts), i.e., ": positive", and check if the tokenisation starts with the tokens of ": ". If this is the case, then we should not strip the prompts, since the tokeniser produces the whitespace token separately.

Parameters

  • labels_to_be_generated : list[str] The labels that are to be generated.

  • tokeniser : PreTrainedTokenizer The tokeniser used to tokenise the labels.

Returns

  • bool Whether we should strip the prompts.

source should_prefix_space_be_added_to_labels(labels_to_be_generated: list[str], tokeniser: PreTrainedTokenizer)bool

Determine if we should add a prefix space to the labels.

This is the case if the prompts are stripped and the tokeniser doesn't automatically add prefix whitespaces to the labels.

Parameters

  • labels_to_be_generated : list[str] The labels that are to be generated.

  • tokeniser : PreTrainedTokenizer The tokeniser used to tokenise the labels.

Returns

  • bool Whether we should add a prefix space to the labels.

source get_bos_token(tokeniser: PreTrainedTokenizer)tuple[str, int] | tuple[None, None]

Get the beginning-of-sequence token from a tokeniser.

Parameters

  • tokeniser : PreTrainedTokenizer The tokeniser.

Returns

  • tuple[str, int] | tuple[None, None] A pair (token, token_id) representing the beginning-of-sequence token and its token ID, or (None, None) if no BOS token is found.

source get_eos_token(tokeniser: PreTrainedTokenizer)tuple[str, int] | tuple[None, None]

Get the end-of-sequence token from a tokeniser.

Parameters

  • tokeniser : PreTrainedTokenizer The tokeniser.

Returns

  • tuple[str, int] | tuple[None, None] A pair (token, token_id) representing the end-of-sequence token and its token ID, or (None, None) if no EOS token is found.

source get_pad_token(tokeniser: PreTrainedTokenizer)tuple[str, int] | tuple[None, None]

Get the padding token from a tokeniser.

Parameters

  • tokeniser : PreTrainedTokenizer The tokeniser.

Returns

  • tuple[str, int] | tuple[None, None] A pair (token, token_id) representing the padding token and its token ID, or (None, None) if no padding token is found.

source get_end_of_chat_token_ids(tokeniser: PreTrainedTokenizer, generative_type: GenerativeType | None)list[int] | None

Get the end token ID for chat models.

This is only relevant for tokenisers with a chat template.

Parameters

  • tokeniser : PreTrainedTokenizer The tokeniser.

  • generative_type : GenerativeType | None The generative type, or None if not available.

Returns

  • list[int] | None The token IDs used to end chats, or None if the tokeniser does not have a chat template or if no end-of-chat token could be found.

Raises

  • e

source get_first_label_token_mapping(dataset_config: DatasetConfig, model_config: ModelConfig, tokeniser: PreTrainedTokenizer | None, generative_type: GenerativeType | None, log_metadata: bool)dict[str, str] | bool

Check if the model should output scores.

Parameters

  • dataset_config : DatasetConfig The dataset configuration.

  • model_config : ModelConfig The model configuration.

  • tokeniser : PreTrainedTokenizer | None The tokeniser, or None if not available.

  • generative_type : GenerativeType | None The generative type, or None if not available.

  • log_metadata : bool Whether to log metadata.

Returns

  • dict[str, str] | bool A mapping from labels to the first token in each label, or alternatively a Boolean value indicating whether the model should output scores (if the mapping is outputted then the model will always output scores).

source has_chat_template(tokeniser: PreTrainedTokenizer)bool

Check if a tokeniser has a chat template.

Parameters

  • tokeniser : PreTrainedTokenizer The tokeniser.

Returns

  • bool Whether the tokeniser has a chat template.

source apply_chat_template(conversation: list[dict[str, str]], tokeniser: PreTrainedTokenizer, tokenise: bool, add_generation_prompt: bool, enable_thinking: bool, **extra_kwargs)str | list[int]

Apply the chat template to a prompt.

Parameters

  • conversation : list[dict[str, str]] The conversation to apply the chat template to.

  • tokeniser : PreTrainedTokenizer The tokeniser.

  • tokenise : bool Whether to tokenise the resulting prompt, returning a list of token IDs instead of a string.

  • add_generation_prompt : bool Whether to add a generation prompt at the end of the conversation. This is only relevant for regular Hugging Face tokenisers, as Mistral tokenisers always add a generation prompt.

  • enable_thinking : bool Whether to enable special handling for reasoning models, such as adding special tokens for thinking. This is only relevant for regular Hugging Face tokenisers, as Mistral tokenisers always handle reasoning models.

  • **extra_kwargs Extra keyword arguments to pass to the tokeniser's apply_chat_template method. Only relevant for regular Hugging Face tokenisers.

Returns

  • str | list[int] The prompt with the chat template applied, either as a string or a list of token IDs, depending on the value of tokenise.

Raises

  • InvalidModel If the tokeniser does not have a chat template.