Skip to content

euroeval.tokenisation_utils

source module euroeval.tokenisation_utils

Utility functions related to tokenisation.

Functions

source get_special_token_metadata(tokeniser: PreTrainedTokenizerBase)dict

Get the special token metadata for a tokeniser.

Parameters

  • tokeniser : PreTrainedTokenizerBase The tokeniser.

Returns

  • dict The special token metadata.

source should_prompts_be_stripped(labels_to_be_generated: c.Sequence[str], tokeniser: PreTrainedTokenizer)bool

Determine if we should strip the prompts for few-shot evaluation.

This is the case if the tokeniser needs to include the space as part of the label token. The strategy is thus to tokenise a label with a preceeding colon (as in the prompts), i.e., ": positive", and check if the tokenisation starts with the tokens of ": ". If this is the case, then we should not strip the prompts, since the tokeniser produces the whitespace token separately.

Parameters

  • labels_to_be_generated : c.Sequence[str] The labels that are to be generated.

  • tokeniser : PreTrainedTokenizer The tokeniser used to tokenise the labels.

Returns

  • bool Whether we should strip the prompts.

source should_prefix_space_be_added_to_labels(labels_to_be_generated: c.Sequence[str], tokeniser: PreTrainedTokenizer)bool

Determine if we should add a prefix space to the labels.

This is the case if the prompts are stripped and the tokeniser doesn't automatically add prefix whitespaces to the labels.

Parameters

  • labels_to_be_generated : c.Sequence[str] The labels that are to be generated.

  • tokeniser : PreTrainedTokenizer The tokeniser used to tokenise the labels.

Returns

  • bool Whether we should add a prefix space to the labels.

source get_bos_token(tokeniser: PreTrainedTokenizer)tuple[str, int] | tuple[None, None]

Get the beginning-of-sequence token from a tokeniser.

Parameters

  • tokeniser : PreTrainedTokenizer The tokeniser.

Returns

  • tuple[str, int] | tuple[None, None] A pair (token, token_id) representing the beginning-of-sequence token and its token ID, or (None, None) if no BOS token is found.

source get_eos_token(tokeniser: PreTrainedTokenizer)tuple[str, int] | tuple[None, None]

Get the end-of-sequence token from a tokeniser.

Parameters

  • tokeniser : PreTrainedTokenizer The tokeniser.

Returns

  • tuple[str, int] | tuple[None, None] A pair (token, token_id) representing the end-of-sequence token and its token ID, or (None, None) if no EOS token is found.

source get_pad_token(tokeniser: PreTrainedTokenizer)tuple[str, int] | tuple[None, None]

Get the padding token from a tokeniser.

Parameters

  • tokeniser : PreTrainedTokenizer The tokeniser.

Returns

  • tuple[str, int] | tuple[None, None] A pair (token, token_id) representing the padding token and its token ID, or (None, None) if no padding token is found.

source get_end_of_chat_token_ids(tokeniser: PreTrainedTokenizer, generative_type: GenerativeType | None)c.Sequence[int] | None

Get the end token ID for chat models.

This is only relevant for tokenisers with a chat template.

Parameters

  • tokeniser : PreTrainedTokenizer The tokeniser.

  • generative_type : GenerativeType | None The generative type, or None if not available.

Returns

  • c.Sequence[int] | None The token IDs used to end chats, or None if the tokeniser does not have a chat template or if no end-of-chat token could be found.

Raises

  • e

source get_first_label_token_mapping(dataset_config: DatasetConfig, model_config: ModelConfig, tokeniser: PreTrainedTokenizer | None, generative_type: GenerativeType | None, log_metadata: bool)dict[str, str] | bool

Check if the model should output scores.

Parameters

  • dataset_config : DatasetConfig The dataset configuration.

  • model_config : ModelConfig The model configuration.

  • tokeniser : PreTrainedTokenizer | None The tokeniser, or None if not available.

  • generative_type : GenerativeType | None The generative type, or None if not available.

  • log_metadata : bool Whether to log metadata.

Returns

  • dict[str, str] | bool A mapping from labels to the first token in each label, or alternatively a Boolean value indicating whether the model should output scores (if the mapping is outputted then the model will always output scores).

source has_chat_template(tokeniser: PreTrainedTokenizer)bool

Check if a tokeniser has a chat template.

Parameters

  • tokeniser : PreTrainedTokenizer The tokeniser.

Returns

  • bool Whether the tokeniser has a chat template.

source apply_chat_template(conversation: c.Sequence[dict[str, str]], tokeniser: PreTrainedTokenizer, tokenise: bool, add_generation_prompt: bool, **extra_kwargs)str | c.Sequence[int]

Apply the chat template to a prompt.

Parameters

  • conversation : c.Sequence[dict[str, str]] The conversation to apply the chat template to.

  • tokeniser : PreTrainedTokenizer The tokeniser.

  • tokenise : bool Whether to tokenise the resulting prompt, returning a list of token IDs instead of a string.

  • add_generation_prompt : bool Whether to add a generation prompt at the end of the conversation. This is only relevant for regular Hugging Face tokenisers, as Mistral tokenisers always add a generation prompt.

  • **extra_kwargs Extra keyword arguments to pass to the tokeniser's apply_chat_template method. Only relevant for regular Hugging Face tokenisers.

Returns

  • str | c.Sequence[int] The prompt with the chat template applied, either as a string or a list of token IDs, depending on the value of tokenise.

Raises

  • InvalidModel If the tokeniser does not have a chat template.