Skip to content

euroeval.task_group_utils.sequence_classification

source module euroeval.task_group_utils.sequence_classification

Utility functions related to the sequence-classification task group.

Functions

source compute_metrics(model_outputs_and_labels: tuple[Predictions, Labels] | EvalPrediction, dataset_config: DatasetConfig, dataset: Dataset)dict[str, float]

Compute the metrics needed for evaluation.

Parameters

  • model_outputs_and_labels : tuple[Predictions, Labels] | EvalPrediction The first sequence contains the model outputs and the second sequence contains the true labels.

  • dataset_config : DatasetConfig The configuration of the dataset.

  • dataset : Dataset The dataset used for evaluation. This is only used in case any additional metadata is used to compute the metrics.

Returns

  • dict[str, float] A dictionary with the names of the metrics as keys and the metric values as values.

source extract_labels_from_generation(input_batch: dict[str, list], model_output: GenerativeModelOutput, dataset_config: DatasetConfig, first_label_token_mapping: dict[str, str] | bool)list[str]

Extract the predicted labels from the generated output.

Parameters

  • input_batch : dict[str, list] The input batch, where the keys are the feature names and the values are lists with the feature values.

  • model_output : GenerativeModelOutput The raw generated output of the model.

  • dataset_config : DatasetConfig The configuration of the dataset.

  • first_label_token_mapping : dict[str, str] | bool A mapping from labels to the first token in each label, or alternatively a Boolean value indicating whether the model should output scores (if the mapping is outputted then the model will always output scores).

Returns

  • list[str] The predicted labels.

Raises

  • InvalidBenchmark If the task requires log probabilities, but the model did not output them, or if the model outputted log probabilities but the first label token mapping is not provided.

source get_closest_logprobs_labels(generation_logprobs: list[list[list[tuple[str, float]]]], dataset_config: DatasetConfig, first_label_token_mapping: dict[str, str] | t.Literal[True])list[str] | None

Get the labels with the highest predicted logprob value.

In case a candidate label is split into multiple tokens, we only use the first token to compute the logprob value. E.g., if the candidate label "positive" is tokenised as ["pos", "itive"], we only use the logprob value of "pos" to represent the logprob value of the entire label.

Parameters

  • generation_logprobs : list[list[list[tuple[str, float]]]] The logprobs of the generated tokens, for all samples in the batch. Of shape (batch_size, num_tokens, num_logprobs).

  • dataset_config : DatasetConfig The configuration of the dataset.

  • first_label_token_mapping : dict[str, str] | t.Literal[True] A mapping from labels to the first token in each label, or alternatively a True value indicating that the model should output logprobs.

Returns

  • list[str] | None The predicted labels, or None if labels could not be extracted.

Raises

  • InvalidBenchmark If no candidate label can be found for any of the generated labels.