euroeval.task_group_utils.sequence_classification

source module euroeval.task_group_utils.sequence_classification

Utility functions related to the sequence-classification task group.

Functions

compute_metrics — Compute the metrics needed for evaluation.
extract_labels_from_generation — Extract the predicted labels from the generated output.
get_closest_logprobs_labels — Get the labels with the highest predicted logprob value.

source compute_metrics(model_outputs_and_labels: tuple[Predictions, Labels] | EvalPrediction, dataset_config: DatasetConfig, benchmark_config: BenchmarkConfig) → dict[str, float]

Compute the metrics needed for evaluation.

Parameters

model_outputs_and_labels : tuple[Predictions, Labels] | EvalPrediction — The first sequence contains the model outputs and the second sequence contains the true labels.
dataset_config : DatasetConfig — The configuration of the dataset.
benchmark_config : BenchmarkConfig — The configuration of the benchmark.

Returns

dict[str, float] — A dictionary with the names of the metrics as keys and the metric values as values.

source extract_labels_from_generation(input_batch: dict[str, list], model_output: GenerativeModelOutput, dataset_config: DatasetConfig, first_label_token_mapping: dict[str, str] | bool) → list[str]

Extract the predicted labels from the generated output.

Parameters

input_batch : dict[str, list] — The input batch, where the keys are the feature names and the values are lists with the feature values.
model_output : GenerativeModelOutput — The raw generated output of the model.
dataset_config : DatasetConfig — The configuration of the dataset.
first_label_token_mapping : dict[str, str] | bool — A mapping from labels to the first token in each label, or alternatively a Boolean value indicating whether the model should output scores (if the mapping is outputted then the model will always output scores).

Returns

list[str] — The predicted labels.

Raises

InvalidBenchmark

source get_closest_logprobs_labels(generation_logprobs: list[list[list[tuple[str, float]]]], dataset_config: DatasetConfig, first_label_token_mapping: dict[str, str] | t.Literal[True]) → list[str] | None

Get the labels with the highest predicted logprob value.

In case a candidate label is split into multiple tokens, we only use the first token to compute the logprob value. E.g., if the candidate label "positive" is tokenised as ["pos", "itive"], we only use the logprob value of "pos" to represent the logprob value of the entire label.

Parameters

generation_logprobs : list[list[list[tuple[str, float]]]] — The logprobs of the generated tokens, for all samples in the batch. Of shape (batch_size, num_tokens, num_logprobs).
dataset_config : DatasetConfig — The configuration of the dataset.
first_label_token_mapping : dict[str, str] | t.Literal[True] — A mapping from labels to the first token in each label, or alternatively a True value indicating that the model should output logprobs.

Returns

list[str] | None — The predicted labels, or None if labels could not be extracted.

Raises

InvalidBenchmark — If no candidate label can be found for any of the generated labels.