Skip to content

euroeval.task_group_utils.token_classification

source module euroeval.task_group_utils.token_classification

Utility functions related to the token-classification task group.

Functions

source compute_metrics(model_outputs_and_labels: tuple[Predictions, Labels] | EvalPrediction, has_misc_tags: bool, dataset_config: DatasetConfig, benchmark_config: BenchmarkConfig)dict[str, float]

Compute the metrics needed for evaluation.

Parameters

  • model_outputs_and_labels : tuple[Predictions, Labels] | EvalPrediction The first array contains the probability predictions and the second array contains the true labels.

  • has_misc_tags : bool Whether the dataset has MISC tags.

  • dataset_config : DatasetConfig The configuration of the dataset.

  • benchmark_config : BenchmarkConfig The configuration of the benchmark.

Returns

  • dict[str, float] A dictionary with the names of the metrics as keys and the metric values as values.

Raises

source extract_labels_from_generation(input_batch: dict[str, list], model_output: GenerativeModelOutput, dataset_config: DatasetConfig)list[t.Any]

Extract the predicted labels from the generated output.

Parameters

  • input_batch : dict[str, list] The input batch, where the keys are the feature names and the values are lists with the feature values.

  • model_output : GenerativeModelOutput The raw generated output of the model.

  • dataset_config : DatasetConfig The configuration of the dataset.

Returns

  • list[t.Any] The predicted labels.

source tokenize_and_align_labels(examples: dict, tokenizer: PreTrainedTokenizer, label2id: dict[str, int])BatchEncoding

Tokenise all texts and align the labels with them.

Parameters

  • examples : dict The examples to be tokenised.

  • tokenizer : PreTrainedTokenizer A pretrained tokenizer.

  • label2id : dict[str, int] A dictionary that converts NER tags to IDs.

Returns

  • BatchEncoding A dictionary containing the tokenized data as well as labels.

Raises

source handle_unk_tokens(tokenizer: PreTrainedTokenizer, tokens: list[str], words: list[str])list[str]

Replace unknown tokens in the tokens with the corresponding word.

Parameters

  • tokenizer : PreTrainedTokenizer The tokenizer used to tokenize the words.

  • tokens : list[str] The list of tokens.

  • words : list[str] The list of words.

Returns

  • list[str] The list of tokens with unknown tokens replaced by the corresponding word.