Skip to content


source module euroeval.task_utils.question_answering

Utility functions related to the question-answering task group.



source class QuestionAnsweringTrainer(model: PreTrainedModel | nn.Module, processing_class: PreTrainedTokenizerBase, args: TrainingArguments, train_dataset: Dataset, eval_dataset: Dataset, compute_metrics: c.Callable[[EvalPrediction], dict[str, float]], callbacks: list[TrainerCallback], data_collator: c.Callable)

Bases : Trainer

Trainer subclass for question answering tasks.

Initialize the trainer.


  • evaluate Evaluate the model on the given dataset.

source method QuestionAnsweringTrainer.evaluate(eval_dataset: Dataset | None = None, orig_eval_dataset: Dataset | None = None, ignore_keys: list[str] | None = None, metric_key_prefix: str = 'eval')dict[str, float] | None

Evaluate the model on the given dataset.


  • eval_dataset : Dataset | None

    The dataset to evaluate on. If None, then use the stored evaluation dataset.

  • orig_eval_dataset : Dataset | None

    The original evaluation dataset, before any postprocessing. If None, then use the stored original evaluation dataset.

  • ignore_keys : list[str] | None

    The keys to ignore when computing the metrics.

  • metric_key_prefix : str

    The prefix to use for the metric keys.


  • dict[str, float] | None The metrics computed on the evaluation dataset.

source compute_metrics(model_outputs_and_labels: tuple[Predictions, Labels] | EvalPrediction, dataset_config: DatasetConfig, benchmark_config: BenchmarkConfig)dict[str, float]

Compute the metrics needed for evaluation.


  • model_outputs_and_labels : tuple[Predictions, Labels] | EvalPrediction

    The first sequence contains the model outputs and the second sequence contains the true labels.

  • dataset_config : DatasetConfig

    The configuration of the dataset.

  • benchmark_config : BenchmarkConfig

    The configuration of the benchmark.


  • dict[str, float] A dictionary with the names of the metrics as keys and the metric values as values.

source extract_labels_from_generation(input_batch: dict[str, list], model_output: GenerativeModelOutput)list[t.Any]

Extract the predicted labels from the generated output.


  • input_batch : dict[str, list]

    The input batch, where the keys are the feature names and the values are lists with the feature values.

  • model_output : GenerativeModelOutput

    The raw generated output of the model.


  • list[t.Any] The predicted labels.

source prepare_train_examples(examples: BatchEncoding, tokenizer: PreTrainedTokenizer)BatchEncoding

Prepare the features for training.


  • examples : BatchEncoding

    The examples to prepare.

  • tokenizer : PreTrainedTokenizer

    The tokenizer to use to prepare the examples.


  • BatchEncoding The prepared examples.

source prepare_test_examples(examples: BatchEncoding, tokenizer: PreTrainedTokenizer)BatchEncoding

Prepare test examples.


  • examples : BatchEncoding

    Dictionary of test examples.

  • tokenizer : PreTrainedTokenizer

    The tokenizer used to preprocess the examples.


  • BatchEncoding The prepared test examples.

source postprocess_predictions_and_labels(predictions: list, dataset: Dataset, prepared_dataset: Dataset, cls_token_index: int)tuple[list[dict], list[dict]]

Postprocess the predictions and labels, to allow easier metric computation.


  • predictions : list

    A pair of (start_logits, end_logits) predictions.

  • dataset : Dataset

    The dataset containing the examples.

  • prepared_dataset : Dataset

    The dataset containing the prepared examples.

  • cls_token_index : int

    The index of the CLS token.


  • tuple[list[dict], list[dict]] The postprocessed predictions and labels.

source find_best_answer(all_start_logits: np.ndarray, all_end_logits: np.ndarray, prepared_dataset: Dataset, feature_indices: list[int], context: str, max_answer_length: int, num_best_logits: int, min_null_score: float, cls_token_index: int)str

Find the best answer for a given example.


  • all_start_logits : np.ndarray

    The start logits for all the features.

  • all_end_logits : np.ndarray

    The end logits for all the features.

  • prepared_dataset : Dataset

    The dataset containing the prepared examples.

  • feature_indices : list[int]

    The indices of the features associated with the current example.

  • context : str

    The context of the example.

  • max_answer_length : int

    The maximum length of the answer.

  • num_best_logits : int

    The number of best logits to consider.

  • min_null_score : float

    The minimum score an answer can have.

  • cls_token_index : int

    The index of the CLS token.


  • str The best answer for the example.

source find_valid_answers(start_logits: np.ndarray, end_logits: np.ndarray, offset_mapping: list[tuple[int, int]], context: str, max_answer_length: int, num_best_logits: int, min_null_score: float)list[dict]

Find the valid answers from the start and end indexes.


  • start_logits : np.ndarray

    The logits for the start of the answer.

  • end_logits : np.ndarray

    The logits for the end of the answer.

  • offset_mapping : list[tuple[int, int]]

    The offset mapping, being a list of pairs of integers for each token index, containing the start and end character index in the original context.

  • context : str

    The context of the example.

  • max_answer_length : int

    The maximum length of the answer.

  • num_best_logits : int

    The number of best logits to consider. Note that this function will run in O(num_best_logits ^ 2) time.

  • min_null_score : float

    The minimum score an answer can have.


  • list[dict] A list of the valid answers, each being a dictionary with keys "text" and "score", the score being the sum of the start and end logits.