euroeval.generation

Functions related to text generation of models.

Functions

generate — Evaluate a model on a dataset through generation.
generate_single_iteration — Evaluate a model on a dataset in a single iteration through generation.
debug_log — Log inputs and outputs for debugging purposes.

source generate(model: BenchmarkModule, datasets: list[DatasetDict], model_config: ModelConfig, dataset_config: DatasetConfig, benchmark_config: BenchmarkConfig) → list[dict[str, float]]

Evaluate a model on a dataset through generation.

Parameters

Returns

source generate_single_iteration(dataset: Dataset, model: BenchmarkModule, dataset_config: DatasetConfig, benchmark_config: BenchmarkConfig, cache: ModelCache) → dict[str, float]

Evaluate a model on a dataset in a single iteration through generation.

Parameters

Returns

dict[str, float] — A list of dictionaries containing the scores for each metric.

Raises

source debug_log(batch: dict[str, t.Any], model_output: GenerativeModelOutput, extracted_labels: list[dict | str | list[str]], dataset_config: DatasetConfig) → None

Log inputs and outputs for debugging purposes.

Parameters

batch : dict[str, t.Any] — The batch of examples to evaluate on.
model_output : GenerativeModelOutput — The output of the model.
extracted_labels : list[dict | str | list[str]] — The extracted labels from the model output.
dataset_config : DatasetConfig — The configuration of the dataset.

Raises