euroeval.data_models

Data models used in EuroEval.

Classes

PromptConfig — Configuration for task-specific prompting across languages.
Task — A dataset task.
DatasetConfig — Configuration for a dataset.
BenchmarkConfig — General benchmarking configuration, across datasets and models.
BenchmarkConfigParams — The parameters for the benchmark configuration.
BenchmarkResult — A benchmark result.
ModelConfig — Configuration for a model.
PreparedModelInputs — The inputs to a model.
GenerativeModelOutput — The output of a generative model.
SingleGenerativeModelOutput — A single output of a generative model.
HFModelInfo — Information about a Hugging Face model.
ModelIdComponents — A model ID split into its components.
HashableDict — A hashable dictionary.

source dataclass PromptConfig(default_prompt_prefix: str, default_prompt_template: str, default_instruction_prompt: str, default_prompt_label_mapping: dict[str, str] | t.Literal['auto'])

Configuration for task-specific prompting across languages.

Defines the prompt templates needed for evaluating a specific task in a given language.

Attributes

default_prompt_prefix : str — The default prefix to use in the few-shot prompt.
default_prompt_template : str — The default template for the prompt to use when benchmarking the dataset using few-shot evaluation.
default_instruction_prompt : str — The default prompt to use when benchmarking the dataset using instruction-based evaluation.
default_prompt_label_mapping : dict[str, str] | t.Literal['auto'] — The default mapping from the labels to another phrase which is used as a substitute for the label in few-shot evaluation. If set to "auto", the mapping will be set to a 1:1 mapping between the labels and themselves.

source dataclass Task(name: str, task_group: TaskGroup, template_dict: dict[Language, PromptConfig], metrics: c.Sequence[Metric], default_num_few_shot_examples: int, default_max_generated_tokens: int, default_labels: c.Sequence[str] | None, requires_zero_shot: bool = False, uses_structured_output: bool = False, uses_logprobs: bool = False, requires_logprobs: bool = False, default_allowed_model_types: c.Sequence[ModelType] = field(default_factory=lambda: [ModelType.ENCODER, ModelType.GENERATIVE]), default_allowed_generative_types: c.Sequence[GenerativeType] = field(default_factory=lambda: [GenerativeType.BASE, GenerativeType.INSTRUCTION_TUNED, GenerativeType.REASONING]), default_allow_invalid_model_outputs: bool = True)

A dataset task.

Attributes

name : str — The name of the task.
task_group : TaskGroup — The task group of the task.
template_dict : dict[Language, PromptConfig] — The template dictionary for the task, from language to prompt template.
metrics : c.Sequence[Metric] — The metrics used to evaluate the task.
default_num_few_shot_examples : int — The default number of examples to use when benchmarking the task using few-shot evaluation. For a classification task, these will be drawn evenly from each label.
default_max_generated_tokens : int — The default maximum number of tokens to generate when benchmarking the task using few-shot evaluation.
default_labels : c.Sequence[str] | None — The default labels for datasets using this task.
requires_zero_shot : optional — Whether to only allow zero-shot evaluation for this task. If True, the task will not be evaluated using few-shot examples.
uses_structured_output : optional — Whether the task uses structured output. If True, the task will return structured output (e.g., BIO tags for NER). Defaults to False.
uses_logprobs : optional — Whether the task uses log probabilities. If True, the task will return log probabilities for the generated tokens. Defaults to False.
requires_logprobs : optional — Whether the task requires log probabilities. Implies uses_logprobs.
default_allowed_model_types : optional — A list of model types that are allowed to be evaluated on this task. Defaults to all model types being allowed.
default_allowed_generative_types : optional — A list of generative model types that are allowed to be evaluated on this task. If None, all generative model types are allowed. Only relevant if allowed_model_types includes generative models.
default_allow_invalid_model_outputs : optional — Whether to allow invalid model outputs. This is only relevant for generative models on classification tasks, where the model may generate an output which is not one of the allowed labels. If True, the model output will be mapped to the closest valid label. If False, the model output will be considered incorrect and the evaluation will be aborted. Defaults to True.

source dataclass DatasetConfig(name: str, pretty_name: str, source: str | dict[str, str], task: Task, languages: c.Sequence[Language], _prompt_prefix: str | None = None, _prompt_template: str | None = None, _instruction_prompt: str | None = None, _num_few_shot_examples: int | None = None, _max_generated_tokens: int | None = None, _labels: c.Sequence[str] | None = None, _prompt_label_mapping: dict[str, str] | t.Literal['auto'] | None = None, _allowed_model_types: c.Sequence[ModelType] | None = None, _allowed_generative_types: c.Sequence[GenerativeType] | None = None, _allow_invalid_model_outputs: bool | None = None, _logging_string: str | None = None, splits: c.Sequence[str] = field(default_factory=lambda: ['train', 'val', 'test']), bootstrap_samples: bool = True, unofficial: bool = False)

Configuration for a dataset.

Attributes

name : str — The name of the dataset. Must be lower case with no spaces.
pretty_name : str — A longer prettier name for the dataset, which allows cases and spaces. Used for logging.
source : str | dict[str, str] — The source of the dataset, which can be a Hugging Face ID or a dictionary with keys "train", "val" and "test" mapping to local CSV file paths.
task : Task — The task of the dataset.
languages : c.Sequence[Language] — The ISO 639-1 language codes of the entries in the dataset.
id2label : HashableDict — The mapping from ID to label.
label2id : HashableDict — The mapping from label to ID.
num_labels : int — The number of labels in the dataset.
_prompt_prefix : optional — The prefix to use in the few-shot prompt. Defaults to the template for the task and language.
_prompt_template : optional — The template for the prompt to use when benchmarking the dataset using few-shot evaluation. Defaults to the template for the task and language.
_instruction_prompt : optional — The prompt to use when benchmarking the dataset using instruction-based evaluation. Defaults to the template for the task and language.
_num_few_shot_examples : optional — The number of examples to use when benchmarking the dataset using few-shot evaluation. For a classification task, these will be drawn evenly from each label. Defaults to the template for the task and language.
_max_generated_tokens : optional — The maximum number of tokens to generate when benchmarking the dataset using few-shot evaluation. Defaults to the template for the task and language.
_labels : optional — The labels in the dataset. Defaults to the template for the task and language.
_prompt_label_mapping : optional — A mapping from the labels to another phrase which is used as a substitute for the label in few-shot evaluation. If "auto" then the mapping will be set to a 1:1 mapping between the labels and themselves. If None then the mapping will be set to the default mapping for the task and language. Defaults to None.
_allowed_model_types : optional — A list of model types that are allowed to be evaluated on this dataset. Defaults to the one for the task.
_allowed_generative_types : optional — A list of generative model types that are allowed to be evaluated on this dataset. If None, all generative model types are allowed. Only relevant if allowed_model_types includes generative models. Defaults to the one for the task.
_allow_invalid_model_outputs : optional — Whether to allow invalid model outputs. This is only relevant for generative models on classification tasks, where the model may generate an output which is not one of the allowed labels. If True, the model output will be mapped to the closest valid label. If False, the model output will be considered incorrect and the evaluation will be aborted. Defaults to the one for the task.
_logging_string : optional — The string used to describe evaluation on the dataset in logging. If not provided, a default string will be generated, based on the pretty name. Only use this if the default string is not suitable.
splits : optional — The names of the splits in the dataset. If not provided, defaults to ["train", "val", "test"].
bootstrap_samples : optional — Whether to bootstrap the dataset samples. Defaults to True.
unofficial : optional — Whether the dataset is unofficial. Defaults to False.
main_language : Language — Get the main language of the dataset.
logging_string : str — The string used to describe evaluation on the dataset in logging.
prompt_prefix : str — The prefix to use in the few-shot prompt.
prompt_template : str — The template used during few-shot evaluation.
instruction_prompt : str — The prompt to use when evaluating instruction-tuned models.
num_few_shot_examples : int — The number of few-shot examples to use.
max_generated_tokens : int — The maximum number of tokens to generate when evaluating a model.
labels : c.Sequence[str] — The labels in the dataset.
prompt_label_mapping : dict[str, str] — Mapping from English labels to localised labels.
allowed_model_types : c.Sequence[ModelType] — A list of model types that are allowed to be evaluated on this dataset.
allowed_generative_types : c.Sequence[GenerativeType] — A list of generative model types that are allowed on this dataset.
allow_invalid_model_outputs : bool — Whether to allow invalid model outputs.

Methods

get_labels_str — Converts a set of labels to a natural string, in the specified language.

source property DatasetConfig.main_language: Language

Get the main language of the dataset.

Returns

Language — The main language.

source property DatasetConfig.logging_string: str

The string used to describe evaluation on the dataset in logging.

source property DatasetConfig.prompt_prefix: str

The prefix to use in the few-shot prompt.

source property DatasetConfig.prompt_template: str

The template used during few-shot evaluation.

source property DatasetConfig.instruction_prompt: str

The prompt to use when evaluating instruction-tuned models.

source property DatasetConfig.num_few_shot_examples: int

The number of few-shot examples to use.

source property DatasetConfig.max_generated_tokens: int

The maximum number of tokens to generate when evaluating a model.

source property DatasetConfig.labels: c.Sequence[str]

The labels in the dataset.

source property DatasetConfig.prompt_label_mapping: dict[str, str]

Mapping from English labels to localised labels.

source property DatasetConfig.allowed_model_types: c.Sequence[ModelType]

A list of model types that are allowed to be evaluated on this dataset.

source property DatasetConfig.allowed_generative_types: c.Sequence[GenerativeType]

A list of generative model types that are allowed on this dataset.

source property DatasetConfig.allow_invalid_model_outputs: bool

Whether to allow invalid model outputs.

source property DatasetConfig.id2label: HashableDict

The mapping from ID to label.

source property DatasetConfig.label2id: HashableDict

The mapping from label to ID.

source property DatasetConfig.num_labels: int

The number of labels in the dataset.

source method DatasetConfig.get_labels_str(labels: c.Sequence[str] | None = None) → str

Converts a set of labels to a natural string, in the specified language.

If the task is NER, we separate using 'and' and use the mapped labels instead of the BIO NER labels.

Parameters

labels : optional — The labels to convert to a natural string. If None, uses all the labels in the dataset. Defaults to None.

Returns

str — The natural string representation of the labels in specified language.

source dataclass BenchmarkConfig(datasets: c.Sequence[DatasetConfig], model_languages: c.Sequence[Language], dataset_languages: c.Sequence[Language], batch_size: int, raise_errors: bool, cache_dir: str, api_key: str | None, api_base: str | None, api_version: str | None, progress_bar: bool, save_results: bool, device: torch.device, trust_remote_code: bool, clear_model_cache: bool, evaluate_test_split: bool, few_shot: bool, num_iterations: int, gpu_memory_utilization: float, requires_safetensors: bool, generative_type: GenerativeType | None, download_only: bool, force: bool, verbose: bool, debug: bool, run_with_cli: bool)

General benchmarking configuration, across datasets and models.

Attributes

datasets : c.Sequence[DatasetConfig] — The datasets to benchmark on.
model_languages : c.Sequence[Language] — The languages of the models to benchmark.
dataset_languages : c.Sequence[Language] — The languages of the datasets in the benchmark.
batch_size : int — The batch size to use.
raise_errors : bool — Whether to raise errors instead of skipping them.
cache_dir : str — Directory to store cached models and datasets.
api_key : str | None — The API key to use for a given inference API.
api_base : str | None — The base URL for a given inference API. Only relevant if model refers to a model on an inference API.
api_version : str | None — The version of the API to use. Only relevant if model refers to a model on an inference API.
progress_bar : bool — Whether to show a progress bar.
save_results : bool — Whether to save the benchmark results to 'euroeval_benchmark_results.json'.
device : torch.device — The device to use for benchmarking.
trust_remote_code : bool — Whether to trust remote code when loading models from the Hugging Face Hub.
clear_model_cache : bool — Whether to clear the model cache after benchmarking each model.
evaluate_test_split : bool — Whether to evaluate on the test split.
few_shot : bool — Whether to only evaluate the model using few-shot evaluation. Only relevant if the model is generative.
num_iterations : int — The number of iterations each model should be evaluated for.
gpu_memory_utilization : float — The GPU memory utilization to use for vLLM. A larger value will result in faster evaluation, but at the risk of running out of GPU memory. Only reduce this if you are running out of GPU memory. Only relevant if the model is generative.
requires_safetensors : bool — Whether to only allow models that use the safetensors format.
generative_type : GenerativeType | None — The type of generative model to benchmark. Only relevant if the model is generative.
download_only : bool — Whether to only download the models, metrics and datasets without evaluating.
force : bool — Whether to force the benchmark to run even if the results are already cached.
verbose : bool — Whether to print verbose output.
debug : bool — Whether to run the benchmark in debug mode.
run_with_cli : bool — Whether the benchmark is being run with the CLI.
tasks : c.Sequence[Task] — Get the tasks in the benchmark configuration.

source property BenchmarkConfig.tasks: c.Sequence[Task]

Get the tasks in the benchmark configuration.

source class BenchmarkConfigParams(**data: Any)

Bases : pydantic.BaseModel

The parameters for the benchmark configuration.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Attributes

model_extra : dict[str, Any] | None — Get extra fields set during validation.
model_fields_set : set[str] — Returns the set of fields that have been explicitly set on this model instance.

source class BenchmarkResult(**data: Any)

Bases : pydantic.BaseModel

A benchmark result.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Attributes

model_config : ClassVar[ConfigDict] — Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].
model_extra : dict[str, Any] | None — Get extra fields set during validation.
model_fields_set : set[str] — Returns the set of fields that have been explicitly set on this model instance.

Methods

from_dict — Create a benchmark result from a dictionary.
append_to_results — Append the benchmark result to the results file.

source classmethod BenchmarkResult.from_dict(config: dict) → BenchmarkResult

Create a benchmark result from a dictionary.

Parameters

config : dict — The configuration dictionary.

Returns

BenchmarkResult — The benchmark result.

source method BenchmarkResult.append_to_results(results_path: pathlib.Path) → None

Append the benchmark result to the results file.

Parameters

results_path : pathlib.Path — The path to the results file.

source dataclass ModelConfig(model_id: str, revision: str, param: str | None, task: str, languages: c.Sequence[Language], inference_backend: InferenceBackend, merge: bool, model_type: ModelType, fresh: bool, model_cache_dir: str, adapter_base_model_id: str | None)

Configuration for a model.

Attributes

model_id : str — The ID of the model.
revision : str — The revision of the model.
param : str | None — The parameter of the model, or None if the model has no parameters.
task : str — The task that the model was trained on.
languages : c.Sequence[Language] — The languages of the model.
inference_backend : InferenceBackend — The backend used to perform inference with the model.
merge : bool — Whether the model is a merged model.
model_type : ModelType — The type of the model (e.g., encoder, base decoder, instruction tuned).
fresh : bool — Whether the model is freshly initialised.
model_cache_dir : str — The directory to cache the model in.
adapter_base_model_id : str | None — The model ID of the base model if the model is an adapter model. Can be None if the model is not an adapter model.

source dataclass PreparedModelInputs(texts: c.Sequence[str] | None = None, input_ids: torch.Tensor | None = None, attention_mask: torch.Tensor | None = None)

The inputs to a model.

Attributes

texts : c.Sequence[str] | None — The texts to input to the model. Can be None if the input IDs and attention mask are provided instead.
input_ids : torch.Tensor | None — The input IDs of the texts. Can be None if the texts are provided instead.
attention_mask : torch.Tensor | None — The attention mask of the texts. Can be None if the texts are provided instead.

source dataclass GenerativeModelOutput(sequences: c.Sequence[str], scores: c.Sequence[c.Sequence[c.Sequence[tuple[str, float]]]] | None = None)

The output of a generative model.

Attributes

sequences : c.Sequence[str] — The generated sequences.
scores : c.Sequence[c.Sequence[c.Sequence[tuple[str, float]]]] | None — The scores of the sequences. This is an array of shape (batch_size, num_tokens, num_logprobs, 2), where the last dimension contains the token and its logprob. Can be None if the scores are not available.

source dataclass SingleGenerativeModelOutput(sequence: str, scores: c.Sequence[c.Sequence[tuple[str, float]]] | None = None)

A single output of a generative model.

Attributes

sequence : str — The generated sequence.
scores : c.Sequence[c.Sequence[tuple[str, float]]] | None — The scores of the sequence. This is an array of shape (num_tokens, num_logprobs, 2), where the last dimension contains the token and its logprob. Can be None if the scores are not available.

source dataclass HFModelInfo(pipeline_tag: str, tags: c.Sequence[str], adapter_base_model_id: str | None)

Information about a Hugging Face model.

Attributes

pipeline_tag : str — The pipeline tag of the model.
tags : c.Sequence[str] — The other tags of the model.
adapter_base_model_id : str | None — The model ID of the base model if the model is an adapter model. Can be None if the model is not an adapter model.

source dataclass ModelIdComponents(model_id: str, revision: str, param: str | None)

A model ID split into its components.

Attributes

model_id : str — The main model ID without revision or parameters.
revision : str — The revision of the model, if any.
param : str | None — The parameter of the model, if any.

source class HashableDict()

Bases : dict

A hashable dictionary.