euroeval.data_models
source module euroeval.data_models
Data models used in EuroEval.
Classes
-
Language — A benchmarkable language.
-
Task — A dataset task.
-
BenchmarkConfig — General benchmarking configuration, across datasets and models.
-
BenchmarkConfigParams — The parameters for the benchmark configuration.
-
BenchmarkResult — A benchmark result.
-
DatasetConfig — Configuration for a dataset.
-
ModelConfig — Configuration for a model.
-
PreparedModelInputs — The inputs to a model.
-
GenerativeModelOutput — The output of a generative model.
-
SingleGenerativeModelOutput — A single output of a generative model.
-
HFModelInfo — Information about a Hugging Face model.
-
PromptConfig — Configuration for task-specific prompting across languages.
source dataclass Language(code: str, name: str, _and_separator: str | None = field(repr=False, default=None), _or_separator: str | None = field(repr=False, default=None))
A benchmarkable language.
Attributes
-
code : str — The ISO 639-1 language code of the language.
-
name : str — The name of the language.
-
and_separator : optional — The word 'and' in the language.
-
or_separator : optional — The word 'or' in the language.
source property Language.and_separator: str
Get the word 'and' in the language.
Returns
-
str — The word 'and' in the language.
Raises
-
NotImplementedError — If
and_separator
isNone
.
source property Language.or_separator: str
Get the word 'or' in the language.
Returns
-
str — The word 'or' in the language.
Raises
-
NotImplementedError — If
or_separator
isNone
.
source dataclass Task(name: str, task_group: TaskGroup, template_dict: dict['Language', 'PromptConfig'], metrics: list[Metric], default_num_few_shot_examples: int, default_max_generated_tokens: int, default_labels: list[str], requires_zero_shot: bool = False, uses_structured_output: bool = False, uses_logprobs: bool = False, requires_logprobs: bool = False, allowed_model_types: list[ModelType] = field(default_factory=lambda: [ModelType.ENCODER, ModelType.GENERATIVE]), allowed_generative_types: list[GenerativeType] = field(default_factory=lambda: [GenerativeType.BASE, GenerativeType.INSTRUCTION_TUNED, GenerativeType.REASONING]))
A dataset task.
Attributes
-
name : str — The name of the task.
-
task_group : TaskGroup — The task group of the task.
-
template_dict : dict['Language', 'PromptConfig'] — The template dictionary for the task, from language to prompt template.
-
metrics : list[Metric] — The metrics used to evaluate the task.
-
default_num_few_shot_examples : int — The default number of examples to use when benchmarking the task using few-shot evaluation. For a classification task, these will be drawn evenly from each label.
-
default_max_generated_tokens : int — The default maximum number of tokens to generate when benchmarking the task using few-shot evaluation.
-
default_labels : list[str] — The default labels for datasets using this task.
-
requires_zero_shot : optional — Whether to only allow zero-shot evaluation for this task. If True, the task will not be evaluated using few-shot examples.
-
uses_structured_output : optional — Whether the task uses structured output. If True, the task will return structured output (e.g., BIO tags for NER). Defaults to False.
-
uses_logprobs : optional — Whether the task uses log probabilities. If True, the task will return log probabilities for the generated tokens. Defaults to False.
-
requires_logprobs : optional — Whether the task requires log probabilities. Implies
uses_logprobs
. -
allowed_model_types : optional — A list of model types that are allowed to be evaluated on this task. Defaults to all model types being allowed.
-
allowed_generative_types : optional — A list of generative model types that are allowed to be evaluated on this task. If None, all generative model types are allowed. Only relevant if
allowed_model_types
includes generative models.
source dataclass BenchmarkConfig(model_languages: list[Language], dataset_languages: list[Language], tasks: list[Task], datasets: list[str], batch_size: int, raise_errors: bool, cache_dir: str, api_key: str | None, force: bool, progress_bar: bool, save_results: bool, device: torch.device, verbose: bool, trust_remote_code: bool, clear_model_cache: bool, evaluate_test_split: bool, few_shot: bool, num_iterations: int, api_base: str | None, api_version: str | None, gpu_memory_utilization: float, debug: bool, run_with_cli: bool, requires_safetensors: bool)
General benchmarking configuration, across datasets and models.
Attributes
-
model_languages : list[Language] — The languages of the models to benchmark.
-
dataset_languages : list[Language] — The languages of the datasets in the benchmark.
-
tasks : list[Task] — The tasks benchmark the model(s) on.
-
datasets : list[str] — The datasets to benchmark on.
-
batch_size : int — The batch size to use.
-
raise_errors : bool — Whether to raise errors instead of skipping them.
-
cache_dir : str — Directory to store cached models and datasets.
-
api_key : str | None — The API key to use for a given inference API.
-
force : bool — Whether to force the benchmark to run even if the results are already cached.
-
progress_bar : bool — Whether to show a progress bar.
-
save_results : bool — Whether to save the benchmark results to 'euroeval_benchmark_results.json'.
-
device : torch.device — The device to use for benchmarking.
-
verbose : bool — Whether to print verbose output.
-
trust_remote_code : bool — Whether to trust remote code when loading models from the Hugging Face Hub.
-
clear_model_cache : bool — Whether to clear the model cache after benchmarking each model.
-
evaluate_test_split : bool — Whether to evaluate on the test split.
-
few_shot : bool — Whether to only evaluate the model using few-shot evaluation. Only relevant if the model is generative.
-
num_iterations : int — The number of iterations each model should be evaluated for.
-
api_base : str | None — The base URL for a given inference API. Only relevant if
model
refers to a model on an inference API. -
api_version : str | None — The version of the API to use. Only relevant if
model
refers to a model on an inference API. -
gpu_memory_utilization : float — The GPU memory utilization to use for vLLM. A larger value will result in faster evaluation, but at the risk of running out of GPU memory. Only reduce this if you are running out of GPU memory. Only relevant if the model is generative.
-
debug : bool — Whether to run the benchmark in debug mode.
-
run_with_cli : bool — Whether the benchmark is being run with the CLI.
-
requires_safetensors : bool — Whether to only allow models that use the safetensors format.
source class BenchmarkConfigParams(**data: Any)
Bases : pydantic.BaseModel
The parameters for the benchmark configuration.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError
][pydantic_core.ValidationError] if the input data cannot be
validated to form a valid model.
self
is explicitly positional-only to allow self
as a field name.
Attributes
-
model_extra : dict[str, Any] | None — Get extra fields set during validation.
-
model_fields_set : set[str] — Returns the set of fields that have been explicitly set on this model instance.
source class BenchmarkResult(**data: Any)
Bases : pydantic.BaseModel
A benchmark result.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError
][pydantic_core.ValidationError] if the input data cannot be
validated to form a valid model.
self
is explicitly positional-only to allow self
as a field name.
Attributes
-
model_config : ClassVar[ConfigDict] — Configuration for the model, should be a dictionary conforming to [
ConfigDict
][pydantic.config.ConfigDict]. -
model_extra : dict[str, Any] | None — Get extra fields set during validation.
-
model_fields_set : set[str] — Returns the set of fields that have been explicitly set on this model instance.
Methods
-
from_dict — Create a benchmark result from a dictionary.
-
append_to_results — Append the benchmark result to the results file.
source classmethod BenchmarkResult.from_dict(config: dict) → BenchmarkResult
Create a benchmark result from a dictionary.
Parameters
-
config : dict — The configuration dictionary.
Returns
-
BenchmarkResult — The benchmark result.
source method BenchmarkResult.append_to_results(results_path: pathlib.Path) → None
Append the benchmark result to the results file.
Parameters
-
results_path : pathlib.Path — The path to the results file.
source dataclass DatasetConfig(name: str, pretty_name: str, huggingface_id: str, task: Task, languages: list[Language], _prompt_prefix: str | None = None, _prompt_template: str | None = None, _instruction_prompt: str | None = None, _num_few_shot_examples: int | None = None, _max_generated_tokens: int | None = None, _labels: list[str] | None = None, _prompt_label_mapping: dict[str, str] | t.Literal['auto'] | None = None, splits: list[str] = field(default_factory=lambda: ['train', 'val', 'test']), bootstrap_samples: bool = True, unofficial: bool = False)
Configuration for a dataset.
Attributes
-
name : str — The name of the dataset. Must be lower case with no spaces.
-
pretty_name : str — A longer prettier name for the dataset, which allows cases and spaces. Used for logging.
-
huggingface_id : str — The Hugging Face ID of the dataset.
-
task : Task — The task of the dataset.
-
languages : list[Language] — The ISO 639-1 language codes of the entries in the dataset.
-
id2label : dict[int, str] — The mapping from ID to label.
-
label2id : dict[str, int] — The mapping from label to ID.
-
num_labels : int — The number of labels in the dataset.
-
_prompt_prefix : optional — The prefix to use in the few-shot prompt. Defaults to the template for the task and language.
-
_prompt_template : optional — The template for the prompt to use when benchmarking the dataset using few-shot evaluation. Defaults to the template for the task and language.
-
_instruction_prompt : optional — The prompt to use when benchmarking the dataset using instruction-based evaluation. Defaults to the template for the task and language.
-
_num_few_shot_examples : optional — The number of examples to use when benchmarking the dataset using few-shot evaluation. For a classification task, these will be drawn evenly from each label. Defaults to the template for the task and language.
-
_max_generated_tokens : optional — The maximum number of tokens to generate when benchmarking the dataset using few-shot evaluation. Defaults to the template for the task and language.
-
_labels : optional — The labels in the dataset. Defaults to the template for the task and language.
-
_prompt_label_mapping : optional — A mapping from the labels to another phrase which is used as a substitute for the label in few-shot evaluation. If "auto" then the mapping will be set to a 1:1 mapping between the labels and themselves. If None then the mapping will be set to the default mapping for the task and language. Defaults to None.
-
splits : optional — The names of the splits in the dataset. If not provided, defaults to ["train", "val", "test"].
-
bootstrap_samples : optional — Whether to bootstrap the dataset samples. Defaults to True.
-
unofficial : optional — Whether the dataset is unofficial. Defaults to False.
-
prompt_prefix : str — The prefix to use in the few-shot prompt.
-
prompt_template : str — The template used during few-shot evaluation.
-
instruction_prompt : str — The prompt to use when evaluating instruction-tuned models.
-
num_few_shot_examples : int — The number of few-shot examples to use.
-
max_generated_tokens : int — The maximum number of tokens to generate when evaluating a model.
-
labels : list[str] — The labels in the dataset.
-
prompt_label_mapping : dict[str, str] — Mapping from English labels to localised labels.
source property DatasetConfig.prompt_prefix: str
The prefix to use in the few-shot prompt.
source property DatasetConfig.prompt_template: str
The template used during few-shot evaluation.
source property DatasetConfig.instruction_prompt: str
The prompt to use when evaluating instruction-tuned models.
source property DatasetConfig.num_few_shot_examples: int
The number of few-shot examples to use.
source property DatasetConfig.max_generated_tokens: int
The maximum number of tokens to generate when evaluating a model.
source property DatasetConfig.labels: list[str]
The labels in the dataset.
source property DatasetConfig.prompt_label_mapping: dict[str, str]
Mapping from English labels to localised labels.
source property DatasetConfig.id2label: dict[int, str]
The mapping from ID to label.
source property DatasetConfig.label2id: dict[str, int]
The mapping from label to ID.
source property DatasetConfig.num_labels: int
The number of labels in the dataset.
source dataclass ModelConfig(model_id: str, revision: str, task: str, languages: list[Language], inference_backend: InferenceBackend, merge: bool, model_type: ModelType, fresh: bool, model_cache_dir: str, adapter_base_model_id: str | None)
Configuration for a model.
Attributes
-
model_id : str — The ID of the model.
-
revision : str — The revision of the model.
-
task : str — The task that the model was trained on.
-
languages : list[Language] — The languages of the model.
-
inference_backend : InferenceBackend — The backend used to perform inference with the model.
-
merge : bool — Whether the model is a merged model.
-
model_type : ModelType — The type of the model (e.g., encoder, base decoder, instruction tuned).
-
fresh : bool — Whether the model is freshly initialised.
-
model_cache_dir : str — The directory to cache the model in.
-
adapter_base_model_id : str | None — The model ID of the base model if the model is an adapter model. Can be None if the model is not an adapter model.
source dataclass PreparedModelInputs(texts: list[str] | None = None, input_ids: torch.Tensor | None = None, attention_mask: torch.Tensor | None = None)
The inputs to a model.
Attributes
-
texts : list[str] | None — The texts to input to the model. Can be None if the input IDs and attention mask are provided instead.
-
input_ids : torch.Tensor | None — The input IDs of the texts. Can be None if the texts are provided instead.
-
attention_mask : torch.Tensor | None — The attention mask of the texts. Can be None if the texts are provided instead.
source dataclass GenerativeModelOutput(sequences: list[str], scores: list[list[list[tuple[str, float]]]] | None = None)
The output of a generative model.
Attributes
-
sequences : list[str] — The generated sequences.
-
scores : list[list[list[tuple[str, float]]]] | None — The scores of the sequences. This is an array of shape (batch_size, num_tokens, num_logprobs, 2), where the last dimension contains the token and its logprob. Can be None if the scores are not available.
source dataclass SingleGenerativeModelOutput(sequence: str, scores: list[list[tuple[str, float]]] | None = None)
A single output of a generative model.
Attributes
-
sequence : str — The generated sequence.
-
scores : list[list[tuple[str, float]]] | None — The scores of the sequence. This is an array of shape (num_tokens, num_logprobs, 2), where the last dimension contains the token and its logprob. Can be None if the scores are not available.
source dataclass HFModelInfo(pipeline_tag: str, tags: list[str], adapter_base_model_id: str | None)
Information about a Hugging Face model.
Attributes
-
pipeline_tag : str — The pipeline tag of the model.
-
tags : list[str] — The other tags of the model.
-
adapter_base_model_id : str | None — The model ID of the base model if the model is an adapter model. Can be None if the model is not an adapter model.
source dataclass PromptConfig(default_prompt_prefix: str, default_prompt_template: str, default_instruction_prompt: str, default_prompt_label_mapping: dict[str, str] | t.Literal['auto'])
Configuration for task-specific prompting across languages.
Defines the prompt templates needed for evaluating a specific task in a given language.
Attributes
-
default_prompt_prefix : str — The default prefix to use in the few-shot prompt.
-
default_prompt_template : str — The default template for the prompt to use when benchmarking the dataset using few-shot evaluation.
-
default_instruction_prompt : str — The default prompt to use when benchmarking the dataset using instruction-based evaluation.
-
default_prompt_label_mapping : dict[str, str] | t.Literal['auto'] — The default mapping from the labels to another phrase which is used as a substitute for the label in few-shot evaluation. If set to "auto", the mapping will be set to a 1:1 mapping between the labels and themselves.