euroeval.data_models
source module euroeval.data_models
Data models used in EuroEval.
Classes
-
MetricConfig — Configuration for a metric.
-
Language — A benchmarkable language.
-
Task — A dataset task.
-
BenchmarkConfig — General benchmarking configuration, across datasets and models.
-
BenchmarkConfigParams — The parameters for the benchmark configuration.
-
BenchmarkResult — A benchmark result.
-
DatasetConfig — Configuration for a dataset.
-
ModelConfig — Configuration for a model.
-
PreparedModelInputs — The inputs to a model.
-
GenerativeModelOutput — The output of a generative model.
-
SingleGenerativeModelOutput — A single output of a generative model.
-
HFModelInfo — Information about a Hugging Face model.
-
PromptConfig — Configuration for task-specific prompting across languages.
source dataclass MetricConfig(name: str, pretty_name: str, huggingface_id: str, results_key: str, compute_kwargs: dict[str, t.Any] = field(default_factory=dict), postprocessing_fn: c.Callable[[float], tuple[float, str]] = field(default_factory=lambda: lambda raw_score: (100 * raw_score, f'{raw_score:.2%}')))
Configuration for a metric.
Attributes
-
name : str — The name of the metric.
-
pretty_name : str — A longer prettier name for the metric, which allows cases and spaces. Used for logging.
-
huggingface_id : str — The Hugging Face ID of the metric.
-
results_key : str — The name of the key used to extract the metric scores from the results dictionary.
-
compute_kwargs : dict[str, t.Any] — Keyword arguments to pass to the metric's compute function. Defaults to an empty dictionary.
-
postprocessing_fn : c.Callable[[float], tuple[float, str]] — A function to apply to the metric scores after they are computed, taking the score to the postprocessed score along with its string representation. Defaults to x -> (100 * x, f"{x:.2%}").
source dataclass Language(code: str, name: str, _and_separator: str | None = field(repr=False, default=None), _or_separator: str | None = field(repr=False, default=None))
A benchmarkable language.
Attributes
-
code : str — The ISO 639-1 language code of the language.
-
name : str — The name of the language.
-
and_separator : optional — The word 'and' in the language.
-
or_separator : optional — The word 'or' in the language.
source property Language.and_separator: str
Get the word 'and' in the language.
Returns
-
str — The word 'and' in the language.
Raises
-
NotImplementedError — If
and_separator
isNone
.
source property Language.or_separator: str
Get the word 'or' in the language.
Returns
-
str — The word 'or' in the language.
Raises
-
NotImplementedError — If
or_separator
isNone
.
source dataclass Task(name: str, task_group: TaskGroup, template_dict: dict['Language', 'PromptConfig'], metrics: list[MetricConfig], default_num_few_shot_examples: int, default_max_generated_tokens: int, default_labels: list[str])
A dataset task.
Attributes
-
name : str — The name of the task.
-
task_group : TaskGroup — The task group of the task.
-
template_dict : dict['Language', 'PromptConfig'] — The template dictionary for the task, from language to prompt template.
-
metrics : list[MetricConfig] — The metrics used to evaluate the task.
-
default_num_few_shot_examples : int — The default number of examples to use when benchmarking the task using few-shot evaluation. For a classification task, these will be drawn evenly from each label.
-
default_max_generated_tokens : int — The default maximum number of tokens to generate when benchmarking the task using few-shot evaluation.
-
default_labels : list[str] — The default labels for datasets using this task.
source dataclass BenchmarkConfig(model_languages: list[Language], dataset_languages: list[Language], tasks: list[Task], datasets: list[str], batch_size: int, raise_errors: bool, cache_dir: str, api_key: str | None, force: bool, progress_bar: bool, save_results: bool, device: torch.device, verbose: bool, trust_remote_code: bool, use_flash_attention: bool | None, clear_model_cache: bool, evaluate_test_split: bool, few_shot: bool, num_iterations: int, api_base: str | None, api_version: str | None, debug: bool, run_with_cli: bool, only_allow_safetensors: bool)
General benchmarking configuration, across datasets and models.
Attributes
-
model_languages : list[Language] — The languages of the models to benchmark.
-
dataset_languages : list[Language] — The languages of the datasets in the benchmark.
-
tasks : list[Task] — The tasks benchmark the model(s) on.
-
datasets : list[str] — The datasets to benchmark on.
-
batch_size : int — The batch size to use.
-
raise_errors : bool — Whether to raise errors instead of skipping them.
-
cache_dir : str — Directory to store cached models and datasets.
-
api_key : str | None — The API key to use for a given inference API.
-
force : bool — Whether to force the benchmark to run even if the results are already cached.
-
progress_bar : bool — Whether to show a progress bar.
-
save_results : bool — Whether to save the benchmark results to 'euroeval_benchmark_results.json'.
-
device : torch.device — The device to use for benchmarking.
-
verbose : bool — Whether to print verbose output.
-
trust_remote_code : bool — Whether to trust remote code when loading models from the Hugging Face Hub.
-
use_flash_attention : bool | None — Whether to use Flash Attention. If None then this will be used for generative models.
-
clear_model_cache : bool — Whether to clear the model cache after benchmarking each model.
-
evaluate_test_split : bool — Whether to evaluate on the test split.
-
few_shot : bool — Whether to only evaluate the model using few-shot evaluation. Only relevant if the model is generative.
-
num_iterations : int — The number of iterations each model should be evaluated for.
-
api_base : str | None — The base URL for a given inference API. Only relevant if
model
refers to a model on an inference API. -
api_version : str | None — The version of the API to use. Only relevant if
model
refers to a model on an inference API. -
debug : bool — Whether to run the benchmark in debug mode.
-
run_with_cli : bool — Whether the benchmark is being run with the CLI.
-
only_allow_safetensors : bool — Whether to only allow models that use the safetensors format.
source class BenchmarkConfigParams(**data: Any)
Bases : pydantic.BaseModel
The parameters for the benchmark configuration.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError
][pydantic_core.ValidationError] if the input data cannot be
validated to form a valid model.
self
is explicitly positional-only to allow self
as a field name.
Attributes
-
model_extra : dict[str, Any] | None — Get extra fields set during validation.
-
model_fields_set : set[str] — Returns the set of fields that have been explicitly set on this model instance.
source class BenchmarkResult(**data: Any)
Bases : pydantic.BaseModel
A benchmark result.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError
][pydantic_core.ValidationError] if the input data cannot be
validated to form a valid model.
self
is explicitly positional-only to allow self
as a field name.
Attributes
-
model_config : ClassVar[ConfigDict] — Configuration for the model, should be a dictionary conforming to [
ConfigDict
][pydantic.config.ConfigDict]. -
model_extra : dict[str, Any] | None — Get extra fields set during validation.
-
model_fields_set : set[str] — Returns the set of fields that have been explicitly set on this model instance.
Methods
-
from_dict — Create a benchmark result from a dictionary.
-
append_to_results — Append the benchmark result to the results file.
source classmethod BenchmarkResult.from_dict(config: dict) → BenchmarkResult
Create a benchmark result from a dictionary.
Parameters
-
config : dict — The configuration dictionary.
Returns
-
BenchmarkResult — The benchmark result.
source method BenchmarkResult.append_to_results(results_path: pathlib.Path) → None
Append the benchmark result to the results file.
Parameters
-
results_path : pathlib.Path — The path to the results file.
source dataclass DatasetConfig(name: str, pretty_name: str, huggingface_id: str, task: Task, languages: list[Language], _prompt_prefix: str | None = None, _prompt_template: str | None = None, _instruction_prompt: str | None = None, _num_few_shot_examples: int | None = None, _max_generated_tokens: int | None = None, _labels: list[str] | None = None, _prompt_label_mapping: dict[str, str] | t.Literal['auto'] | None = None, unofficial: bool = False)
Configuration for a dataset.
Attributes
-
name : str — The name of the dataset. Must be lower case with no spaces.
-
pretty_name : str — A longer prettier name for the dataset, which allows cases and spaces. Used for logging.
-
huggingface_id : str — The Hugging Face ID of the dataset.
-
task : Task — The task of the dataset.
-
languages : list[Language] — The ISO 639-1 language codes of the entries in the dataset.
-
id2label : dict[int, str] — The mapping from ID to label.
-
label2id : dict[str, int] — The mapping from label to ID.
-
num_labels : int — The number of labels in the dataset.
-
_prompt_prefix : optional — The prefix to use in the few-shot prompt. Defaults to the template for the task and language.
-
_prompt_template : optional — The template for the prompt to use when benchmarking the dataset using few-shot evaluation. Defaults to the template for the task and language.
-
_instruction_prompt : optional — The prompt to use when benchmarking the dataset using instruction-based evaluation. Defaults to the template for the task and language.
-
_num_few_shot_examples : optional — The number of examples to use when benchmarking the dataset using few-shot evaluation. For a classification task, these will be drawn evenly from each label. Defaults to the template for the task and language.
-
_max_generated_tokens : optional — The maximum number of tokens to generate when benchmarking the dataset using few-shot evaluation. Defaults to the template for the task and language.
-
_labels : optional — The labels in the dataset. Defaults to the template for the task and language.
-
_prompt_label_mapping : optional — A mapping from the labels to another phrase which is used as a substitute for the label in few-shot evaluation. If "auto" then the mapping will be set to a 1:1 mapping between the labels and themselves. If None then the mapping will be set to the default mapping for the task and language. Defaults to None.
-
unofficial : optional — Whether the dataset is unofficial. Defaults to False.
-
prompt_prefix : str — The prefix to use in the few-shot prompt.
-
prompt_template : str — The template used during few-shot evaluation.
-
instruction_prompt : str — The prompt to use when evaluating instruction-tuned models.
-
num_few_shot_examples : int — The number of few-shot examples to use.
-
max_generated_tokens : int — The maximum number of tokens to generate when evaluating a model.
-
labels : list[str] — The labels in the dataset.
-
prompt_label_mapping : dict[str, str] — Mapping from English labels to localised labels.
source property DatasetConfig.prompt_prefix: str
The prefix to use in the few-shot prompt.
source property DatasetConfig.prompt_template: str
The template used during few-shot evaluation.
source property DatasetConfig.instruction_prompt: str
The prompt to use when evaluating instruction-tuned models.
source property DatasetConfig.num_few_shot_examples: int
The number of few-shot examples to use.
source property DatasetConfig.max_generated_tokens: int
The maximum number of tokens to generate when evaluating a model.
source property DatasetConfig.labels: list[str]
The labels in the dataset.
source property DatasetConfig.prompt_label_mapping: dict[str, str]
Mapping from English labels to localised labels.
source property DatasetConfig.id2label: dict[int, str]
The mapping from ID to label.
source property DatasetConfig.label2id: dict[str, int]
The mapping from label to ID.
source property DatasetConfig.num_labels: int
The number of labels in the dataset.
source dataclass ModelConfig(model_id: str, revision: str, task: str, languages: list[Language], inference_backend: InferenceBackend, merge: bool, model_type: ModelType, fresh: bool, model_cache_dir: str, adapter_base_model_id: str | None)
Configuration for a model.
Attributes
-
model_id : str — The ID of the model.
-
revision : str — The revision of the model.
-
task : str — The task that the model was trained on.
-
languages : list[Language] — The languages of the model.
-
inference_backend : InferenceBackend — The backend used to perform inference with the model.
-
merge : bool — Whether the model is a merged model.
-
model_type : ModelType — The type of the model (e.g., encoder, base decoder, instruction tuned).
-
fresh : bool — Whether the model is freshly initialised.
-
model_cache_dir : str — The directory to cache the model in.
-
adapter_base_model_id : str | None — The model ID of the base model if the model is an adapter model. Can be None if the model is not an adapter model.
source dataclass PreparedModelInputs(texts: list[str] | None = None, input_ids: torch.Tensor | None = None, attention_mask: torch.Tensor | None = None)
The inputs to a model.
Attributes
-
texts : list[str] | None — The texts to input to the model. Can be None if the input IDs and attention mask are provided instead.
-
input_ids : torch.Tensor | None — The input IDs of the texts. Can be None if the texts are provided instead.
-
attention_mask : torch.Tensor | None — The attention mask of the texts. Can be None if the texts are provided instead.
source dataclass GenerativeModelOutput(sequences: list[str], scores: list[list[list[tuple[str, float]]]] | None = None)
The output of a generative model.
Attributes
-
sequences : list[str] — The generated sequences.
-
scores : list[list[list[tuple[str, float]]]] | None — The scores of the sequences. This is an array of shape (batch_size, num_tokens, num_logprobs, 2), where the last dimension contains the token and its logprob. Can be None if the scores are not available.
source dataclass SingleGenerativeModelOutput(sequence: str, scores: list[list[tuple[str, float]]] | None = None)
A single output of a generative model.
Attributes
-
sequence : str — The generated sequence.
-
scores : list[list[tuple[str, float]]] | None — The scores of the sequence. This is an array of shape (num_tokens, num_logprobs, 2), where the last dimension contains the token and its logprob. Can be None if the scores are not available.
source dataclass HFModelInfo(pipeline_tag: str, tags: list[str], adapter_base_model_id: str | None)
Information about a Hugging Face model.
Attributes
-
pipeline_tag : str — The pipeline tag of the model.
-
tags : list[str] — The other tags of the model.
-
adapter_base_model_id : str | None — The model ID of the base model if the model is an adapter model. Can be None if the model is not an adapter model.
source dataclass PromptConfig(default_prompt_prefix: str, default_prompt_template: str, default_instruction_prompt: str, default_prompt_label_mapping: dict[str, str] | t.Literal['auto'])
Configuration for task-specific prompting across languages.
Defines the prompt templates needed for evaluating a specific task in a given language.
Attributes
-
default_prompt_prefix : str — The default prefix to use in the few-shot prompt.
-
default_prompt_template : str — The default template for the prompt to use when benchmarking the dataset using few-shot evaluation.
-
default_instruction_prompt : str — The default prompt to use when benchmarking the dataset using instruction-based evaluation.
-
default_prompt_label_mapping : dict[str, str] | t.Literal['auto'] — The default mapping from the labels to another phrase which is used as a substitute for the label in few-shot evaluation. If set to "auto", the mapping will be set to a 1:1 mapping between the labels and themselves.