Skip to content

euroeval.data_models

source module euroeval.data_models

Data models used in EuroEval.

Classes

source dataclass MetricConfig(name: str, pretty_name: str, huggingface_id: str, results_key: str, compute_kwargs: dict[str, t.Any] = field(default_factory=dict), postprocessing_fn: c.Callable[[float], tuple[float, str]] = field(default_factory=lambda: lambda raw_score: (100 * raw_score, f'{raw_score:.2%}')))

Configuration for a metric.

Attributes

  • name : str The name of the metric.

  • pretty_name : str A longer prettier name for the metric, which allows cases and spaces. Used for logging.

  • huggingface_id : str The Hugging Face ID of the metric.

  • results_key : str The name of the key used to extract the metric scores from the results dictionary.

  • compute_kwargs : dict[str, t.Any] Keyword arguments to pass to the metric's compute function. Defaults to an empty dictionary.

  • postprocessing_fn : c.Callable[[float], tuple[float, str]] A function to apply to the metric scores after they are computed, taking the score to the postprocessed score along with its string representation. Defaults to x -> (100 * x, f"{x:.2%}").

source dataclass Language(code: str, name: str, _and_separator: str | None = field(repr=False, default=None), _or_separator: str | None = field(repr=False, default=None))

A benchmarkable language.

Attributes

  • code : str The ISO 639-1 language code of the language.

  • name : str The name of the language.

  • and_separator : optional The word 'and' in the language.

  • or_separator : optional The word 'or' in the language.

source property Language.and_separator: str

Get the word 'and' in the language.

Returns

  • str The word 'and' in the language.

Raises

source property Language.or_separator: str

Get the word 'or' in the language.

Returns

  • str The word 'or' in the language.

Raises

source dataclass Task(name: str, task_group: TaskGroup, template_dict: dict['Language', 'PromptConfig'], metrics: list[MetricConfig], default_num_few_shot_examples: int, default_max_generated_tokens: int, default_labels: list[str])

A dataset task.

Attributes

  • name : str The name of the task.

  • task_group : TaskGroup The task group of the task.

  • template_dict : dict['Language', 'PromptConfig'] The template dictionary for the task, from language to prompt template.

  • metrics : list[MetricConfig] The metrics used to evaluate the task.

  • default_num_few_shot_examples : int The default number of examples to use when benchmarking the task using few-shot evaluation. For a classification task, these will be drawn evenly from each label.

  • default_max_generated_tokens : int The default maximum number of tokens to generate when benchmarking the task using few-shot evaluation.

  • default_labels : list[str] The default labels for datasets using this task.

source dataclass BenchmarkConfig(model_languages: list[Language], dataset_languages: list[Language], tasks: list[Task], datasets: list[str], batch_size: int, raise_errors: bool, cache_dir: str, api_key: str | None, force: bool, progress_bar: bool, save_results: bool, device: torch.device, verbose: bool, trust_remote_code: bool, use_flash_attention: bool | None, clear_model_cache: bool, evaluate_test_split: bool, few_shot: bool, num_iterations: int, api_base: str | None, api_version: str | None, debug: bool, run_with_cli: bool, only_allow_safetensors: bool)

General benchmarking configuration, across datasets and models.

Attributes

  • model_languages : list[Language] The languages of the models to benchmark.

  • dataset_languages : list[Language] The languages of the datasets in the benchmark.

  • tasks : list[Task] The tasks benchmark the model(s) on.

  • datasets : list[str] The datasets to benchmark on.

  • batch_size : int The batch size to use.

  • raise_errors : bool Whether to raise errors instead of skipping them.

  • cache_dir : str Directory to store cached models and datasets.

  • api_key : str | None The API key to use for a given inference API.

  • force : bool Whether to force the benchmark to run even if the results are already cached.

  • progress_bar : bool Whether to show a progress bar.

  • save_results : bool Whether to save the benchmark results to 'euroeval_benchmark_results.json'.

  • device : torch.device The device to use for benchmarking.

  • verbose : bool Whether to print verbose output.

  • trust_remote_code : bool Whether to trust remote code when loading models from the Hugging Face Hub.

  • use_flash_attention : bool | None Whether to use Flash Attention. If None then this will be used for generative models.

  • clear_model_cache : bool Whether to clear the model cache after benchmarking each model.

  • evaluate_test_split : bool Whether to evaluate on the test split.

  • few_shot : bool Whether to only evaluate the model using few-shot evaluation. Only relevant if the model is generative.

  • num_iterations : int The number of iterations each model should be evaluated for.

  • api_base : str | None The base URL for a given inference API. Only relevant if model refers to a model on an inference API.

  • api_version : str | None The version of the API to use. Only relevant if model refers to a model on an inference API.

  • debug : bool Whether to run the benchmark in debug mode.

  • run_with_cli : bool Whether the benchmark is being run with the CLI.

  • only_allow_safetensors : bool Whether to only allow models that use the safetensors format.

source class BenchmarkConfigParams(**data: Any)

Bases : pydantic.BaseModel

The parameters for the benchmark configuration.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Attributes

  • model_extra : dict[str, Any] | None Get extra fields set during validation.

  • model_fields_set : set[str] Returns the set of fields that have been explicitly set on this model instance.

source class BenchmarkResult(**data: Any)

Bases : pydantic.BaseModel

A benchmark result.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Attributes

  • model_config : ClassVar[ConfigDict] Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

  • model_extra : dict[str, Any] | None Get extra fields set during validation.

  • model_fields_set : set[str] Returns the set of fields that have been explicitly set on this model instance.

Methods

  • from_dict Create a benchmark result from a dictionary.

  • append_to_results Append the benchmark result to the results file.

source classmethod BenchmarkResult.from_dict(config: dict)BenchmarkResult

Create a benchmark result from a dictionary.

Parameters

  • config : dict The configuration dictionary.

Returns

source method BenchmarkResult.append_to_results(results_path: pathlib.Path)None

Append the benchmark result to the results file.

Parameters

  • results_path : pathlib.Path The path to the results file.

source dataclass DatasetConfig(name: str, pretty_name: str, huggingface_id: str, task: Task, languages: list[Language], _prompt_prefix: str | None = None, _prompt_template: str | None = None, _instruction_prompt: str | None = None, _num_few_shot_examples: int | None = None, _max_generated_tokens: int | None = None, _labels: list[str] | None = None, _prompt_label_mapping: dict[str, str] | t.Literal['auto'] | None = None, unofficial: bool = False)

Configuration for a dataset.

Attributes

  • name : str The name of the dataset. Must be lower case with no spaces.

  • pretty_name : str A longer prettier name for the dataset, which allows cases and spaces. Used for logging.

  • huggingface_id : str The Hugging Face ID of the dataset.

  • task : Task The task of the dataset.

  • languages : list[Language] The ISO 639-1 language codes of the entries in the dataset.

  • id2label : dict[int, str] The mapping from ID to label.

  • label2id : dict[str, int] The mapping from label to ID.

  • num_labels : int The number of labels in the dataset.

  • _prompt_prefix : optional The prefix to use in the few-shot prompt. Defaults to the template for the task and language.

  • _prompt_template : optional The template for the prompt to use when benchmarking the dataset using few-shot evaluation. Defaults to the template for the task and language.

  • _instruction_prompt : optional The prompt to use when benchmarking the dataset using instruction-based evaluation. Defaults to the template for the task and language.

  • _num_few_shot_examples : optional The number of examples to use when benchmarking the dataset using few-shot evaluation. For a classification task, these will be drawn evenly from each label. Defaults to the template for the task and language.

  • _max_generated_tokens : optional The maximum number of tokens to generate when benchmarking the dataset using few-shot evaluation. Defaults to the template for the task and language.

  • _labels : optional The labels in the dataset. Defaults to the template for the task and language.

  • _prompt_label_mapping : optional A mapping from the labels to another phrase which is used as a substitute for the label in few-shot evaluation. If "auto" then the mapping will be set to a 1:1 mapping between the labels and themselves. If None then the mapping will be set to the default mapping for the task and language. Defaults to None.

  • unofficial : optional Whether the dataset is unofficial. Defaults to False.

  • prompt_prefix : str The prefix to use in the few-shot prompt.

  • prompt_template : str The template used during few-shot evaluation.

  • instruction_prompt : str The prompt to use when evaluating instruction-tuned models.

  • num_few_shot_examples : int The number of few-shot examples to use.

  • max_generated_tokens : int The maximum number of tokens to generate when evaluating a model.

  • labels : list[str] The labels in the dataset.

  • prompt_label_mapping : dict[str, str] Mapping from English labels to localised labels.

source property DatasetConfig.prompt_prefix: str

The prefix to use in the few-shot prompt.

source property DatasetConfig.prompt_template: str

The template used during few-shot evaluation.

source property DatasetConfig.instruction_prompt: str

The prompt to use when evaluating instruction-tuned models.

source property DatasetConfig.num_few_shot_examples: int

The number of few-shot examples to use.

source property DatasetConfig.max_generated_tokens: int

The maximum number of tokens to generate when evaluating a model.

source property DatasetConfig.labels: list[str]

The labels in the dataset.

source property DatasetConfig.prompt_label_mapping: dict[str, str]

Mapping from English labels to localised labels.

source property DatasetConfig.id2label: dict[int, str]

The mapping from ID to label.

source property DatasetConfig.label2id: dict[str, int]

The mapping from label to ID.

source property DatasetConfig.num_labels: int

The number of labels in the dataset.

source dataclass ModelConfig(model_id: str, revision: str, task: str, languages: list[Language], inference_backend: InferenceBackend, merge: bool, model_type: ModelType, fresh: bool, model_cache_dir: str, adapter_base_model_id: str | None)

Configuration for a model.

Attributes

  • model_id : str The ID of the model.

  • revision : str The revision of the model.

  • task : str The task that the model was trained on.

  • languages : list[Language] The languages of the model.

  • inference_backend : InferenceBackend The backend used to perform inference with the model.

  • merge : bool Whether the model is a merged model.

  • model_type : ModelType The type of the model (e.g., encoder, base decoder, instruction tuned).

  • fresh : bool Whether the model is freshly initialised.

  • model_cache_dir : str The directory to cache the model in.

  • adapter_base_model_id : str | None The model ID of the base model if the model is an adapter model. Can be None if the model is not an adapter model.

source dataclass PreparedModelInputs(texts: list[str] | None = None, input_ids: torch.Tensor | None = None, attention_mask: torch.Tensor | None = None)

The inputs to a model.

Attributes

  • texts : list[str] | None The texts to input to the model. Can be None if the input IDs and attention mask are provided instead.

  • input_ids : torch.Tensor | None The input IDs of the texts. Can be None if the texts are provided instead.

  • attention_mask : torch.Tensor | None The attention mask of the texts. Can be None if the texts are provided instead.

source dataclass GenerativeModelOutput(sequences: list[str], scores: list[list[list[tuple[str, float]]]] | None = None)

The output of a generative model.

Attributes

  • sequences : list[str] The generated sequences.

  • scores : list[list[list[tuple[str, float]]]] | None The scores of the sequences. This is an array of shape (batch_size, num_tokens, num_logprobs, 2), where the last dimension contains the token and its logprob. Can be None if the scores are not available.

source dataclass SingleGenerativeModelOutput(sequence: str, scores: list[list[tuple[str, float]]] | None = None)

A single output of a generative model.

Attributes

  • sequence : str The generated sequence.

  • scores : list[list[tuple[str, float]]] | None The scores of the sequence. This is an array of shape (num_tokens, num_logprobs, 2), where the last dimension contains the token and its logprob. Can be None if the scores are not available.

source dataclass HFModelInfo(pipeline_tag: str, tags: list[str], adapter_base_model_id: str | None)

Information about a Hugging Face model.

Attributes

  • pipeline_tag : str The pipeline tag of the model.

  • tags : list[str] The other tags of the model.

  • adapter_base_model_id : str | None The model ID of the base model if the model is an adapter model. Can be None if the model is not an adapter model.

source dataclass PromptConfig(default_prompt_prefix: str, default_prompt_template: str, default_instruction_prompt: str, default_prompt_label_mapping: dict[str, str] | t.Literal['auto'])

Configuration for task-specific prompting across languages.

Defines the prompt templates needed for evaluating a specific task in a given language.

Attributes

  • default_prompt_prefix : str The default prefix to use in the few-shot prompt.

  • default_prompt_template : str The default template for the prompt to use when benchmarking the dataset using few-shot evaluation.

  • default_instruction_prompt : str The default prompt to use when benchmarking the dataset using instruction-based evaluation.

  • default_prompt_label_mapping : dict[str, str] | t.Literal['auto'] The default mapping from the labels to another phrase which is used as a substitute for the label in few-shot evaluation. If set to "auto", the mapping will be set to a 1:1 mapping between the labels and themselves.