Skip to content

euroeval.data_models

source module euroeval.data_models

Data models used in EuroEval.

Classes

source dataclass Language(code: str, name: str, _and_separator: str | None = field(repr=False, default=None), _or_separator: str | None = field(repr=False, default=None))

A benchmarkable language.

Attributes

  • code : str The ISO 639-1 language code of the language.

  • name : str The name of the language.

  • and_separator : optional The word 'and' in the language.

  • or_separator : optional The word 'or' in the language.

source property Language.and_separator: str

Get the word 'and' in the language.

Returns

  • str The word 'and' in the language.

Raises

source property Language.or_separator: str

Get the word 'or' in the language.

Returns

  • str The word 'or' in the language.

Raises

source dataclass Task(name: str, task_group: TaskGroup, template_dict: dict['Language', 'PromptConfig'], metrics: list[Metric], default_num_few_shot_examples: int, default_max_generated_tokens: int, default_labels: list[str], requires_zero_shot: bool = False, uses_structured_output: bool = False, uses_logprobs: bool = False, requires_logprobs: bool = False, allowed_model_types: list[ModelType] = field(default_factory=lambda: [ModelType.ENCODER, ModelType.GENERATIVE]), allowed_generative_types: list[GenerativeType] = field(default_factory=lambda: [GenerativeType.BASE, GenerativeType.INSTRUCTION_TUNED, GenerativeType.REASONING]))

A dataset task.

Attributes

  • name : str The name of the task.

  • task_group : TaskGroup The task group of the task.

  • template_dict : dict['Language', 'PromptConfig'] The template dictionary for the task, from language to prompt template.

  • metrics : list[Metric] The metrics used to evaluate the task.

  • default_num_few_shot_examples : int The default number of examples to use when benchmarking the task using few-shot evaluation. For a classification task, these will be drawn evenly from each label.

  • default_max_generated_tokens : int The default maximum number of tokens to generate when benchmarking the task using few-shot evaluation.

  • default_labels : list[str] The default labels for datasets using this task.

  • requires_zero_shot : optional Whether to only allow zero-shot evaluation for this task. If True, the task will not be evaluated using few-shot examples.

  • uses_structured_output : optional Whether the task uses structured output. If True, the task will return structured output (e.g., BIO tags for NER). Defaults to False.

  • uses_logprobs : optional Whether the task uses log probabilities. If True, the task will return log probabilities for the generated tokens. Defaults to False.

  • requires_logprobs : optional Whether the task requires log probabilities. Implies uses_logprobs.

  • allowed_model_types : optional A list of model types that are allowed to be evaluated on this task. Defaults to all model types being allowed.

  • allowed_generative_types : optional A list of generative model types that are allowed to be evaluated on this task. If None, all generative model types are allowed. Only relevant if allowed_model_types includes generative models.

source dataclass BenchmarkConfig(model_languages: list[Language], dataset_languages: list[Language], tasks: list[Task], datasets: list[str], batch_size: int, raise_errors: bool, cache_dir: str, api_key: str | None, force: bool, progress_bar: bool, save_results: bool, device: torch.device, verbose: bool, trust_remote_code: bool, clear_model_cache: bool, evaluate_test_split: bool, few_shot: bool, num_iterations: int, api_base: str | None, api_version: str | None, gpu_memory_utilization: float, debug: bool, run_with_cli: bool, requires_safetensors: bool)

General benchmarking configuration, across datasets and models.

Attributes

  • model_languages : list[Language] The languages of the models to benchmark.

  • dataset_languages : list[Language] The languages of the datasets in the benchmark.

  • tasks : list[Task] The tasks benchmark the model(s) on.

  • datasets : list[str] The datasets to benchmark on.

  • batch_size : int The batch size to use.

  • raise_errors : bool Whether to raise errors instead of skipping them.

  • cache_dir : str Directory to store cached models and datasets.

  • api_key : str | None The API key to use for a given inference API.

  • force : bool Whether to force the benchmark to run even if the results are already cached.

  • progress_bar : bool Whether to show a progress bar.

  • save_results : bool Whether to save the benchmark results to 'euroeval_benchmark_results.json'.

  • device : torch.device The device to use for benchmarking.

  • verbose : bool Whether to print verbose output.

  • trust_remote_code : bool Whether to trust remote code when loading models from the Hugging Face Hub.

  • clear_model_cache : bool Whether to clear the model cache after benchmarking each model.

  • evaluate_test_split : bool Whether to evaluate on the test split.

  • few_shot : bool Whether to only evaluate the model using few-shot evaluation. Only relevant if the model is generative.

  • num_iterations : int The number of iterations each model should be evaluated for.

  • api_base : str | None The base URL for a given inference API. Only relevant if model refers to a model on an inference API.

  • api_version : str | None The version of the API to use. Only relevant if model refers to a model on an inference API.

  • gpu_memory_utilization : float The GPU memory utilization to use for vLLM. A larger value will result in faster evaluation, but at the risk of running out of GPU memory. Only reduce this if you are running out of GPU memory. Only relevant if the model is generative.

  • debug : bool Whether to run the benchmark in debug mode.

  • run_with_cli : bool Whether the benchmark is being run with the CLI.

  • requires_safetensors : bool Whether to only allow models that use the safetensors format.

source class BenchmarkConfigParams(**data: Any)

Bases : pydantic.BaseModel

The parameters for the benchmark configuration.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Attributes

  • model_extra : dict[str, Any] | None Get extra fields set during validation.

  • model_fields_set : set[str] Returns the set of fields that have been explicitly set on this model instance.

source class BenchmarkResult(**data: Any)

Bases : pydantic.BaseModel

A benchmark result.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Attributes

  • model_config : ClassVar[ConfigDict] Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

  • model_extra : dict[str, Any] | None Get extra fields set during validation.

  • model_fields_set : set[str] Returns the set of fields that have been explicitly set on this model instance.

Methods

  • from_dict Create a benchmark result from a dictionary.

  • append_to_results Append the benchmark result to the results file.

source classmethod BenchmarkResult.from_dict(config: dict)BenchmarkResult

Create a benchmark result from a dictionary.

Parameters

  • config : dict The configuration dictionary.

Returns

source method BenchmarkResult.append_to_results(results_path: pathlib.Path)None

Append the benchmark result to the results file.

Parameters

  • results_path : pathlib.Path The path to the results file.

source dataclass DatasetConfig(name: str, pretty_name: str, huggingface_id: str, task: Task, languages: list[Language], _prompt_prefix: str | None = None, _prompt_template: str | None = None, _instruction_prompt: str | None = None, _num_few_shot_examples: int | None = None, _max_generated_tokens: int | None = None, _labels: list[str] | None = None, _prompt_label_mapping: dict[str, str] | t.Literal['auto'] | None = None, splits: list[str] = field(default_factory=lambda: ['train', 'val', 'test']), bootstrap_samples: bool = True, unofficial: bool = False)

Configuration for a dataset.

Attributes

  • name : str The name of the dataset. Must be lower case with no spaces.

  • pretty_name : str A longer prettier name for the dataset, which allows cases and spaces. Used for logging.

  • huggingface_id : str The Hugging Face ID of the dataset.

  • task : Task The task of the dataset.

  • languages : list[Language] The ISO 639-1 language codes of the entries in the dataset.

  • id2label : dict[int, str] The mapping from ID to label.

  • label2id : dict[str, int] The mapping from label to ID.

  • num_labels : int The number of labels in the dataset.

  • _prompt_prefix : optional The prefix to use in the few-shot prompt. Defaults to the template for the task and language.

  • _prompt_template : optional The template for the prompt to use when benchmarking the dataset using few-shot evaluation. Defaults to the template for the task and language.

  • _instruction_prompt : optional The prompt to use when benchmarking the dataset using instruction-based evaluation. Defaults to the template for the task and language.

  • _num_few_shot_examples : optional The number of examples to use when benchmarking the dataset using few-shot evaluation. For a classification task, these will be drawn evenly from each label. Defaults to the template for the task and language.

  • _max_generated_tokens : optional The maximum number of tokens to generate when benchmarking the dataset using few-shot evaluation. Defaults to the template for the task and language.

  • _labels : optional The labels in the dataset. Defaults to the template for the task and language.

  • _prompt_label_mapping : optional A mapping from the labels to another phrase which is used as a substitute for the label in few-shot evaluation. If "auto" then the mapping will be set to a 1:1 mapping between the labels and themselves. If None then the mapping will be set to the default mapping for the task and language. Defaults to None.

  • splits : optional The names of the splits in the dataset. If not provided, defaults to ["train", "val", "test"].

  • bootstrap_samples : optional Whether to bootstrap the dataset samples. Defaults to True.

  • unofficial : optional Whether the dataset is unofficial. Defaults to False.

  • prompt_prefix : str The prefix to use in the few-shot prompt.

  • prompt_template : str The template used during few-shot evaluation.

  • instruction_prompt : str The prompt to use when evaluating instruction-tuned models.

  • num_few_shot_examples : int The number of few-shot examples to use.

  • max_generated_tokens : int The maximum number of tokens to generate when evaluating a model.

  • labels : list[str] The labels in the dataset.

  • prompt_label_mapping : dict[str, str] Mapping from English labels to localised labels.

source property DatasetConfig.prompt_prefix: str

The prefix to use in the few-shot prompt.

source property DatasetConfig.prompt_template: str

The template used during few-shot evaluation.

source property DatasetConfig.instruction_prompt: str

The prompt to use when evaluating instruction-tuned models.

source property DatasetConfig.num_few_shot_examples: int

The number of few-shot examples to use.

source property DatasetConfig.max_generated_tokens: int

The maximum number of tokens to generate when evaluating a model.

source property DatasetConfig.labels: list[str]

The labels in the dataset.

source property DatasetConfig.prompt_label_mapping: dict[str, str]

Mapping from English labels to localised labels.

source property DatasetConfig.id2label: dict[int, str]

The mapping from ID to label.

source property DatasetConfig.label2id: dict[str, int]

The mapping from label to ID.

source property DatasetConfig.num_labels: int

The number of labels in the dataset.

source dataclass ModelConfig(model_id: str, revision: str, task: str, languages: list[Language], inference_backend: InferenceBackend, merge: bool, model_type: ModelType, fresh: bool, model_cache_dir: str, adapter_base_model_id: str | None)

Configuration for a model.

Attributes

  • model_id : str The ID of the model.

  • revision : str The revision of the model.

  • task : str The task that the model was trained on.

  • languages : list[Language] The languages of the model.

  • inference_backend : InferenceBackend The backend used to perform inference with the model.

  • merge : bool Whether the model is a merged model.

  • model_type : ModelType The type of the model (e.g., encoder, base decoder, instruction tuned).

  • fresh : bool Whether the model is freshly initialised.

  • model_cache_dir : str The directory to cache the model in.

  • adapter_base_model_id : str | None The model ID of the base model if the model is an adapter model. Can be None if the model is not an adapter model.

source dataclass PreparedModelInputs(texts: list[str] | None = None, input_ids: torch.Tensor | None = None, attention_mask: torch.Tensor | None = None)

The inputs to a model.

Attributes

  • texts : list[str] | None The texts to input to the model. Can be None if the input IDs and attention mask are provided instead.

  • input_ids : torch.Tensor | None The input IDs of the texts. Can be None if the texts are provided instead.

  • attention_mask : torch.Tensor | None The attention mask of the texts. Can be None if the texts are provided instead.

source dataclass GenerativeModelOutput(sequences: list[str], scores: list[list[list[tuple[str, float]]]] | None = None)

The output of a generative model.

Attributes

  • sequences : list[str] The generated sequences.

  • scores : list[list[list[tuple[str, float]]]] | None The scores of the sequences. This is an array of shape (batch_size, num_tokens, num_logprobs, 2), where the last dimension contains the token and its logprob. Can be None if the scores are not available.

source dataclass SingleGenerativeModelOutput(sequence: str, scores: list[list[tuple[str, float]]] | None = None)

A single output of a generative model.

Attributes

  • sequence : str The generated sequence.

  • scores : list[list[tuple[str, float]]] | None The scores of the sequence. This is an array of shape (num_tokens, num_logprobs, 2), where the last dimension contains the token and its logprob. Can be None if the scores are not available.

source dataclass HFModelInfo(pipeline_tag: str, tags: list[str], adapter_base_model_id: str | None)

Information about a Hugging Face model.

Attributes

  • pipeline_tag : str The pipeline tag of the model.

  • tags : list[str] The other tags of the model.

  • adapter_base_model_id : str | None The model ID of the base model if the model is an adapter model. Can be None if the model is not an adapter model.

source dataclass PromptConfig(default_prompt_prefix: str, default_prompt_template: str, default_instruction_prompt: str, default_prompt_label_mapping: dict[str, str] | t.Literal['auto'])

Configuration for task-specific prompting across languages.

Defines the prompt templates needed for evaluating a specific task in a given language.

Attributes

  • default_prompt_prefix : str The default prefix to use in the few-shot prompt.

  • default_prompt_template : str The default template for the prompt to use when benchmarking the dataset using few-shot evaluation.

  • default_instruction_prompt : str The default prompt to use when benchmarking the dataset using instruction-based evaluation.

  • default_prompt_label_mapping : dict[str, str] | t.Literal['auto'] The default mapping from the labels to another phrase which is used as a substitute for the label in few-shot evaluation. If set to "auto", the mapping will be set to a 1:1 mapping between the labels and themselves.