Skip to content

euroeval.data_models

source module euroeval.data_models

Data models used in EuroEval.

Classes

Functions

source get_package_version(package_name: str)str | None

Get the version of a package.

Parameters

  • package_name : str The name of the package.

Returns

  • str | None The version of the package, or None if the package is not installed.

source dataclass PromptConfig(default_prompt_prefix: str, default_prompt_template: str, default_instruction_prompt: str, default_prompt_label_mapping: dict[str, str] | t.Literal['auto'])

Configuration for task-specific prompting across languages.

Defines the prompt templates needed for evaluating a specific task in a given language.

Attributes

  • default_prompt_prefix : str The default prefix to use in the few-shot prompt.

  • default_prompt_template : str The default template for the prompt to use when benchmarking the dataset using few-shot evaluation.

  • default_instruction_prompt : str The default prompt to use when benchmarking the dataset using instruction-based evaluation.

  • default_prompt_label_mapping : dict[str, str] | t.Literal['auto'] The default mapping from the labels to another phrase which is used as a substitute for the label in few-shot evaluation. If set to "auto", the mapping will be set to a 1:1 mapping between the labels and themselves.

source dataclass Task(name: str, task_group: TaskGroup, template_dict: dict[Language, PromptConfig] | dict[tuple[Language, Language], PromptConfig], metrics: c.Sequence[Metric], default_num_few_shot_examples: int, default_max_generated_tokens: int, default_labels: c.Sequence[str] | None = tuple(), requires_zero_shot: bool = False, uses_structured_output: bool = False, uses_logprobs: bool = False, requires_logprobs: bool = False, default_allowed_model_types: c.Sequence[ModelType] = field(default_factory=lambda: [ModelType.ENCODER, ModelType.GENERATIVE]), default_allowed_generative_types: c.Sequence[GenerativeType] = field(default_factory=lambda: [GenerativeType.BASE, GenerativeType.INSTRUCTION_TUNED, GenerativeType.REASONING]), default_allow_invalid_model_outputs: bool = True)

A dataset task.

Attributes

  • name : str The name of the task.

  • task_group : TaskGroup The task group of the task.

  • template_dict : dict[Language, PromptConfig] | dict[tuple[Language, Language], PromptConfig] The template dictionary for the task, from language (or language tuples) to prompt template.

  • metrics : c.Sequence[Metric] The metrics used to evaluate the task.

  • default_num_few_shot_examples : int The default number of examples to use when benchmarking the task using few-shot evaluation. For a classification task, these will be drawn evenly from each label.

  • default_max_generated_tokens : int The default maximum number of tokens to generate when benchmarking the task using few-shot evaluation.

  • default_labels : optional The default labels for datasets using this task. Can be None if the labels should be set manually in the dataset configs. Defaults to an empty tuple.

  • requires_zero_shot : optional Whether to only allow zero-shot evaluation for this task. If True, the task will not be evaluated using few-shot examples.

  • uses_structured_output : optional Whether the task uses structured output. If True, the task will return structured output (e.g., BIO tags for NER). Defaults to False.

  • uses_logprobs : optional Whether the task uses log probabilities. If True, the task will return log probabilities for the generated tokens. Defaults to False.

  • requires_logprobs : optional Whether the task requires log probabilities. Implies uses_logprobs.

  • default_allowed_model_types : optional A list of model types that are allowed to be evaluated on this task. Defaults to all model types being allowed.

  • default_allowed_generative_types : optional A list of generative model types that are allowed to be evaluated on this task. If None, all generative model types are allowed. Only relevant if allowed_model_types includes generative models.

  • default_allow_invalid_model_outputs : optional Whether to allow invalid model outputs. This is only relevant for generative models on classification tasks, where the model may generate an output which is not one of the allowed labels. If True, the model output will be mapped to the closest valid label. If False, the model output will be considered incorrect and the evaluation will be aborted. Defaults to True.

source class DatasetConfig(task: Task, languages: c.Sequence[Language], name: str | None = None, pretty_name: str | None = None, source: str | dict[str, str] | None = None, prompt_prefix: str | None = None, prompt_template: str | None = None, instruction_prompt: str | None = None, num_few_shot_examples: int | None = None, max_generated_tokens: int | None = None, labels: c.Sequence[str] | None = None, prompt_label_mapping: dict[str, str] | t.Literal['auto'] | None = None, allowed_model_types: c.Sequence[ModelType] | None = None, allowed_generative_types: c.Sequence[GenerativeType] | None = None, allow_invalid_model_outputs: bool | None = None, train_split: str | None = 'train', val_split: str | None = 'val', test_split: str = 'test', bootstrap_samples: bool = True, unofficial: bool = False, _prompt_prefix: str | None = None, _prompt_template: str | None = None, _instruction_prompt: str | None = None, _num_few_shot_examples: int | None = None, _max_generated_tokens: int | None = None, _labels: c.Sequence[str] | None = None, _prompt_label_mapping: dict[str, str] | t.Literal['auto'] | None = None, _allowed_model_types: c.Sequence[ModelType] | None = None, _allowed_generative_types: c.Sequence[GenerativeType] | None = None, _allow_invalid_model_outputs: bool | None = None, _logging_string: str | None = None)

Configuration for a dataset.

Initialise a DatasetConfig object.

Parameters

  • task : Task The task of the dataset.

  • languages : c.Sequence[Language] The ISO 639-1 language codes of the entries in the dataset.

  • name : optional The name of the dataset. Must be lower case with no spaces. Can be None if and only if the dataset config resides directly in the Hugging Face dataset repo. Defaults to None.

  • pretty_name : optional A longer prettier name for the dataset, which allows cases and spaces. Used for logging. Can be None if and only if the dataset config resides directly in the Hugging Face dataset repo. Defaults to None.

  • source : optional The source of the dataset, which can be a Hugging Face ID or a dictionary with keys "train", "val" and "test" mapping to local CSV file paths. Can be None if and only if the dataset config resides directly in the Hugging Face dataset repo. Defaults to None.

  • prompt_prefix : optional The prefix to use in the few-shot prompt. Defaults to the template for the task and language.

  • prompt_template : optional The template for the prompt to use when benchmarking the dataset using few-shot evaluation. Defaults to the template for the task and language.

  • instruction_prompt : optional The prompt to use when benchmarking the dataset using instruction-based evaluation. Defaults to the template for the task and language.

  • num_few_shot_examples : optional The number of examples to use when benchmarking the dataset using few-shot evaluation. For a classification task, these will be drawn evenly from each label. Defaults to the template for the task and language.

  • max_generated_tokens : optional The maximum number of tokens to generate when benchmarking the dataset using few-shot evaluation. Defaults to the template for the task and language.

  • labels : optional The labels in the dataset. Defaults to the template for the task and language.

  • prompt_label_mapping : optional A mapping from the labels to another phrase which is used as a substitute for the label in few-shot evaluation. If "auto" then the mapping will be set to a 1:1 mapping between the labels and themselves. If None then the mapping will be set to the default mapping for the task and language. Defaults to None.

  • allowed_model_types : optional A list of model types that are allowed to be evaluated on this dataset. Defaults to the one for the task.

  • allowed_generative_types : optional A list of generative model types that are allowed to be evaluated on this dataset. If None, all generative model types are allowed. Only relevant if allowed_model_types includes generative models. Defaults to the one for the task.

  • allow_invalid_model_outputs : optional Whether to allow invalid model outputs. This is only relevant for generative models on classification tasks, where the model may generate an output which is not one of the allowed labels. If True, the model output will be mapped to the closest valid label. If False, the model output will be considered incorrect and the evaluation will be aborted. Defaults to the one for the task.

  • train_split : optional The name of the split to use as the training set. Can be None if there is no training split in the dataset. Defaults to "train".

  • val_split : optional The name of the split to use as the validation set. Can be None if there is no validation split in the dataset. Defaults to "val".

  • test_split : optional The name of the split to use as the test set. Defaults to "test".

  • bootstrap_samples : optional Whether to bootstrap the dataset samples. Defaults to True.

  • unofficial : optional Whether the dataset is unofficial. Defaults to False.

  • _prompt_prefix : optional This argument is deprecated. Please use prompt_prefix instead.

  • _prompt_template : optional This argument is deprecated. Please use prompt_template instead.

  • _instruction_prompt : optional This argument is deprecated. Please use instruction_prompt instead.

  • _num_few_shot_examples : optional This argument is deprecated. Please use num_few_shot_examples instead.

  • _max_generated_tokens : optional This argument is deprecated. Please use max_generated_tokens instead.

  • _labels : optional This argument is deprecated. Please use labels instead.

  • _prompt_label_mapping : optional This argument is deprecated. Please use prompt_label_mapping instead.

  • _allowed_model_types : optional This argument is deprecated. Please use allowed_model_types instead.

  • _allowed_generative_types : optional This argument is deprecated. Please use allowed_generative_types instead.

  • _allow_invalid_model_outputs : optional This argument is deprecated. Please use allow_invalid_model_outputs instead.

  • _logging_string : optional This argument is deprecated. Please use logging_string instead.

Attributes

  • name : str The name of the dataset.

  • pretty_name : str The pretty name of the dataset.

  • source : str | dict[str, str] The source of the dataset.

  • logging_string : str The string used to describe evaluation on the dataset in logging.

  • main_language : Language | tuple[Language, Language] Get the main language of the dataset.

  • id2label : HashableDict The mapping from ID to label.

  • label2id : HashableDict The mapping from label to ID.

  • num_labels : int The number of labels in the dataset.

Methods

  • get_labels_str Converts a set of labels to a natural string, in the specified language.

source property DatasetConfig.name: str

The name of the dataset.

Returns

  • str The name of the dataset.

Raises

  • ValueError If the name of the dataset is not set.

source property DatasetConfig.pretty_name: str

The pretty name of the dataset.

Returns

  • str The pretty name of the dataset.

Raises

  • ValueError If the pretty name of the dataset is not set.

source property DatasetConfig.source: str | dict[str, str]

The source of the dataset.

Returns

  • str | dict[str, str] The source of the dataset.

Raises

  • ValueError If the source of the dataset is not set.

source property DatasetConfig.logging_string: str

The string used to describe evaluation on the dataset in logging.

Returns

  • str The logging string.

source property DatasetConfig.main_language: Language | tuple[Language, Language]

Get the main language of the dataset.

Returns

Raises

  • InvalidBenchmark If the dataset has no languages.

source property DatasetConfig.id2label: HashableDict

The mapping from ID to label.

source property DatasetConfig.label2id: HashableDict

The mapping from label to ID.

source property DatasetConfig.num_labels: int

The number of labels in the dataset.

source method DatasetConfig.get_labels_str(labels: c.Sequence[str] | None = None)str

Converts a set of labels to a natural string, in the specified language.

If the task is NER, we separate using 'and' and use the mapped labels instead of the BIO NER labels.

Parameters

  • labels : optional The labels to convert to a natural string. If None, uses all the labels in the dataset. Defaults to None.

Returns

  • str The natural string representation of the labels in specified language.

source dataclass BenchmarkConfig(datasets: c.Sequence[DatasetConfig], languages: c.Sequence[Language], finetuning_batch_size: int, raise_errors: bool, cache_dir: str, api_key: str | None, api_base: str | None, api_version: str | None, progress_bar: bool, save_results: bool, device: torch.device, trust_remote_code: bool, clear_model_cache: bool, evaluate_test_split: bool, few_shot: bool, num_iterations: int, gpu_memory_utilization: float, attention_backend: t.Literal[*ATTENTION_BACKENDS,], requires_safetensors: bool, generative_type: GenerativeType | None, download_only: bool, force: bool, verbose: bool, debug: bool, run_with_cli: bool)

General benchmarking configuration, across datasets and models.

Attributes

  • datasets : c.Sequence[DatasetConfig] The datasets to benchmark on.

  • finetuning_batch_size : int The batch size to use for finetuning.

  • raise_errors : bool Whether to raise errors instead of skipping them.

  • cache_dir : str Directory to store cached models and datasets.

  • api_key : str | None The API key to use for a given inference API.

  • api_base : str | None The base URL for a given inference API. Only relevant if model refers to a model on an inference API.

  • api_version : str | None The version of the API to use. Only relevant if model refers to a model on an inference API.

  • progress_bar : bool Whether to show a progress bar.

  • save_results : bool Whether to save the benchmark results to 'euroeval_benchmark_results.json'.

  • device : torch.device The device to use for benchmarking.

  • trust_remote_code : bool Whether to trust remote code when loading models from the Hugging Face Hub.

  • clear_model_cache : bool Whether to clear the model cache after benchmarking each model.

  • evaluate_test_split : bool Whether to evaluate on the test split.

  • few_shot : bool Whether to only evaluate the model using few-shot evaluation. Only relevant if the model is generative.

  • num_iterations : int The number of iterations each model should be evaluated for.

  • gpu_memory_utilization : float The GPU memory utilization to use for vLLM. A larger value will result in faster evaluation, but at the risk of running out of GPU memory. Only reduce this if you are running out of GPU memory. Only relevant if the model is generative.

  • attention_backend : t.Literal[*ATTENTION_BACKENDS,] The attention backend to use for vLLM. Defaults to FLASHINFER. Only relevant if the model is generative.

  • requires_safetensors : bool Whether to only allow models that use the safetensors format.

  • generative_type : GenerativeType | None The type of generative model to benchmark. Only relevant if the model is generative.

  • download_only : bool Whether to only download the models, metrics and datasets without evaluating.

  • force : bool Whether to force the benchmark to run even if the results are already cached.

  • verbose : bool Whether to print verbose output.

  • debug : bool Whether to run the benchmark in debug mode.

  • run_with_cli : bool Whether the benchmark is being run with the CLI.

  • tasks : c.Sequence[Task] Get the tasks in the benchmark configuration.

source property BenchmarkConfig.tasks: c.Sequence[Task]

Get the tasks in the benchmark configuration.

source class BenchmarkConfigParams(**data: Any)

Bases : pydantic.BaseModel

The parameters for the benchmark configuration.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Attributes

  • model_extra : dict[str, Any] | None Get extra fields set during validation.

  • model_fields_set : set[str] Returns the set of fields that have been explicitly set on this model instance.

source class BenchmarkResult(**data: Any)

Bases : pydantic.BaseModel

A benchmark result.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Attributes

  • model_config : ClassVar[ConfigDict] Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

  • model_extra : dict[str, Any] | None Get extra fields set during validation.

  • model_fields_set : set[str] Returns the set of fields that have been explicitly set on this model instance.

Methods

  • from_dict Create a benchmark result from a dictionary.

  • append_to_results Append the benchmark result to the results file.

source classmethod BenchmarkResult.from_dict(config: dict)BenchmarkResult

Create a benchmark result from a dictionary.

Parameters

  • config : dict The configuration dictionary.

Returns

source method BenchmarkResult.append_to_results(results_path: Path)None

Append the benchmark result to the results file.

Parameters

  • results_path : Path The path to the results file.

source dataclass ModelConfig(model_id: str, revision: str, param: str | None, task: str, languages: c.Sequence[Language], inference_backend: InferenceBackend, merge: bool, model_type: ModelType, fresh: bool, model_cache_dir: str, adapter_base_model_id: str | None, generation_config: GenerationConfig | None = None)

Configuration for a model.

Attributes

  • model_id : str The ID of the model.

  • revision : str The revision of the model.

  • param : str | None The parameter of the model, or None if the model has no parameters.

  • task : str The task that the model was trained on.

  • languages : c.Sequence[Language] The languages of the model.

  • inference_backend : InferenceBackend The backend used to perform inference with the model.

  • merge : bool Whether the model is a merged model.

  • model_type : ModelType The type of the model (e.g., encoder, base decoder, instruction tuned).

  • fresh : bool Whether the model is freshly initialised.

  • model_cache_dir : str The directory to cache the model in.

  • adapter_base_model_id : str | None The model ID of the base model if the model is an adapter model. Can be None if the model is not an adapter model.

  • generation_config : optional The generation configuration for generative models, if specified in the model repository. Defaults to no generation configuration.

source dataclass PreparedModelInputs(texts: c.Sequence[str] | None = None, input_ids: torch.Tensor | None = None, attention_mask: torch.Tensor | None = None)

The inputs to a model.

Attributes

  • texts : c.Sequence[str] | None The texts to input to the model. Can be None if the input IDs and attention mask are provided instead.

  • input_ids : torch.Tensor | None The input IDs of the texts. Can be None if the texts are provided instead.

  • attention_mask : torch.Tensor | None The attention mask of the texts. Can be None if the texts are provided instead.

source dataclass GenerativeModelOutput(sequences: c.Sequence[str], predicted_labels: c.Sequence | None = None, scores: c.Sequence[c.Sequence[c.Sequence[tuple[str, float]]]] | None = None, metadatas: list['HashableDict | None'] = field(default_factory=list))

The output of a generative model.

Attributes

  • sequences : c.Sequence[str] The generated sequences.

  • predicted_labels : optional The predicted labels from the sequences and sometimes also scores. Can be None if the labels have not been predicted yet. Defaults to None.

  • scores : optional The scores of the sequences. This is an array of shape (batch_size, num_tokens, num_logprobs, 2), where the last dimension contains the token and its logprob. Can be None if the scores are not available. Defaults to None.

  • metadatas : optional All the metadata fields for the samples, including ground truth labels (if applicable). Defaults to an empty list.

source dataclass SingleGenerativeModelOutput(sequence: str, predicted_label: str | None = None, scores: c.Sequence[c.Sequence[tuple[str, float]]] | None = None, metadata: HashableDict | None = None)

A single output of a generative model.

Attributes

  • sequence : str The generated sequence.

  • predicted_label : optional The predicted label from the sequence and sometimes also scores. Can be None if the label has not been predicted yet. Defaults to None.

  • scores : optional The scores of the sequence. This is an array of shape (num_tokens, num_logprobs, 2), where the last dimension contains the token and its logprob. Can be None if the scores are not available. Defaults to None.

  • metadata : optional The metadata fields for the sample, including ground truth labels (if applicable). Can be None if the metadata is not available. Defaults to None.

source dataclass HFModelInfo(pipeline_tag: str, tags: c.Sequence[str], adapter_base_model_id: str | None)

Information about a Hugging Face model.

Attributes

  • pipeline_tag : str The pipeline tag of the model.

  • tags : c.Sequence[str] The other tags of the model.

  • adapter_base_model_id : str | None The model ID of the base model if the model is an adapter model. Can be None if the model is not an adapter model.

source dataclass ModelIdComponents(model_id: str, revision: str, param: str | None)

A model ID split into its components.

Attributes

  • model_id : str The main model ID without revision or parameters.

  • revision : str The revision of the model, if any.

  • param : str | None The parameter of the model, if any.

source class HashableDict()

Bases : dict

A hashable dictionary.