euroeval.data_models¶
source module euroeval.data_models
Data models used in EuroEval.
Classes
-
PromptConfig — Configuration for task-specific prompting across languages.
-
Task — A dataset task.
-
DatasetConfig — Configuration for a dataset.
-
BenchmarkConfig — General benchmarking configuration, across datasets and models.
-
BenchmarkConfigParams — The parameters for the benchmark configuration.
-
BenchmarkResult — A benchmark result.
-
ModelConfig — Configuration for a model.
-
PreparedModelInputs — The inputs to a model.
-
GenerativeModelOutput — The output of a generative model.
-
SingleGenerativeModelOutput — A single output of a generative model.
-
HFModelInfo — Information about a Hugging Face model.
-
ModelIdComponents — A model ID split into its components.
-
HashableDict — A hashable dictionary.
Functions
-
get_package_version — Get the version of a package.
source get_package_version(package_name: str) → str | None
Get the version of a package.
Parameters
-
package_name : str — The name of the package.
Returns
-
str | None — The version of the package, or None if the package is not installed.
source dataclass PromptConfig(default_prompt_prefix: str, default_prompt_template: str, default_instruction_prompt: str, default_prompt_label_mapping: dict[str, str] | t.Literal['auto'])
Configuration for task-specific prompting across languages.
Defines the prompt templates needed for evaluating a specific task in a given language.
Attributes
-
default_prompt_prefix : str — The default prefix to use in the few-shot prompt.
-
default_prompt_template : str — The default template for the prompt to use when benchmarking the dataset using few-shot evaluation.
-
default_instruction_prompt : str — The default prompt to use when benchmarking the dataset using instruction-based evaluation.
-
default_prompt_label_mapping : dict[str, str] | t.Literal['auto'] — The default mapping from the labels to another phrase which is used as a substitute for the label in few-shot evaluation. If set to "auto", the mapping will be set to a 1:1 mapping between the labels and themselves.
source dataclass Task(name: str, task_group: TaskGroup, template_dict: dict[Language, PromptConfig] | dict[tuple[Language, Language], PromptConfig], metrics: c.Sequence[Metric], default_num_few_shot_examples: int, default_max_generated_tokens: int, default_labels: c.Sequence[str] | None = tuple(), requires_zero_shot: bool = False, uses_structured_output: bool = False, uses_logprobs: bool = False, requires_logprobs: bool = False, default_allowed_model_types: c.Sequence[ModelType] = field(default_factory=lambda: [ModelType.ENCODER, ModelType.GENERATIVE]), default_allowed_generative_types: c.Sequence[GenerativeType] = field(default_factory=lambda: [GenerativeType.BASE, GenerativeType.INSTRUCTION_TUNED, GenerativeType.REASONING]), default_allow_invalid_model_outputs: bool = True)
A dataset task.
Attributes
-
name : str — The name of the task.
-
task_group : TaskGroup — The task group of the task.
-
template_dict : dict[Language, PromptConfig] | dict[tuple[Language, Language], PromptConfig] — The template dictionary for the task, from language (or language tuples) to prompt template.
-
metrics : c.Sequence[Metric] — The metrics used to evaluate the task.
-
default_num_few_shot_examples : int — The default number of examples to use when benchmarking the task using few-shot evaluation. For a classification task, these will be drawn evenly from each label.
-
default_max_generated_tokens : int — The default maximum number of tokens to generate when benchmarking the task using few-shot evaluation.
-
default_labels : optional — The default labels for datasets using this task. Can be None if the labels should be set manually in the dataset configs. Defaults to an empty tuple.
-
requires_zero_shot : optional — Whether to only allow zero-shot evaluation for this task. If True, the task will not be evaluated using few-shot examples.
-
uses_structured_output : optional — Whether the task uses structured output. If True, the task will return structured output (e.g., BIO tags for NER). Defaults to False.
-
uses_logprobs : optional — Whether the task uses log probabilities. If True, the task will return log probabilities for the generated tokens. Defaults to False.
-
requires_logprobs : optional — Whether the task requires log probabilities. Implies
uses_logprobs. -
default_allowed_model_types : optional — A list of model types that are allowed to be evaluated on this task. Defaults to all model types being allowed.
-
default_allowed_generative_types : optional — A list of generative model types that are allowed to be evaluated on this task. If None, all generative model types are allowed. Only relevant if
allowed_model_typesincludes generative models. -
default_allow_invalid_model_outputs : optional — Whether to allow invalid model outputs. This is only relevant for generative models on classification tasks, where the model may generate an output which is not one of the allowed labels. If True, the model output will be mapped to the closest valid label. If False, the model output will be considered incorrect and the evaluation will be aborted. Defaults to True.
source class DatasetConfig(task: Task, languages: c.Sequence[Language], name: str | None = None, pretty_name: str | None = None, source: str | dict[str, str] | None = None, prompt_prefix: str | None = None, prompt_template: str | None = None, instruction_prompt: str | None = None, num_few_shot_examples: int | None = None, max_generated_tokens: int | None = None, labels: c.Sequence[str] | None = None, prompt_label_mapping: dict[str, str] | t.Literal['auto'] | None = None, allowed_model_types: c.Sequence[ModelType] | None = None, allowed_generative_types: c.Sequence[GenerativeType] | None = None, allow_invalid_model_outputs: bool | None = None, train_split: str | None = 'train', val_split: str | None = 'val', test_split: str = 'test', bootstrap_samples: bool = True, unofficial: bool = False, _prompt_prefix: str | None = None, _prompt_template: str | None = None, _instruction_prompt: str | None = None, _num_few_shot_examples: int | None = None, _max_generated_tokens: int | None = None, _labels: c.Sequence[str] | None = None, _prompt_label_mapping: dict[str, str] | t.Literal['auto'] | None = None, _allowed_model_types: c.Sequence[ModelType] | None = None, _allowed_generative_types: c.Sequence[GenerativeType] | None = None, _allow_invalid_model_outputs: bool | None = None, _logging_string: str | None = None)
Configuration for a dataset.
Initialise a DatasetConfig object.
Parameters
-
task : Task — The task of the dataset.
-
languages : c.Sequence[Language] — The ISO 639-1 language codes of the entries in the dataset.
-
name : optional — The name of the dataset. Must be lower case with no spaces. Can be None if and only if the dataset config resides directly in the Hugging Face dataset repo. Defaults to None.
-
pretty_name : optional — A longer prettier name for the dataset, which allows cases and spaces. Used for logging. Can be None if and only if the dataset config resides directly in the Hugging Face dataset repo. Defaults to None.
-
source : optional — The source of the dataset, which can be a Hugging Face ID or a dictionary with keys "train", "val" and "test" mapping to local CSV file paths. Can be None if and only if the dataset config resides directly in the Hugging Face dataset repo. Defaults to None.
-
prompt_prefix : optional — The prefix to use in the few-shot prompt. Defaults to the template for the task and language.
-
prompt_template : optional — The template for the prompt to use when benchmarking the dataset using few-shot evaluation. Defaults to the template for the task and language.
-
instruction_prompt : optional — The prompt to use when benchmarking the dataset using instruction-based evaluation. Defaults to the template for the task and language.
-
num_few_shot_examples : optional — The number of examples to use when benchmarking the dataset using few-shot evaluation. For a classification task, these will be drawn evenly from each label. Defaults to the template for the task and language.
-
max_generated_tokens : optional — The maximum number of tokens to generate when benchmarking the dataset using few-shot evaluation. Defaults to the template for the task and language.
-
labels : optional — The labels in the dataset. Defaults to the template for the task and language.
-
prompt_label_mapping : optional — A mapping from the labels to another phrase which is used as a substitute for the label in few-shot evaluation. If "auto" then the mapping will be set to a 1:1 mapping between the labels and themselves. If None then the mapping will be set to the default mapping for the task and language. Defaults to None.
-
allowed_model_types : optional — A list of model types that are allowed to be evaluated on this dataset. Defaults to the one for the task.
-
allowed_generative_types : optional — A list of generative model types that are allowed to be evaluated on this dataset. If None, all generative model types are allowed. Only relevant if
allowed_model_typesincludes generative models. Defaults to the one for the task. -
allow_invalid_model_outputs : optional — Whether to allow invalid model outputs. This is only relevant for generative models on classification tasks, where the model may generate an output which is not one of the allowed labels. If True, the model output will be mapped to the closest valid label. If False, the model output will be considered incorrect and the evaluation will be aborted. Defaults to the one for the task.
-
train_split : optional — The name of the split to use as the training set. Can be None if there is no training split in the dataset. Defaults to "train".
-
val_split : optional — The name of the split to use as the validation set. Can be None if there is no validation split in the dataset. Defaults to "val".
-
test_split : optional — The name of the split to use as the test set. Defaults to "test".
-
bootstrap_samples : optional — Whether to bootstrap the dataset samples. Defaults to True.
-
unofficial : optional — Whether the dataset is unofficial. Defaults to False.
-
_prompt_prefix : optional — This argument is deprecated. Please use
prompt_prefixinstead. -
_prompt_template : optional — This argument is deprecated. Please use
prompt_templateinstead. -
_instruction_prompt : optional — This argument is deprecated. Please use
instruction_promptinstead. -
_num_few_shot_examples : optional — This argument is deprecated. Please use
num_few_shot_examplesinstead. -
_max_generated_tokens : optional — This argument is deprecated. Please use
max_generated_tokensinstead. -
_labels : optional — This argument is deprecated. Please use
labelsinstead. -
_prompt_label_mapping : optional — This argument is deprecated. Please use
prompt_label_mappinginstead. -
_allowed_model_types : optional — This argument is deprecated. Please use
allowed_model_typesinstead. -
_allowed_generative_types : optional — This argument is deprecated. Please use
allowed_generative_typesinstead. -
_allow_invalid_model_outputs : optional — This argument is deprecated. Please use
allow_invalid_model_outputsinstead. -
_logging_string : optional — This argument is deprecated. Please use
logging_stringinstead.
Attributes
-
name : str — The name of the dataset.
-
pretty_name : str — The pretty name of the dataset.
-
source : str | dict[str, str] — The source of the dataset.
-
logging_string : str — The string used to describe evaluation on the dataset in logging.
-
main_language : Language | tuple[Language, Language] — Get the main language of the dataset.
-
id2label : HashableDict — The mapping from ID to label.
-
label2id : HashableDict — The mapping from label to ID.
-
num_labels : int — The number of labels in the dataset.
Methods
-
get_labels_str — Converts a set of labels to a natural string, in the specified language.
source property DatasetConfig.name: str
The name of the dataset.
Returns
-
str — The name of the dataset.
Raises
-
ValueError — If the name of the dataset is not set.
source property DatasetConfig.pretty_name: str
The pretty name of the dataset.
Returns
-
str — The pretty name of the dataset.
Raises
-
ValueError — If the pretty name of the dataset is not set.
source property DatasetConfig.source: str | dict[str, str]
The source of the dataset.
Returns
-
str | dict[str, str] — The source of the dataset.
Raises
-
ValueError — If the source of the dataset is not set.
source property DatasetConfig.logging_string: str
The string used to describe evaluation on the dataset in logging.
Returns
-
str — The logging string.
source property DatasetConfig.main_language: Language | tuple[Language, Language]
Get the main language of the dataset.
Returns
Raises
-
InvalidBenchmark — If the dataset has no languages.
source property DatasetConfig.id2label: HashableDict
The mapping from ID to label.
source property DatasetConfig.label2id: HashableDict
The mapping from label to ID.
source property DatasetConfig.num_labels: int
The number of labels in the dataset.
source method DatasetConfig.get_labels_str(labels: c.Sequence[str] | None = None) → str
Converts a set of labels to a natural string, in the specified language.
If the task is NER, we separate using 'and' and use the mapped labels instead of the BIO NER labels.
Parameters
-
labels : optional — The labels to convert to a natural string. If None, uses all the labels in the dataset. Defaults to None.
Returns
-
str — The natural string representation of the labels in specified language.
source dataclass BenchmarkConfig(datasets: c.Sequence[DatasetConfig], languages: c.Sequence[Language], finetuning_batch_size: int, raise_errors: bool, cache_dir: str, api_key: str | None, api_base: str | None, api_version: str | None, progress_bar: bool, save_results: bool, device: torch.device, trust_remote_code: bool, clear_model_cache: bool, evaluate_test_split: bool, few_shot: bool, num_iterations: int, gpu_memory_utilization: float, attention_backend: t.Literal[*ATTENTION_BACKENDS,], requires_safetensors: bool, generative_type: GenerativeType | None, download_only: bool, force: bool, verbose: bool, debug: bool, run_with_cli: bool)
General benchmarking configuration, across datasets and models.
Attributes
-
datasets : c.Sequence[DatasetConfig] — The datasets to benchmark on.
-
finetuning_batch_size : int — The batch size to use for finetuning.
-
raise_errors : bool — Whether to raise errors instead of skipping them.
-
cache_dir : str — Directory to store cached models and datasets.
-
api_key : str | None — The API key to use for a given inference API.
-
api_base : str | None — The base URL for a given inference API. Only relevant if
modelrefers to a model on an inference API. -
api_version : str | None — The version of the API to use. Only relevant if
modelrefers to a model on an inference API. -
progress_bar : bool — Whether to show a progress bar.
-
save_results : bool — Whether to save the benchmark results to 'euroeval_benchmark_results.json'.
-
device : torch.device — The device to use for benchmarking.
-
trust_remote_code : bool — Whether to trust remote code when loading models from the Hugging Face Hub.
-
clear_model_cache : bool — Whether to clear the model cache after benchmarking each model.
-
evaluate_test_split : bool — Whether to evaluate on the test split.
-
few_shot : bool — Whether to only evaluate the model using few-shot evaluation. Only relevant if the model is generative.
-
num_iterations : int — The number of iterations each model should be evaluated for.
-
gpu_memory_utilization : float — The GPU memory utilization to use for vLLM. A larger value will result in faster evaluation, but at the risk of running out of GPU memory. Only reduce this if you are running out of GPU memory. Only relevant if the model is generative.
-
attention_backend : t.Literal[*ATTENTION_BACKENDS,] — The attention backend to use for vLLM. Defaults to FLASHINFER. Only relevant if the model is generative.
-
requires_safetensors : bool — Whether to only allow models that use the safetensors format.
-
generative_type : GenerativeType | None — The type of generative model to benchmark. Only relevant if the model is generative.
-
download_only : bool — Whether to only download the models, metrics and datasets without evaluating.
-
force : bool — Whether to force the benchmark to run even if the results are already cached.
-
verbose : bool — Whether to print verbose output.
-
debug : bool — Whether to run the benchmark in debug mode.
-
run_with_cli : bool — Whether the benchmark is being run with the CLI.
-
tasks : c.Sequence[Task] — Get the tasks in the benchmark configuration.
source property BenchmarkConfig.tasks: c.Sequence[Task]
Get the tasks in the benchmark configuration.
source class BenchmarkConfigParams(**data: Any)
Bases : pydantic.BaseModel
The parameters for the benchmark configuration.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be
validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
Attributes
-
model_extra : dict[str, Any] | None — Get extra fields set during validation.
-
model_fields_set : set[str] — Returns the set of fields that have been explicitly set on this model instance.
source class BenchmarkResult(**data: Any)
Bases : pydantic.BaseModel
A benchmark result.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be
validated to form a valid model.
self is explicitly positional-only to allow self as a field name.
Attributes
-
model_config : ClassVar[ConfigDict] — Configuration for the model, should be a dictionary conforming to [
ConfigDict][pydantic.config.ConfigDict]. -
model_extra : dict[str, Any] | None — Get extra fields set during validation.
-
model_fields_set : set[str] — Returns the set of fields that have been explicitly set on this model instance.
Methods
-
from_dict — Create a benchmark result from a dictionary.
-
append_to_results — Append the benchmark result to the results file.
source classmethod BenchmarkResult.from_dict(config: dict) → BenchmarkResult
Create a benchmark result from a dictionary.
Parameters
-
config : dict — The configuration dictionary.
Returns
-
BenchmarkResult — The benchmark result.
source method BenchmarkResult.append_to_results(results_path: Path) → None
Append the benchmark result to the results file.
Parameters
-
results_path : Path — The path to the results file.
source dataclass ModelConfig(model_id: str, revision: str, param: str | None, task: str, languages: c.Sequence[Language], inference_backend: InferenceBackend, merge: bool, model_type: ModelType, fresh: bool, model_cache_dir: str, adapter_base_model_id: str | None, generation_config: GenerationConfig | None = None)
Configuration for a model.
Attributes
-
model_id : str — The ID of the model.
-
revision : str — The revision of the model.
-
param : str | None — The parameter of the model, or None if the model has no parameters.
-
task : str — The task that the model was trained on.
-
languages : c.Sequence[Language] — The languages of the model.
-
inference_backend : InferenceBackend — The backend used to perform inference with the model.
-
merge : bool — Whether the model is a merged model.
-
model_type : ModelType — The type of the model (e.g., encoder, base decoder, instruction tuned).
-
fresh : bool — Whether the model is freshly initialised.
-
model_cache_dir : str — The directory to cache the model in.
-
adapter_base_model_id : str | None — The model ID of the base model if the model is an adapter model. Can be None if the model is not an adapter model.
-
generation_config : optional — The generation configuration for generative models, if specified in the model repository. Defaults to no generation configuration.
source dataclass PreparedModelInputs(texts: c.Sequence[str] | None = None, input_ids: torch.Tensor | None = None, attention_mask: torch.Tensor | None = None)
The inputs to a model.
Attributes
-
texts : c.Sequence[str] | None — The texts to input to the model. Can be None if the input IDs and attention mask are provided instead.
-
input_ids : torch.Tensor | None — The input IDs of the texts. Can be None if the texts are provided instead.
-
attention_mask : torch.Tensor | None — The attention mask of the texts. Can be None if the texts are provided instead.
source dataclass GenerativeModelOutput(sequences: c.Sequence[str], predicted_labels: c.Sequence | None = None, scores: c.Sequence[c.Sequence[c.Sequence[tuple[str, float]]]] | None = None, metadatas: list['HashableDict | None'] = field(default_factory=list))
The output of a generative model.
Attributes
-
sequences : c.Sequence[str] — The generated sequences.
-
predicted_labels : optional — The predicted labels from the
sequencesand sometimes alsoscores. Can be None if the labels have not been predicted yet. Defaults to None. -
scores : optional — The scores of the sequences. This is an array of shape (batch_size, num_tokens, num_logprobs, 2), where the last dimension contains the token and its logprob. Can be None if the scores are not available. Defaults to None.
-
metadatas : optional — All the metadata fields for the samples, including ground truth labels (if applicable). Defaults to an empty list.
source dataclass SingleGenerativeModelOutput(sequence: str, predicted_label: str | None = None, scores: c.Sequence[c.Sequence[tuple[str, float]]] | None = None, metadata: HashableDict | None = None)
A single output of a generative model.
Attributes
-
sequence : str — The generated sequence.
-
predicted_label : optional — The predicted label from the
sequenceand sometimes alsoscores. Can be None if the label has not been predicted yet. Defaults to None. -
scores : optional — The scores of the sequence. This is an array of shape (num_tokens, num_logprobs, 2), where the last dimension contains the token and its logprob. Can be None if the scores are not available. Defaults to None.
-
metadata : optional — The metadata fields for the sample, including ground truth labels (if applicable). Can be None if the metadata is not available. Defaults to None.
source dataclass HFModelInfo(pipeline_tag: str, tags: c.Sequence[str], adapter_base_model_id: str | None)
Information about a Hugging Face model.
Attributes
-
pipeline_tag : str — The pipeline tag of the model.
-
tags : c.Sequence[str] — The other tags of the model.
-
adapter_base_model_id : str | None — The model ID of the base model if the model is an adapter model. Can be None if the model is not an adapter model.
source dataclass ModelIdComponents(model_id: str, revision: str, param: str | None)
A model ID split into its components.
Attributes
-
model_id : str — The main model ID without revision or parameters.
-
revision : str — The revision of the model, if any.
-
param : str | None — The parameter of the model, if any.
source class HashableDict()
Bases : dict
A hashable dictionary.