Skip to content

euroeval

source package euroeval

EuroEval - A benchmarking framework for language models.

Classes

Functions

source class Benchmarker(progress_bar: bool = True, save_results: bool = True, task: str | Task | c.Sequence[str | Task] | None = None, dataset: str | DatasetConfig | c.Sequence[str | DatasetConfig] | None = None, language: str | c.Sequence[str] = 'all', device: Device | None = None, finetuning_batch_size: int = 32, raise_errors: bool = False, cache_dir: str = '.euroeval_cache', api_key: str | None = None, force: bool = False, verbose: bool = False, trust_remote_code: bool = False, clear_model_cache: bool = False, evaluate_test_split: bool = False, few_shot: bool = True, num_iterations: int = 10, api_base: str | None = None, api_version: str | None = None, gpu_memory_utilization: float = 0.8, attention_backend: t.Literal[*ATTENTION_BACKENDS,] = 'FLASHINFER', generative_type: GenerativeType | None = None, custom_datasets_file: Path | str = Path('custom_datasets.py'), debug: bool = False, run_with_cli: bool = False, requires_safetensors: bool = False, download_only: bool = False, model_language: str | c.Sequence[str] | None = None, dataset_language: str | c.Sequence[str] | None = None, batch_size: int | None = None)

Benchmarking all the language models.

Initialise the benchmarker.

Attributes

  • benchmark_config_default_params The default parameters for the benchmark configuration.

  • benchmark_config The benchmark configuration.

  • force Whether to force evaluations of models, even if they have been benchmarked already.

  • results_path The path to the results file.

  • benchmark_results : c.Sequence[BenchmarkResult] The benchmark results.

Parameters

  • progress_bar : bool Whether progress bars should be shown. Defaults to True.

  • save_results : bool Whether to save the benchmark results to 'euroeval_benchmark_results.jsonl'. Defaults to True.

  • task : str | Task | c.Sequence[str | Task] | None The tasks benchmark the model(s) on. Mutually exclusive with dataset. If both task and dataset are None then all datasets will be benchmarked.

  • dataset : str | DatasetConfig | c.Sequence[str | DatasetConfig] | None The datasets to benchmark on. Mutually exclusive with task. If both task and dataset are None then all datasets will be benchmarked.

  • language : str | c.Sequence[str] The language codes of the languages to include, both for models and datasets. Set this to 'all' if all languages should be considered. Defaults to "all".

  • device : Device | None The device to use for benchmarking. Defaults to None.

  • finetuning_batch_size : int The batch size to use when finetuning. Defaults to 32.

  • raise_errors : bool Whether to raise errors instead of skipping the model evaluation. Defaults to False.

  • cache_dir : str Directory to store cached models. Defaults to '.euroeval_cache'.

  • api_key : str | None The API key to use for a given inference API.

  • force : bool Whether to force evaluations of models, even if they have been benchmarked already. Defaults to False.

  • verbose : bool Whether to output additional output. This is automatically set if debug is True. Defaults to False.

  • trust_remote_code : bool Whether to trust remote code when loading models. Defaults to False.

  • clear_model_cache : bool Whether to clear the model cache after benchmarking each model. Defaults to False.

  • evaluate_test_split : bool Whether to evaluate the test split of the datasets. Defaults to False.

  • few_shot : bool Whether to only evaluate the model using few-shot evaluation. Only relevant if the model is generative. Defaults to True.

  • num_iterations : int The number of times each model should be evaluated. This is only meant to be used for power users, and scores will not be allowed on the leaderboards if this is changed. Defaults to 10.

  • api_base : str | None The base URL for a given inference API. Only relevant if model refers to a model on an inference API. Defaults to None.

  • api_version : str | None The version of the API to use. Defaults to None.

  • gpu_memory_utilization : float The GPU memory utilization to use for vLLM. Only relevant if the model is generative. A larger value will result in faster evaluation, but at the risk of running out of GPU memory. Only reduce this if you are running out of GPU memory. Defaults to 0.9.

  • attention_backend : t.Literal[*ATTENTION_BACKENDS,] The attention backend to use for vLLM. Defaults to FLASHINFER. Only relevant if the model is generative.

  • generative_type : GenerativeType | None The type of generative model to benchmark. Only relevant if the model is generative. If not specified, then the type will be inferred based on the tags of the model. Defaults to None.

  • custom_datasets_file : Path | str Path to a Python file defining custom datasets. Defaults to 'custom_datasets.py'.

  • debug : bool Whether to output debug information. Defaults to False.

  • run_with_cli : bool Whether the benchmarker is being run from the command-line interface. Defaults to False.

  • requires_safetensors : bool Whether to only allow models that use the safetensors format. Defaults to False.

  • download_only : bool Whether to only download models and datasets without performing any benchmarking. Defaults to False.

  • model_language : str | c.Sequence[str] | None Deprecated argument. Please use language instead.

  • dataset_language : str | c.Sequence[str] | None Deprecated argument. Please use language instead.

  • batch_size : int | None Deprecated argument. Please use finetuning_batch_size instead.

Raises

  • ValueError If both task and dataset are specified, or if download_only is True and we have no internet connection.

  • ImportError If hf_transfer is enabled but not installed.

Methods

  • benchmark Benchmarks models on datasets.

source property Benchmarker.benchmark_results: c.Sequence[BenchmarkResult]

The benchmark results.

Returns

Raises

  • ValueError If there is an error decoding a line in the results file.

source method Benchmarker.benchmark(model: c.Sequence[str] | str, task: str | Task | c.Sequence[str | Task] | None = None, dataset: str | DatasetConfig | c.Sequence[str | DatasetConfig] | None = None, progress_bar: bool | None = None, save_results: bool | None = None, language: str | c.Sequence[str] | None = None, device: Device | None = None, finetuning_batch_size: int | None = None, raise_errors: bool | None = None, cache_dir: str | None = None, api_key: str | None = None, api_base: str | None = None, api_version: str | None = None, trust_remote_code: bool | None = None, clear_model_cache: bool | None = None, evaluate_test_split: bool | None = None, few_shot: bool | None = None, num_iterations: int | None = None, requires_safetensors: bool | None = None, download_only: bool | None = None, gpu_memory_utilization: float | None = None, generative_type: GenerativeType | None = None, attention_backend: t.Literal[*ATTENTION_BACKENDS,] | None = None, custom_datasets_file: Path | str | None = None, force: bool | None = None, verbose: bool | None = None, debug: bool | None = None, model_language: str | c.Sequence[str] | None = None, dataset_language: str | c.Sequence[str] | None = None, batch_size: int | None = None)c.Sequence[BenchmarkResult]

Benchmarks models on datasets.

Parameters

  • model : c.Sequence[str] | str The full Hugging Face Hub path(s) to the pretrained transformer model. The specific model version to use can be added after the suffix '@': "model@v1.0.0". It can be a branch name, a tag name, or a commit id, and defaults to the latest version if not specified.

  • task : str | Task | c.Sequence[str | Task] | None The tasks benchmark the model(s) on. Mutually exclusive with dataset. If both task and dataset are None then all datasets will be benchmarked. Defaults to None.

  • dataset : str | DatasetConfig | c.Sequence[str | DatasetConfig] | None The datasets to benchmark on. Mutually exclusive with task. If both task and dataset are None then all datasets will be benchmarked. Defaults to None.

  • progress_bar : bool | None Whether progress bars should be shown. Defaults to the value specified when initialising the benchmarker.

  • save_results : bool | None Whether to save the benchmark results to 'euroeval_benchmark_results.jsonl'. Defaults to the value specified when initialising the benchmarker.

  • language : str | c.Sequence[str] | None The language codes of the languages to include, both for models and datasets. Here 'no' means both Bokmål (nb) and Nynorsk (nn). Set this to 'all' if all languages should be considered. Defaults to the value specified when initialising the benchmarker.

  • device : Device | None The device to use for benchmarking. Defaults to the value specified when initialising the benchmarker.

  • finetuning_batch_size : int | None The batch size to use for finetuning. Defaults to the value specified when initialising the benchmarker.

  • raise_errors : bool | None Whether to raise errors instead of skipping the model evaluation.

  • cache_dir : str | None Directory to store cached models. Defaults to the value specified when initialising the benchmarker.

  • api_key : str | None The API key to use for a given inference server. Defaults to the value specified when initialising the benchmarker.

  • api_base : str | None The base URL for a given inference API. Only relevant if model refers to a model on an inference API. Defaults to the value specified when initialising the benchmarker.

  • api_version : str | None The version of the API to use. Defaults to the value specified when initialising the benchmarker.

  • trust_remote_code : bool | None Whether to trust remote code when loading models. Defaults to the value specified when initialising the benchmarker.

  • clear_model_cache : bool | None Whether to clear the model cache after benchmarking each model. Defaults to the value specified when initialising the benchmarker.

  • evaluate_test_split : bool | None Whether to evaluate the test split of the datasets. Defaults to the value specified when initialising the benchmarker.

  • few_shot : bool | None Whether to only evaluate the model using few-shot evaluation. Only relevant if the model is generative. Defaults to the value specified when initialising the benchmarker.

  • num_iterations : int | None The number of times each model should be evaluated. This is only meant to be used for power users, and scores will not be allowed on the leaderboards if this is changed. Defaults to the value specified when initialising the benchmarker.

  • requires_safetensors : bool | None Whether to only allow models that use the safetensors format. Defaults to the value specified when initialising the benchmarker.

  • download_only : bool | None Whether to only download the models without evaluating them. Defaults to the value specified when initialising the benchmarker.

  • gpu_memory_utilization : float | None The GPU memory utilization to use for vLLM. Only relevant if the model is generative. A larger value will result in faster evaluation, but at the risk of running out of GPU memory. Only reduce this if you are running out of GPU memory. Defaults to the value specified when initialising the benchmarker.

  • generative_type : GenerativeType | None The type of generative model to benchmark. Only relevant if the model is generative. If not specified, then the type will be inferred based on the tags of the model. Defaults to the value specified when initialising the benchmarker.

  • custom_datasets_file : Path | str | None Path to a Python file defining custom datasets. Defaults to the value specified when initialising the benchmarker.

  • force : bool | None Whether to force evaluations of models, even if they have been benchmarked already. Defaults to the value specified when initialising the benchmarker.

  • verbose : bool | None Whether to output additional output. Defaults to the value specified when initialising the benchmarker.

  • debug : bool | None Whether to output debug information. Defaults to the value specified when initialising the benchmarker.

  • model_language : str | c.Sequence[str] | None Deprecated argument. Please use language instead.

  • dataset_language : str | c.Sequence[str] | None Deprecated argument. Please use language instead.

  • batch_size : int | None Deprecated argument. Please use finetuning_batch_size instead.

Returns

Raises

  • ValueError If both task and dataset are specified.

  • InvalidModel

  • benchmark_output_or_err

  • e

source class DatasetConfig(task: Task, languages: c.Sequence[Language], name: str | None = None, pretty_name: str | None = None, source: str | dict[str, str] | None = None, prompt_prefix: str | None = None, prompt_template: str | None = None, instruction_prompt: str | None = None, num_few_shot_examples: int | None = None, max_generated_tokens: int | None = None, labels: c.Sequence[str] | None = None, prompt_label_mapping: dict[str, str] | t.Literal['auto'] | None = None, allowed_model_types: c.Sequence[ModelType] | None = None, allowed_generative_types: c.Sequence[GenerativeType] | None = None, allow_invalid_model_outputs: bool | None = None, train_split: str | None = 'train', val_split: str | None = 'val', test_split: str = 'test', bootstrap_samples: bool = True, unofficial: bool = False, _prompt_prefix: str | None = None, _prompt_template: str | None = None, _instruction_prompt: str | None = None, _num_few_shot_examples: int | None = None, _max_generated_tokens: int | None = None, _labels: c.Sequence[str] | None = None, _prompt_label_mapping: dict[str, str] | t.Literal['auto'] | None = None, _allowed_model_types: c.Sequence[ModelType] | None = None, _allowed_generative_types: c.Sequence[GenerativeType] | None = None, _allow_invalid_model_outputs: bool | None = None, _logging_string: str | None = None)

Configuration for a dataset.

Initialise a DatasetConfig object.

Parameters

  • task : Task The task of the dataset.

  • languages : c.Sequence[Language] The ISO 639-1 language codes of the entries in the dataset.

  • name : optional The name of the dataset. Must be lower case with no spaces. Can be None if and only if the dataset config resides directly in the Hugging Face dataset repo. Defaults to None.

  • pretty_name : optional A longer prettier name for the dataset, which allows cases and spaces. Used for logging. Can be None if and only if the dataset config resides directly in the Hugging Face dataset repo. Defaults to None.

  • source : optional The source of the dataset, which can be a Hugging Face ID or a dictionary with keys "train", "val" and "test" mapping to local CSV file paths. Can be None if and only if the dataset config resides directly in the Hugging Face dataset repo. Defaults to None.

  • prompt_prefix : optional The prefix to use in the few-shot prompt. Defaults to the template for the task and language.

  • prompt_template : optional The template for the prompt to use when benchmarking the dataset using few-shot evaluation. Defaults to the template for the task and language.

  • instruction_prompt : optional The prompt to use when benchmarking the dataset using instruction-based evaluation. Defaults to the template for the task and language.

  • num_few_shot_examples : optional The number of examples to use when benchmarking the dataset using few-shot evaluation. For a classification task, these will be drawn evenly from each label. Defaults to the template for the task and language.

  • max_generated_tokens : optional The maximum number of tokens to generate when benchmarking the dataset using few-shot evaluation. Defaults to the template for the task and language.

  • labels : optional The labels in the dataset. Defaults to the template for the task and language.

  • prompt_label_mapping : optional A mapping from the labels to another phrase which is used as a substitute for the label in few-shot evaluation. If "auto" then the mapping will be set to a 1:1 mapping between the labels and themselves. If None then the mapping will be set to the default mapping for the task and language. Defaults to None.

  • allowed_model_types : optional A list of model types that are allowed to be evaluated on this dataset. Defaults to the one for the task.

  • allowed_generative_types : optional A list of generative model types that are allowed to be evaluated on this dataset. If None, all generative model types are allowed. Only relevant if allowed_model_types includes generative models. Defaults to the one for the task.

  • allow_invalid_model_outputs : optional Whether to allow invalid model outputs. This is only relevant for generative models on classification tasks, where the model may generate an output which is not one of the allowed labels. If True, the model output will be mapped to the closest valid label. If False, the model output will be considered incorrect and the evaluation will be aborted. Defaults to the one for the task.

  • train_split : optional The name of the split to use as the training set. Can be None if there is no training split in the dataset. Defaults to "train".

  • val_split : optional The name of the split to use as the validation set. Can be None if there is no validation split in the dataset. Defaults to "val".

  • test_split : optional The name of the split to use as the test set. Defaults to "test".

  • bootstrap_samples : optional Whether to bootstrap the dataset samples. Defaults to True.

  • unofficial : optional Whether the dataset is unofficial. Defaults to False.

  • _prompt_prefix : optional This argument is deprecated. Please use prompt_prefix instead.

  • _prompt_template : optional This argument is deprecated. Please use prompt_template instead.

  • _instruction_prompt : optional This argument is deprecated. Please use instruction_prompt instead.

  • _num_few_shot_examples : optional This argument is deprecated. Please use num_few_shot_examples instead.

  • _max_generated_tokens : optional This argument is deprecated. Please use max_generated_tokens instead.

  • _labels : optional This argument is deprecated. Please use labels instead.

  • _prompt_label_mapping : optional This argument is deprecated. Please use prompt_label_mapping instead.

  • _allowed_model_types : optional This argument is deprecated. Please use allowed_model_types instead.

  • _allowed_generative_types : optional This argument is deprecated. Please use allowed_generative_types instead.

  • _allow_invalid_model_outputs : optional This argument is deprecated. Please use allow_invalid_model_outputs instead.

  • _logging_string : optional This argument is deprecated. Please use logging_string instead.

Attributes

  • name : str The name of the dataset.

  • pretty_name : str The pretty name of the dataset.

  • source : str | dict[str, str] The source of the dataset.

  • logging_string : str The string used to describe evaluation on the dataset in logging.

  • main_language : Language Get the main language of the dataset.

  • id2label : HashableDict The mapping from ID to label.

  • label2id : HashableDict The mapping from label to ID.

  • num_labels : int The number of labels in the dataset.

Methods

  • get_labels_str Converts a set of labels to a natural string, in the specified language.

source property DatasetConfig.name: str

The name of the dataset.

Returns

  • str The name of the dataset.

source property DatasetConfig.pretty_name: str

The pretty name of the dataset.

Returns

  • str The pretty name of the dataset.

source property DatasetConfig.source: str | dict[str, str]

The source of the dataset.

Returns

  • str | dict[str, str] The source of the dataset.

source property DatasetConfig.logging_string: str

The string used to describe evaluation on the dataset in logging.

Returns

  • str The logging string.

source property DatasetConfig.main_language: Language

Get the main language of the dataset.

Returns

source property DatasetConfig.id2label: HashableDict

The mapping from ID to label.

source property DatasetConfig.label2id: HashableDict

The mapping from label to ID.

source property DatasetConfig.num_labels: int

The number of labels in the dataset.

source method DatasetConfig.get_labels_str(labels: c.Sequence[str] | None = None)str

Converts a set of labels to a natural string, in the specified language.

If the task is NER, we separate using 'and' and use the mapped labels instead of the BIO NER labels.

Parameters

  • labels : optional The labels to convert to a natural string. If None, uses all the labels in the dataset. Defaults to None.

Returns

  • str The natural string representation of the labels in specified language.

source block_terminal_output()None

Blocks libraries from writing output to the terminal.

This filters warnings from some libraries, sets the logging level to ERROR for some libraries, disabled tokeniser progress bars when using Hugging Face tokenisers, and disables most of the logging from the transformers library.