euroeval
EuroEval - A benchmarking framework for language models.
Classes
-
Benchmarker — Benchmarking all the language models.
-
DatasetConfig — Configuration for a dataset.
Functions
-
block_terminal_output — Blocks libraries from writing output to the terminal.
source class Benchmarker(progress_bar: bool = True, save_results: bool = True, task: str | Task | c.Sequence[str | Task] | None = None, dataset: str | DatasetConfig | c.Sequence[str | DatasetConfig] | None = None, language: str | c.Sequence[str] = 'all', device: Device | None = None, finetuning_batch_size: int = 32, raise_errors: bool = False, cache_dir: str = '.euroeval_cache', api_key: str | None = None, force: bool = False, verbose: bool = False, trust_remote_code: bool = False, clear_model_cache: bool = False, evaluate_test_split: bool = False, few_shot: bool = True, num_iterations: int = 10, api_base: str | None = None, api_version: str | None = None, gpu_memory_utilization: float = 0.8, generative_type: GenerativeType | None = None, debug: bool = False, run_with_cli: bool = False, requires_safetensors: bool = False, download_only: bool = False, model_language: str | c.Sequence[str] | None = None, dataset_language: str | c.Sequence[str] | None = None, batch_size: int | None = None)
Benchmarking all the language models.
Initialise the benchmarker.
Attributes
-
benchmark_config_default_params — The default parameters for the benchmark configuration.
-
benchmark_config — The benchmark configuration.
-
force — Whether to force evaluations of models, even if they have been benchmarked already.
-
results_path — The path to the results file.
-
benchmark_results : c.Sequence[BenchmarkResult] — The benchmark results.
Parameters
-
progress_bar : bool — Whether progress bars should be shown. Defaults to True.
-
save_results : bool — Whether to save the benchmark results to 'euroeval_benchmark_results.jsonl'. Defaults to True.
-
task : str | Task | c.Sequence[str | Task] | None — The tasks benchmark the model(s) on. Mutually exclusive with
dataset. If bothtaskanddatasetare None then all datasets will be benchmarked. -
dataset : str | DatasetConfig | c.Sequence[str | DatasetConfig] | None — The datasets to benchmark on. Mutually exclusive with
task. If bothtaskanddatasetare None then all datasets will be benchmarked. -
language : str | c.Sequence[str] — The language codes of the languages to include, both for models and datasets. Set this to 'all' if all languages should be considered. Defaults to "all".
-
device : Device | None — The device to use for benchmarking. Defaults to None.
-
finetuning_batch_size : int — The batch size to use when finetuning. Defaults to 32.
-
raise_errors : bool — Whether to raise errors instead of skipping the model evaluation. Defaults to False.
-
cache_dir : str — Directory to store cached models. Defaults to '.euroeval_cache'.
-
api_key : str | None — The API key to use for a given inference API.
-
force : bool — Whether to force evaluations of models, even if they have been benchmarked already. Defaults to False.
-
verbose : bool — Whether to output additional output. This is automatically set if
debugis True. Defaults to False. -
trust_remote_code : bool — Whether to trust remote code when loading models. Defaults to False.
-
clear_model_cache : bool — Whether to clear the model cache after benchmarking each model. Defaults to False.
-
evaluate_test_split : bool — Whether to evaluate the test split of the datasets. Defaults to False.
-
few_shot : bool — Whether to only evaluate the model using few-shot evaluation. Only relevant if the model is generative. Defaults to True.
-
num_iterations : int — The number of times each model should be evaluated. This is only meant to be used for power users, and scores will not be allowed on the leaderboards if this is changed. Defaults to 10.
-
api_base : str | None — The base URL for a given inference API. Only relevant if
modelrefers to a model on an inference API. Defaults to None. -
api_version : str | None — The version of the API to use. Defaults to None.
-
gpu_memory_utilization : float — The GPU memory utilization to use for vLLM. Only relevant if the model is generative. A larger value will result in faster evaluation, but at the risk of running out of GPU memory. Only reduce this if you are running out of GPU memory. Defaults to 0.9.
-
generative_type : GenerativeType | None — The type of generative model to benchmark. Only relevant if the model is generative. If not specified, then the type will be inferred based on the tags of the model. Defaults to None.
-
debug : bool — Whether to output debug information. Defaults to False.
-
run_with_cli : bool — Whether the benchmarker is being run from the command-line interface. Defaults to False.
-
requires_safetensors : bool — Whether to only allow models that use the safetensors format. Defaults to False.
-
download_only : bool — Whether to only download models and datasets without performing any benchmarking. Defaults to False.
-
model_language : str | c.Sequence[str] | None — Deprecated argument. Please use
languageinstead. -
dataset_language : str | c.Sequence[str] | None — Deprecated argument. Please use
languageinstead. -
batch_size : int | None — Deprecated argument. Please use
finetuning_batch_sizeinstead.
Raises
-
ValueError — If both
taskanddatasetare specified, or ifdownload_onlyis True and we have no internet connection. -
ImportError — If
hf_transferis enabled but not installed.
Methods
-
benchmark — Benchmarks models on datasets.
source property Benchmarker.benchmark_results: c.Sequence[BenchmarkResult]
The benchmark results.
Returns
-
c.Sequence[BenchmarkResult] — A list of benchmark results.
Raises
-
ValueError — If there is an error decoding a line in the results file.
source method Benchmarker.benchmark(model: c.Sequence[str] | str, task: str | Task | c.Sequence[str | Task] | None = None, dataset: str | DatasetConfig | c.Sequence[str | DatasetConfig] | None = None, progress_bar: bool | None = None, save_results: bool | None = None, language: str | c.Sequence[str] | None = None, device: Device | None = None, finetuning_batch_size: int | None = None, raise_errors: bool | None = None, cache_dir: str | None = None, api_key: str | None = None, api_base: str | None = None, api_version: str | None = None, trust_remote_code: bool | None = None, clear_model_cache: bool | None = None, evaluate_test_split: bool | None = None, few_shot: bool | None = None, num_iterations: int | None = None, requires_safetensors: bool | None = None, download_only: bool | None = None, gpu_memory_utilization: float | None = None, generative_type: GenerativeType | None = None, force: bool | None = None, verbose: bool | None = None, debug: bool | None = None, model_language: str | c.Sequence[str] | None = None, dataset_language: str | c.Sequence[str] | None = None, batch_size: int | None = None) → c.Sequence[BenchmarkResult]
Benchmarks models on datasets.
Parameters
-
model : c.Sequence[str] | str — The full Hugging Face Hub path(s) to the pretrained transformer model. The specific model version to use can be added after the suffix '@': "model@v1.0.0". It can be a branch name, a tag name, or a commit id, and defaults to the latest version if not specified.
-
task : str | Task | c.Sequence[str | Task] | None — The tasks benchmark the model(s) on. Mutually exclusive with
dataset. If bothtaskanddatasetare None then all datasets will be benchmarked. Defaults to None. -
dataset : str | DatasetConfig | c.Sequence[str | DatasetConfig] | None — The datasets to benchmark on. Mutually exclusive with
task. If bothtaskanddatasetare None then all datasets will be benchmarked. Defaults to None. -
progress_bar : bool | None — Whether progress bars should be shown. Defaults to the value specified when initialising the benchmarker.
-
save_results : bool | None — Whether to save the benchmark results to 'euroeval_benchmark_results.jsonl'. Defaults to the value specified when initialising the benchmarker.
-
language : str | c.Sequence[str] | None — The language codes of the languages to include, both for models and datasets. Here 'no' means both Bokmål (nb) and Nynorsk (nn). Set this to 'all' if all languages should be considered. Defaults to the value specified when initialising the benchmarker.
-
device : Device | None — The device to use for benchmarking. Defaults to the value specified when initialising the benchmarker.
-
finetuning_batch_size : int | None — The batch size to use for finetuning. Defaults to the value specified when initialising the benchmarker.
-
raise_errors : bool | None — Whether to raise errors instead of skipping the model evaluation.
-
cache_dir : str | None — Directory to store cached models. Defaults to the value specified when initialising the benchmarker.
-
api_key : str | None — The API key to use for a given inference server. Defaults to the value specified when initialising the benchmarker.
-
api_base : str | None — The base URL for a given inference API. Only relevant if
modelrefers to a model on an inference API. Defaults to the value specified when initialising the benchmarker. -
api_version : str | None — The version of the API to use. Defaults to the value specified when initialising the benchmarker.
-
trust_remote_code : bool | None — Whether to trust remote code when loading models. Defaults to the value specified when initialising the benchmarker.
-
clear_model_cache : bool | None — Whether to clear the model cache after benchmarking each model. Defaults to the value specified when initialising the benchmarker.
-
evaluate_test_split : bool | None — Whether to evaluate the test split of the datasets. Defaults to the value specified when initialising the benchmarker.
-
few_shot : bool | None — Whether to only evaluate the model using few-shot evaluation. Only relevant if the model is generative. Defaults to the value specified when initialising the benchmarker.
-
num_iterations : int | None — The number of times each model should be evaluated. This is only meant to be used for power users, and scores will not be allowed on the leaderboards if this is changed. Defaults to the value specified when initialising the benchmarker.
-
requires_safetensors : bool | None — Whether to only allow models that use the safetensors format. Defaults to the value specified when initialising the benchmarker.
-
download_only : bool | None — Whether to only download the models without evaluating them. Defaults to the value specified when initialising the benchmarker.
-
gpu_memory_utilization : float | None — The GPU memory utilization to use for vLLM. Only relevant if the model is generative. A larger value will result in faster evaluation, but at the risk of running out of GPU memory. Only reduce this if you are running out of GPU memory. Defaults to the value specified when initialising the benchmarker.
-
generative_type : GenerativeType | None — The type of generative model to benchmark. Only relevant if the model is generative. If not specified, then the type will be inferred based on the tags of the model. Defaults to the value specified when initialising the benchmarker.
-
force : bool | None — Whether to force evaluations of models, even if they have been benchmarked already. Defaults to the value specified when initialising the benchmarker.
-
verbose : bool | None — Whether to output additional output. Defaults to the value specified when initialising the benchmarker.
-
debug : bool | None — Whether to output debug information. Defaults to the value specified when initialising the benchmarker.
-
model_language : str | c.Sequence[str] | None — Deprecated argument. Please use
languageinstead. -
dataset_language : str | c.Sequence[str] | None — Deprecated argument. Please use
languageinstead. -
batch_size : int | None — Deprecated argument. Please use
finetuning_batch_sizeinstead.
Returns
-
c.Sequence[BenchmarkResult] — A list of benchmark results.
Raises
-
ValueError — If both
taskanddatasetare specified. -
benchmark_output_or_err
-
e
source dataclass DatasetConfig(name: str, pretty_name: str, source: str | dict[str, str], task: Task, languages: c.Sequence[Language], _prompt_prefix: str | None = None, _prompt_template: str | None = None, _instruction_prompt: str | None = None, _num_few_shot_examples: int | None = None, _max_generated_tokens: int | None = None, _labels: c.Sequence[str] | None = None, _prompt_label_mapping: dict[str, str] | t.Literal['auto'] | None = None, _allowed_model_types: c.Sequence[ModelType] | None = None, _allowed_generative_types: c.Sequence[GenerativeType] | None = None, _allow_invalid_model_outputs: bool | None = None, _logging_string: str | None = None, splits: c.Sequence[str] = field(default_factory=lambda: ['train', 'val', 'test']), bootstrap_samples: bool = True, unofficial: bool = False)
Configuration for a dataset.
Attributes
-
name : str — The name of the dataset. Must be lower case with no spaces.
-
pretty_name : str — A longer prettier name for the dataset, which allows cases and spaces. Used for logging.
-
source : str | dict[str, str] — The source of the dataset, which can be a Hugging Face ID or a dictionary with keys "train", "val" and "test" mapping to local CSV file paths.
-
task : Task — The task of the dataset.
-
languages : c.Sequence[Language] — The ISO 639-1 language codes of the entries in the dataset.
-
id2label : HashableDict — The mapping from ID to label.
-
label2id : HashableDict — The mapping from label to ID.
-
num_labels : int — The number of labels in the dataset.
-
_prompt_prefix : optional — The prefix to use in the few-shot prompt. Defaults to the template for the task and language.
-
_prompt_template : optional — The template for the prompt to use when benchmarking the dataset using few-shot evaluation. Defaults to the template for the task and language.
-
_instruction_prompt : optional — The prompt to use when benchmarking the dataset using instruction-based evaluation. Defaults to the template for the task and language.
-
_num_few_shot_examples : optional — The number of examples to use when benchmarking the dataset using few-shot evaluation. For a classification task, these will be drawn evenly from each label. Defaults to the template for the task and language.
-
_max_generated_tokens : optional — The maximum number of tokens to generate when benchmarking the dataset using few-shot evaluation. Defaults to the template for the task and language.
-
_labels : optional — The labels in the dataset. Defaults to the template for the task and language.
-
_prompt_label_mapping : optional — A mapping from the labels to another phrase which is used as a substitute for the label in few-shot evaluation. If "auto" then the mapping will be set to a 1:1 mapping between the labels and themselves. If None then the mapping will be set to the default mapping for the task and language. Defaults to None.
-
_allowed_model_types : optional — A list of model types that are allowed to be evaluated on this dataset. Defaults to the one for the task.
-
_allowed_generative_types : optional — A list of generative model types that are allowed to be evaluated on this dataset. If None, all generative model types are allowed. Only relevant if
allowed_model_typesincludes generative models. Defaults to the one for the task. -
_allow_invalid_model_outputs : optional — Whether to allow invalid model outputs. This is only relevant for generative models on classification tasks, where the model may generate an output which is not one of the allowed labels. If True, the model output will be mapped to the closest valid label. If False, the model output will be considered incorrect and the evaluation will be aborted. Defaults to the one for the task.
-
_logging_string : optional — The string used to describe evaluation on the dataset in logging. If not provided, a default string will be generated, based on the pretty name. Only use this if the default string is not suitable.
-
splits : optional — The names of the splits in the dataset. If not provided, defaults to ["train", "val", "test"].
-
bootstrap_samples : optional — Whether to bootstrap the dataset samples. Defaults to True.
-
unofficial : optional — Whether the dataset is unofficial. Defaults to False.
-
main_language : Language — Get the main language of the dataset.
-
logging_string : str — The string used to describe evaluation on the dataset in logging.
-
prompt_prefix : str — The prefix to use in the few-shot prompt.
-
prompt_template : str — The template used during few-shot evaluation.
-
instruction_prompt : str — The prompt to use when evaluating instruction-tuned models.
-
num_few_shot_examples : int — The number of few-shot examples to use.
-
max_generated_tokens : int — The maximum number of tokens to generate when evaluating a model.
-
labels : c.Sequence[str] — The labels in the dataset.
-
prompt_label_mapping : dict[str, str] — Mapping from English labels to localised labels.
-
allowed_model_types : c.Sequence[ModelType] — A list of model types that are allowed to be evaluated on this dataset.
-
allowed_generative_types : c.Sequence[GenerativeType] — A list of generative model types that are allowed on this dataset.
-
allow_invalid_model_outputs : bool — Whether to allow invalid model outputs.
Methods
-
get_labels_str — Converts a set of labels to a natural string, in the specified language.
source property DatasetConfig.main_language: Language
source property DatasetConfig.logging_string: str
The string used to describe evaluation on the dataset in logging.
source property DatasetConfig.prompt_prefix: str
The prefix to use in the few-shot prompt.
source property DatasetConfig.prompt_template: str
The template used during few-shot evaluation.
source property DatasetConfig.instruction_prompt: str
The prompt to use when evaluating instruction-tuned models.
source property DatasetConfig.num_few_shot_examples: int
The number of few-shot examples to use.
source property DatasetConfig.max_generated_tokens: int
The maximum number of tokens to generate when evaluating a model.
source property DatasetConfig.labels: c.Sequence[str]
The labels in the dataset.
source property DatasetConfig.prompt_label_mapping: dict[str, str]
Mapping from English labels to localised labels.
source property DatasetConfig.allowed_model_types: c.Sequence[ModelType]
A list of model types that are allowed to be evaluated on this dataset.
source property DatasetConfig.allowed_generative_types: c.Sequence[GenerativeType]
A list of generative model types that are allowed on this dataset.
source property DatasetConfig.allow_invalid_model_outputs: bool
Whether to allow invalid model outputs.
source property DatasetConfig.id2label: HashableDict
The mapping from ID to label.
source property DatasetConfig.label2id: HashableDict
The mapping from label to ID.
source property DatasetConfig.num_labels: int
The number of labels in the dataset.
source method DatasetConfig.get_labels_str(labels: c.Sequence[str] | None = None) → str
Converts a set of labels to a natural string, in the specified language.
If the task is NER, we separate using 'and' and use the mapped labels instead of the BIO NER labels.
Parameters
-
labels : optional — The labels to convert to a natural string. If None, uses all the labels in the dataset. Defaults to None.
Returns
-
str — The natural string representation of the labels in specified language.
source block_terminal_output() → None
Blocks libraries from writing output to the terminal.
This filters warnings from some libraries, sets the logging level to ERROR for some
libraries, disabled tokeniser progress bars when using Hugging Face tokenisers, and
disables most of the logging from the transformers library.