euroeval.benchmarker¶

Class that benchmarks language models.

Classes

Benchmarker — Benchmarking all the language models.

Functions

get_record — Get the benchmark record for a given model and dataset.
clear_model_cache_fn — Clear the model cache.
initial_logging — Initial logging at the start of the benchmarking process.

source class Benchmarker(progress_bar: bool = True, save_results: bool = True, task: str | Task | c.Sequence[str | Task] | None = None, dataset: str | DatasetConfig | c.Sequence[str | DatasetConfig] | None = None, language: str | c.Sequence[str] = 'all', device: Device | None = None, finetuning_batch_size: int = 32, raise_errors: bool = False, cache_dir: str = '.euroeval_cache', api_key: str | None = None, force: bool = False, verbose: bool = False, trust_remote_code: bool = False, clear_model_cache: bool = False, evaluate_test_split: bool = False, few_shot: bool = True, num_iterations: int = 10, api_base: str | None = None, api_version: str | None = None, gpu_memory_utilization: float = 0.8, attention_backend: t.Literal[*ATTENTION_BACKENDS,] = 'FLASHINFER', generative_type: GenerativeType | None = None, custom_datasets_file: Path | str = Path('custom_datasets.py'), debug: bool = False, run_with_cli: bool = False, requires_safetensors: bool = False, download_only: bool = False, model_language: str | c.Sequence[str] | None = None, dataset_language: str | c.Sequence[str] | None = None, batch_size: int | None = None)

Benchmarking all the language models.

Initialise the benchmarker.

Attributes

benchmark_config_default_params — The default parameters for the benchmark configuration.
benchmark_config — The benchmark configuration.
force — Whether to force evaluations of models, even if they have been benchmarked already.
results_path — The path to the results file.
benchmark_results : c.Sequence[BenchmarkResult] — The benchmark results.

Parameters

progress_bar : bool — Whether progress bars should be shown. Defaults to True.
save_results : bool — Whether to save the benchmark results to 'euroeval_benchmark_results.jsonl'. Defaults to True.
task : str | Task | c.Sequence[str | Task] | None — The tasks benchmark the model(s) on. Mutually exclusive with dataset. If both task and dataset are None then all datasets will be benchmarked.
dataset : str | DatasetConfig | c.Sequence[str | DatasetConfig] | None — The datasets to benchmark on. Mutually exclusive with task. If both task and dataset are None then all datasets will be benchmarked.
language : str | c.Sequence[str] — The language codes of the languages to include, both for models and datasets. Set this to 'all' if all languages should be considered. Defaults to "all".
device : Device | None — The device to use for benchmarking. Defaults to None.
finetuning_batch_size : int — The batch size to use when finetuning. Defaults to 32.
raise_errors : bool — Whether to raise errors instead of skipping the model evaluation. Defaults to False.
cache_dir : str — Directory to store cached models. Defaults to '.euroeval_cache'.
api_key : str | None — The API key to use for a given inference API.
force : bool — Whether to force evaluations of models, even if they have been benchmarked already. Defaults to False.
verbose : bool — Whether to output additional output. This is automatically set if debug is True. Defaults to False.
trust_remote_code : bool — Whether to trust remote code when loading models. Defaults to False.
clear_model_cache : bool — Whether to clear the model cache after benchmarking each model. Defaults to False.
evaluate_test_split : bool — Whether to evaluate the test split of the datasets. Defaults to False.
few_shot : bool — Whether to only evaluate the model using few-shot evaluation. Only relevant if the model is generative. Defaults to True.
num_iterations : int — The number of times each model should be evaluated. This is only meant to be used for power users, and scores will not be allowed on the leaderboards if this is changed. Defaults to 10.
api_base : str | None — The base URL for a given inference API. Only relevant if model refers to a model on an inference API. Defaults to None.
api_version : str | None — The version of the API to use. Defaults to None.
gpu_memory_utilization : float — The GPU memory utilization to use for vLLM. Only relevant if the model is generative. A larger value will result in faster evaluation, but at the risk of running out of GPU memory. Only reduce this if you are running out of GPU memory. Defaults to 0.9.
attention_backend : t.Literal[*ATTENTION_BACKENDS,] — The attention backend to use for vLLM. Defaults to FLASHINFER. Only relevant if the model is generative.
generative_type : GenerativeType | None — The type of generative model to benchmark. Only relevant if the model is generative. If not specified, then the type will be inferred based on the tags of the model. Defaults to None.
custom_datasets_file : Path | str — Path to a Python file defining custom datasets. Defaults to 'custom_datasets.py'.
debug : bool — Whether to output debug information. Defaults to False.
run_with_cli : bool — Whether the benchmarker is being run from the command-line interface. Defaults to False.
requires_safetensors : bool — Whether to only allow models that use the safetensors format. Defaults to False.
download_only : bool — Whether to only download models and datasets without performing any benchmarking. Defaults to False.
model_language : str | c.Sequence[str] | None — Deprecated argument. Please use language instead.
dataset_language : str | c.Sequence[str] | None — Deprecated argument. Please use language instead.
batch_size : int | None — Deprecated argument. Please use finetuning_batch_size instead.

Raises

ValueError — If both task and dataset are specified, or if download_only is True and we have no internet connection.

Methods

benchmark — Benchmarks models on datasets.

source property Benchmarker.benchmark_results: c.Sequence[BenchmarkResult]

The benchmark results.

Returns

c.Sequence[BenchmarkResult] — A list of benchmark results.

Raises

ValueError — If there is an error decoding a line in the results file.

Benchmarks models on datasets.

Parameters

model : c.Sequence[str] | str — The full Hugging Face Hub path(s) to the pretrained transformer model. The specific model version to use can be added after the suffix '@': "model@v1.0.0". It can be a branch name, a tag name, or a commit id, and defaults to the latest version if not specified.
task : str | Task | c.Sequence[str | Task] | None — The tasks benchmark the model(s) on. Mutually exclusive with dataset. If both task and dataset are None then all datasets will be benchmarked. Defaults to None.
dataset : str | DatasetConfig | c.Sequence[str | DatasetConfig] | None — The datasets to benchmark on. Mutually exclusive with task. If both task and dataset are None then all datasets will be benchmarked. Defaults to None.
progress_bar : bool | None — Whether progress bars should be shown. Defaults to the value specified when initialising the benchmarker.
save_results : bool | None — Whether to save the benchmark results to 'euroeval_benchmark_results.jsonl'. Defaults to the value specified when initialising the benchmarker.
language : str | c.Sequence[str] | None — The language codes of the languages to include, both for models and datasets. Here 'no' means both Bokmål (nb) and Nynorsk (nn). Set this to 'all' if all languages should be considered. Defaults to the value specified when initialising the benchmarker.
device : Device | None — The device to use for benchmarking. Defaults to the value specified when initialising the benchmarker.
finetuning_batch_size : int | None — The batch size to use for finetuning. Defaults to the value specified when initialising the benchmarker.
raise_errors : bool | None — Whether to raise errors instead of skipping the model evaluation.
cache_dir : str | None — Directory to store cached models. Defaults to the value specified when initialising the benchmarker.
api_key : str | None — The API key to use for a given inference server. Defaults to the value specified when initialising the benchmarker.
api_base : str | None — The base URL for a given inference API. Only relevant if model refers to a model on an inference API. Defaults to the value specified when initialising the benchmarker.
api_version : str | None — The version of the API to use. Defaults to the value specified when initialising the benchmarker.
trust_remote_code : bool | None — Whether to trust remote code when loading models. Defaults to the value specified when initialising the benchmarker.
clear_model_cache : bool | None — Whether to clear the model cache after benchmarking each model. Defaults to the value specified when initialising the benchmarker.
evaluate_test_split : bool | None — Whether to evaluate the test split of the datasets. Defaults to the value specified when initialising the benchmarker.
few_shot : bool | None — Whether to only evaluate the model using few-shot evaluation. Only relevant if the model is generative. Defaults to the value specified when initialising the benchmarker.
num_iterations : int | None — The number of times each model should be evaluated. This is only meant to be used for power users, and scores will not be allowed on the leaderboards if this is changed. Defaults to the value specified when initialising the benchmarker.
requires_safetensors : bool | None — Whether to only allow models that use the safetensors format. Defaults to the value specified when initialising the benchmarker.
download_only : bool | None — Whether to only download the models without evaluating them. Defaults to the value specified when initialising the benchmarker.
gpu_memory_utilization : float | None — The GPU memory utilization to use for vLLM. Only relevant if the model is generative. A larger value will result in faster evaluation, but at the risk of running out of GPU memory. Only reduce this if you are running out of GPU memory. Defaults to the value specified when initialising the benchmarker.
generative_type : GenerativeType | None — The type of generative model to benchmark. Only relevant if the model is generative. If not specified, then the type will be inferred based on the tags of the model. Defaults to the value specified when initialising the benchmarker.
attention_backend : t.Literal[*ATTENTION_BACKENDS,] | None — The attention backend to use for vLLM. Only relevant if the model is generative. Defaults to the value specified when initialising the benchmarker.
custom_datasets_file : Path | str | None — Path to a Python file defining custom datasets. Defaults to the value specified when initialising the benchmarker.
force : bool | None — Whether to force evaluations of models, even if they have been benchmarked already. Defaults to the value specified when initialising the benchmarker.
verbose : bool | None — Whether to output additional output. Defaults to the value specified when initialising the benchmarker.
debug : bool | None — Whether to output debug information. Defaults to the value specified when initialising the benchmarker.
model_language : str | c.Sequence[str] | None — Deprecated argument. Please use language instead.
dataset_language : str | c.Sequence[str] | None — Deprecated argument. Please use language instead.
batch_size : int | None — Deprecated argument. Please use finetuning_batch_size instead.

Returns

c.Sequence[BenchmarkResult] — A list of benchmark results.

Raises

ValueError — If both task and dataset are specified.
InvalidModel — If we're offline benchmarking an adapter model, or if model loading failed.
benchmark_output_or_err
e

source get_record(model_config: ModelConfig, dataset_config: DatasetConfig, benchmark_config: BenchmarkConfig, benchmark_results: c.Sequence[BenchmarkResult]) → BenchmarkResult | None

Get the benchmark record for a given model and dataset.

Parameters

model_config : ModelConfig — The configuration of the model we are evaluating.
dataset_config : DatasetConfig — The configuration of the dataset we are evaluating on.
benchmark_config : BenchmarkConfig — The general benchmark configuration.
benchmark_results : c.Sequence[BenchmarkResult] — The benchmark results.

Returns

BenchmarkResult | None — The benchmark record, or None if no such record exists.

source clear_model_cache_fn(cache_dir: str) → None

Clear the model cache.

Note that this will not remove the stored completions.

Parameters

cache_dir : str — The path to the cache directory.

source initial_logging(model_config: ModelConfig, dataset_config: DatasetConfig, benchmark_config: BenchmarkConfig, num_finished_benchmarks: int, num_total_benchmarks: int) → None

Initial logging at the start of the benchmarking process.

Parameters

model_config : ModelConfig — The configuration of the model we are evaluating.
dataset_config : DatasetConfig — The configuration of the dataset we are evaluating on.
benchmark_config : BenchmarkConfig — The general benchmark configuration.
num_finished_benchmarks : int — The number of benchmarks that have already been finished.
num_total_benchmarks : int — The total number of benchmarks to be run.