The euroeval Python Package¶
The euroeval Python package is the Python package used to evaluate language models in
EuroEval. This page will give you a brief overview of the package and how to use it.
You can also check out the full API reference for more details.
Installation¶
To install the package simply write the following command in your favorite terminal:
pip install euroeval[all]
This will install the EuroEval package with all extras. You can also install the
minimal version by leaving out the [all], in which case the package will let you know
when an evaluation requires a certain extra dependency, and how you install it.
Quickstart¶
Benchmarking¶
euroeval allows for benchmarking both via. script and using the command line.
The easiest way to benchmark pretrained models is via the command line interface. After having installed the package, you can benchmark your favorite model like so:
euroeval --model <model-id>
Here model is the HuggingFace model ID, which can be found on the HuggingFace
Hub. By default this will benchmark the model on all
the tasks available. If you want to benchmark on a particular task, then use the
--task argument:
euroeval --model <model-id> --task sentiment-classification
We can also narrow down which languages we would like to benchmark on. This can be done
by setting the --language argument. Here we thus benchmark the model on the Danish
sentiment classification task:
euroeval --model <model-id> --task sentiment-classification --language da
Multiple models, datasets and/or languages can be specified by just attaching multiple arguments. Here is an example with two models:
euroeval --model <model-id1> --model <model-id2>
The specific model version/revision to use can also be added after the suffix '@':
euroeval --model <model-id>@<commit>
This can be a branch name, a tag name, or a commit id. It defaults to 'main' for latest.
See all the arguments and options available for the euroeval command by typing
euroeval --help
In a script, the syntax is similar to the command line interface. You simply initialise
an object of the Benchmarker class, and call this benchmark object with your favorite
model:
>>> from euroeval import Benchmarker
>>> benchmarker = Benchmarker()
>>> benchmarker.benchmark(model="<model-id>")
To benchmark on a specific task and/or language, you simply specify the task or
language arguments, shown here with same example as above:
>>> benchmarker.benchmark(
... model="<model-id>",
... task="sentiment-classification",
... language="da",
... )
If you want to benchmark a subset of all the models on the Hugging Face Hub, you can
simply leave out the model argument. In this example, we're benchmarking all Danish
models on the Danish sentiment classification task:
>>> benchmarker.benchmark(task="sentiment-classification", language="da")
A Dockerfile is provided in the repo, which can be downloaded and run, without needing to clone the repo and installing from source. This can be fetched programmatically by running the following:
wget https://raw.githubusercontent.com/EuroEval/EuroEval/main/Dockerfile
Next, to be able to build the Docker image, first ensure that the NVIDIA Container
Toolkit is
installed
and
configured.
Ensure that the the CUDA version stated at the top of the Dockerfile matches the CUDA
version installed (which you can check using nvidia-smi). After that, we build the
image as follows:
docker build --pull -t euroeval .
With the Docker image built, we can now evaluate any model as follows:
docker run --rm --gpus 1 euroeval <euroeval-arguments>
Here <euroeval-arguments> consists of the arguments added to the euroeval CLI
argument. This could for instance be --model <model-id> --task
sentiment-classification.
Benchmarking custom inference APIs¶
If the model you want to benchmark is hosted by a custom inference provider, such as a vLLM server, then this is also supported in EuroEval.
When benchmarking, you simply have to set the --api-base argument (api_base when
using the Benchmarker API) to the URL of the inference API, and optionally the
--api-key argument (api_key) to the API key, if authentication is required.
If you're benchmarking an Ollama model, then you're urged to add the prefix
ollama_chat/ to the model name, as that will also fetch model metadata as well as pull
the models from the Ollama model repository before evaluating it, e.g.:
euroeval --model ollama_chat/mymodel --api-base http://localhost:11434
For all other OpenAI-compatible inference APIs, you simply provide the model name as is, e.g.:
euroeval --model my-model --api-base http://localhost:8000
Again, if the inference API requires authentication, you simply add the --api-key
argument:
euroeval --model my-model --api-base http://localhost:8000 --api-key my-secret-key
If your model is a reasoning model, then you need to specify this as follows:
euroeval --model my-reasoning-model --api-base http://localhost:8000 --generative-type reasoning
Likewise, if it is a pretrained decoder model (aka a completion model), then you specify this as follows:
euroeval --model my-base-decoder-model --api-base http://localhost:8000 --generative-type base
When using the Benchmarker API, the same applies. Here is an example of benchmarking
an Ollama model hosted locally:
>>> benchmarker.benchmark(
... model="ollama_chat/mymodel",
... api_base="http://localhost:11434",
... )
Benchmarking in an offline environment¶
If you need to benchmark in an offline environment, you need to download the models, datasets and metrics beforehand. For example to download the model you want and all of the Danish sentiment classification datasets:
This can be done by adding the --download-only argument, from the command line:
euroeval --model <model-id> --task sentiment-classification --language da --download-only
This can be done using the download_only argument, if benchmarking from a script:
benchmarker.benchmark(
model="<model-id>",
task="sentiment-classification",
language="da",
download_only=True,
)
Note
Offline benchmarking of adapter models is not currently supported, meaning that we still require an internet connection during the evaluation of these. If offline support of adapters is important to you, please consider opening an issue.
Overriding model metadata¶
Some models do not have metadata (maximum context length and vocabulary size) specified,
or have it specified incorrectly. This leads to incorrect values on the leaderboards.
To work around this you can manually override these values using the
--max-context-length and --vocabulary-size arguments:
euroeval --model <model-id> --max-context-length 4096 --vocabulary-size 32000
>>> benchmarker.benchmark(
... model="<model-id>",
... max_context_length=4096,
... vocabulary_size=32000,
... )
Benchmarking custom datasets¶
If you want to benchmark models on your own custom dataset, this is also possible. First, you need to set up your dataset to be compatible with EuroEval. This means splitting up your dataset in a training, validation and test split. By default, EuroEval expects these standard column names:
- Text or multiple-choice classification:
textandlabel - Token classification:
tokensandlabels - Reading comprehension:
textandanswers - Free-form text generation:
textandtarget_text
If your dataset uses different column names, you can specify the mapping via
input_column, target_column, and choices_column in DatasetConfig (see
Custom column names below) — no need to rename your columns
beforehand.
Text and multiple-choice classification tasks are by far the most common. Then you can decide whether your dataset should be accessible locally (good for testing, and good for sensitive datasets), or accessible via the Hugging Face Hub (good for allowing others to benchmark on your dataset).
For a local dataset, you store your three dataset splits as three different CSV files
with the desired two columns, and then you create a file called custom_datasets.py
in which you define the associated DatasetConfig objects for your dataset. Here
is an example of a simple text classification dataset with two classes:
from euroeval import DatasetConfig, TEXT_CLASSIFICATION
from euroeval.languages import ENGLISH
MY_CONFIG = DatasetConfig(
name="my-dataset",
pretty_name="My Dataset",
source=dict(train="train.csv", val="val.csv", test="test.csv"),
task=TEXT_CLASSIFICATION,
languages=[ENGLISH],
labels=["positive", "negative"],
)
You can then benchmark your custom dataset by simply running
euroeval --dataset my-dataset --model <model-id>
You can also run the benchmark from a Python script, by simply providing your custom
dataset configuration directly into the benchmark method:
from euroeval import Benchmarker
benchmarker = Benchmarker()
benchmarker.benchmark(model="<model-id>", dataset=MY_CONFIG)
The simplest and most secure way to add a EuroEval configuration to a Hugging Face Hub
dataset is via a YAML file. No Python code is written, so no --trust-remote-code flag
is required.
Create a file called eval.yaml in the root of your dataset repository. The file
follows the Inspect AI eval.yaml format
and works with both Inspect AI and EuroEval:
name: My Dataset
tasks:
- id: my_dataset
split: test
field_spec:
input: review
target: sentiment
solvers:
- name: generate
scorers:
- name: choice
# EuroEval-specific keys (optional; ignored by Inspect AI)
task: classification
languages:
- en
labels:
- positive
- negative
The EuroEval-specific keys (task, languages, labels, and all other
DatasetConfig arguments) are placed at the top level alongside the standard Inspect
AI tasks block. Inspect AI silently ignores keys it does not recognise, so the same
file works for both frameworks.
The value of task must be one of the task names used in EuroEval
(e.g. classification, sentiment-classification,
named-entity-recognition, multiple-choice, etc.). languages is a list of
ISO 639-1 language codes.
All other DatasetConfig arguments are also supported:
name: My Dataset
tasks:
- id: my_dataset
split: test
field_spec:
input: review
target: sentiment
solvers:
- name: generate
scorers:
- name: choice
# EuroEval-specific keys (optional; ignored by Inspect AI)
task: classification
languages:
- en
labels:
- positive
- negative
num_few_shot_examples: 12
max_generated_tokens: 5
prompt_label_mapping:
positive: positive
negative: negative
The EuroEval-specific task and languages keys are optional — EuroEval will
infer them automatically when they are absent:
taskis inferred from the Inspect AItasksblock: a solver withname: multiple_choiceor afield_spec.choicesentry both map to themultiple-choicetask.languagesare read from the Hugging Face Hub repository metadata (thelanguagefield in the dataset card). If the language cannot be determined, EuroEval defaults to English and logs a warning.
This means a standard Inspect AI eval.yaml with no EuroEval-specific keys works
out of the box:
# Pure Inspect AI format — no EuroEval keys required
name: My Dataset
description: My dataset description.
tasks:
- id: my_dataset
split: test
field_spec:
input: question
target: answer
choices: options
solvers:
- name: multiple_choice
scorers:
- name: choice
Column names can also be supplied as flat top-level keys (input_column,
target_column, choices_column) instead of inside the field_spec block;
top-level keys take precedence when both are present. Note that Inspect AI allows
field_spec.target values such as "literal:A" (a hard-coded answer string) and
bare integers (mapped to letters A, B, C … by Inspect AI); EuroEval silently ignores
both forms because they are not column names.
The standard Inspect AI task keys are also used directly by EuroEval:
tasks[0].split— the evaluation split to use (e.g.test,validation). EuroEval uses this as the test split, so no separate EuroEval key is needed.tasks[0].config— the Hugging Face dataset config/subset name (e.g.main,default). EuroEval automatically appends it when loading the dataset.
You can then benchmark your custom dataset by simply running
euroeval --dataset <org-id>/<repo-id> --model <model-id>
or from a Python script:
from euroeval import Benchmarker
benchmarker = Benchmarker()
benchmarker.benchmark(model="<model-id>", dataset="<org-id>/<repo-id>")
For a dataset that is accessible via the Hugging Face Hub, you can also create a
file called euroeval_config.py in the root of your repository, in which you define
the associated dataset configuration. This gives you full Python flexibility (e.g.
custom preprocessing functions) but requires the --trust-remote-code flag. Note
that you don't need to specify the name, pretty_name or source arguments in this
case, as these are automatically inferred from the repository name. Here is an example
of a simple text classification dataset with two classes:
from euroeval import DatasetConfig, TEXT_CLASSIFICATION
from euroeval.languages import ENGLISH
CONFIG = DatasetConfig(
task=TEXT_CLASSIFICATION,
languages=[ENGLISH],
labels=["positive", "negative"],
)
Note
To benchmark a dataset from the Hugging Face Hub using a Python config, you always
need to set the --trust-remote-code flag (or trust_remote_code=True if using
the Benchmarker), as the dataset configuration is loaded from the remote code.
We advise you to always look at the code of the dataset configuration before running
the benchmark.
You can then benchmark your custom dataset by simply running
euroeval --dataset <org-id>/<repo-id> --model <model-id> --trust-remote-code
You can also run the benchmark from a Python script, by simply providing your repo ID to
the benchmark method:
from euroeval import Benchmarker
benchmarker = Benchmarker()
benchmarker.benchmark(
model="<model-id>", dataset="<org-id>/<repo-id>", trust_remote_code=True
)
You can try it out with our test dataset:
euroeval --dataset EuroEval/test_dataset --model <model-id> --trust-remote-code
Custom column names¶
If your dataset uses column names that differ from EuroEval's expected names, you can
specify a column mapping directly in DatasetConfig using the input_column,
target_column, and choices_column arguments. EuroEval will rename (or merge) the
columns at load time, so you don't need to preprocess your dataset beforehand.
input_column — the name of the column containing the input text. Defaults to
"text" (no rename). If set to a different value, that column is renamed to "text".
DatasetConfig(
name="my-dataset",
...,
input_column="review", # rename "review" → "text"
)
target_column — the name of the column containing the label. If set, the column
is renamed to the task-appropriate standard name ("label" for classification,
"labels" for token classification, "target_text" for text-to-text).
DatasetConfig(
name="my-dataset",
...,
target_column="sentiment", # rename "sentiment" → "label" (for classification)
)
choices_column — for multiple-choice tasks, the column (or list of columns)
containing the answer choices. A single string names a column that holds a list of
choice strings. A list of strings names separate columns, each holding one choice
string. When set, the input text and choices are automatically merged into the
formatted "text" column that EuroEval expects.
# Single column holding a list of choices
DatasetConfig(
name="my-mcq-dataset",
...,
choices_column="choices",
target_column="answer",
)
# Separate columns, one per choice
DatasetConfig(
name="my-mcq-dataset",
...,
input_column="question",
choices_column=["choice_a", "choice_b", "choice_c", "choice_d"],
target_column="answer",
)
preprocessing_func — for full control, you can supply an arbitrary preprocessing
function that receives a DatasetDict and returns a DatasetDict. If this argument
is provided together with any of the column arguments above, preprocessing_func takes
precedence and the column arguments are ignored (a warning is logged in this case).
def my_preprocess(dataset):
for split_name, split in dataset.items():
split = split.rename_column("review", "text")
split = split.rename_column("stars", "label")
dataset[split_name] = split
return dataset
DatasetConfig(
name="my-dataset",
...,
preprocessing_func=my_preprocess,
)
We have included three convenience tasks to make it easier to set up custom datasets:
TEXT_CLASSIFICATION, which is used for text classification tasks. This requires you to set thelabelsargument in theDatasetConfig, and requires the columnstextandlabelto be present in the dataset.MULTIPLE_CHOICE, which is used for multiple-choice classification tasks. This also requires you to set thelabelsargument in theDatasetConfig. Note that for multiple choice tasks, you need to set up yourtextcolumn to also list all the choices, and all the samples should have the same number of choices. This requires the columnstextandlabelto be present in the dataset.TOKEN_CLASSIFICATION, which is used when classifying individual tokens in a text. This also require you to set thelabelsargument in theDatasetConfig. This requires the columnstokensandlabelsto be present in the dataset, wheretokensis a list of tokens/words in the text, andlabelsis a list of the corresponding labels for each token (so the two lists have the same length).
On top of these three convenience tasks, there are of course also the tasks that we use in the official benchmark, which you can use if you want to use one of these tasks with your own bespoke dataset:
LA, for linguistic acceptability datasets.NER, for named entity recognition datasets with the standard BIO tagging scheme.RC, for reading comprehension datasets in the SQuAD format.SENT, for sentiment classification datasets.SUMM, for text summarisation datasets.KNOW, for multiple-choice knowledge datasets (e.g., MMLU).MCRC, for multiple-choice reading comprehension datasets (e.g., Belebele).COMMON_SENSE, for multiple-choice common-sense reasoning datasets (e.g., HellaSwag).
These can all be imported from euroeval.tasks module.
Creating your own custom task¶
You are of course also free to define your own task from scratch, which allows you to
customise the prompts used when evaluating generative models, for instance. When
creating a custom task you need to specify a task_group, which determines the overall
type of task and the required dataset columns. Below are examples for each supported
task group.
The PromptConfig object defines the prompts used for evaluation and accepts the
following arguments:
default_prompt_prefix: Introductory text shown before the few-shot examples (only required for base decoders).default_prompt_template: Template used to format each example in few-shot evaluation (only required for base decoders). Available placeholders depend on the task group (see examples below).default_instruction_prompt: Template used for instruction-tuned models (zero-shot or instruction-style evaluation). Available placeholders depend on the task group (see examples below).default_prompt_label_mapping: A mapping from label strings to human-readable phrases used in the prompts (e.g.,{"b-per": "person"}). Set to"auto"for a 1:1 mapping or to an emptydict()for tasks that don't use labels in prompts.
Task group: TaskGroup.SEQUENCE_CLASSIFICATION
Required dataset columns: text (string), label (string)
The label column should contain the class label as a string. You must provide the
list of possible labels in the DatasetConfig.
Available placeholders in PromptConfig:
{text}: The input text.{label}: The label for the example (empty string for the new sample).{labels_str}: A formatted string listing all possible labels.
from euroeval import DatasetConfig
from euroeval.data_models import Task, PromptConfig
from euroeval.enums import TaskGroup
from euroeval.languages import DANISH
from euroeval.metrics import mcc_metric, macro_f1_metric
from euroeval.constants import NUM_GENERATION_TOKENS_FOR_CLASSIFICATION
my_classification_task = Task(
name="my-classification",
task_group=TaskGroup.SEQUENCE_CLASSIFICATION,
template_dict={
DANISH: PromptConfig(
default_prompt_prefix="The following are texts and their categories, which "
"can be {labels_str}.",
default_prompt_template="Text: {text}\nCategory: {label}",
default_instruction_prompt="Text: {text}\n\nClassify the text into one of "
"the categories {labels_str}, and answer with only the category.",
default_prompt_label_mapping="auto",
),
},
metrics=[mcc_metric, macro_f1_metric],
default_num_few_shot_examples=12,
default_max_generated_tokens=NUM_GENERATION_TOKENS_FOR_CLASSIFICATION,
uses_logprobs=True,
)
MY_DATASET = DatasetConfig(
name="my-classification-dataset",
pretty_name="My Classification Dataset",
source=dict(train="train.csv", val="val.csv", test="test.csv"),
task=my_classification_task,
languages=[DANISH],
labels=["sports", "politics", "entertainment"],
)
Task group: TaskGroup.MULTIPLE_CHOICE_CLASSIFICATION
Required dataset columns: text (string), label (string)
The label column should be the letter of the correct choice (e.g., "a"). The text
column must include both the question and the formatted answer choices. You can either
pre-format the text column yourself, or use choices_column in DatasetConfig to
have EuroEval merge a separate choices column (or per-choice columns) into text
automatically. The merged format is:
<question>
Choices:
a. <choice 0>
b. <choice 1>
...
All samples must have the same number of choices. You must provide the list of possible
label letters in the DatasetConfig (e.g., ["a", "b", "c", "d"]).
Available placeholders in PromptConfig:
{text}: The full question text including choices.{label}: The correct answer letter (empty string for the new sample).{labels_str}: A formatted string listing all possible answer letters.
from euroeval import DatasetConfig
from euroeval.data_models import Task, PromptConfig
from euroeval.enums import TaskGroup, ModelType
from euroeval.languages import FRENCH
from euroeval.metrics import mcc_metric, accuracy_metric
from euroeval.constants import NUM_GENERATION_TOKENS_FOR_CLASSIFICATION
my_multiple_choice_task = Task(
name="my-multiple-choice",
task_group=TaskGroup.MULTIPLE_CHOICE_CLASSIFICATION,
template_dict={
FRENCH: PromptConfig(
default_prompt_prefix="The following are multiple-choice questions "
"(with answers).",
default_prompt_template="Question: {text}\nAnswer: {label}",
default_instruction_prompt="Question: {text}\n\nAnswer the question above "
"by replying with {labels_str}, and nothing else.",
default_prompt_label_mapping="auto",
),
},
metrics=[mcc_metric, accuracy_metric],
default_num_few_shot_examples=5,
default_max_generated_tokens=NUM_GENERATION_TOKENS_FOR_CLASSIFICATION,
default_allowed_model_types=[ModelType.GENERATIVE],
uses_logprobs=True,
)
# If your dataset has a single column with a list of choices and a separate answer column:
MY_DATASET = DatasetConfig(
name="my-multiple-choice-dataset",
pretty_name="My Multiple Choice Dataset",
source=dict(train="train.csv", val="val.csv", test="test.csv"),
task=my_multiple_choice_task,
languages=[FRENCH],
labels=["a", "b", "c", "d"],
choices_column="choices", # column containing a list of choice strings
target_column="answer", # column containing the correct answer letter
)
# Or if each choice is in its own column:
MY_DATASET = DatasetConfig(
name="my-multiple-choice-dataset",
pretty_name="My Multiple Choice Dataset",
source=dict(train="train.csv", val="val.csv", test="test.csv"),
task=my_multiple_choice_task,
languages=[FRENCH],
labels=["a", "b", "c", "d"],
input_column="question",
choices_column=["choice_a", "choice_b", "choice_c", "choice_d"],
target_column="answer",
)
Task group: TaskGroup.TOKEN_CLASSIFICATION
Required dataset columns: tokens (list of strings), labels (list of strings)
The tokens column is a list of word tokens in the text, and the labels column is a
list of corresponding BIO tags (e.g., ["o", "b-per", "i-per", "o"]). The two lists
must have the same length. You must provide the full list of possible labels (including
"o") in the DatasetConfig. The default_prompt_label_mapping should map BIO labels
to human-readable category names, and for each entity type both b-X and i-X must map
to the same category string (e.g., {"b-per": "person", "i-per": "person"}).
Available placeholders in PromptConfig:
{text}: The tokens joined into a string.{label}: A JSON dictionary mapping category names to lists of matching spans (empty string for the new sample).{labels_str}: A formatted string listing all category names from the label mapping.
from euroeval import DatasetConfig
from euroeval.data_models import Task, PromptConfig
from euroeval.enums import TaskGroup
from euroeval.languages import GERMAN
from euroeval.metrics import micro_f1_metric
my_token_classification_task = Task(
name="my-token-classification",
task_group=TaskGroup.TOKEN_CLASSIFICATION,
template_dict={
GERMAN: PromptConfig(
default_prompt_prefix="Below are texts and JSON dictionaries with the "
"categories that appear in the given text.",
default_prompt_template="Text: {text}\nCategories: {label}",
default_instruction_prompt="Text: {text}\n\nIdentify the categories in "
"the text. Print this as a JSON dictionary with the keys being "
"{labels_str}. The values should be lists of the spans of that category, "
"exactly as they appear in the text.",
default_prompt_label_mapping={
"b-product": "product",
"i-product": "product",
"b-company": "company",
"i-company": "company",
},
),
},
metrics=[micro_f1_metric],
default_num_few_shot_examples=8,
default_max_generated_tokens=128,
uses_structured_output=True,
)
MY_DATASET = DatasetConfig(
name="my-token-classification-dataset",
pretty_name="My Token Classification Dataset",
source=dict(train="train.csv", val="val.csv", test="test.csv"),
task=my_token_classification_task,
languages=[GERMAN],
labels=["o", "b-product", "i-product", "b-company", "i-company"],
)
Task group: TaskGroup.QUESTION_ANSWERING
Required dataset columns: context (string), question (string), answers (dict)
The context column is the passage to read, question is the question to answer, and
answers is a dict with "text" (a list of answer strings) and "answer_start" (a
list of character-level start positions of those answers in the context). This follows
the SQuAD format.
Available placeholders in PromptConfig:
{text}: The context passage.{question}: The question.{label}: The answer text (empty string for the new sample).
from euroeval import DatasetConfig
from euroeval.data_models import Task, PromptConfig
from euroeval.enums import TaskGroup
from euroeval.languages import SWEDISH
from euroeval.metrics import f1_metric, em_metric
my_qa_task = Task(
name="my-reading-comprehension",
task_group=TaskGroup.QUESTION_ANSWERING,
template_dict={
SWEDISH: PromptConfig(
default_prompt_prefix="Below are texts with questions and answers.",
default_prompt_template="Text: {text}\nQuestion: {question}\nAnswer in "
"max 3 words: {label}",
default_instruction_prompt="Text: {text}\n\nAnswer the following question "
"about the text above in max 3 words.\n\nQuestion: {question}",
default_prompt_label_mapping=dict(),
),
},
metrics=[f1_metric, em_metric],
default_num_few_shot_examples=4,
default_max_generated_tokens=32,
)
MY_DATASET = DatasetConfig(
name="my-reading-comprehension-dataset",
pretty_name="My Reading Comprehension Dataset",
source=dict(train="train.csv", val="val.csv", test="test.csv"),
task=my_qa_task,
languages=[SWEDISH],
)
Task group: TaskGroup.TEXT_TO_TEXT
Required dataset columns: text (string), target_text (string)
The text column is the input to the model, and target_text is the expected output.
This covers tasks such as summarization, translation, simplification, and free-form text
generation.
Available placeholders in PromptConfig:
{text}: The input text.{target_text}: The expected output text (empty string for the new sample).
Here is an example of a custom text-to-text task where the goal is to generate a SQL query from a natural language input:
from euroeval import DatasetConfig
from euroeval.data_models import Task, PromptConfig
from euroeval.enums import TaskGroup, ModelType
from euroeval.languages import ENGLISH
from euroeval.metrics import rouge_l_metric
sql_generation_task = Task(
name="sql-generation",
task_group=TaskGroup.TEXT_TO_TEXT,
template_dict={
ENGLISH: PromptConfig(
default_prompt_prefix="The following are natural language texts and their "
"corresponding SQL queries.",
default_prompt_template="Natural language query: {text}\nSQL query: "
"{target_text}",
default_instruction_prompt="Generate the SQL query for the following "
"natural language query:\n{text!r}",
default_prompt_label_mapping=dict(),
),
},
metrics=[rouge_l_metric],
default_num_few_shot_examples=3,
default_max_generated_tokens=256,
default_allowed_model_types=[ModelType.GENERATIVE],
)
MY_SQL_DATASET = DatasetConfig(
name="my-sql-dataset",
pretty_name="My SQL Dataset",
source=dict(train="train.csv", val="val.csv", test="test.csv"),
task=sql_generation_task,
languages=[ENGLISH],
)
With any of these custom tasks you can then benchmark your dataset by running
euroeval --dataset <dataset-name> --model <model-id>
Output format: Every Eval Ever (EEE)¶
Each entry written to euroeval_benchmark_results.jsonl conforms to the
Every Eval Ever (EEE) JSON schema v0.2.1,
a community-standard format for evaluation reporting. The file can therefore be
consumed directly by any tool that understands the EEE schema.
Every line is a self-contained JSON object with the following top-level structure:
| Field | Description |
|---|---|
schema_version |
EEE schema version ("0.2.1") |
evaluation_id |
Unique run identifier in dataset/model/timestamp format |
evaluation_timestamp |
ISO 8601 timestamp of when the evaluation ran |
retrieved_timestamp |
Unix epoch timestamp of when the record was written |
source_metadata |
EuroEval organisation info and evaluator relationship |
model_info |
Model identifier and EuroEval-specific metadata |
eval_library |
Library name/version and evaluation context |
evaluation_results |
Array of per-metric scores with uncertainty estimates |
Here is an abbreviated example:
{
"schema_version": "0.2.1",
"evaluation_id": "angry-tweets/meta-llama/Llama-3.1-8B-Instruct/1741260000",
"evaluation_timestamp": "2025-03-06T11:00:00+00:00",
"retrieved_timestamp": "1741260000",
"source_metadata": {
"source_name": "EuroEval",
"source_type": "evaluation_run",
"source_organization_name": "EuroEval",
"source_organization_url": "https://euroeval.com",
"evaluator_relationship": "third_party"
},
"model_info": {
"id": "meta-llama/Llama-3.1-8B-Instruct",
"name": "meta-llama/Llama-3.1-8B-Instruct",
"additional_details": {
"num_model_parameters": "8000000000",
"max_sequence_length": "131072",
"vocabulary_size": "128256",
"generative": "true",
"generative_type": "instruction_tuned"
}
},
"eval_library": {
"name": "euroeval",
"version": "16.16.1",
"additional_details": {
"task": "sentiment-classification",
"languages": "[\"da\"]",
"few_shot": "true",
"raw_results": "[{\"test_mcc\": 0.40}, {\"test_mcc\": 0.45}]"
}
},
"evaluation_results": [
{
"evaluation_name": "test_mcc",
"source_data": { "dataset_name": "angry-tweets", "source_type": "hf_dataset" },
"metric_config": {
"lower_is_better": false,
"score_type": "continuous",
"min_score": 0,
"max_score": 100
},
"score_details": {
"score": 42.5,
"details": { "num_failed_instances": "0.0" },
"uncertainty": {
"confidence_interval": {
"lower": 41.3,
"upper": 43.7,
"confidence_level": 0.95,
"method": "bootstrap"
},
"num_samples": 10
}
}
}
]
}
Note
The _se values stored internally are 95 % confidence interval half-widths
(1.96 × SE). They are exposed in the EEE output as a
confidence_interval { lower, upper, confidence_level: 0.95 }, which is the
correct EEE field for this statistic.
The EEE format can be read back into a BenchmarkResult object without any loss of
information:
import json
from euroeval.data_models import BenchmarkResult
with open("euroeval_benchmark_results.jsonl") as f:
for line in f:
if line.strip():
result = BenchmarkResult.from_dict(json.loads(line))
Analysing the results of generative models¶
Failed instances¶
When evaluating a generative model, some samples may fail silently — for example when
the model's output cannot be parsed as JSON (for NER tasks), or when no valid label can
be matched to the model's output (for classification tasks). These failures are recorded
in euroeval_benchmark_results.jsonl using the EEE format.
The total number of failed instances across all iterations is stored as a string in
evaluation_results[N].score_details.details.num_failed_instances for each metric.
The raw per-iteration scores, including per-iteration failed_instances lists, are
stored as a JSON-encoded string in eval_library.additional_details.raw_results. Each
element in the decoded list is a per-iteration score dictionary that may contain a
failed_instances key with a list of failed samples. Every item in the list has:
sample_index— the 0-based index of the sample within the bootstrapped batch for that iteration.error— a short description of why the sample failed (e.g."Could not parse JSON from model output"or"No candidate label found in model output").
Here is an abbreviated example (omitting top-level EEE fields such as
schema_version, evaluation_id, model_info, etc. — see the
Output format section for the full structure):
{
"schema_version": "0.2.1",
"evaluation_id": "ner-dataset/my-model/1741260000",
"eval_library": {
"name": "euroeval",
"version": "16.17.0",
"additional_details": {
"raw_results": "[{\"micro_f1\": 0.82, \"failed_instances\": [{\"sample_index\": 4, \"error\": \"Could not parse JSON from model output\"}]}]"
}
},
"evaluation_results": [
{
"evaluation_name": "test_micro_f1",
"score_details": {
"score": 0.82,
"details": { "num_failed_instances": "3.0" }
}
}
]
}
If a model never fails (e.g. encoder/fine-tuned models, or a flawless generative run),
num_failed_instances will be "0.0" and failed_instances will be an empty list for
every iteration.
To inspect failed instances programmatically, you can load a result via
BenchmarkResult.from_dict(), which transparently decodes the EEE format (including
the JSON-encoded raw_results string) into a BenchmarkResult object. The decoded
per-iteration scores are then available via result.results["raw"]:
import json
from euroeval.data_models import BenchmarkResult
with open("euroeval_benchmark_results.jsonl") as f:
for line in f:
if line.strip():
# from_dict() decodes the EEE format transparently
result = BenchmarkResult.from_dict(json.loads(line))
raw = result.results.get("raw", [])
for iteration_idx, iteration_scores in enumerate(raw):
for failure in iteration_scores.get("failed_instances", []):
print(
f"Iteration {iteration_idx}, "
f"sample {failure['sample_index']}: {failure['error']}"
)
Detailed model outputs¶
If you're evaluating a generative model and want to be able to analyse the model results
more in-depth, you can run your evaluation with the --debug flag (or debug=True if
using the Benchmarker), which will output all the model outputs and all the dataset
metadata (including the ground truth labels, if present) to both the terminal as well as
to a JSON file in your current working directory, named
<model-id>-<dataset-name>-model-outputs.json.
It is a JSON dictionary with keys being hashes of the input (which you can just ignore, it's used for caching during generation), and values being dictionaries with the following keys:
index: The row index of the sample in the dataset. This allows you to match up the sample with the corresponding sample in the dataset.text/messages: The full input prompt used for generation. If the model is a base decoder then this will be a string stored intext, and if it's an instruction-tuned model then this will be an array of dictionaries stored inmessages. This will include all few-shot examples, if any - see the belowpromptto get the content of the present sample, without any few-shot examples.prompt: The actual example, without any few-shot examples. This is not exactly the input to the model (unless you're conducting zero-shot evaluation), but it can be handy to separate the actual query that the model was asked to answer.sequence: The generated sequence by the model.predicted_label: The predicted label for the generated sequence, if the task has a label. This allows you to compare directly with the ground truth label, if present.scores: An array of shape (num_tokens_generated,num_logprobs_per_token, 2), where the first dimension is the index of the token in the generated sequence, the second dimension is the index of the logprob for that token (ordered by most likely token to be generated to least likely), and the third dimension is a pair (token, logprob) for the token and its logprob. This will only be present if the task requires logprobs, and will otherwise be null.- Any metadata for the sample that was present in the dataset, including the ground truth label, if present.
If you sort the rows by this index, you will get the samples in the same order as they appear in the dataset, effectively just recreating the entire dataset, with the additional model output features mentioned above. Here's an example of how you can do this in Python:
>>> import json
>>> import pandas as pd
>>> with open("<model-id>-<dataset-name>-model-outputs.json") as f:
... model_outputs = json.load(f)
>>> df = pd.DataFrame(model_outputs.values()).set_index('index').sort_index()
>>> df.head()
sequence predicted_label scores corruption_type label messages prompt
index
0 nej incorrect [[[nej, -1.735893965815194e-05], [ja, -11.0000... flip_nogle_nogen incorrect [{'content': 'Sætning: Styrkeforholdet må være... Sætning: Peter Elmegaard med nogen af sine hæs...
0 nej incorrect [[[nej, -3.128163257315464e-07], [ja, -15.125]... flip_nogle_nogen incorrect [{'content': 'Sætning: Ægteparret hævdede, at ... Sætning: Peter Elmegaard med nogen af sine hæs...
0 nej incorrect [[[nej, -0.0009307525469921529], [ja, -7.00093... flip_nogle_nogen incorrect [{'content': 'Sætning: Samtidig lægger hans on... Sætning: Peter Elmegaard med nogen af sine hæs...
0 nej incorrect [[[nej, -4.127333340875339e-06], [ja, -12.5000... flip_nogle_nogen incorrect [{'content': 'Sætning: Hej til Bente som jeg v... Sætning: Peter Elmegaard med nogen af sine hæs...
1 nej incorrect [[[nej, 0.0], [ja, -16.75], [ne, -19.0], [n, -... flip_indefinite_article incorrect [{'content': 'Sætning: Ægteparret hævdede, at ... Sætning: Der blev afprøvet et masse ting.\n\nB...
Note that the index column is not unique, which is because the model is generating
multiple answers for each sample, but with different few-shot examples. You can see
these few-shot examples in the messages column.
Example
Here is a (truncated) example of a model output file:
{
"cb3f9ea749fec9d2f83ca6d3a8744cce": {
"index": 181,
"sequence": "ja",
"predicted_label": "correct",
"scores": [
[
[
"ja",
-0.5232800841331482
],
[
"nej",
-0.8982800841331482
],
[
"ne",
-8.773280143737793
],
[
"j",
-13.710780143737793
],
[
"n",
-13.960780143737793
],
[
"!",
-100.0
],
[
"\"",
-100.0
],
[
"#",
-100.0
]
]
],
"corruption_type": null,
"label": "correct",
"messages": [
{
"content": "Sætning: Styrkeforholdet må være det afgørene, siger de begge.\n\nBestem om sætningen er grammatisk korrekt eller ej. Svar kun med 'ja' eller 'nej', og intet andet.",
"role": "user"
},
{
"content": "nej",
"role": "assistant"
},
(...more few-shot examples...)
{
"content": "Sætning: Rør peberfrugt i og steg igen et par minutter.\n\nBestem om sætningen er grammatisk korrekt eller ej. Svar kun med 'ja' eller 'nej', og intet andet.",
"role": "user"
}
],
"prompt": "Sætning: Rør peberfrugt i og steg igen et par minutter.\n\nBestem om sætningen er grammatisk korrekt eller ej. Svar kun med 'ja' eller 'nej', og intet andet."
},
"a8fab2c68e9ec63184636341eaf43f6c": {
(...)
},
(...)
}