euroeval.human_evaluation

Gradio app for conducting human evaluation of the tasks.

Classes

HumanEvaluator — An app for evaluating human performance on the EuroEval benchmark.

Functions

main — Start the Gradio app for human evaluation.

source class HumanEvaluator(annotator_id: int, title: str, description: str, dummy_model_id: str = 'mistralai/Mistral-7B-v0.1')

An app for evaluating human performance on the EuroEval benchmark.

Initialise the HumanEvaluator.

Parameters

annotator_id : int — The annotator ID for the evaluation.
title : str — The title of the app.
description : str — The description of the app.
dummy_model_id : str — The model ID to use for generating prompts.

Methods

create_app — Create the Gradio app for human evaluation.
update_dataset_choices — Update the dataset choices based on the selected language and task.
update_dataset — Update the dataset based on a selected dataset name.
add_entity_to_answer — Add an entity to the answer.
reset_entities — Reset the entities in the answer.
submit_answer — Submit an answer to the dataset.
example_to_markdown — Convert an example to a Markdown string.
compute_and_log_scores — Computes and logs the scores for the dataset.

source method HumanEvaluator.create_app() → gr.Blocks

Create the Gradio app for human evaluation.

Returns

gr.Blocks — The Gradio app for human evaluation.

source method HumanEvaluator.update_dataset_choices(language: str | None, task: str | None) → Dropdown

Update the dataset choices based on the selected language and task.

Parameters

language : str | None — The language selected by the user.
task : str | None — The task selected by the user.

Returns

Dropdown — A list of dataset names that match the selected language and task.

source method HumanEvaluator.update_dataset(dataset_name: str, iteration: int) → tuple[Markdown, Markdown, Dropdown, Textbox, Button, Button, Textbox, Button]

Update the dataset based on a selected dataset name.

Parameters

dataset_name : str — The dataset name selected by the user.
iteration : int — The iteration index of the datasets to evaluate.

Returns

tuple[Markdown, Markdown, Dropdown, Textbox, Button, Button, Textbox, Button] — A tuple (task_examples, question, entity_type, entity, entity_add_button, entity_reset_button, answer, submit_button) for the selected dataset.

Raises

NotImplementedError

source method HumanEvaluator.add_entity_to_answer(question: str, entity_type: str, entity: str, answer: str) → tuple[Textbox, Textbox]

Add an entity to the answer.

Parameters

question : str — The current question.
entity_type : str — The entity type selected by the user.
entity : str — The entity provided by the user.
answer : str — The current answer.

Returns

tuple[Textbox, Textbox] — A tuple (entity, answer) with a (blank) entity and answer.

source method HumanEvaluator.reset_entities() → Textbox

Reset the entities in the answer.

Returns

Textbox — A blank answer.

source method HumanEvaluator.submit_answer(dataset_name: str, question: str, answer: str, annotator_id: int) → tuple[str, str]

Submit an answer to the dataset.

Parameters

dataset_name : str — The name of the dataset.
question : str — The question for the dataset.
answer : str — The answer to the question.
annotator_id : int — The annotator ID for the evaluation.

Returns

tuple[str, str] — A tuple (question, answer), with question being the next question, and answer being an empty string.

source method HumanEvaluator.example_to_markdown(example: dict) → tuple[str, str]

Convert an example to a Markdown string.

Parameters

example : dict — The example to convert.

Returns

tuple[str, str] — A tuple (task_examples, question) for the example.

source method HumanEvaluator.compute_and_log_scores() → None

Computes and logs the scores for the dataset.

source main(annotator_id: int) → None

Start the Gradio app for human evaluation.

Raises

NeedsExtraInstalled