Skip to content

euroeval.metrics

source module euroeval.metrics

All the metrics used in EuroEval.

Classes

source class Metric(name: str, pretty_name: str, postprocessing_fn: t.Callable[[float], tuple[float, str]] | None = None)

Bases : abc.ABC

Abstract base class for all metrics.

Initialise the metric.

Parameters

  • name : str The name of the metric in snake_case.

  • pretty_name : str The pretty name of the metric, used for display purposes.

  • postprocessing_fn : t.Callable[[float], tuple[float, str]] | None A function to apply to the metric scores after they are computed, taking the score to the postprocessed score along with its string representation. Defaults to x -> (100 * x, f"{x:.2%}").

source class HuggingFaceMetric(name: str, pretty_name: str, huggingface_id: str, results_key: str, compute_kwargs: dict[str, t.Any] | None = None, postprocessing_fn: t.Callable[[float], tuple[float, str]] | None = None)

Bases : Metric

A metric which is implemented in the evaluate package.

Initialise the Hugging Face metric.

Attributes

  • name The name of the metric in snake_case.

  • pretty_name The pretty name of the metric, used for display purposes.

  • huggingface_id The Hugging Face ID of the metric.

  • results_key The name of the key used to extract the metric scores from the results dictionary.

  • compute_kwargs : dict[str, t.Any] Keyword arguments to pass to the metric's compute function. Defaults to an empty dictionary.

Parameters

  • name : str The name of the metric in snake_case.

  • pretty_name : str The pretty name of the metric, used for display purposes.

  • huggingface_id : str The Hugging Face ID of the metric.

  • results_key : str The name of the key used to extract the metric scores from the results dictionary.

  • compute_kwargs : dict[str, t.Any] | None Keyword arguments to pass to the metric's compute function. Defaults to an empty dictionary.

  • postprocessing_fn : t.Callable[[float], tuple[float, str]] | None A function to apply to the metric scores after they are computed, taking the score to the postprocessed score along with its string representation. Defaults to x -> (100 * x, f"{x:.2%}").

source class LLMAsAJudgeMetric(name: str, pretty_name: str, judge_id: str, judge_kwargs: dict[str, t.Any], user_prompt: str, response_format: t.Type[BaseModel], scoring_fn: t.Callable[[BaseModel], float], condition_formatting_fn: t.Callable[[str], str] = lambda x: x, system_prompt: str | None = None)

Bases : Metric

Use an LLM to judge the quality of the predictions.

Initialise the LLM as a judge metric.

Parameters

  • name : str The name of the metric in snake_case.

  • pretty_name : str The pretty name of the metric, used for display purposes.

  • judge_id : str The model ID of the LLM to use as a judge.

  • judge_kwargs : dict[str, t.Any] Generation parameters for the judge model, such as temperature.

  • user_prompt : str The user prompt to use for the judge model. The prompt should be formatted with the variables prediction and condition, to include the model predictions and a description of what the prediction should be judged on, respectively. If the condition is not needed, it can be omitted from the prompt, but the prediction variable must still be present.

  • response_format : t.Type[BaseModel] The response format to use for the judge model. This should be a Pydantic model that defines the expected structure of the judge's response.

  • scoring_fn : t.Callable[[BaseModel], float] A function that takes the judge's response and returns a score.

  • condition_formatting_fn : optional A function to format the condition string before it is included in the user prompt. Defaults to a no-op function that returns the input unchanged.

  • system_prompt : optional The system prompt to use for the judge model. If not provided, no system prompt will be used.

source class SpeedMetric(name: str, pretty_name: str)

Bases : Metric

Speed metric.

Initialise the speed metric.

Parameters

  • name : str The name of the metric in snake_case.

  • pretty_name : str The pretty name of the metric, used for display purposes.

source class Fluency(**data: Any)

Bases : BaseModel

Response format for the fluency metric.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

Attributes

  • fluency : t.Annotated[int, Field(ge=1, le=5)] The fluency rating, an integer between 1 and 5.

  • model_config : ClassVar[ConfigDict] Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

  • model_extra : dict[str, Any] | None Get extra fields set during validation.

  • model_fields_set : set[str] Returns the set of fields that have been explicitly set on this model instance.