euroeval.metrics
source module euroeval.metrics
All the metrics used in EuroEval.
Classes
-
Metric — Abstract base class for all metrics.
-
HuggingFaceMetric — A metric which is implemented in the
evaluate
package. -
LLMAsAJudgeMetric — Use an LLM to judge the quality of the predictions.
-
SpeedMetric — Speed metric.
-
Fluency — Response format for the fluency metric.
source class Metric(name: str, pretty_name: str, postprocessing_fn: t.Callable[[float], tuple[float, str]] | None = None)
Bases : abc.ABC
Abstract base class for all metrics.
Initialise the metric.
Parameters
-
name : str — The name of the metric in snake_case.
-
pretty_name : str — The pretty name of the metric, used for display purposes.
-
postprocessing_fn : t.Callable[[float], tuple[float, str]] | None — A function to apply to the metric scores after they are computed, taking the score to the postprocessed score along with its string representation. Defaults to x -> (100 * x, f"{x:.2%}").
source class HuggingFaceMetric(name: str, pretty_name: str, huggingface_id: str, results_key: str, compute_kwargs: dict[str, t.Any] | None = None, postprocessing_fn: t.Callable[[float], tuple[float, str]] | None = None)
Bases : Metric
A metric which is implemented in the evaluate
package.
Initialise the Hugging Face metric.
Attributes
-
name — The name of the metric in snake_case.
-
pretty_name — The pretty name of the metric, used for display purposes.
-
huggingface_id — The Hugging Face ID of the metric.
-
results_key — The name of the key used to extract the metric scores from the results dictionary.
-
compute_kwargs : dict[str, t.Any] — Keyword arguments to pass to the metric's compute function. Defaults to an empty dictionary.
Parameters
-
name : str — The name of the metric in snake_case.
-
pretty_name : str — The pretty name of the metric, used for display purposes.
-
huggingface_id : str — The Hugging Face ID of the metric.
-
results_key : str — The name of the key used to extract the metric scores from the results dictionary.
-
compute_kwargs : dict[str, t.Any] | None — Keyword arguments to pass to the metric's compute function. Defaults to an empty dictionary.
-
postprocessing_fn : t.Callable[[float], tuple[float, str]] | None — A function to apply to the metric scores after they are computed, taking the score to the postprocessed score along with its string representation. Defaults to x -> (100 * x, f"{x:.2%}").
source class LLMAsAJudgeMetric(name: str, pretty_name: str, judge_id: str, judge_kwargs: dict[str, t.Any], user_prompt: str, response_format: t.Type[BaseModel], scoring_fn: t.Callable[[BaseModel], float], condition_formatting_fn: t.Callable[[str], str] = lambda x: x, system_prompt: str | None = None)
Bases : Metric
Use an LLM to judge the quality of the predictions.
Initialise the LLM as a judge metric.
Parameters
-
name : str — The name of the metric in snake_case.
-
pretty_name : str — The pretty name of the metric, used for display purposes.
-
judge_id : str — The model ID of the LLM to use as a judge.
-
judge_kwargs : dict[str, t.Any] — Generation parameters for the judge model, such as temperature.
-
user_prompt : str — The user prompt to use for the judge model. The prompt should be formatted with the variables
prediction
andcondition
, to include the model predictions and a description of what the prediction should be judged on, respectively. If the condition is not needed, it can be omitted from the prompt, but theprediction
variable must still be present. -
response_format : t.Type[BaseModel] — The response format to use for the judge model. This should be a Pydantic model that defines the expected structure of the judge's response.
-
scoring_fn : t.Callable[[BaseModel], float] — A function that takes the judge's response and returns a score.
-
condition_formatting_fn : optional — A function to format the condition string before it is included in the user prompt. Defaults to a no-op function that returns the input unchanged.
-
system_prompt : optional — The system prompt to use for the judge model. If not provided, no system prompt will be used.
source class SpeedMetric(name: str, pretty_name: str)
Bases : Metric
Speed metric.
Initialise the speed metric.
Parameters
-
name : str — The name of the metric in snake_case.
-
pretty_name : str — The pretty name of the metric, used for display purposes.
source class Fluency(**data: Any)
Bases : BaseModel
Response format for the fluency metric.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError
][pydantic_core.ValidationError] if the input data cannot be
validated to form a valid model.
self
is explicitly positional-only to allow self
as a field name.
Attributes
-
fluency : t.Annotated[int, Field(ge=1, le=5)] — The fluency rating, an integer between 1 and 5.
-
model_config : ClassVar[ConfigDict] — Configuration for the model, should be a dictionary conforming to [
ConfigDict
][pydantic.config.ConfigDict]. -
model_extra : dict[str, Any] | None — Get extra fields set during validation.
-
model_fields_set : set[str] — Returns the set of fields that have been explicitly set on this model instance.