euroeval.metrics
source module euroeval.metrics
All the metrics used in EuroEval.
Classes
-
Metric — Abstract base class for all metrics.
-
HuggingFaceMetric — A metric which is implemented in the
evaluate
package. -
PipelineMetric — Load a scikit-learn pipeline and use it to get scores from the predictions.
-
LLMAsAJudgeMetric — Use an LLM to judge the quality of the predictions.
-
SpeedMetric — Speed metric.
-
Fluency — Response format for the fluency metric.
Functions
-
european_values_preprocessing_fn — Preprocess the model predictions for the European Values metric.
-
european_values_scoring_function — Scoring function for the European Values metric.
source class Metric(name: str, pretty_name: str, postprocessing_fn: t.Callable[[float], tuple[float, str]] | None = None)
Bases : abc.ABC
Abstract base class for all metrics.
Initialise the metric.
Parameters
-
name : str — The name of the metric in snake_case.
-
pretty_name : str — The pretty name of the metric, used for display purposes.
-
postprocessing_fn : t.Callable[[float], tuple[float, str]] | None — A function to apply to the metric scores after they are computed, taking the score to the postprocessed score along with its string representation. Defaults to x -> (100 * x, f"{x:.2%}").
source class HuggingFaceMetric(name: str, pretty_name: str, huggingface_id: str, results_key: str, compute_kwargs: dict[str, t.Any] | None = None, postprocessing_fn: t.Callable[[float], tuple[float, str]] | None = None)
Bases : Metric
A metric which is implemented in the evaluate
package.
Initialise the Hugging Face metric.
Attributes
-
name — The name of the metric in snake_case.
-
pretty_name — The pretty name of the metric, used for display purposes.
-
huggingface_id — The Hugging Face ID of the metric.
-
results_key — The name of the key used to extract the metric scores from the results dictionary.
-
compute_kwargs : dict[str, t.Any] — Keyword arguments to pass to the metric's compute function. Defaults to an empty dictionary.
Parameters
-
name : str — The name of the metric in snake_case.
-
pretty_name : str — The pretty name of the metric, used for display purposes.
-
huggingface_id : str — The Hugging Face ID of the metric.
-
results_key : str — The name of the key used to extract the metric scores from the results dictionary.
-
compute_kwargs : dict[str, t.Any] | None — Keyword arguments to pass to the metric's compute function. Defaults to an empty dictionary.
-
postprocessing_fn : t.Callable[[float], tuple[float, str]] | None — A function to apply to the metric scores after they are computed, taking the score to the postprocessed score along with its string representation. Defaults to x -> (100 * x, f"{x:.2%}").
source class PipelineMetric(name: str, pretty_name: str, pipeline_repo: str, pipeline_scoring_function: c.Callable[['Pipeline', c.Sequence], float], pipeline_file_name: str = 'pipeline.pkl', preprocessing_fn: c.Callable[[c.Sequence[T]], c.Sequence[T]] = lambda x: x, postprocessing_fn: c.Callable[[float], tuple[float, str]] | None = None)
Bases : Metric
Load a scikit-learn pipeline and use it to get scores from the predictions.
Initialise the pipeline transform metric.
Parameters
-
name : str — The name of the metric in snake_case.
-
pretty_name : str — The pretty name of the metric, used for display purposes.
-
pipeline_repo : str — The Hugging Face repository ID of the scikit-learn pipeline to load.
-
pipeline_scoring_method — The method to use for scoring the predictions with the pipeline. Takes a 1D sequence of predictions and returns a float score.
-
pipeline_file_name : optional — The name of the file to download from the Hugging Face repository. Defaults to "pipeline.joblib".
-
preprocessing_fn : optional — A function to apply to the predictions before they are passed to the pipeline. This is useful for preprocessing the predictions to match the expected input format of the pipeline. Defaults to a no-op function that returns the input unchanged.
-
postprocessing_fn : optional — A function to apply to the metric scores after they are computed, taking the score to the postprocessed score along with its string representation. Defaults to x -> (100 * x, f"{x:.2%}").
source class LLMAsAJudgeMetric(name: str, pretty_name: str, judge_id: str, judge_kwargs: dict[str, t.Any], user_prompt: str, response_format: t.Type[BaseModel], scoring_fn: t.Callable[[BaseModel], float], condition_formatting_fn: t.Callable[[str], str] = lambda x: x, system_prompt: str | None = None)
Bases : Metric
Use an LLM to judge the quality of the predictions.
Initialise the LLM as a judge metric.
Parameters
-
name : str — The name of the metric in snake_case.
-
pretty_name : str — The pretty name of the metric, used for display purposes.
-
judge_id : str — The model ID of the LLM to use as a judge.
-
judge_kwargs : dict[str, t.Any] — Generation parameters for the judge model, such as temperature.
-
user_prompt : str — The user prompt to use for the judge model. The prompt should be formatted with the variables
prediction
andcondition
, to include the model predictions and a description of what the prediction should be judged on, respectively. If the condition is not needed, it can be omitted from the prompt, but theprediction
variable must still be present. -
response_format : t.Type[BaseModel] — The response format to use for the judge model. This should be a Pydantic model that defines the expected structure of the judge's response.
-
scoring_fn : t.Callable[[BaseModel], float] — A function that takes the judge's response and returns a score.
-
condition_formatting_fn : optional — A function to format the condition string before it is included in the user prompt. Defaults to a no-op function that returns the input unchanged.
-
system_prompt : optional — The system prompt to use for the judge model. If not provided, no system prompt will be used.
source class SpeedMetric(name: str, pretty_name: str)
Bases : Metric
Speed metric.
Initialise the speed metric.
Parameters
-
name : str — The name of the metric in snake_case.
-
pretty_name : str — The pretty name of the metric, used for display purposes.
source european_values_preprocessing_fn(predictions: c.Sequence[int]) → c.Sequence[int]
Preprocess the model predictions for the European Values metric.
Parameters
-
predictions : c.Sequence[int] — The model predictions, a sequence of integers representing the predicted choices for each question.
Returns
-
c.Sequence[int] — The preprocessed model predictions, a sequence of integers representing the final predicted choices for each question after any necessary aggregation and mapping.
Raises
-
AssertionError — If the number of predictions is not a multiple of 53, which is required for the European Values metric.
source european_values_scoring_function(pipeline: Pipeline, predictions: c.Sequence[int]) → float
Scoring function for the European Values metric.
source class Fluency(**data: Any)
Bases : BaseModel
Response format for the fluency metric.
Create a new model by parsing and validating input data from keyword arguments.
Raises [ValidationError
][pydantic_core.ValidationError] if the input data cannot be
validated to form a valid model.
self
is explicitly positional-only to allow self
as a field name.
Attributes
-
fluency : t.Annotated[int, Field(ge=1, le=5)] — The fluency rating, an integer between 1 and 5.
-
model_config : ClassVar[ConfigDict] — Configuration for the model, should be a dictionary conforming to [
ConfigDict
][pydantic.config.ConfigDict]. -
model_extra : dict[str, Any] | None — Get extra fields set during validation.
-
model_fields_set : set[str] — Returns the set of fields that have been explicitly set on this model instance.