euroeval.metrics¶
source package euroeval.metrics
All the metrics used in EuroEval.
Classes
-
Metric — Abstract base class for all metrics.
-
BiasMetric — Bias and accuracy metrics for MBBQ (Neplenbroek et al., 2024).
-
HuggingFaceMetric — A metric which is implemented in the
evaluatepackage. -
SourceBasedMetric — Subclass of HuggingfaceMetric for metrics also requiring source text as input.
-
IFEvalInstructionAccuracy — Metric for instruction-level accuracy using IFEval methodology.
-
LanguageDetector — Language detector class.
-
LLMAsAJudgeMetric — Use an LLM to judge the quality of the predictions.
-
PreprocessingFunction — A protocol for a preprocessing function.
-
PipelineMetric — Load a scikit-learn pipeline and use it to get scores from the predictions.
-
ChrF — The ChrF metric.
-
SpeedMetric — Speed metric.
-
ToolCallingAccuracy — Metric for tool calling.
Functions
-
create_model_graded_fact_metric — Create a model-graded fact metric that uses a given judge model.
-
european_values_preprocessing_fn — Preprocess the model predictions for the European Values metric.
-
european_values_scoring_function — Scoring function for the European Values metric.
source class Metric(name: str, pretty_name: str, postprocessing_fn: t.Callable[[float], tuple[float, str]] | None = None)
Bases : abc.ABC
Abstract base class for all metrics.
Initialise the metric.
Parameters
-
name : str — The name of the metric in snake_case.
-
pretty_name : str — The pretty name of the metric, used for display purposes.
-
postprocessing_fn : t.Callable[[float], tuple[float, str]] | None — A function to apply to the metric scores after they are computed, taking the score to the postprocessed score along with its string representation. Defaults to x -> (100 * x, f"{x:.2%}").
Methods
-
download — Initiates the download of the metric if needed.
source method Metric.download(cache_dir: str) → Metric
Initiates the download of the metric if needed.
Parameters
-
cache_dir : str — The directory where the metric will be downloaded to.
Returns
-
Metric — The metric object itself.
source class BiasMetric(name: str, pretty_name: str, bias_type: BiasType = 'bias_adjusted_accuracy_ambig')
Bases : Metric
Bias and accuracy metrics for MBBQ (Neplenbroek et al., 2024).
Initialise the bias metric.
Context type - Ambiguous: correct answer should be "unknown/not enough information".
Metrics - Ambiguous bias (bias_ambig): (stereotype picks − counter-stereotype picks) / n_ambiguous - Ambiguous accuracy (accuracy_ambig): correct "unknown" picks / n_ambiguous - Bias-adjusted accuracy: accuracy minus |bias|, clamped at zero.
Notes
- "Unknown/not enough info" answers are ignored in bias numerators.
- Returns NaN when the context type is absent.
Parameters
-
name : str — Metric identifier.
-
pretty_name : str — Human-readable metric name.
-
bias_type : BiasType — Metric variant to compute.
Raises
-
ValueError — If the bias type is not one of the supported options.
source class HuggingFaceMetric(name: str, pretty_name: str, huggingface_id: str, results_key: str, compute_kwargs: dict[str, t.Any] | None = None, postprocessing_fn: t.Callable[[float], tuple[float, str]] | None = None)
Bases : Metric
A metric which is implemented in the evaluate package.
Initialise the Hugging Face metric.
Attributes
-
name — The name of the metric in snake_case.
-
pretty_name — The pretty name of the metric, used for display purposes.
-
huggingface_id — The Hugging Face ID of the metric.
-
results_key — The name of the key used to extract the metric scores from the results dictionary.
-
compute_kwargs : dict[str, t.Any] — Keyword arguments to pass to the metric's compute function. Defaults to an empty dictionary.
Parameters
-
name : str — The name of the metric in snake_case.
-
pretty_name : str — The pretty name of the metric, used for display purposes.
-
huggingface_id : str — The Hugging Face ID of the metric.
-
results_key : str — The name of the key used to extract the metric scores from the results dictionary.
-
compute_kwargs : dict[str, t.Any] | None — Keyword arguments to pass to the metric's compute function. Defaults to an empty dictionary.
-
postprocessing_fn : t.Callable[[float], tuple[float, str]] | None — A function to apply to the metric scores after they are computed, taking the score to the postprocessed score along with its string representation. Defaults to x -> (100 * x, f"{x:.2%}").
Methods
-
download — Initiates the download of the metric if needed.
source method HuggingFaceMetric.download(cache_dir: str) → HuggingFaceMetric
Initiates the download of the metric if needed.
Parameters
-
cache_dir : str — The directory where the metric will be downloaded to.
Returns
-
HuggingFaceMetric — The metric object itself.
source class SourceBasedMetric(name: str, pretty_name: str, huggingface_id: str, results_key: str, compute_kwargs: dict[str, t.Any] | None = None, postprocessing_fn: t.Callable[[float], tuple[float, str]] | None = None)
Bases : HuggingFaceMetric
Subclass of HuggingfaceMetric for metrics also requiring source text as input.
Initialise the Hugging Face metric.
Parameters
-
name : str — The name of the metric in snake_case.
-
pretty_name : str — The pretty name of the metric, used for display purposes.
-
huggingface_id : str — The Hugging Face ID of the metric.
-
results_key : str — The name of the key used to extract the metric scores from the results dictionary.
-
compute_kwargs : dict[str, t.Any] | None — Keyword arguments to pass to the metric's compute function. Defaults to an empty dictionary.
-
postprocessing_fn : t.Callable[[float], tuple[float, str]] | None — A function to apply to the metric scores after they are computed, taking the score to the postprocessed score along with its string representation. Defaults to x -> (100 * x, f"{x:.2%}").
source class IFEvalInstructionAccuracy()
Bases : Metric
Metric for instruction-level accuracy using IFEval methodology.
Initialise the metric.
source class LanguageDetector()
Language detector class.
Initialize the language detector.
Methods
-
download — Download and initialize the language detection model.
source method LanguageDetector.download() → None
Download and initialize the language detection model.
source class LLMAsAJudgeMetric(name: str, pretty_name: str, judge_id: str, judge_kwargs: dict[str, t.Any], user_prompt: str, response_format: t.Type[BaseModel], scoring_fn: ScoringFunction | None = None, batch_scoring_fn: BatchScoringFunction | None = None, condition_formatting_fn: t.Callable[[str], str] = lambda x: x, system_prompt: str | None = None)
Bases : Metric
Use an LLM to judge the quality of the predictions.
Initialise the LLM as a judge metric.
Parameters
-
name : str — The name of the metric in snake_case.
-
pretty_name : str — The pretty name of the metric, used for display purposes.
-
judge_id : str — The model ID of the LLM to use as a judge.
-
judge_kwargs : dict[str, t.Any] — Generation parameters for the judge model, such as temperature.
-
user_prompt : str — The user prompt to use for the judge model. The prompt should be formatted with the variables
predictionandcondition, to include the model predictions and a description of what the prediction should be judged on, respectively. If the condition is not needed, it can be omitted from the prompt, but thepredictionvariable must still be present. -
response_format : t.Type[BaseModel] — The response format to use for the judge model. This should be a Pydantic model that defines the expected structure of the judge's response.
-
scoring_fn : ScoringFunction | None — A function that takes the judge's response and returns a score.
-
batch_scoring_fn : BatchScoringFunction | None — A function that takes all judge responses and returns a score.
-
condition_formatting_fn : optional — A function to format the condition string before it is included in the user prompt. Defaults to a no-op function that returns the input unchanged.
-
system_prompt : optional — The system prompt to use for the judge model. If not provided, no system prompt will be used.
source create_model_graded_fact_metric(judge_id: str = 'gpt-5-mini', user_prompt: str = "You are evaluating whether a model's answer is factually correct.\n\nReference answer: {condition}\n\nModel's answer: {prediction}\n\nIs the model's answer factually correct? Output a JSON object with a single key 'correct' (true or false).", system_prompt: str | None = None, temperature: float = 1.0, response_format: type[BaseModel] | None = None, scoring_fn: ScoringFunction | None = None) → LLMAsAJudgeMetric
Create a model-graded fact metric that uses a given judge model.
This corresponds to Inspect AI's model_graded_fact scorer, which checks
whether the model's answer is factually consistent with the reference answer.
Parameters
-
judge_id : optional — The model ID of the LLM to use as a judge (e.g.
openai/o3-mini). Defaults togpt-5-mini. -
user_prompt : optional — The user prompt template passed to the judge. Must contain
{prediction}and{condition}placeholders. Defaults to a prompt that asks whether the model's answer is factually correct. -
system_prompt : optional — An optional system prompt for the judge. Defaults to None.
-
temperature : float — Sampling temperature for the judge. Defaults to 1.0.
-
response_format : optional — A Pydantic model class that defines the expected JSON structure of the judge's response. The model must have a single boolean field named
correct. If not provided, a model with that shape is created automatically viapydantic.create_model. -
scoring_fn : optional — A function mapping the judge's parsed response to a scalar score in [0, 1]. Defaults to 1.0 when
correctis True, 0.0 otherwise.
Returns
-
LLMAsAJudgeMetric — An
LLMAsAJudgeMetricconfigured for factual-correctness grading.
source class PreprocessingFunction()
Bases : t.Protocol
A protocol for a preprocessing function.
source class PipelineMetric(name: str, pretty_name: str, pipeline_repo: str, pipeline_scoring_function: c.Callable[['Pipeline', c.Sequence], float], pipeline_file_name: str = 'pipeline.pkl', preprocessing_fn: PreprocessingFunction | None = None, postprocessing_fn: c.Callable[[float], tuple[float, str]] | None = None)
Bases : Metric
Load a scikit-learn pipeline and use it to get scores from the predictions.
Initialise the pipeline transform metric.
Parameters
-
name : str — The name of the metric in snake_case.
-
pretty_name : str — The pretty name of the metric, used for display purposes.
-
pipeline_repo : str — The Hugging Face repository ID of the scikit-learn pipeline to load.
-
pipeline_scoring_function : c.Callable[['Pipeline', c.Sequence], float] — The function to use to score the predictions.
-
pipeline_file_name : optional — The name of the file to download from the Hugging Face repository. Defaults to "pipeline.joblib".
-
preprocessing_fn : optional — A function to apply to the predictions before they are passed to the pipeline. This is useful for preprocessing the predictions to match the expected input format of the pipeline. Defaults to a no-op function that returns the input unchanged.
-
postprocessing_fn : optional — A function to apply to the metric scores after they are computed, taking the score to the postprocessed score along with its string representation. Defaults to x -> (100 * x, f"{x:.2%}").
source european_values_preprocessing_fn(predictions: c.Sequence[int], dataset: Dataset) → c.Sequence[int]
Preprocess the model predictions for the European Values metric.
Parameters
-
predictions : c.Sequence[int] — The model predictions, a sequence of integers representing the predicted choices for each question.
-
dataset : Dataset — The dataset used for evaluation. This is only used in case any additional metadata is used to compute the metrics.
Returns
-
c.Sequence[int] — The preprocessed model predictions, a sequence of integers representing the final predicted choices for each question after any necessary aggregation and mapping.
Raises
-
InvalidBenchmark — If the question has no valid choices (all choices were None).
source european_values_scoring_function(pipeline: Pipeline, predictions: c.Sequence[int]) → float
Scoring function for the European Values metric.
Parameters
-
pipeline : Pipeline — The pipeline to use for scoring.
-
predictions : c.Sequence[int] — The predictions to score.
Returns
-
float — The score.
source class ChrF(word_order: int = 0, beta: int = 2, language_detector: LanguageDetector | None = None)
Bases : Metric
The ChrF metric.
Initialise the ChrF metric.
Parameters
-
word_order : optional — The word order for the ChrF metric. Defaults to 0, which is the original chrF metric. If set to 2, it is the chrF++ metric.
-
beta : optional — The beta parameter for the ChrF metric. Defaults to 2, which is the original chrF (and chrF++) metric.
-
language_detector : optional — A LanguageDetector instance. If provided, each per-sentence score is multiplied by a binary language penalty (1.0 if the prediction is in the correct language, 0.0 otherwise) before averaging. Defaults to None, which disables language penalization.
Methods
-
download — Download the language detection model if needed.
source method ChrF.download(cache_dir: str) → ChrF
Download the language detection model if needed.
Parameters
-
cache_dir : str — The directory where the metric will be downloaded to.
Returns
-
ChrF — The metric object itself.
source class SpeedMetric(name: str, pretty_name: str)
Bases : Metric
Speed metric.
Initialise the speed metric.
Parameters
-
name : str — The name of the metric in snake_case.
-
pretty_name : str — The pretty name of the metric, used for display purposes.
source class ToolCallingAccuracy(name: str, pretty_name: str, postprocessing_fn: t.Callable[[float], tuple[float, str]] | None = None)
Bases : Metric
Metric for tool calling.
Initialise the metric.
Parameters
-
name : str — The name of the metric in snake_case.
-
pretty_name : str — The pretty name of the metric, used for display purposes.
-
postprocessing_fn : t.Callable[[float], tuple[float, str]] | None — A function to apply to the metric scores after they are computed, taking the score to the postprocessed score along with its string representation. Defaults to x -> (100 * x, f"{x:.2%}").