euroeval.scores

Aggregation of raw scores into the mean and a confidence interval.

Functions

log_scores — Log the scores.
aggregate_scores — Helper function to compute the mean with confidence intervals.

source log_scores(dataset_name: str, metric_configs: list['MetricConfig'], scores: list[dict[str, float]], model_id: str, model_revision: str) → ScoreDict

Log the scores.

Parameters

dataset_name : str — Name of the dataset.
metric_configs : list['MetricConfig'] — List of metrics to log.
scores : list[dict[str, float]] — The scores that are to be logged. This is a list of dictionaries full of scores.
model_id : str — The model ID of the model that was evaluated.
model_revision : str — The revision of the model.

Returns

ScoreDict — A dictionary with keys 'raw_scores' and 'total', with 'raw_scores' being identical to scores and 'total' being a dictionary with the aggregated scores (means and standard errors).

source aggregate_scores(scores: list[dict[str, float]], metric_config: MetricConfig) → tuple[float, float]

Helper function to compute the mean with confidence intervals.

Parameters

scores : list[dict[str, float]] — Dictionary with the names of the metrics as keys, of the form "_", such as "val_f1", and values the metric values.
metric_config : MetricConfig — The configuration of the metric, which is used to collect the correct metric from scores.

Returns

tuple[float, float] — A pair of floats, containing the score and the radius of its 95% confidence interval.