Skip to content

euroeval.metrics.llm_as_a_judge

source module euroeval.metrics.llm_as_a_judge

Metrics based on LLM-as-a-judge.

Classes

Functions

source class LLMAsAJudgeMetric(name: str, pretty_name: str, judge_id: str, judge_kwargs: dict[str, t.Any], user_prompt: str, response_format: t.Type[BaseModel], scoring_fn: ScoringFunction | None = None, batch_scoring_fn: BatchScoringFunction | None = None, condition_formatting_fn: t.Callable[[str], str] = lambda x: x, system_prompt: str | None = None)

Bases : Metric

Use an LLM to judge the quality of the predictions.

Initialise the LLM as a judge metric.

Parameters

  • name : str The name of the metric in snake_case.

  • pretty_name : str The pretty name of the metric, used for display purposes.

  • judge_id : str The model ID of the LLM to use as a judge.

  • judge_kwargs : dict[str, t.Any] Generation parameters for the judge model, such as temperature.

  • user_prompt : str The user prompt to use for the judge model. The prompt should be formatted with the variables prediction and condition, to include the model predictions and a description of what the prediction should be judged on, respectively. If the condition is not needed, it can be omitted from the prompt, but the prediction variable must still be present.

  • response_format : t.Type[BaseModel] The response format to use for the judge model. This should be a Pydantic model that defines the expected structure of the judge's response.

  • scoring_fn : ScoringFunction | None A function that takes the judge's response and returns a score.

  • batch_scoring_fn : BatchScoringFunction | None A function that takes all judge responses and returns a score.

  • condition_formatting_fn : optional A function to format the condition string before it is included in the user prompt. Defaults to a no-op function that returns the input unchanged.

  • system_prompt : optional The system prompt to use for the judge model. If not provided, no system prompt will be used.

source create_model_graded_fact_metric(judge_id: str = 'gpt-5-mini', user_prompt: str = "You are evaluating whether a model's answer is factually correct.\n\nReference answer: {condition}\n\nModel's answer: {prediction}\n\nIs the model's answer factually correct? Output a JSON object with a single key 'correct' (true or false).", system_prompt: str | None = None, temperature: float = 1.0, response_format: type[BaseModel] | None = None, scoring_fn: ScoringFunction | None = None)LLMAsAJudgeMetric

Create a model-graded fact metric that uses a given judge model.

This corresponds to Inspect AI's model_graded_fact scorer, which checks whether the model's answer is factually consistent with the reference answer.

Parameters

  • judge_id : optional The model ID of the LLM to use as a judge (e.g. openai/o3-mini). Defaults to gpt-5-mini.

  • user_prompt : optional The user prompt template passed to the judge. Must contain {prediction} and {condition} placeholders. Defaults to a prompt that asks whether the model's answer is factually correct.

  • system_prompt : optional An optional system prompt for the judge. Defaults to None.

  • temperature : float Sampling temperature for the judge. Defaults to 1.0.

  • response_format : optional A Pydantic model class that defines the expected JSON structure of the judge's response. The model must have a single boolean field named correct. If not provided, a model with that shape is created automatically via pydantic.create_model.

  • scoring_fn : optional A function mapping the judge's parsed response to a scalar score in [0, 1]. Defaults to 1.0 when correct is True, 0.0 otherwise.

Returns