euroeval.metrics.llm_as_a_judge¶
source module euroeval.metrics.llm_as_a_judge
Metrics based on LLM-as-a-judge.
Classes
-
LLMAsAJudgeMetric — Use an LLM to judge the quality of the predictions.
Functions
-
create_model_graded_fact_metric — Create a model-graded fact metric that uses a given judge model.
source class LLMAsAJudgeMetric(name: str, pretty_name: str, judge_id: str, judge_kwargs: dict[str, t.Any], user_prompt: str, response_format: t.Type[BaseModel], scoring_fn: ScoringFunction | None = None, batch_scoring_fn: BatchScoringFunction | None = None, condition_formatting_fn: t.Callable[[str], str] = lambda x: x, system_prompt: str | None = None)
Bases : Metric
Use an LLM to judge the quality of the predictions.
Initialise the LLM as a judge metric.
Parameters
-
name : str — The name of the metric in snake_case.
-
pretty_name : str — The pretty name of the metric, used for display purposes.
-
judge_id : str — The model ID of the LLM to use as a judge.
-
judge_kwargs : dict[str, t.Any] — Generation parameters for the judge model, such as temperature.
-
user_prompt : str — The user prompt to use for the judge model. The prompt should be formatted with the variables
predictionandcondition, to include the model predictions and a description of what the prediction should be judged on, respectively. If the condition is not needed, it can be omitted from the prompt, but thepredictionvariable must still be present. -
response_format : t.Type[BaseModel] — The response format to use for the judge model. This should be a Pydantic model that defines the expected structure of the judge's response.
-
scoring_fn : ScoringFunction | None — A function that takes the judge's response and returns a score.
-
batch_scoring_fn : BatchScoringFunction | None — A function that takes all judge responses and returns a score.
-
condition_formatting_fn : optional — A function to format the condition string before it is included in the user prompt. Defaults to a no-op function that returns the input unchanged.
-
system_prompt : optional — The system prompt to use for the judge model. If not provided, no system prompt will be used.
source create_model_graded_fact_metric(judge_id: str = 'gpt-5-mini', user_prompt: str = "You are evaluating whether a model's answer is factually correct.\n\nReference answer: {condition}\n\nModel's answer: {prediction}\n\nIs the model's answer factually correct? Output a JSON object with a single key 'correct' (true or false).", system_prompt: str | None = None, temperature: float = 1.0, response_format: type[BaseModel] | None = None, scoring_fn: ScoringFunction | None = None) → LLMAsAJudgeMetric
Create a model-graded fact metric that uses a given judge model.
This corresponds to Inspect AI's model_graded_fact scorer, which checks
whether the model's answer is factually consistent with the reference answer.
Parameters
-
judge_id : optional — The model ID of the LLM to use as a judge (e.g.
openai/o3-mini). Defaults togpt-5-mini. -
user_prompt : optional — The user prompt template passed to the judge. Must contain
{prediction}and{condition}placeholders. Defaults to a prompt that asks whether the model's answer is factually correct. -
system_prompt : optional — An optional system prompt for the judge. Defaults to None.
-
temperature : float — Sampling temperature for the judge. Defaults to 1.0.
-
response_format : optional — A Pydantic model class that defines the expected JSON structure of the judge's response. The model must have a single boolean field named
correct. If not provided, a model with that shape is created automatically viapydantic.create_model. -
scoring_fn : optional — A function mapping the judge's parsed response to a scalar score in [0, 1]. Defaults to 1.0 when
correctis True, 0.0 otherwise.
Returns
-
LLMAsAJudgeMetric — An
LLMAsAJudgeMetricconfigured for factual-correctness grading.