Running Evaluations
Execute your configured LLM judges to evaluate your dataset.
Quick Setup
Arguments
Control evaluation execution and results handling:
Judge Selection (judge_ids)
Run Identification (run_id)
If custom run id is not set, each evaluation gets a unique run ID:
Results Storage (save_results)
Control whether results are saved to your project directory:
Results Format (results_format)
Specify output format for saved results:
Duplicate Handling (skip_duplicates)
Control re-evaluation of existing results:
Consistency Runs (consistency)
Run each judge N times per row and automatically aggregate the results. This reduces the impact of non-deterministic model outputs.
The aggregation strategy depends on the task type:
Outputs are aggregated by majority vote. The label that appears most often across the N runs wins. Ties are broken by first occurrence.
Outputs from all runs are concatenated with labelled run markers, preserving every response in a single string. This lets you inspect the full range of outputs.
Both task types are supported simultaneously — a judge with a mixed schema (some classification, some free-form tasks) will apply the appropriate strategy per task.
Note
With consistency > 1, token counts and call durations in the results are summed across all runs.
Results Management
Results are saved to your project directory:
my_project/
└── results/
├── run_20250815_110504_15c89e71_anthropic_claude_3_5_haiku_judge_20250815_110521_results.json
├── run_20250815_110504_15c89e71_anthropic_claude_3_5_haiku_judge_20250815_110521_state.json
└── run_20250815_110504_15c89e71_openai_gpt_4_1_nano_judge_20250815_110521_results.json