Running Evaluations
Execute your configured LLM judges to evaluate your dataset.
Quick Setup
Arguments
Control evaluation execution and results handling:
Judge Selection (judge_ids
)
Run Identification (run_id
)
If custom run id is not set, each evaluation gets a unique run ID:
Results Storage (save_results
)
Control whether results are saved to your project directory:
Results Format (results_format
)
Specify output format for saved results:
Duplicate Handling (skip_duplicates
)
Control re-evaluation of existing results:
Results Management
Results are saved to your project directory:
my_project/
└── results/
├── run_20250815_110504_15c89e71_anthropic_claude_3_5_haiku_judge_20250815_110521_results.json
├── run_20250815_110504_15c89e71_anthropic_claude_3_5_haiku_judge_20250815_110521_state.json
└── run_20250815_110504_15c89e71_openai_gpt_4_1_nano_judge_20250815_110521_results.json