Loading Results
If you used MetaEvaluator to generate judge and human annotation results, great! You may move on to Scoring. Results are automatically loaded when you call compare_async()
:
External Data Loading
MetaEvaluator supports loading pre-existing judge and human annotation results from external sources. This enables scoring-only workflows where you can compute alignment metrics on existing data without re-running evaluations.
Importing External Judge Results
Use add_external_judge_results()
to load a single judge evaluation data from a CSV file:
Arguments:
file_path
: Path to the CSV file containing judge resultsjudge_id
: Unique identifier for this judgellm_client
: The LLM provider used (e.g., "openai", "anthropic"). See LiteLLM providers for supported optionsmodel_used
: The specific model name used for evaluation (e.g., "gpt-4", "claude-3-sonnet")run_id
: Unique identifier for this evaluation run (optional, will be auto-generated if not provided)
Your CSV file must contain these columns:
original_id
: Original identifier from your evaluation data- Task columns: One column for each task defined in your
EvalTask.task_schemas
Example judge results CSV:
Importing External Annotation Results
Use add_external_annotation_results()
to load a single human annotation data from a CSV file:
Arguments:
file_path
: Path to the CSV file containing human annotation resultsannotator_id
: Unique identifier for the annotator(s)run_id
: Unique identifier for this annotation run (optional, will be auto-generated if not provided)
Your CSV file must contain these columns:
original_id
: Original identifier from your evaluation data- Task columns: One column for each task defined in your
EvalTask.task_schemas
Example human results CSV:
Schema Requirements
Important: The task columns in your external data must match the task schema defined in your EvalTask
:
Complete Example
See examples/rejection/run_scoring_only_async.py
in the GitHub Repository for a complete example that:
- Loads original evaluation data
- Creates mock external judge and human results
- Loads the external results using the methods above
- Runs scoring metrics to compare judge vs human performance
This approach allows you to leverage MetaEvaluator's scoring capabilities on any existing judge/human evaluation data, making it easy to compute alignment metrics without needing to re-run evaluations.
Advanced: View Results Data Format
For more advanced users, you may load the results directly for analysis, debugging, or additional operations.
Result Files
Results are stored in your project directory:
my_project/
├── main_state.json # Project configuration
├── data/
│ └── main_state_data.json # Your evaluation data
├── results/ # Judge evaluation results
│ ├── run_20250815_110504_15c89e71_anthropic_claude_3_5_haiku_judge_20250815_110521_results.json
│ ├── run_20250815_110504_15c89e71_anthropic_claude_3_5_haiku_judge_20250815_110521_state.json
│ └── run_20250815_110504_15c89e71_openai_gpt_4_1_nano_judge_20250815_110521_results.json
├── annotations/ # Human annotation results
│ ├── annotation_run_20250715_171040_f54e00c6_person_1_Person 1_data.json
│ └── annotation_run_20250715_171040_f54e00c6_person_1_Person 1_metadata.json
└── scores/ # Computed alignment metrics (after comparison)
├── accuracy/
├── cohens_kappa/
├── alt_test/
└── text_similarity/