Scoring and Metrics
Configure and use alignment metrics to compare judge evaluations with human annotations.
Quick Setup
Available Scorers
- Purpose: Classification accuracy between judge and human annotations
- Requirements: 1 human annotator minimum
- Output: Percentage accuracy (0-1)
Sample Results:
- Purpose: Inter-rater agreement accounting for chance
- Requirements: 2 human annotators minimum
- Output: Kappa coefficient (-1 to 1)
Kappa Interpretation
Kappa Range | Interpretation |
---|---|
< 0.00 | Poor |
0.00-0.20 | Slight |
0.21-0.40 | Fair |
0.41-0.60 | Moderate |
0.61-0.80 | Substantial |
0.81-1.00 | Almost Perfect |
Sample Results:
- Purpose: Alt-Test
- Requirements: 3 human annotators minimum, and minimally 30 instances per human. (For statistical significance.)
- Output: Winning Rates across different epsilon values, Advantage Probability, and Human Advantage Probabilities.
Note
Alt-Test is a leave-one-annotator-out hypothesis test that measures whether an LLM judge agrees with the remaining human consensus at least as well as the left-out human does.
Sample Results:
{
"judge_id": "anthropic_claude_3_5_haiku_judge",
"scorer_name": "alt_test",
"task_strategy": "multilabel",
"task_name": "rejection",
"scores": {
"winning_rate": {
"0.00": 0.0,
"0.05": 0.0,
"0.10": 0.0,
"0.15": 0.0,
"0.20": 0.0,
"0.25": 0.0,
"0.30": 0.0
},
"advantage_probability": 0.9
},
"metadata": {
"human_advantage_probabilities": {
"person_1": [0.9, 1.0],
"person_2": [0.9, 0.8],
"person_3": [0.9, 1.0]
},
"scoring_function": "accuracy",
"epsilon": 0.2,
"multiplicative_epsilon": false,
"min_instances_per_human": 10,
"ground_truth_method": "alt_test_procedure"
}
}
- Purpose: String similarity for text responses. Uses SequenceMatcher.
- Requirements: 1 human annotator minimum
- Output: Similarity scores (0-1)
Note
SequenceMatcher uses the Ratcliff-Obershelp pattern matching algorithm, which recursively looks for the longest contiguous matching subsequence between the two sequences, and calculate similarity ratio using the formula:
(2 × total_matching_characters) / (len(A) + len(B))
Sample Results:
- Purpose: Semantic similarity for text responses using OpenAI embeddings.
- Requirements:
- 1 human annotator minimum
- OpenAI API key (set
OPENAI_API_KEY
environment variable)
- Output: Cosine similarity scores (0-1)
Note
Uses OpenAI's text embedding models to compute embeddings and calculate cosine similarity between judge and human text responses. Captures semantic meaning rather than just string matching.
Sample Results:
{
"judge_id": "claude_judge",
"scorer_name": "semantic_similarity",
"task_strategy": "single",
"task_name": "explanation",
"score": 0.84,
"metadata": {
"mean_similarity": 0.84,
"median_similarity": 0.86,
"std_similarity": 0.08,
"total_comparisons": 100,
"embedding_model": "sentence-transformers/all-MiniLM-L6-v2"
}
}
Score Ranges
- Accuracy: 0-1 (higher is better)
- Cohen's Kappa: -1 to 1 (higher is better, accounts for chance)
- Text/Semantic Similarity: 0-1 (higher is better, semantic similarity)
- Alt-Test: Winning rates across epsilon values, advantage probabilities (0-1)
Task Configuration Types (task_strategy
)
The task_strategy
parameter defines how tasks are processed and must be one of: "single"
, "multitask"
, or "multilabel"
.
Task Count Requirements
"single"
: Use only whentask_names
contains exactly 1 task"multitask"
and"multilabel"
: Use only whentask_names
contains 2 or more tasks
Evaluate one classification task:
Single Task Behavior
- Single score for the specified task
- Required: Exactly 1 task in
task_names
list - Use this when evaluating one task independently
Apply the same scorer to multiple separate tasks:
Multi-Task Behavior
- Required: 2 or more tasks in
task_names
list - Same scorer applied to each task individually
- Result: Aggregated score across all tasks
- For different scorers on different tasks, create separate
MetricConfig
entries
Treat multiple classification tasks as a single multi-label problem:
Multi-label Behavior
- Required: 2 or more tasks in
task_names
list - Each instance can have multiple labels:
["hateful", "violent"]
, and these will be passed as 1 input into the Scorer. - Calculation of metric depends on the Scorer. For instance AltTestScorer uses Jaccard similarity for comparison, and AccuracyScorer uses exact match.
Example with different scorers per task:
Annotator Aggregation (annotator_aggregation
)
Control how multiple human annotations are aggregated before comparison with judge results.
Compare judge vs each human separately, then average the scores:
MetricConfig(
scorer=AccuracyScorer(),
task_names=["rejection"],
task_strategy="single",
annotator_aggregation="individual_average" # Default
)
How it works: - Judge vs Human 1: Calculate metric score - Judge vs Human 2: Calculate metric score - Judge vs Human 3: Calculate metric score - Final Score: Average of all individual scores
Aggregate humans first using majority vote, then compare judge vs consensus:
MetricConfig(
scorer=AccuracyScorer(),
task_names=["rejection"],
task_strategy="single",
annotator_aggregation="majority_vote"
)
How it works: - Find consensus among human annotators first - Compare judge predictions with consensus - Implementation varies by metric type (see support table below)
Metric Support Table
Metric | individual_average | majority_vote |
---|---|---|
AccuracyScorer | :material-check: | :material-check: Per-position majority for multilabel, alphabetical tie-breaking |
TextSimilarityScorer | :material-check: | :material-check: Best-match approach: highest similarity human per sample |
SemanticSimilarityScorer | :material-check: | :material-check: Best-match approach: highest similarity human per sample |
CohensKappaScorer | :material-check: | :material-close: Logs warning, falls back to individual_average (agreement metric) |
AltTestScorer | :material-check: | :material-close: Logs warning, falls back to individual_average (agreement metric) |
Why agreement metrics don't support majority_vote: Inter-annotator agreement metrics measure disagreement between individual humans. Majority vote eliminates this disagreement information, making the metrics less meaningful.
Custom Scorer
You may implement your own evaluation metrics.
Here is a concrete example of custom metric that count how many times judge has more "A"s than human in it's response.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
|
Custom Scorer Requirements
Required methods to implement:
can_score_task(sample_label)
- Check if scorer can handle the data typecompute_score_async(judge_data, human_data, task_name, judge_id, aggregation_mode)
- Core scoring logicaggregate_results(results, scores_dir, unique_name)
- Optional visualization method
Guidelines:
- Call
super().__init__(scorer_name="your_scorer_name")
in constructor - Return
BaseScoringResult
fromcompute_score_async()
- Handle edge cases (empty data, mismatched IDs, etc.)
- Add meaningful metadata for debugging and transparency
Results Output
Individual Metric Results
Detailed scores and charts are saved to individual metric directories in your project:
my_project/
└── scores/
├── accuracy/
│ └── accuracy_1tasks_22e76eaf_rejection_accuracy/
│ ├── accuracy_scores.png
│ ├── gpt_4_judge_result.json
│ └── claude_judge_result.json
├── cohens_kappa/
│ └── cohens_kappa_1tasks_22e76eaf_rejection_agreement/
│ ├── cohens_kappa_scores.png
│ ├── gpt_4_judge_result.json
│ └── claude_judge_result.json
├── alt_test/
│ └── alt_test_3tasks_22e76eaf_safety_significance/
│ ├── aggregate_advantage_probabilities.png
│ ├── aggregate_human_vs_llm_advantage.png
│ ├── aggregate_winning_rates.png
│ ├── gpt_4_judge_result.json
│ └── claude_judge_result.json
└── text_similarity/
└── text_similarity_1tasks_74d08617_explanation_quality/
├── text_similarity_scores.png
├── gpt_4_judge_result.json
└── claude_judge_result.json
Summary Reports
You can generate summary reports that aggregate all metrics across all judges in a single view.
Summary reports are saved to the scores directory:
my_project/
└── scores/
├── score_report.html # Interactive HTML table with best score highlighting
├── score_report.csv # CSV format for analysis/Excel
├── accuracy/ # Detailed accuracy results...
├── cohens_kappa/ # Detailed kappa results...
├── alt_test/ # Detailed alt-test results...
└── text_similarity/ # Detailed similarity results...
Console Output:
Score Report:
┌─────────┬─────────────────────┬─────────────────────┬─────────────────────┬─────────────────────┐
│ judge_id ┆ accuracy_1tasks_22e ┆ alt_test_1tasks_22e ┆ alt_test_1tasks_22e ┆ text_similarity_1ta │
│ --- ┆ 76eaf_single ┆ 76eaf_single_winni… ┆ 76eaf_single_advant ┆ sks_74d0861_single │
│ str ┆ --- ┆ --- ┆ --- ┆ --- │
│ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════════╪═════════════════════╪═════════════════════╪═════════════════════╪═════════════════════╡
│ judge1 ┆ 0.87 ┆ 0.6 ┆ 0.75 ┆ 0.85 │
│ judge2 ┆ 0.82 ┆ 0.4 ┆ 0.65 ┆ 0.78 │
└─────────┴─────────────────────┴─────────────────────┴─────────────────────┴─────────────────────┘