Scoring and Metrics
Configure and use alignment metrics to compare judge evaluations with human annotations.
Quick Setup
Available Scorers
ClassificationScorer supports 4 different metrics: accuracy, F1, precision, and recall. Select the metric at initialization.
- Purpose: Classification metrics between judge and human annotations
- Requirements: 1 human annotator minimum
- Output: Metric score (0-1)
- Parameters:
metric: "accuracy", "f1", "precision", or "recall", default='accuracy'.pos_label: For F1/precision/recall, the label to treat as positive (for binary classification), default=1. See sklearn.metrics documentation.average: For F1/precision/recall, averaging strategy, default='binary'. See sklearn.metrics documentation.
Sample Results:
- Purpose: Inter-rater agreement accounting for chance
- Requirements: 2 human annotators minimum
- Output: Kappa coefficient (-1 to 1)
Kappa Interpretation
| Kappa Range | Interpretation |
|---|---|
| < 0.00 | Poor |
| 0.00-0.20 | Slight |
| 0.21-0.40 | Fair |
| 0.41-0.60 | Moderate |
| 0.61-0.80 | Substantial |
| 0.81-1.00 | Almost Perfect |
Sample Results:
- Purpose: Alt-Test
- Requirements: 3 human annotators minimum, and minimally 30 instances per human. (For statistical significance.)
- Output: Winning Rates across different epsilon values, Advantage Probability, and Human Advantage Probabilities.
Note
Alt-Test is a leave-one-annotator-out hypothesis test that measures whether an LLM judge agrees with the remaining human consensus at least as well as the left-out human does.
Sample Results:
{
"judge_id": "anthropic_claude_3_5_haiku_judge",
"scorer_name": "alt_test",
"task_strategy": "multilabel",
"task_name": "rejection",
"scores": {
"winning_rate": {
"0.00": 0.0,
"0.05": 0.0,
"0.10": 0.0,
"0.15": 0.0,
"0.20": 0.0,
"0.25": 0.0,
"0.30": 0.0
},
"advantage_probability": 0.9
},
"metadata": {
"human_advantage_probabilities": {
"person_1": [0.9, 1.0],
"person_2": [0.9, 0.8],
"person_3": [0.9, 1.0]
},
"scoring_function": "accuracy",
"epsilon": 0.2,
"multiplicative_epsilon": false,
"min_instances_per_human": 10,
"ground_truth_method": "alt_test_procedure"
}
}
- Purpose: String similarity for text responses. Uses SequenceMatcher.
- Requirements: 1 human annotator minimum
- Output: Similarity scores (0-1)
Note
SequenceMatcher uses the Ratcliff-Obershelp pattern matching algorithm, which recursively looks for the longest contiguous matching subsequence between the two sequences, and calculate similarity ratio using the formula:
(2 × total_matching_characters) / (len(A) + len(B))
Sample Results:
- Purpose: Semantic similarity for text responses using OpenAI embeddings.
- Requirements:
- 1 human annotator minimum
- OpenAI API key (set
OPENAI_API_KEYenvironment variable)
- Output: Cosine similarity scores (0-1)
Note
Uses OpenAI's text embedding models to compute embeddings and calculate cosine similarity between judge and human text responses. Captures semantic meaning rather than just string matching.
Sample Results:
{
"judge_id": "claude_judge",
"scorer_name": "semantic_similarity",
"task_strategy": "single",
"task_name": "explanation",
"score": 0.84,
"metadata": {
"mean_similarity": 0.84,
"median_similarity": 0.86,
"std_similarity": 0.08,
"total_comparisons": 100,
"embedding_model": "sentence-transformers/all-MiniLM-L6-v2"
}
}
Score Ranges
- Classification: 0-1 (higher is better)
- Cohen's Kappa: -1 to 1 (higher is better, accounts for chance)
- Text/Semantic Similarity: 0-1 (higher is better, semantic similarity)
- Alt-Test: Winning rates across epsilon values, advantage probabilities (0-1)
Arguments
Task Configuration Types (task_strategy)
The task_strategy parameter defines how tasks are processed and must be one of: "single", "multitask", or "multilabel".
Task Count Requirements
"single": Use only whentask_namescontains exactly 1 task"multitask"and"multilabel": Use only whentask_namescontains 2 or more tasks
Evaluate one classification task:
Single Task Behavior
- Single score for the specified task
- Required: Exactly 1 task in
task_nameslist - Use this when evaluating one task independently
Apply the same scorer to multiple separate tasks:
Multi-Task Behavior
- Required: 2 or more tasks in
task_nameslist - Same scorer applied to each task individually
- Result: Aggregated score across all tasks
- For different scorers on different tasks, create separate
MetricConfigentries
Treat multiple classification tasks as a single multi-label problem:
Multi-label Behavior
- Required: 2 or more tasks in
task_nameslist - Each instance can have multiple labels:
["hateful", "violent"], and these will be passed as 1 input into the Scorer. - Calculation of metric depends on the Scorer. For instance AltTestScorer uses Jaccard similarity for comparison, and ClassificationScorer with metric="accuracy" uses exact match.
Example with different scorers per task:
Annotator Aggregation (annotator_aggregation)
Control how multiple human annotations are aggregated before comparison with judge results.
Compare judge vs each human separately, then average the scores:
MetricConfig(
scorer=ClassificationScorer(metric="accuracy"),
task_names=["rejection"],
task_strategy="single",
annotator_aggregation="individual_average" # Default
)
How it works: - Judge vs Human 1: Calculate metric score - Judge vs Human 2: Calculate metric score - Judge vs Human 3: Calculate metric score - Final Score: Average of all individual scores
Aggregate humans first using majority vote, then compare judge vs consensus:
MetricConfig(
scorer=ClassificationScorer(metric="accuracy"),
task_names=["rejection"],
task_strategy="single",
annotator_aggregation="majority_vote"
)
How it works: - Find consensus among human annotators first - Compare judge predictions with consensus - Implementation varies by metric type (see support table below)
Custom Column Names (display_name)
Override auto-generated column names (e.g., classification_accuracy_1tasks_5d6db9a1_single) with custom names:
Metric Support Table
| Metric | individual_average | majority_vote |
|---|---|---|
| ClassificationScorer | Per-position majority for multilabel, alphabetical tie-breaking | |
| TextSimilarityScorer | Best-match approach: highest similarity human per sample | |
| SemanticSimilarityScorer | Best-match approach: highest similarity human per sample | |
| CohensKappaScorer | Logs warning, falls back to individual_average (agreement metric) | |
| AltTestScorer | Logs warning, falls back to individual_average (agreement metric) |
Why agreement metrics don't support majority_vote: Inter-annotator agreement metrics measure disagreement between individual humans. Majority vote eliminates this disagreement information, making the metrics less meaningful.
Results Output
Individual Metric Results
Detailed scores and charts are saved to individual metric directories in your project:
my_project/
└── scores/
├── classification_accuracy/
│ └── classification_accuracy_1tasks_22e76eaf_rejection_accuracy/
│ ├── accuracy_scores.png
│ ├── judge_1_result.json
│ └── judge_2_result.json
├── cohens_kappa/
│ └── cohens_kappa_1tasks_22e76eaf_rejection_agreement/
│ ├── cohens_kappa_scores.png
│ ├── judge_1_result.json
│ └── judge_2_result.json
├── alt_test/
│ └── alt_test_3tasks_22e76eaf_safety_significance/
│ ├── aggregate_advantage_probabilities.png
│ ├── aggregate_human_vs_llm_advantage.png
│ ├── aggregate_winning_rates.png
│ ├── judge_1_result.json
│ └── judge_2_result.json
├── text_similarity/
│ └── text_similarity_1tasks_74d08617_explanation_quality/
│ ├── text_similarity_scores.png
│ ├── judge_1_result.json
│ └── judge_2_result.json
├── score_report.csv
└── score_report.html
Chart Styling
Basic, shared plots use default matplotlib styling. Users may customize their chart theme by using Matplotlib's style sheets and rcParams.
Sample Charts:
accuracy_scores.png (ClassificationScorer) |
aggregate_winning_rates.png (AltTestScorer) |
|---|---|
![]() |
![]() |
Summary Reports
You can generate summary reports that aggregate all metrics across all judges in a single view.
Summary reports are saved to the scores directory:
my_project/
└── scores/
├── score_report.html # Interactive HTML table with best score highlighting
├── score_report.csv # CSV format for analysis/Excel
├── classification_accuracy/ # Detailed accuracy results...
├── cohens_kappa/ # Detailed kappa results...
├── alt_test/ # Detailed alt-test results...
└── text_similarity/ # Detailed similarity results...
Sample Console Output:

Sample HTML Report:

Custom Scorer
You may implement your own evaluation metrics.
Here is a concrete example of custom metric that count how many times judge has more "A"s than human in it's response.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 | |
Custom Scorer Requirements
Required methods to implement:
can_score_task(sample_label)- Check if scorer can handle the data typecompute_score_async(judge_data, human_data, task_name, judge_id, aggregation_mode)- Core scoring logicaggregate_results(results, scores_dir, unique_name)- Optional visualization method
Guidelines:
- Call
super().__init__(scorer_name="your_scorer_name")in constructor - Return
BaseScoringResultfromcompute_score_async() - Handle edge cases (empty data, mismatched IDs, etc.)
- Add meaningful metadata for debugging and transparency

