MetaEvaluator

A comprehensive Python framework for evaluating LLM-as-a-Judge systems by comparing judge outputs with human annotations and calculating alignment metrics.

What is MetaEvaluator?

MetaEvaluator helps you answer the critical question: "How well do LLM judges align with human judgment?"

Given an evaluation task and dataset, MetaEvaluator:

Runs multiple LLM judges across different providers (OpenAI, Anthropic, Google, AWS, etc.)
Collects human annotations through a built-in web interface
Calculates alignment metrics (Accuracy, Cohen's Kappa, Alt-Test, Text Similarity)
Provides detailed comparison and analysis

When to Use MetaEvaluator?

When evaluating the quality of your LLM-as-a-judge
Research on LLM evaluation capabilities, to compare performance across various LLMs and system prompts.

Key Features

1. Easy LLM Judge processing

LiteLLM Integration: Support for 100+ LLM providers with unified API
Supports Structured Outputs/Instructor/XML parsing for automatic JSON parsing
Load multiple judges through simplified YAML Configurations.

2. Built-in Human Annotation Platform

Streamlit Interface: Clean, intuitive annotation workflow
Multi-annotator Support: Separate sessions with progress tracking
Resume Capability: Continue annotation sessions across multiple visits
Export Options: JSON format for analysis and sharing

3. Comprehensive Alignment Metrics

Classification Metrics: Accuracy, Cohen's Kappa for agreement analysis
Statistical Testing: Alt-Test for advantage comparison
Text Similarity: Semantic similarity for free-form responses
Custom Metrics: Extensible framework for your own evaluation methods

Ready to start evaluating your LLM judges? Head to the Tutorial!