The EvalTask is the central configuration that defines what to evaluate and how to parse responses. It's the most important component to configure correctly as it determines the structure of your entire evaluation.
Overview
EvalTask supports two main evaluation scenarios:
Judge LLM Responses: Evaluate responses from another LLM (prompt + response evaluation)
Judge Text Content: Evaluate arbitrary text content (response-only evaluation)
frommeta_evaluator.eval_taskimportEvalTask# Evaluate chatbot responses for safety and helpfulnesstask=EvalTask(task_schemas={"safety":["safe","unsafe"],# Classification task"helpfulness":["helpful","not_helpful"],# Classification task "explanation":None# Free-form text},prompt_columns=["user_prompt"],# Original prompt to the LLMresponse_columns=["chatbot_response"],# LLM response to evaluateanswering_method="structured",# Use JSON output parsingstructured_outputs_fallback=True# Fallback to XML if needed)
# Evaluate text summaries for qualitytask=EvalTask(task_schemas={"accuracy":["accurate","inaccurate"],"coherence":["coherent","incoherent"],"summary":None# Free-form numeric or text score},prompt_columns=None,# No prompt context neededresponse_columns=["summary_text"],# Just evaluate the summaryanswering_method="structured")
Arguments
Define columns (prompt_columns and response_columns)
Control which columns judges see during evaluation:
# Scenario 1: Judge sees both prompt and responseprompt_columns=["user_input","system_instruction"]# Contextresponse_columns=["llm_output"]# What to evaluate# Scenario 2: Judge sees only the content to evaluateprompt_columns=None# No contextresponse_columns=["text_to_evaluate"]# Direct evaluation
Template Variable System:
MetaEvaluator uses a template-based system where your prompt.md files can include placeholders like {column_name} that get automatically replaced with actual data. The available variables correspond to your prompt_columns and response_columns.
CSV Data:
user_input,system_instruction,llm_output
"What is 2+2?","Be helpful","The answer is 4"
## Instructions:Evaluate the LLM response for helpfulness.
## Context:User Input: {user_input}
System Instruction: {system_instruction}
## Response to Evaluate:{llm_output}
Formatted prompt given to Judge:
## Instructions:
Evaluate the LLM response for helpfulness.
## Context:
User Input: What is 2+2?
System Instruction: Be helpful
## Response to Evaluate:
The answer is 4
CSV Data:
text_to_evaluate
"This is a summary of the research paper findings."
# Default behavior exampletask=EvalTask(task_schemas={"safety":["safe","unsafe"],# Required by default"helpfulness":["helpful","not_helpful"],# Required by default"explanation":None,# NOT required by default (free-form)"notes":None# NOT required by default (free-form)},# required_tasks not specified - uses default behaviorprompt_columns=["user_prompt"],response_columns=["chatbot_response"],answering_method="structured")
answering_method="instructor"structured_outputs_fallback=True# Fallback to other methods if unsupported
Pros: Good compatibility, structured validation Cons: Additional dependency, model-specific implementation Best for: When you need structured outputs with older models
Fallback sequence (when structured_outputs_fallback=True): instructor → structured → xml
# Load existing projectevaluator=MetaEvaluator(project_dir="my_project",load=True)# Skip function resets to default. You must reassign it after loading:evaluator.eval_task.skip_function=skip_empty_responses
Annotation Prompt for Human Interface (annotation_prompt)
moderation_task=EvalTask(task_schemas={"toxicity":["toxic","borderline","safe"],"harassment":["harassment","no_harassment"],"violence":["violent","non_violent"],"explanation":None},prompt_columns=["user_message"],# Original user inputresponse_columns=["content_to_check"],# Content that might violate policyanswering_method="structured",structured_outputs_fallback=True,annotation_prompt="Evaluate this content for policy violations.")
research_task=EvalTask(task_schemas={"methodology_quality":["excellent","good","fair","poor"],"novelty":["highly_novel","somewhat_novel","incremental","not_novel"],"clarity":["very_clear","clear","unclear","very_unclear"],"detailed_feedback":None,},prompt_columns=None,# No prompt neededresponse_columns=["paper_abstract","methodology_section"],answering_method="structured",structured_outputs_fallback=True,annotation_prompt="Please evaluate this research paper on methodology, novelty, and clarity. Provide detailed feedback.")