Skip to main content

Creating an Evaluation Set

An evaluation set is the collection of test inputs you'll use to evaluate your target. Kaleidoscope generates diverse, realistic inputs by first creating personas and then generating inputs from each persona's perspective.

Step 1: Create Personas

Personas are realistic user archetypes representing the different types of people who interact with your application.

FieldDescription
TitleA short label (e.g. "Elderly patient seeking medication info")
InfoBackground context about this user type
StyleHow they communicate (formal, casual, terse, detailed)
Use caseWhat they typically ask about

There are three ways to create personas:

  • Generate with AI - Kaleidoscope uses your target's purpose, target users, and knowledge base to generate relevant personas with an LLM. Choose how many to create.
  • Random Personas - Sample from the Nemotron Personas dataset, a large collection of realistic user profiles. Useful for quickly bootstrapping diverse persona sets.
  • Add Manually - Define personas yourself when you have specific user archetypes in mind.

Add personas

Changing the Nemotron dataset

By default, Kaleidoscope samples from nvidia/Nemotron-Personas-Singapore. To use a different country variant, set NEMOTRON_PERSONAS_DATASET in your .env file to any nvidia/Nemotron-Personas-* HuggingFace dataset. If the style defaults for your specific Nemotron dataset cannot be found in the code, you can add them in the persona sampler.

Reviewing Personas

Generated and sampled personas start in pending status. Review each one and:

  • Approve - include in input generation
  • Reject - exclude from input generation
  • Edit - modify and save (status becomes "edited")

Only approved personas are used for input generation.

Select personas

Step 2: Generate Inputs

Once you have approved personas, generate test inputs from their perspectives. For each batch you configure:

SettingOptionsEffect
CountAny numberTotal number of inputs to generate (distributed across personas and languages)
Input styleBrief / Regular / DetailedHow much context the LLM prompt includes when generating
ModelAny configured modelWhich LLM generates the inputs
LanguagesAny selection (default: English)Which languages to generate inputs in; the total count is split evenly across selected languages

Configure questions

Distribution Logic

Inputs are distributed evenly across all approved personas, then within each persona they are allocated across type/scope combinations using weighted ratios.

When a knowledge base is uploaded:

TypeScopeWeight
TypicalIn-KB70%
TypicalOut-of-KB10%
Edge caseIn-KB15%
Edge caseOut-of-KB5%

When no knowledge base is present:

TypeScopeWeight
TypicalOut-of-KB80%
Edge caseOut-of-KB20%
  • Typical inputs represent common asks from real users.
  • Edge case inputs are unusual, adversarial, or boundary-pushing.
  • In-KB inputs are answerable from uploaded knowledge base documents.
  • Out-of-KB inputs are outside what the target's knowledge base covers.
For developers

These ratios are configurable in backend/src/common/config.py.

Setting Languages

You may generate test inputs in various languages. When more than one language is selected, the total input count is divided evenly across all selected languages first. The Distribution Logic above then runs independently for each language, and each generated input is tagged with its language.

warning

Make sure your target accepts inputs in the languages you select — Kaleidoscope will send the generated inputs as-is. Judges always produce their reasoning in English (see Scoring and Validation), with the assumption that the judge model can comprehend the source language. You can swap in a different model using Custom Judges if needed.

Uploading Inputs

If you already have test inputs, you can upload them directly (CSV, JSON, or Excel) instead of generating. Uploaded inputs are marked with source "uploaded" and can optionally be assigned to existing personas by title. The language setting only applies to generated inputs — uploaded inputs are stored without language metadata.

Reviewing Inputs

Generated inputs start as pending. Review and approve the ones you want to include in scoring.

Snapshots

Once you're happy with your approved inputs, create a snapshot - a versioned evaluation run. A snapshot locks in the current set of inputs so you can score, annotate, and compare results with other snapshots.

Annotations

Once a snapshot is created, you can run scoring, annotate outcomes with human labels, and compare results across snapshots to measure improvement over time.