Evaluations¶

HoloDeck grades your agent automatically: you declare metrics in agent.yaml, attach test cases, and run holodeck test run. Each metric scores the agent's response (0.0–1.0) and passes or fails against a threshold.

Quick start¶

Add one deterministic numeric metric and one test case to agent.yaml, then run the suite. This case asserts the agent returns 60.94 within a 1% relative tolerance:

# agent.yaml
evaluations:
  metrics:
    - type: standard
      metric: numeric
      relative_tolerance: 0.01

test_cases:
  - name: "Exercise price"
    input: "What was the weighted average exercise price per share in 2007?"
    ground_truth: "60.94"

holodeck test run agent.yaml

Test: "Exercise price"
Metrics:
  ✓ numeric: 1.00 (threshold: —)
Result: PASS

1 passed, 0 failed

numeric runs locally with no LLM call — fast, free, and reproducible. Swap in metric: equality for exact string answers, or type: geval for LLM-judged quality (see below).

How it works¶

You attach metrics under evaluations and a list of test_cases (each with an input and optional ground_truth). When you run holodeck test run agent.yaml, HoloDeck invokes the agent on each test input, records its response and tool calls, then runs every enabled metric. Metrics fall into four families: deterministic / NLP (type: standard — no LLM, scores token overlap or compares numbers/strings), DeepEval (type: geval and type: rag — LLM-as-a-judge), code graders (type: code — your Python callable with full tool-call access), and legacy Azure AI metrics (deprecated). A metric passes when its score meets its threshold; a test case passes when all its metrics pass. The evaluation LLM (for geval/rag) is provider-agnostic — configure it via a model: block (see Evaluation model configuration).

Choosing a metric¶

Family	`type`	LLM needed?	Use for
Deterministic	`standard` (`equality`, `numeric`)	No	Exact strings, numbers, slot-filling, classification labels
NLP overlap	`standard` (`f1_score`, `bleu`, `rouge`, `meteor`)	No	Token / n-gram similarity to a reference
DeepEval GEval	`geval`	Yes	Custom natural-language quality criteria
DeepEval RAG	`rag`	Yes	Hallucination / retrieval quality in RAG pipelines
Code grader	`code`	No	Tool-call assertions, domain logic over tool results
Legacy (deprecated)	`standard` (`groundedness`, etc.)	Yes	Migrate to `geval`/`rag` instead

Deterministic & NLP metrics (`type: standard`)¶

Standard metrics compare the agent response to ground_truth without an LLM call. They emit a score in 0.0–1.0 and use a metric: discriminator. Implemented in src/holodeck/lib/evaluators/deterministic.py.

Numeric¶

Number compare with absolute and/or relative tolerance. Score is 1.0 if either tolerance passes, 0.0 otherwise. Requires ground_truth (must parse to a number).

- type: standard
  metric: numeric
  absolute_tolerance: 0.5
  relative_tolerance: 0.0
  accept_percent: true
  accept_thousands_separators: true

Flag	Default	Effect
`absolute_tolerance`	`1e-6`	Pass when `abs(actual - expected) <= absolute_tolerance` (inclusive)
`relative_tolerance`	`0.0`	Pass when `abs(actual - expected) <= relative_tolerance * abs(expected)`
`accept_percent`	`false`	Parse a trailing `%` as `/100` (so `"35%"` ↔ `0.35`)
`accept_thousands_separators`	`false`	Strip `,`, `_`, and NBSP before parsing
`response_path`	—	Extract a leaf from a JSON response envelope before parsing (e.g. `answer` for `{"answer": 60.94}`)

When to use: financial / quantitative answers. Common pairing: enable both accept_percent and accept_thousands_separators for filings-style data ("1,234.56", "35.8%"). See sample/financial-assistant/claude/agent.yaml for a working example that grades the answer leaf of a structured-output envelope via response_path.

Equality¶

Strict string equality with optional normalization. Score is 1.0 on match, 0.0 otherwise. Requires ground_truth.

- type: standard
  metric: equality
  case_insensitive: true
  strip_whitespace: true
  strip_punctuation: false

Flag	Effect (all default `false`)
`case_insensitive`	Lowercase both sides before compare
`strip_whitespace`	Collapse whitespace runs and trim before compare
`strip_punctuation`	Remove `string.punctuation` before compare

When to use: closed-form answers where the agent must return an exact token (slot-filling, classification labels, "yes"/"no", enum values). Setting a model override on this metric is a no-op (a warning is logged).

NLP overlap metrics¶

Token / n-gram overlap against ground_truth. All score 0.0–1.0 (higher is better).

`metric`	Measures	Use for
`f1_score`	Token-level precision/recall balance	Exact word matching matters
`bleu`	N-gram similarity; penalizes brevity	Translation, paraphrase
`rouge`	Recall of n-grams; reference coverage	Summarization
`meteor`	N-gram match with synonyms + word order	Translation/paraphrase with synonyms

- type: standard
  metric: f1_score
  threshold: 0.8

DeepEval metrics (`type: geval`, `type: rag`)¶

DeepEval provides LLM-as-a-judge evaluation. It works with any configured provider (Ollama for free local eval, or OpenAI / Anthropic / Azure OpenAI). Set temperature: 0.0 for deterministic scoring.

GEval — custom criteria¶

GEval uses the G-Eval algorithm with chain-of-thought prompting to score responses against natural-language criteria.

- type: geval
  name: "Technical Accuracy"
  criteria: |
    Evaluate whether the response provides accurate technical information
    that correctly addresses the user's question.
  evaluation_steps:              # Optional: auto-generated if omitted
    - "Check if the response directly addresses the user's question"
    - "Verify technical accuracy of any code or commands provided"
  evaluation_params:             # Which test-case fields to feed in
    - actual_output
    - input
  threshold: 0.8
  strict_mode: false             # Binary scoring (1.0 / 0.0) when true
  model:                         # Optional per-metric model override
    provider: openai
    name: gpt-4

GEval config fields:

Field	Type	Required	Description
`type`	string	Yes	`"geval"`
`name`	string	Yes	Metric name (e.g. "Coherence")
`criteria`	string	Yes	Natural-language evaluation criteria
`evaluation_steps`	list	No	Step-by-step instructions (auto-generated if omitted)
`evaluation_params`	list	No	Test-case fields to use (default `["actual_output"]`)
`threshold`	float	No	Minimum passing score (0–1)
`strict_mode`	bool	No	Binary scoring when `true` (default `false`)
`enabled`	bool	No	Default `true`
`fail_on_error`	bool	No	Fail test on evaluation error (default `false`)
`model`	object	No	Per-metric model override

Evaluation parameters:

Parameter	Description	When to use
`actual_output`	Agent's response	Always (required)
`input`	User's query	When relevance to query matters
`expected_output`	Ground-truth answer	When comparing to expected response
`context`	Additional context provided	When evaluating context usage
`retrieval_context`	Retrieved documents	For RAG evaluation

RAG metrics¶

RAG metrics evaluate responses generated from retrieved context.

- type: rag
  metric_type: faithfulness
  threshold: 0.8
  include_reason: true           # Include reasoning in results
  model:                         # Optional per-metric override
    provider: openai
    name: gpt-4

`metric_type`	Detects / measures	Required params
`faithfulness`	Hallucinations (claims unsupported by context)	input, actual_output, retrieval_context
`answer_relevancy`	Response relevance to query	input, actual_output
`contextual_relevancy`	Relevance of retrieved chunks	input, actual_output, retrieval_context
`contextual_precision`	Chunk ranking quality (most-relevant ranked highest)	input, actual_output, expected_output, retrieval_context
`contextual_recall`	Retrieval completeness for the expected answer	input, actual_output, expected_output, retrieval_context

The test runner automatically extracts retrieval_context from vectorstore tool results, or you can supply it per test case.

Code graders (`type: code`)¶

Code graders are user-supplied Python callables that score a turn's behavior with the full per-turn context: input, response, ground truth, all tool invocations (args + results), and a free-form turn_config. They run in-process — free, deterministic, and as fast as your Python.

When to use: verifying tool-call ordering or argument shapes, domain logic over actual tool-returned values (program equivalence, schema-aware diff), or anything that doesn't reduce to a single LLM-judgeable string.

- type: code
  grader: "graders.exact_tool_call:called_subtract_with"
  threshold: 0.7         # Optional — used when grader returns a bare float
  enabled: true          # Optional — default true
  fail_on_error: false   # Optional — true = grader exception fails the case
  name: "Turn Program"   # Optional — defaults to the callable name

The grader is a "module.path:callable" reference resolved at config-load time (a missing module / attribute / non-callable raises ConfigError before any agent runs). The module must be importable from the working directory (e.g. a graders/ package next to agent.yaml).

Grader contract¶

from holodeck.lib.test_runner.code_grader import GraderContext, GraderResult

def my_grader(ctx: GraderContext) -> GraderResult | bool | float:
    ...

GraderContext (frozen dataclass) fields: turn_input, agent_response, ground_truth, tool_invocations (immutable tuple; each exposes name, args, result, bytes, duration_ms, error), retrieval_context, turn_index, test_case_name, turn_config.

GraderResult fields: score (float, normalized to [0,1]), passed (bool | None — None defers to threshold, or >= 0.5 if unset), reason (str | None), details (dict | None).

Shortcut returns: True → score=1.0, passed=True; False → score=0.0, passed=False; bare float/int → score=value, passed=None (caller derives passed from threshold).

Example¶

# graders/exact_tool_call.py
from holodeck.lib.test_runner.code_grader import GraderContext, GraderResult

def called_subtract_with(ctx: GraderContext) -> GraderResult:
    expected_a = ctx.turn_config.get("expected_a")
    expected_b = ctx.turn_config.get("expected_b")
    for inv in ctx.tool_invocations:
        if inv.name == "subtract" and inv.args.get("a") == expected_a and inv.args.get("b") == expected_b:
            return GraderResult(score=1.0, passed=True, reason="exact match")
    return GraderResult(score=0.0, passed=False, reason="no matching subtract call")

A production-style example lives at sample/financial-assistant/claude/graders/turn_program_equivalence.py, which walks ConvFinQA turn_program strings against the actual subtract / divide tool calls.

Test cases¶

Per-test-case metric overrides¶

A test case can specify its own evaluations: list, overriding the agent-wide metrics for that case:

test_cases:
  - name: "Fact check test"
    input: "What's our company's founding date?"
    expected_tools: [search_kb]
    ground_truth: "Founded in 2010"
    evaluations:
      - type: rag
        metric_type: faithfulness
        threshold: 0.85
      - type: rag
        metric_type: answer_relevancy
        threshold: 0.7

Multi-turn test cases¶

A test case is multi-turn when it carries a turns: list instead of a top-level input / ground_truth / expected_tools. Each turn drives one exchange through the same agent session, so retrieval context, history, and tool state carry forward.

test_cases:
  - name: "ConvFinQA · MRO 2007"
    turns:
      - input: "What was the weighted average exercise price per share in 2007?"
        ground_truth: "60.94"
        turn_config:
          turn_program: "60.94"
      - input: "And in 2005?"
        ground_truth: "25.14"
        turn_config:
          turn_program: "25.14"
      - input: "What was the change over the years?"
        ground_truth: "35.8"
        turn_config:
          turn_program: "subtract(60.94, 25.14)"
        expected_tools:
          - name: subtract
            args:
              a: { fuzzy: "60.94" }
              b: { fuzzy: "25.14" }
        evaluations:
          - type: code
            grader: "graders.turn_program_equivalence:turn_program_equivalence"

Per-turn fields (holodeck.models.test_case.Turn):

Field	Type	Notes
`input`	string	Required, non-empty
`ground_truth`	string	Optional expected reply for this turn
`expected_tools`	list[str \| ExpectedTool]	Per-turn tool assertions; bare strings → `ExpectedTool(name=..., count=1)`
`files`	list[FileInput]	Up to 10 multimodal inputs per turn
`retrieval_context`	list[str]	RAG context for this turn
`evaluations`	list[metric]	Per-turn metric overrides (same `type` discriminators)
`turn_config`	dict	Free-form payload passed to code graders as `ctx.turn_config`

Mutual exclusivity: when turns: is present, the top-level input, ground_truth, expected_tools, files, and retrieval_context keys must not be set.

Argument matchers in expected_tools[*].args:

YAML shape	Matcher	Behavior
`key: 42` (scalar/list/dict)	`LiteralMatcher`	Exact equality (int↔float numeric equivalence)
`key: { fuzzy: "60.94" }`	`FuzzyMatcher`	Case / whitespace / separator-tolerant, numeric-aware
`key: { regex: "^\\d+$" }`	`RegexMatcher`	Anchored full-match against `str(actual)`

External test-case files (`test_cases_file`)¶

For large or shared suites, lift test_cases: out of agent.yaml and reference an external file:

# agent.yaml
evaluations:
  metrics:
    - type: standard
      metric: numeric
      relative_tolerance: 0.01

test_cases_file: data/convfinqa_subset.yaml

The referenced file may be either a top-level list of cases or a dict with a test_cases: key. Resolution rules (src/holodeck/config/loader.py:_resolve_test_cases_file):

Resolved relative to the directory containing agent.yaml (absolute paths also work).
Parsed as YAML with the same ${VAR} environment-variable substitution applied to agent.yaml.
Mutually exclusive with an inline test_cases: block — specifying both raises ConfigError at load time.
Missing files, parse errors, and non-list/dict payloads raise ConfigError before any test runs.

When to use: large suites (hundreds of cases), generated fixtures, or sharing one corpus across multiple agent variants.

Evaluation model configuration¶

LLM-backed metrics (geval, rag) need an evaluation model. It is provider-agnostic — configure it with the same model: block shape used elsewhere, at any of three levels (highest priority first):

Per-metric — a model: on the metric itself, used for that metric only.
Evaluation-wide — evaluations.model, the default for all metrics without an override.
Agent model — falls back to the agent's top-level model if neither above is set.

evaluations:
  model:                          # Level 2: default for all metrics
    provider: ollama              # Free local inference
    name: llama3.2:latest
    temperature: 0.0

  metrics:
    - type: rag
      metric_type: faithfulness
      threshold: 0.9
      model:                      # Level 1: override for this critical metric
        provider: openai
        name: gpt-4
    - type: geval
      name: "Coherence"
      criteria: "Evaluate response clarity."
      threshold: 0.75             # Uses the Level 2 model above

Model fields: provider (ollama / openai / azure_openai / anthropic) and name are required; temperature (recommend 0.0), max_tokens, and top_p are optional. Backends route automatically: openai / azure_openai use the OpenAI Agents backend; anthropic / ollama use the Claude backend. The eval model is independent of which backend runs the agent.

Cost optimization¶

Default to free local Ollama; override to a paid model only for critical metrics (per-metric model:).
Prefer deterministic / NLP type: standard metrics where possible — they make no LLM calls at all.
For LLM metrics, a smaller model (e.g. gpt-4o-mini) is often enough and far cheaper than gpt-4.

Metric options reference¶

These fields apply to most metric types:

Field	Type	Default	Purpose
`threshold`	float (0–1)	none	Minimum score to pass. Without it, the metric is informational.
`enabled`	bool	`true`	Temporarily disable a metric without removing it.
`fail_on_error`	bool	`false`	Whether an evaluation error fails the test (soft failure by default).
`model`	object	inherited	Per-metric LLM override (LLM-backed metrics only).

Legacy AI metrics (deprecated)¶

Deprecated

Azure AI-based metrics (groundedness, relevance, coherence, safety) are deprecated and will be removed in a future version. They remain supported for backwards compatibility only.

Legacy metric	Migrate to
`groundedness`	`type: rag`, `metric_type: faithfulness`
`relevance`	`type: rag`, `metric_type: answer_relevancy`
`coherence`	`type: geval` with custom `criteria`
`safety`	`type: geval` with safety `criteria`

# Before (deprecated)            # After
- type: standard                 - type: rag
  metric: groundedness             metric_type: faithfulness
  threshold: 0.8                   threshold: 0.8

Test execution¶

holodeck test run agent.yaml (or the shorthand holodeck test agent.yaml) for each test case:

Executes the agent with the test input.
Records which tools were called and validates them against expected_tools.
Runs each enabled metric and compares against thresholds.
Reports pass/fail per metric, then an overall N passed, M failed summary.

The command exits non-zero if any test fails (CI-friendly). Use holodeck test view to launch the dashboard over past runs — see the Dashboard guide.

Test: "Password Reset"
Input: "How do I reset my password?"
Tools called: [search_kb] ✓
Metrics:
  ✓ Faithfulness: 0.92 (threshold: 0.8)
  ✓ Answer Relevancy: 0.88 (threshold: 0.75)
  ✓ F1 Score: 0.81 (threshold: 0.7)
Result: PASS

Troubleshooting¶

Error: "invalid metric type" — valid type values are geval, rag, standard, code. For standard: f1_score, bleu, rouge, meteor, equality, numeric. For rag: faithfulness, answer_relevancy, contextual_relevancy, contextual_precision, contextual_recall. For code, provide a grader: "module.path:callable".

Metric always fails — check the evaluation model is working; try without a threshold first; for RAG metrics ensure the required parameters are available.

LLM evaluation too slow — use a local Ollama model, a faster model (gpt-4o-mini), or switch to NLP / deterministic metrics (free and fast).

Inconsistent evaluation results — set temperature: 0.0; use a more powerful model for complex criteria; add evaluation_steps to GEval for more stable scoring.

RAG metric missing retrieval_context — ensure your agent uses a vectorstore tool (the runner auto-extracts context from tool results), or supply manual retrieval_context in the test case.

Next steps¶

Optimizer guide — holodeck test optimize automates the tune-and-rerun loop as a coordinate-descent optimizer over your eval suite.
Agent Configuration Guide — full agent.yaml reference.
Dashboard guide — visualize eval-run history.
API Reference — evaluator class details.