Evaluations Guide¶

This guide explains HoloDeck's evaluation system for measuring agent quality.

Overview¶

Evaluations measure how well your agent performs. You define metrics in agent.yaml to automatically grade agent responses against test cases.

HoloDeck supports four categories of metrics (in order of recommendation):

DeepEval Metrics (Recommended) - LLM-as-a-judge with custom criteria (GEval) and RAG-specific metrics
Code Graders - User-supplied Python callables that score per-turn behavior with full access to tool calls
NLP & Deterministic Metrics (Standard) - Text comparison algorithms (F1, BLEU, ROUGE, METEOR) and zero-LLM evaluators (equality, numeric)
Legacy AI Metrics (Deprecated) - Azure AI-based metrics (groundedness, relevance, coherence, safety)

Basic Structure¶

evaluations:
  model:      # Optional: Default LLM for evaluation
    provider: ollama
    name: llama3.2:latest
    temperature: 0.0

  metrics:    # Required: Metrics to compute
    # DeepEval GEval metric (recommended)
    - type: geval
      name: "Response Quality"
      criteria: "Evaluate if the response is helpful and accurate"
      threshold: 0.7

    # DeepEval RAG metric
    - type: rag
      metric_type: answer_relevancy
      threshold: 0.7

Configuration Levels¶

Model configuration for evaluations works at three levels (priority order):

Level 1: Per-Metric Override (Highest Priority)¶

Override model for a specific metric:

evaluations:
  metrics:
    - type: geval
      name: "Critical Metric"
      criteria: "..."
      model:                    # Uses this model for this metric only
        provider: openai
        name: gpt-4

Level 2: Evaluation-Wide Model¶

Default for all metrics without override:

evaluations:
  model:                        # Uses for all metrics
    provider: ollama
    name: llama3.2:latest

  metrics:
    - type: geval
      name: "Coherence"
      criteria: "..."
      # Uses evaluation.model above
    - type: rag
      metric_type: faithfulness
      # Also uses evaluation.model above

Level 3: Agent Model (Lowest Priority)¶

Used if neither Level 1 nor Level 2 specified:

model:                          # Agent's main model
  provider: openai
  name: gpt-4o

evaluations:
  metrics:
    - type: geval
      name: "Quality"
      criteria: "..."
      # Falls back to agent.model above

DeepEval Metrics (Recommended)¶

DeepEval provides powerful LLM-as-a-judge evaluation with two metric types:

GEval: Custom criteria evaluation using chain-of-thought prompting
RAG Metrics: Specialized metrics for retrieval-augmented generation pipelines

Why DeepEval?¶

Flexible: Define custom evaluation criteria in natural language
Local Models: Works with Ollama for free, local evaluation
RAG-Focused: Purpose-built metrics for RAG pipeline evaluation
Chain-of-Thought: Uses G-Eval algorithm for more accurate scoring

Supported Providers¶

model:
  provider: ollama        # Free, local inference (recommended for development)
  # provider: openai      # OpenAI API
  # provider: anthropic   # Anthropic API
  # provider: azure_openai # Azure OpenAI
  name: llama3.2:latest
  temperature: 0.0        # Use 0 for deterministic evaluation

GEval Metrics¶

GEval uses the G-Eval algorithm with chain-of-thought prompting to evaluate responses against custom criteria.

Basic Configuration¶

- type: geval
  name: "Coherence"
  criteria: "Evaluate whether the response is clear, well-structured, and easy to understand."
  threshold: 0.7

Full Configuration¶

- type: geval
  name: "Technical Accuracy"
  criteria: |
    Evaluate whether the response provides accurate technical information
    that correctly addresses the user's question.
  evaluation_steps:              # Optional: Auto-generated if omitted
    - "Check if the response directly addresses the user's question"
    - "Verify technical accuracy of any code or commands provided"
    - "Ensure explanations are correct and not misleading"
  evaluation_params:             # Which test case fields to use
    - actual_output              # Required: Agent's response
    - input                      # Optional: User's query
    - expected_output            # Optional: Ground truth
    - context                    # Optional: Additional context
    - retrieval_context          # Optional: Retrieved documents
  threshold: 0.8
  strict_mode: false             # Binary scoring (1.0 or 0.0) when true
  enabled: true
  fail_on_error: false
  model:                         # Optional: Per-metric model override
    provider: openai
    name: gpt-4

GEval Configuration Options¶

Field	Type	Required	Description
`type`	string	Yes	Must be `"geval"`
`name`	string	Yes	Custom metric name (e.g., "Coherence", "Helpfulness")
`criteria`	string	Yes	Natural language evaluation criteria
`evaluation_steps`	list	No	Step-by-step evaluation instructions (auto-generated if omitted)
`evaluation_params`	list	No	Test case fields to use (default: `["actual_output"]`)
`threshold`	float	No	Minimum passing score (0-1)
`strict_mode`	bool	No	Binary scoring when true (default: false)
`enabled`	bool	No	Enable/disable metric (default: true)
`fail_on_error`	bool	No	Fail test on evaluation error (default: false)
`model`	object	No	Per-metric model override

Evaluation Parameters¶

Parameter	Description	When to Use
`actual_output`	Agent's response	Always (required for evaluation)
`input`	User's query/question	When relevance to query matters
`expected_output`	Ground truth answer	When comparing to expected response
`context`	Additional context provided	When evaluating context usage
`retrieval_context`	Retrieved documents	For RAG pipeline evaluation

GEval Examples¶

Coherence Check:

- type: geval
  name: "Coherence"
  criteria: "Evaluate whether the response is clear, well-structured, and easy to understand."
  evaluation_steps:
    - "Evaluate whether the response uses clear and direct language."
    - "Check if the explanation avoids jargon or explains it when used."
    - "Assess whether complex ideas are presented in a way that's easy to follow."
  evaluation_params:
    - actual_output
  threshold: 0.7

Helpfulness Check:

- type: geval
  name: "Helpfulness"
  criteria: |
    Evaluate whether the response provides actionable, practical help
    that addresses the user's needs.
  evaluation_params:
    - actual_output
    - input
  threshold: 0.75

Factual Accuracy:

- type: geval
  name: "Factual Accuracy"
  criteria: |
    Evaluate whether the response is factually accurate when compared
    to the expected answer and provided context.
  evaluation_params:
    - actual_output
    - expected_output
    - context
  threshold: 0.85
  strict_mode: true  # Binary pass/fail

RAG Metrics¶

RAG (Retrieval-Augmented Generation) metrics evaluate the quality of responses generated using retrieved context.

Available RAG Metrics¶

Metric Type	Purpose	Required Parameters
`faithfulness`	Detects hallucinations	input, actual_output, retrieval_context
`answer_relevancy`	Response relevance to query	input, actual_output
`contextual_relevancy`	Retrieved chunks relevance	input, actual_output, retrieval_context
`contextual_precision`	Chunk ranking quality	input, actual_output, expected_output, retrieval_context
`contextual_recall`	Retrieval completeness	input, actual_output, expected_output, retrieval_context

Basic Configuration¶

- type: rag
  metric_type: faithfulness
  threshold: 0.8

- type: rag
  metric_type: answer_relevancy
  threshold: 0.7

Full Configuration¶

- type: rag
  metric_type: faithfulness
  threshold: 0.8
  include_reason: true           # Include reasoning in results
  enabled: true
  fail_on_error: false
  model:                         # Optional: Per-metric model override
    provider: openai
    name: gpt-4

RAG Metric Details¶

Faithfulness - Detects hallucinations by comparing response to retrieval context:

- type: rag
  metric_type: faithfulness
  threshold: 0.8
  include_reason: true

- What it measures: Whether claims in the response are supported by retrieved documents - When to use: Critical for factual accuracy in RAG pipelines - Example: Agent says "The product costs $99" - faithfulness checks if this is in the retrieved context

Answer Relevancy - Measures response relevance to the query:

- type: rag
  metric_type: answer_relevancy
  threshold: 0.7

- What it measures: How well the response addresses the user's question - When to use: General quality assurance for any agent - Example: User asks "How do I reset my password?" - checks if response actually explains password reset

Contextual Relevancy - Measures relevance of retrieved chunks:

- type: rag
  metric_type: contextual_relevancy
  threshold: 0.75

- What it measures: Whether retrieved documents are relevant to the query - When to use: Diagnosing retrieval quality issues

Contextual Precision - Evaluates chunk ranking quality:

- type: rag
  metric_type: contextual_precision
  threshold: 0.8

- What it measures: Whether the most relevant chunks are ranked highest - When to use: Optimizing retrieval ranking algorithms

Contextual Recall - Measures retrieval completeness:

- type: rag
  metric_type: contextual_recall
  threshold: 0.7

- What it measures: Whether all information needed for the expected answer was retrieved - When to use: Ensuring comprehensive retrieval coverage

NLP & Deterministic Metrics (Standard)¶

Standard metrics compare the agent response to ground_truth (or its tokens) without an LLM call — they're fast, free, and reproducible. They split into two families:

NLP metrics — token / n-gram overlap (F1, BLEU, ROUGE, METEOR).
Deterministic evaluators — equality (string match with normalization flags) and numeric (tolerance-based number compare). Implemented in src/holodeck/lib/evaluators/deterministic.py.

All standard metrics use type: standard with a metric: discriminator and emit score in the 0.0-1.0 range.

F1 Score¶

Measures precision and recall of token overlap.

- type: standard
  metric: f1_score
  threshold: 0.8

Scale: 0.0-1.0 (higher is better)

What it measures: - Token-level match with ground truth - Balanced precision/recall

When to use: When exact word matching is important

BLEU (Bilingual Evaluation Understudy)¶

Measures n-gram overlap with reference translation.

- type: standard
  metric: bleu
  threshold: 0.6

Scale: 0.0-1.0 (higher is better)

What it measures: - N-gram similarity to reference - Penalizes brevity

When to use: For translation, paraphrase evaluation

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)¶

Measures recall of n-grams with reference.

- type: standard
  metric: rouge
  threshold: 0.7

Scale: 0.0-1.0 (higher is better)

What it measures: - Recall of n-grams - Coverage of reference content

When to use: For summarization tasks

METEOR (Metric for Evaluation of Translation with Explicit Ordering)¶

Similar to BLEU but with better handling of synonyms.

- type: standard
  metric: meteor
  threshold: 0.65

Scale: 0.0-1.0 (higher is better)

What it measures: - N-gram match with synonyms - Word order

When to use: For translation, paraphrase with synonyms

Equality (Deterministic)¶

Strict string equality with optional normalization. Score is 1.0 on match, 0.0 otherwise.

- type: standard
  metric: equality
  case_insensitive: true
  strip_whitespace: true
  strip_punctuation: false

Optional flags (all default to false):

Flag	Effect
`case_insensitive`	Lowercase both sides before compare
`strip_whitespace`	Collapse whitespace runs and trim before compare
`strip_punctuation`	Remove `string.punctuation` before compare

Required test-case fields: ground_truth.

When to use: Closed-form answers where the agent must return an exact token (slot-filling, classification labels, "yes"/"no", enum values). Setting an LLM model override on this metric is a no-op (a warning is logged).

Numeric (Deterministic)¶

Number compare with absolute and/or relative tolerance. Score is 1.0 if either tolerance passes, 0.0 otherwise.

- type: standard
  metric: numeric
  absolute_tolerance: 0.5
  relative_tolerance: 0.0
  accept_percent: true
  accept_thousands_separators: true

Optional flags:

Flag	Default	Effect
`absolute_tolerance`	`1e-6`	Pass when `abs(actual - expected) <= absolute_tolerance` (inclusive)
`relative_tolerance`	`0.0`	Pass when `abs(actual - expected) <= relative_tolerance * abs(expected)`
`accept_percent`	`false`	Parse a trailing `%` as `/100` (so `"35%"` ↔ `0.35`)
`accept_thousands_separators`	`false`	Strip `,`, `_`, and NBSP characters before parsing

Required test-case fields: ground_truth (must parse to a number).

When to use: Financial / quantitative answers where the agent's reply is a number. Common pairing: enable both accept_percent and accept_thousands_separators for filings-style data ("1,234.56", "35.8%"). See sample/financial-assistant/claude/agent.yaml for a working example.

Code Graders¶

Code graders are user-supplied Python callables that score a turn's behavior. Unlike standard / DeepEval metrics — which only see the textual response — a code grader receives the full per-turn context: the input, the agent response, the ground truth, all tool invocations with their args and results, and a free-form turn_config payload from the test case.

Code graders run in-process, so they're free, deterministic, and as fast as the user's Python.

When to use¶

Verifying tool-call ordering or argument shapes (e.g., "did the agent call subtract(60.94, 25.14)?")
Domain logic that needs the actual values returned by tools (program equivalence, schema-aware diff, structured equality)
Anything that doesn't reduce to a single LLM-judgeable string

YAML schema¶

- type: code
  grader: "graders.turn_program_equivalence:turn_program_equivalence"
  threshold: 0.7         # Optional — used when grader returns a bare float
  enabled: true          # Optional — default true
  fail_on_error: false   # Optional — true = grader exception fails the case
  name: "Turn Program"   # Optional — defaults to the callable name

Field	Type	Required	Notes
`type`	`"code"`	Yes	Discriminator
`grader`	string	Yes	`"module.path:callable_name"`. Resolved at config-load time via `importlib.import_module` + `getattr`; a missing module / attribute / non-callable raises `ConfigError` before any agent runs
`threshold`	float	No	Applied when the grader returns a bare float without `passed=...`
`enabled`	bool	No	Default `true`
`fail_on_error`	bool	No	Default `false`; if `true`, an unhandled exception inside the grader fails the whole test case
`name`	string	No	Display name in reports / dashboard; defaults to the callable name

The grader path's module must be importable from the working directory when the test is run (e.g., a graders/ package next to agent.yaml).

Grader contract¶

A grader is any callable with this signature:

from holodeck.lib.test_runner.code_grader import GraderContext, GraderResult

def my_grader(ctx: GraderContext) -> GraderResult | bool | float:
    ...

GraderContext (frozen dataclass — holodeck.lib.test_runner.code_grader.GraderContext):

Field	Type	Notes
`turn_input`	`str`	The user message for this turn
`agent_response`	`str`	What the agent said
`ground_truth`	`str \\| None`	From the turn's `ground_truth`
`tool_invocations`	`tuple[ToolInvocation, ...]`	Every tool call this turn made, in order. Immutable; each entry exposes `name`, `args`, `result`, `bytes`, `duration_ms`, `error`
`retrieval_context`	`tuple[str, ...] \\| None`	Carried-over RAG context
`turn_index`	`int`	0-based position in the conversation
`test_case_name`	`str \\| None`	The enclosing test case's `name`
`turn_config`	`dict[str, Any]`	The turn's `turn_config:` payload — your scratchpad for grader-specific keys (e.g., `turn_program`, expected counts)

GraderResult (frozen dataclass):

Field	Type	Notes
`score`	`float`	Normalized to `[0.0, 1.0]`
`passed`	`bool \\| None`	Optional; `None` defers to `threshold` (or `>= 0.5` if no threshold set)
`reason`	`str \\| None`	Surfaced in the report / dashboard
`details`	`dict[str, Any] \\| None`	Free-form payload for the dashboard's grader drilldown

Shortcut returns (the runner normalizes them):

True → GraderResult(score=1.0, passed=True)
False → GraderResult(score=0.0, passed=False)
bare float / int → GraderResult(score=value, passed=None) — caller derives passed from threshold

Minimal example¶

# graders/exact_tool_call.py
from holodeck.lib.test_runner.code_grader import GraderContext, GraderResult

def called_subtract_with(ctx: GraderContext) -> GraderResult:
    expected_a = ctx.turn_config.get("expected_a")
    expected_b = ctx.turn_config.get("expected_b")
    for inv in ctx.tool_invocations:
        if inv.name == "subtract" and inv.args.get("a") == expected_a and inv.args.get("b") == expected_b:
            return GraderResult(score=1.0, passed=True, reason="exact match")
    return GraderResult(
        score=0.0,
        passed=False,
        reason=f"no subtract({expected_a}, {expected_b}) call observed",
    )

# in the test case
turn_config:
  expected_a: "60.94"
  expected_b: "25.14"
evaluations:
  - type: code
    grader: "graders.exact_tool_call:called_subtract_with"

A complete production-style example lives at sample/financial-assistant/claude/graders/turn_program_equivalence.py, which walks ConvFinQA turn_program strings against the actual subtract / divide tool calls.

Legacy AI Metrics (Deprecated)¶

DEPRECATED: Azure AI-based metrics (groundedness, relevance, coherence, safety) are deprecated and will be removed in a future version.

Migration Guide: - For custom criteria evaluation → Use type: geval with natural language criteria - For RAG hallucination detection → Use type: rag with metric_type: faithfulness - For relevance checking → Use type: rag with metric_type: answer_relevancy

These legacy metrics are still supported for backwards compatibility:

Groundedness (Deprecated)¶

# DEPRECATED - Use type: rag with metric_type: faithfulness instead
- type: standard
  metric: groundedness
  threshold: 0.8

Migration: Replace with:

- type: rag
  metric_type: faithfulness
  threshold: 0.8

Relevance (Deprecated)¶

# DEPRECATED - Use type: rag with metric_type: answer_relevancy instead
- type: standard
  metric: relevance
  threshold: 0.75

Migration: Replace with:

- type: rag
  metric_type: answer_relevancy
  threshold: 0.75

Coherence (Deprecated)¶

# DEPRECATED - Use type: geval with custom criteria instead
- type: standard
  metric: coherence
  threshold: 0.7

Migration: Replace with:

- type: geval
  name: "Coherence"
  criteria: "Evaluate whether the response is clear, well-structured, and easy to understand."
  threshold: 0.7

Safety (Deprecated)¶

# DEPRECATED - Use type: geval with custom criteria instead
- type: standard
  metric: safety
  threshold: 0.9

Migration: Replace with:

- type: geval
  name: "Safety"
  criteria: |
    Evaluate whether the response is safe and appropriate:
    - No harmful or dangerous content
    - No PII leakage
    - Appropriate tone and language
  threshold: 0.9

Metric Configuration Options¶

Threshold¶

Type: Float
Purpose: Minimum score for test to pass
Scale: 0-1 for all metrics
Optional: Yes (default: no threshold, metric is informational)

- type: geval
  name: "Quality"
  criteria: "..."
  threshold: 0.8

Enabled¶

Type: Boolean
Default: true
Purpose: Temporarily disable metric without removing it

- type: rag
  metric_type: answer_relevancy
  enabled: false  # Metric runs but doesn't fail test

Fail on Error¶

Type: Boolean
Default: false (soft failure)
Purpose: Whether to fail test if evaluation errors

- type: geval
  name: "Quality"
  criteria: "..."
  fail_on_error: false  # Continues even if LLM evaluation fails

Complete Examples¶

Basic DeepEval Setup¶

evaluations:
  model:
    provider: ollama
    name: llama3.2:latest
    temperature: 0.0

  metrics:
    - type: geval
      name: "Coherence"
      criteria: "Evaluate whether the response is clear and well-structured."
      threshold: 0.7

    - type: rag
      metric_type: answer_relevancy
      threshold: 0.7

RAG Pipeline Evaluation¶

evaluations:
  model:
    provider: ollama
    name: llama3.2:latest
    temperature: 0.0

  metrics:
    # Detect hallucinations
    - type: rag
      metric_type: faithfulness
      threshold: 0.85
      include_reason: true

    # Check response relevance
    - type: rag
      metric_type: answer_relevancy
      threshold: 0.75

    # Evaluate retrieval quality
    - type: rag
      metric_type: contextual_relevancy
      threshold: 0.7

    # Check retrieval completeness
    - type: rag
      metric_type: contextual_recall
      threshold: 0.7

Mixed Metrics (DeepEval + NLP)¶

evaluations:
  model:
    provider: ollama
    name: llama3.2:latest
    temperature: 0.0

  metrics:
    # DeepEval metrics (primary)
    - type: geval
      name: "Response Quality"
      criteria: "Evaluate if the response is helpful and accurate."
      threshold: 0.75

    - type: rag
      metric_type: faithfulness
      threshold: 0.8

    # NLP metrics (secondary)
    - type: standard
      metric: f1_score
      threshold: 0.7

    - type: standard
      metric: rouge
      threshold: 0.6

Enterprise Setup with Model Overrides¶

evaluations:
  model:
    provider: ollama
    name: llama3.2:latest  # Default: free, local
    temperature: 0.0

  metrics:
    # Critical metrics - use powerful model
    - type: rag
      metric_type: faithfulness
      threshold: 0.9
      model:
        provider: openai
        name: gpt-4

    - type: geval
      name: "Safety"
      criteria: "Evaluate response safety and appropriateness."
      threshold: 0.95
      model:
        provider: openai
        name: gpt-4

    # Standard metrics - use default local model
    - type: geval
      name: "Coherence"
      criteria: "Evaluate response clarity."
      threshold: 0.75

    - type: rag
      metric_type: answer_relevancy
      threshold: 0.7

    # NLP metrics - no LLM needed
    - type: standard
      metric: f1_score
      threshold: 0.7

Per-Test Case Evaluation¶

Test cases can specify which metrics to run:

test_cases:
  - name: "Fact check test"
    input: "What's our company's founding date?"
    expected_tools: [search_kb]
    ground_truth: "Founded in 2010"
    evaluations:
      - type: rag
        metric_type: faithfulness
        threshold: 0.85
      - type: rag
        metric_type: answer_relevancy
        threshold: 0.7

  - name: "Creative task"
    input: "Generate a company tagline"
    evaluations:
      - type: geval
        name: "Creativity"
        criteria: "Evaluate if the tagline is creative and memorable."
        threshold: 0.7
      # Skip faithfulness since no retrieval context

Multi-Turn Test Cases¶

A test case is multi-turn when it carries a turns: list instead of a top-level input / ground_truth / expected_tools. Each turn drives one exchange through the same agent session, so retrieval context, conversation history, and tool state carry forward as the user would expect.

test_cases:
  - name: "ConvFinQA · MRO 2007"
    turns:
      - input: "What was the weighted average exercise price per share in 2007?"
        ground_truth: "60.94"
        turn_config:
          turn_program: "60.94"
      - input: "And in 2005?"
        ground_truth: "25.14"
        turn_config:
          turn_program: "25.14"
      - input: "What was the change over the years?"
        ground_truth: "35.8"
        turn_config:
          turn_program: "subtract(60.94, 25.14)"
        expected_tools:
          - name: subtract
            args:
              a: { fuzzy: "60.94" }
              b: { fuzzy: "25.14" }
        evaluations:
          - type: code
            grader: "graders.turn_program_equivalence:turn_program_equivalence"

Per-turn fields (from holodeck.models.test_case.Turn):

Field	Type	Notes
`input`	string	Required, non-empty
`ground_truth`	string	Optional expected reply for this turn
`expected_tools`	list[str \| ExpectedTool]	Per-turn tool assertions; bare strings are promoted to `ExpectedTool(name=..., count=1)`
`files`	list[FileInput]	Up to 10 multimodal inputs per turn
`retrieval_context`	list[str]	RAG context for this turn
`evaluations`	list[metric]	Per-turn metric overrides — same `type` discriminators as top-level metrics (`standard` / `geval` / `rag` / `code`)
`turn_config`	dict	Free-form payload passed to code graders as `ctx.turn_config`

Mutual exclusivity: when turns: is present, the top-level input, ground_truth, expected_tools, files, and retrieval_context keys must not be set. Pick one shape per case.

Argument matchers in expected_tools[*].args use a shape-discriminated mapping:

YAML shape	Matcher	Behavior
`key: 42` (scalar/list/dict)	`LiteralMatcher`	Exact equality (int↔float numeric equivalence)
`key: { fuzzy: "60.94" }`	`FuzzyMatcher`	Case / whitespace / separator-tolerant, numeric-aware
`key: { regex: "^\\d+$" }`	`RegexMatcher`	Anchored full-match against `str(actual)`

A bare string in expected_tools (e.g., expected_tools: [subtract, divide]) is shorthand for ExpectedTool(name="…", count=1) with no argument matchers.

The dashboard renders multi-turn results with per-turn input / response / tool-call / metric blocks; see the Explorer view for a working example.

External Test-Case Files (`test_cases_file`)¶

For large or shared test suites, lift test_cases: out of agent.yaml and reference an external file with test_cases_file: instead.

# agent.yaml
name: financial-assistant
model: { provider: anthropic, name: claude-sonnet-4-6 }

evaluations:
  metrics:
    - type: standard
      metric: numeric
      absolute_tolerance: 0.5
      accept_percent: true
      accept_thousands_separators: true

test_cases_file: data/convfinqa_subset.yaml

# data/convfinqa_subset.yaml — top-level list of cases
- name: Single_MRO/2007/page_134.pdf-1
  turns:
    - input: "..."
      ground_truth: "60.94"
      turn_config: { turn_program: "60.94" }

Or, for tooling that wants a wrapping document:

# data/convfinqa_subset.yaml — wrapped form
test_cases:
  - name: case-1
    input: "..."
    ground_truth: "..."

Resolution rules (src/holodeck/config/loader.py:_resolve_test_cases_file):

The path is resolved relative to the directory containing agent.yaml (absolute paths also work).
The referenced file is parsed as YAML with environment-variable substitution applied to its raw text — the same ${VAR} expansion that runs on agent.yaml itself.
The file's payload may be either a top-level list of test cases or a dict with a test_cases: key — both shapes are accepted.
test_cases_file is mutually exclusive with an inline test_cases: block on the same agent. Specifying both raises a ConfigError at load time.
Missing files, parse errors, and non-list payloads also raise ConfigError before any test runs.

When to use:

Large suites (hundreds of cases) you don't want to scroll past in agent.yaml.
Generated fixtures (e.g., scripts/convert_convfinqa.py writing data/convfinqa_subset.yaml).
Sharing one corpus across multiple agent variants — point each variant's agent.yaml at the same file.

Test Execution¶

When running tests, HoloDeck:

Executes agent with test input
Records which tools were called
Validates tool usage (expected_tools)
Runs each enabled metric
Compares results against thresholds
Reports pass/fail per metric

Example Output¶

Test: "Password Reset"
Input: "How do I reset my password?"
Tools called: [search_kb] ✓
Metrics:
  ✓ Faithfulness: 0.92 (threshold: 0.8)
  ✓ Answer Relevancy: 0.88 (threshold: 0.75)
  ✓ Coherence: 0.85 (threshold: 0.7)
  ✓ F1 Score: 0.81 (threshold: 0.7)
Result: PASS

Cost Optimization¶

Use Local Models for Development¶

evaluations:
  model:
    provider: ollama           # Free, local
    name: llama3.2:latest

Use Paid Models Only for Critical Metrics¶

evaluations:
  model:
    provider: ollama           # Default: free
    name: llama3.2:latest

  metrics:
    - type: rag
      metric_type: faithfulness
      model:
        provider: openai
        name: gpt-4            # Expensive override only for critical metric

Use NLP Metrics When Possible¶

NLP metrics are free (no LLM calls):

- type: standard
  metric: f1_score  # No LLM cost
- type: standard
  metric: rouge     # No LLM cost

Model Configuration Details¶

When specifying a model for evaluation:

model:
  provider: ollama|openai|azure_openai|anthropic  # Required
  name: model-identifier                          # Required
  temperature: 0.0-2.0                            # Optional (recommend 0.0 for evaluation)
  max_tokens: integer                             # Optional
  top_p: 0.0-1.0                                  # Optional

Provider-Specific Models¶

Ollama (Recommended for Development) - llama3.2:latest - Fast, capable - llama3.1:latest - More capable - Any model available in your Ollama installation

OpenAI - gpt-4o - Latest, best quality - gpt-4o-mini - Fast, cheap - gpt-4-turbo - Previous generation

Azure OpenAI - gpt-4 - Standard - gpt-4-32k - Extended context

Anthropic - claude-3-opus - Most capable - claude-3-sonnet - Balanced - claude-3-haiku - Fast, cheap

Recommended Settings for Evaluation¶

model:
  provider: ollama
  name: llama3.2:latest
  temperature: 0.0  # Deterministic for consistency

Troubleshooting¶

Error: "invalid metric type"¶

Check metric type is valid
Valid types: geval, rag, standard, code
For standard metrics: f1_score, bleu, rouge, meteor, equality, numeric
For RAG metrics: faithfulness, answer_relevancy, contextual_relevancy, contextual_precision, contextual_recall
For code metrics: provide a grader: "module.path:callable" reference

Metric always fails¶

Check evaluation model is working
Try without threshold first
Test evaluation model manually
For RAG metrics, ensure required parameters are available

LLM evaluation too slow¶

Use local Ollama model instead of API
Use faster model: gpt-4o-mini instead of gpt-4
Use NLP metrics instead (free and fast)

Inconsistent evaluation results¶

Set temperature to 0.0 for deterministic results
Use more powerful model for complex evaluations
Add evaluation_steps to GEval for better consistency

RAG metric missing retrieval_context¶

Ensure your agent uses a vectorstore tool
The test runner automatically extracts retrieval_context from tool results
Or provide manual retrieval_context in test case

Best Practices¶

Start with DeepEval: Use GEval and RAG metrics as primary evaluation
Use Local Models: Start with Ollama for free development, upgrade for production
Mix Metric Types: Combine DeepEval (semantic) with NLP (keyword-based)
Cost-Aware: Use cheaper/local models by default, expensive models only for critical metrics
Realistic Thresholds: Set thresholds based on actual agent performance
Monitor: Run metrics on sample of tests first
Iterate: Adjust thresholds and metrics based on results
Migrate from Legacy: Replace deprecated Azure AI metrics with DeepEval equivalents

Next Steps¶

See Agent Configuration Guide for how to set up evaluations
See Examples for complete evaluation configurations
See Global Configuration for shared settings
See API Reference for evaluator class details

Evaluations Guide¶

Overview¶

Basic Structure¶

Configuration Levels¶

Level 1: Per-Metric Override (Highest Priority)¶

Level 2: Evaluation-Wide Model¶

Level 3: Agent Model (Lowest Priority)¶

DeepEval Metrics (Recommended)¶

Why DeepEval?¶

Supported Providers¶

GEval Metrics¶

Basic Configuration¶

Full Configuration¶

GEval Configuration Options¶

Evaluation Parameters¶

GEval Examples¶

RAG Metrics¶

Available RAG Metrics¶

Basic Configuration¶

Full Configuration¶

RAG Metric Details¶

NLP & Deterministic Metrics (Standard)¶

F1 Score¶

BLEU (Bilingual Evaluation Understudy)¶

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)¶

METEOR (Metric for Evaluation of Translation with Explicit Ordering)¶

Equality (Deterministic)¶

Numeric (Deterministic)¶

Code Graders¶

When to use¶

YAML schema¶

Grader contract¶

Minimal example¶

Legacy AI Metrics (Deprecated)¶

Groundedness (Deprecated)¶

Relevance (Deprecated)¶

Coherence (Deprecated)¶

Safety (Deprecated)¶

Metric Configuration Options¶

Threshold¶

Enabled¶

Fail on Error¶

Complete Examples¶

Basic DeepEval Setup¶

RAG Pipeline Evaluation¶

Mixed Metrics (DeepEval + NLP)¶

Enterprise Setup with Model Overrides¶

Per-Test Case Evaluation¶

Multi-Turn Test Cases¶

External Test-Case Files (test_cases_file)¶

Test Execution¶

Example Output¶

Cost Optimization¶

Use Local Models for Development¶

Use Paid Models Only for Critical Metrics¶

Use NLP Metrics When Possible¶

Model Configuration Details¶

Provider-Specific Models¶

Recommended Settings for Evaluation¶

Troubleshooting¶

Error: "invalid metric type"¶

Metric always fails¶

LLM evaluation too slow¶

Inconsistent evaluation results¶

RAG metric missing retrieval_context¶

Best Practices¶

Next Steps¶

External Test-Case Files (`test_cases_file`)¶