Evaluations Guide¶
This guide explains HoloDeck's evaluation system for measuring agent quality.
Overview¶
Evaluations measure how well your agent performs. You define metrics in agent.yaml to automatically grade agent responses against test cases.
HoloDeck supports three categories of metrics (in order of recommendation):
- DeepEval Metrics (Recommended) - LLM-as-a-judge with custom criteria (GEval) and RAG-specific metrics
- NLP Metrics (Standard) - Text comparison algorithms (F1, BLEU, ROUGE, METEOR)
- Legacy AI Metrics (Deprecated) - Azure AI-based metrics (groundedness, relevance, coherence, safety)
Basic Structure¶
evaluations:
model: # Optional: Default LLM for evaluation
provider: ollama
name: llama3.2:latest
temperature: 0.0
metrics: # Required: Metrics to compute
# DeepEval GEval metric (recommended)
- type: geval
name: "Response Quality"
criteria: "Evaluate if the response is helpful and accurate"
threshold: 0.7
# DeepEval RAG metric
- type: rag
metric_type: answer_relevancy
threshold: 0.7
Configuration Levels¶
Model configuration for evaluations works at three levels (priority order):
Level 1: Per-Metric Override (Highest Priority)¶
Override model for a specific metric:
evaluations:
metrics:
- type: geval
name: "Critical Metric"
criteria: "..."
model: # Uses this model for this metric only
provider: openai
name: gpt-4
Level 2: Evaluation-Wide Model¶
Default for all metrics without override:
evaluations:
model: # Uses for all metrics
provider: ollama
name: llama3.2:latest
metrics:
- type: geval
name: "Coherence"
criteria: "..."
# Uses evaluation.model above
- type: rag
metric_type: faithfulness
# Also uses evaluation.model above
Level 3: Agent Model (Lowest Priority)¶
Used if neither Level 1 nor Level 2 specified:
model: # Agent's main model
provider: openai
name: gpt-4o
evaluations:
metrics:
- type: geval
name: "Quality"
criteria: "..."
# Falls back to agent.model above
DeepEval Metrics (Recommended)¶
DeepEval provides powerful LLM-as-a-judge evaluation with two metric types:
- GEval: Custom criteria evaluation using chain-of-thought prompting
- RAG Metrics: Specialized metrics for retrieval-augmented generation pipelines
Why DeepEval?¶
- Flexible: Define custom evaluation criteria in natural language
- Local Models: Works with Ollama for free, local evaluation
- RAG-Focused: Purpose-built metrics for RAG pipeline evaluation
- Chain-of-Thought: Uses G-Eval algorithm for more accurate scoring
Supported Providers¶
model:
provider: ollama # Free, local inference (recommended for development)
# provider: openai # OpenAI API
# provider: anthropic # Anthropic API
# provider: azure_openai # Azure OpenAI
name: llama3.2:latest
temperature: 0.0 # Use 0 for deterministic evaluation
GEval Metrics¶
GEval uses the G-Eval algorithm with chain-of-thought prompting to evaluate responses against custom criteria.
Basic Configuration¶
- type: geval
name: "Coherence"
criteria: "Evaluate whether the response is clear, well-structured, and easy to understand."
threshold: 0.7
Full Configuration¶
- type: geval
name: "Technical Accuracy"
criteria: |
Evaluate whether the response provides accurate technical information
that correctly addresses the user's question.
evaluation_steps: # Optional: Auto-generated if omitted
- "Check if the response directly addresses the user's question"
- "Verify technical accuracy of any code or commands provided"
- "Ensure explanations are correct and not misleading"
evaluation_params: # Which test case fields to use
- actual_output # Required: Agent's response
- input # Optional: User's query
- expected_output # Optional: Ground truth
- context # Optional: Additional context
- retrieval_context # Optional: Retrieved documents
threshold: 0.8
strict_mode: false # Binary scoring (1.0 or 0.0) when true
enabled: true
fail_on_error: false
model: # Optional: Per-metric model override
provider: openai
name: gpt-4
GEval Configuration Options¶
| Field | Type | Required | Description |
|---|---|---|---|
type |
string | Yes | Must be "geval" |
name |
string | Yes | Custom metric name (e.g., "Coherence", "Helpfulness") |
criteria |
string | Yes | Natural language evaluation criteria |
evaluation_steps |
list | No | Step-by-step evaluation instructions (auto-generated if omitted) |
evaluation_params |
list | No | Test case fields to use (default: ["actual_output"]) |
threshold |
float | No | Minimum passing score (0-1) |
strict_mode |
bool | No | Binary scoring when true (default: false) |
enabled |
bool | No | Enable/disable metric (default: true) |
fail_on_error |
bool | No | Fail test on evaluation error (default: false) |
model |
object | No | Per-metric model override |
Evaluation Parameters¶
| Parameter | Description | When to Use |
|---|---|---|
actual_output |
Agent's response | Always (required for evaluation) |
input |
User's query/question | When relevance to query matters |
expected_output |
Ground truth answer | When comparing to expected response |
context |
Additional context provided | When evaluating context usage |
retrieval_context |
Retrieved documents | For RAG pipeline evaluation |
GEval Examples¶
Coherence Check:
- type: geval
name: "Coherence"
criteria: "Evaluate whether the response is clear, well-structured, and easy to understand."
evaluation_steps:
- "Evaluate whether the response uses clear and direct language."
- "Check if the explanation avoids jargon or explains it when used."
- "Assess whether complex ideas are presented in a way that's easy to follow."
evaluation_params:
- actual_output
threshold: 0.7
Helpfulness Check:
- type: geval
name: "Helpfulness"
criteria: |
Evaluate whether the response provides actionable, practical help
that addresses the user's needs.
evaluation_params:
- actual_output
- input
threshold: 0.75
Factual Accuracy:
- type: geval
name: "Factual Accuracy"
criteria: |
Evaluate whether the response is factually accurate when compared
to the expected answer and provided context.
evaluation_params:
- actual_output
- expected_output
- context
threshold: 0.85
strict_mode: true # Binary pass/fail
RAG Metrics¶
RAG (Retrieval-Augmented Generation) metrics evaluate the quality of responses generated using retrieved context.
Available RAG Metrics¶
| Metric Type | Purpose | Required Parameters |
|---|---|---|
faithfulness |
Detects hallucinations | input, actual_output, retrieval_context |
answer_relevancy |
Response relevance to query | input, actual_output |
contextual_relevancy |
Retrieved chunks relevance | input, actual_output, retrieval_context |
contextual_precision |
Chunk ranking quality | input, actual_output, expected_output, retrieval_context |
contextual_recall |
Retrieval completeness | input, actual_output, expected_output, retrieval_context |
Basic Configuration¶
- type: rag
metric_type: faithfulness
threshold: 0.8
- type: rag
metric_type: answer_relevancy
threshold: 0.7
Full Configuration¶
- type: rag
metric_type: faithfulness
threshold: 0.8
include_reason: true # Include reasoning in results
enabled: true
fail_on_error: false
model: # Optional: Per-metric model override
provider: openai
name: gpt-4
RAG Metric Details¶
Faithfulness - Detects hallucinations by comparing response to retrieval context:
- type: rag
metric_type: faithfulness
threshold: 0.8
include_reason: true
Answer Relevancy - Measures response relevance to the query:
- type: rag
metric_type: answer_relevancy
threshold: 0.7
Contextual Relevancy - Measures relevance of retrieved chunks:
- type: rag
metric_type: contextual_relevancy
threshold: 0.75
Contextual Precision - Evaluates chunk ranking quality:
- type: rag
metric_type: contextual_precision
threshold: 0.8
Contextual Recall - Measures retrieval completeness:
- type: rag
metric_type: contextual_recall
threshold: 0.7
NLP Metrics (Standard)¶
NLP metrics compare response text to expected output using algorithms. They're fast, free (no LLM calls), and deterministic.
F1 Score¶
Measures precision and recall of token overlap.
- type: standard
metric: f1_score
threshold: 0.8
Scale: 0.0-1.0 (higher is better)
What it measures: - Token-level match with ground truth - Balanced precision/recall
When to use: When exact word matching is important
BLEU (Bilingual Evaluation Understudy)¶
Measures n-gram overlap with reference translation.
- type: standard
metric: bleu
threshold: 0.6
Scale: 0.0-1.0 (higher is better)
What it measures: - N-gram similarity to reference - Penalizes brevity
When to use: For translation, paraphrase evaluation
ROUGE (Recall-Oriented Understudy for Gisting Evaluation)¶
Measures recall of n-grams with reference.
- type: standard
metric: rouge
threshold: 0.7
Scale: 0.0-1.0 (higher is better)
What it measures: - Recall of n-grams - Coverage of reference content
When to use: For summarization tasks
METEOR (Metric for Evaluation of Translation with Explicit Ordering)¶
Similar to BLEU but with better handling of synonyms.
- type: standard
metric: meteor
threshold: 0.65
Scale: 0.0-1.0 (higher is better)
What it measures: - N-gram match with synonyms - Word order
When to use: For translation, paraphrase with synonyms
Legacy AI Metrics (Deprecated)¶
DEPRECATED: Azure AI-based metrics (groundedness, relevance, coherence, safety) are deprecated and will be removed in a future version.
Migration Guide: - For custom criteria evaluation → Use
type: gevalwith natural language criteria - For RAG hallucination detection → Usetype: ragwithmetric_type: faithfulness- For relevance checking → Usetype: ragwithmetric_type: answer_relevancy
These legacy metrics are still supported for backwards compatibility:
Groundedness (Deprecated)¶
# DEPRECATED - Use type: rag with metric_type: faithfulness instead
- type: standard
metric: groundedness
threshold: 0.8
Migration: Replace with:
- type: rag
metric_type: faithfulness
threshold: 0.8
Relevance (Deprecated)¶
# DEPRECATED - Use type: rag with metric_type: answer_relevancy instead
- type: standard
metric: relevance
threshold: 0.75
Migration: Replace with:
- type: rag
metric_type: answer_relevancy
threshold: 0.75
Coherence (Deprecated)¶
# DEPRECATED - Use type: geval with custom criteria instead
- type: standard
metric: coherence
threshold: 0.7
Migration: Replace with:
- type: geval
name: "Coherence"
criteria: "Evaluate whether the response is clear, well-structured, and easy to understand."
threshold: 0.7
Safety (Deprecated)¶
# DEPRECATED - Use type: geval with custom criteria instead
- type: standard
metric: safety
threshold: 0.9
Migration: Replace with:
- type: geval
name: "Safety"
criteria: |
Evaluate whether the response is safe and appropriate:
- No harmful or dangerous content
- No PII leakage
- Appropriate tone and language
threshold: 0.9
Metric Configuration Options¶
Threshold¶
- Type: Float
- Purpose: Minimum score for test to pass
- Scale: 0-1 for all metrics
- Optional: Yes (default: no threshold, metric is informational)
- type: geval
name: "Quality"
criteria: "..."
threshold: 0.8
Enabled¶
- Type: Boolean
- Default:
true - Purpose: Temporarily disable metric without removing it
- type: rag
metric_type: answer_relevancy
enabled: false # Metric runs but doesn't fail test
Fail on Error¶
- Type: Boolean
- Default:
false(soft failure) - Purpose: Whether to fail test if evaluation errors
- type: geval
name: "Quality"
criteria: "..."
fail_on_error: false # Continues even if LLM evaluation fails
Complete Examples¶
Basic DeepEval Setup¶
evaluations:
model:
provider: ollama
name: llama3.2:latest
temperature: 0.0
metrics:
- type: geval
name: "Coherence"
criteria: "Evaluate whether the response is clear and well-structured."
threshold: 0.7
- type: rag
metric_type: answer_relevancy
threshold: 0.7
RAG Pipeline Evaluation¶
evaluations:
model:
provider: ollama
name: llama3.2:latest
temperature: 0.0
metrics:
# Detect hallucinations
- type: rag
metric_type: faithfulness
threshold: 0.85
include_reason: true
# Check response relevance
- type: rag
metric_type: answer_relevancy
threshold: 0.75
# Evaluate retrieval quality
- type: rag
metric_type: contextual_relevancy
threshold: 0.7
# Check retrieval completeness
- type: rag
metric_type: contextual_recall
threshold: 0.7
Mixed Metrics (DeepEval + NLP)¶
evaluations:
model:
provider: ollama
name: llama3.2:latest
temperature: 0.0
metrics:
# DeepEval metrics (primary)
- type: geval
name: "Response Quality"
criteria: "Evaluate if the response is helpful and accurate."
threshold: 0.75
- type: rag
metric_type: faithfulness
threshold: 0.8
# NLP metrics (secondary)
- type: standard
metric: f1_score
threshold: 0.7
- type: standard
metric: rouge
threshold: 0.6
Enterprise Setup with Model Overrides¶
evaluations:
model:
provider: ollama
name: llama3.2:latest # Default: free, local
temperature: 0.0
metrics:
# Critical metrics - use powerful model
- type: rag
metric_type: faithfulness
threshold: 0.9
model:
provider: openai
name: gpt-4
- type: geval
name: "Safety"
criteria: "Evaluate response safety and appropriateness."
threshold: 0.95
model:
provider: openai
name: gpt-4
# Standard metrics - use default local model
- type: geval
name: "Coherence"
criteria: "Evaluate response clarity."
threshold: 0.75
- type: rag
metric_type: answer_relevancy
threshold: 0.7
# NLP metrics - no LLM needed
- type: standard
metric: f1_score
threshold: 0.7
Per-Test Case Evaluation¶
Test cases can specify which metrics to run:
test_cases:
- name: "Fact check test"
input: "What's our company's founding date?"
expected_tools: [search_kb]
ground_truth: "Founded in 2010"
evaluations:
- type: rag
metric_type: faithfulness
threshold: 0.85
- type: rag
metric_type: answer_relevancy
threshold: 0.7
- name: "Creative task"
input: "Generate a company tagline"
evaluations:
- type: geval
name: "Creativity"
criteria: "Evaluate if the tagline is creative and memorable."
threshold: 0.7
# Skip faithfulness since no retrieval context
Test Execution¶
When running tests, HoloDeck:
- Executes agent with test input
- Records which tools were called
- Validates tool usage (
expected_tools) - Runs each enabled metric
- Compares results against thresholds
- Reports pass/fail per metric
Example Output¶
Test: "Password Reset"
Input: "How do I reset my password?"
Tools called: [search_kb] ✓
Metrics:
✓ Faithfulness: 0.92 (threshold: 0.8)
✓ Answer Relevancy: 0.88 (threshold: 0.75)
✓ Coherence: 0.85 (threshold: 0.7)
✓ F1 Score: 0.81 (threshold: 0.7)
Result: PASS
Cost Optimization¶
Use Local Models for Development¶
evaluations:
model:
provider: ollama # Free, local
name: llama3.2:latest
Use Paid Models Only for Critical Metrics¶
evaluations:
model:
provider: ollama # Default: free
name: llama3.2:latest
metrics:
- type: rag
metric_type: faithfulness
model:
provider: openai
name: gpt-4 # Expensive override only for critical metric
Use NLP Metrics When Possible¶
NLP metrics are free (no LLM calls):
- type: standard
metric: f1_score # No LLM cost
- type: standard
metric: rouge # No LLM cost
Model Configuration Details¶
When specifying a model for evaluation:
model:
provider: ollama|openai|azure_openai|anthropic # Required
name: model-identifier # Required
temperature: 0.0-2.0 # Optional (recommend 0.0 for evaluation)
max_tokens: integer # Optional
top_p: 0.0-1.0 # Optional
Provider-Specific Models¶
Ollama (Recommended for Development)
- llama3.2:latest - Fast, capable
- llama3.1:latest - More capable
- Any model available in your Ollama installation
OpenAI
- gpt-4o - Latest, best quality
- gpt-4o-mini - Fast, cheap
- gpt-4-turbo - Previous generation
Azure OpenAI
- gpt-4 - Standard
- gpt-4-32k - Extended context
Anthropic
- claude-3-opus - Most capable
- claude-3-sonnet - Balanced
- claude-3-haiku - Fast, cheap
Recommended Settings for Evaluation¶
model:
provider: ollama
name: llama3.2:latest
temperature: 0.0 # Deterministic for consistency
Troubleshooting¶
Error: "invalid metric type"¶
- Check metric type is valid
- Valid types:
geval,rag,standard - For standard metrics: f1_score, bleu, rouge, meteor
- For RAG metrics: faithfulness, answer_relevancy, contextual_relevancy, contextual_precision, contextual_recall
Metric always fails¶
- Check evaluation model is working
- Try without threshold first
- Test evaluation model manually
- For RAG metrics, ensure required parameters are available
LLM evaluation too slow¶
- Use local Ollama model instead of API
- Use faster model:
gpt-4o-miniinstead ofgpt-4 - Use NLP metrics instead (free and fast)
Inconsistent evaluation results¶
- Set temperature to 0.0 for deterministic results
- Use more powerful model for complex evaluations
- Add
evaluation_stepsto GEval for better consistency
RAG metric missing retrieval_context¶
- Ensure your agent uses a vectorstore tool
- The test runner automatically extracts retrieval_context from tool results
- Or provide manual retrieval_context in test case
Best Practices¶
- Start with DeepEval: Use GEval and RAG metrics as primary evaluation
- Use Local Models: Start with Ollama for free development, upgrade for production
- Mix Metric Types: Combine DeepEval (semantic) with NLP (keyword-based)
- Cost-Aware: Use cheaper/local models by default, expensive models only for critical metrics
- Realistic Thresholds: Set thresholds based on actual agent performance
- Monitor: Run metrics on sample of tests first
- Iterate: Adjust thresholds and metrics based on results
- Migrate from Legacy: Replace deprecated Azure AI metrics with DeepEval equivalents
Next Steps¶
- See Agent Configuration Guide for how to set up evaluations
- See Examples for complete evaluation configurations
- See Global Configuration for shared settings
- See API Reference for evaluator class details