Evaluation Framework API¶
HoloDeck provides a flexible evaluation framework for measuring agent response quality. The framework supports three tiers of metrics:
- DeepEval Metrics (Recommended) - LLM-as-a-judge with GEval and RAG metrics
- NLP Metrics (Standard) - Algorithmic text comparison
- Legacy AI Metrics (Deprecated) - Azure AI-based metrics
Evaluation Configuration Models¶
EvaluationConfig
¶
Bases: BaseModel
Evaluation framework configuration.
Container for evaluation metrics with optional default model configuration. Supports standard EvaluationMetric, GEvalMetric (custom criteria), and RAGMetric (RAG pipeline evaluation).
validate_metrics(v)
classmethod
¶
Validate metrics list is not empty.
Source code in src/holodeck/models/evaluation.py
314 315 316 317 318 319 320 321 322 | |
MetricType = Annotated[EvaluationMetric | GEvalMetric | RAGMetric, Field(discriminator='type')]
module-attribute
¶
GEvalMetric
¶
Bases: BaseModel
G-Eval custom criteria metric configuration.
Uses discriminator pattern with type="geval" to distinguish from standard EvaluationMetric instances in a discriminated union.
G-Eval enables custom evaluation criteria defined in natural language, using chain-of-thought prompting with LLM-based scoring.
Example
metric = GEvalMetric( ... name="Professionalism", ... criteria="Evaluate if the response uses professional language", ... threshold=0.7 ... )
validate_criteria(v)
classmethod
¶
Validate criteria is not empty.
Source code in src/holodeck/models/evaluation.py
198 199 200 201 202 203 204 | |
validate_evaluation_params(v)
classmethod
¶
Validate evaluation_params contains valid values.
Source code in src/holodeck/models/evaluation.py
206 207 208 209 210 211 212 213 214 215 216 217 218 | |
validate_name(v)
classmethod
¶
Validate name is not empty.
Source code in src/holodeck/models/evaluation.py
190 191 192 193 194 195 196 | |
validate_threshold(v)
classmethod
¶
Validate threshold is in valid range.
Source code in src/holodeck/models/evaluation.py
220 221 222 223 224 225 226 | |
RAGMetric
¶
Bases: BaseModel
RAG pipeline evaluation metric configuration.
Uses discriminator pattern with type="rag" to distinguish from standard EvaluationMetric and GEvalMetric instances in a discriminated union.
RAG metrics evaluate the quality of retrieval-augmented generation pipelines: - Faithfulness: Detects hallucinations by comparing response to context - ContextualRelevancy: Measures relevance of retrieved chunks to query - ContextualPrecision: Evaluates ranking quality of retrieved chunks - ContextualRecall: Measures retrieval completeness against expected output
Example
metric = RAGMetric( ... metric_type=RAGMetricType.FAITHFULNESS, ... threshold=0.8 ... )
validate_threshold(v)
classmethod
¶
Validate threshold is in valid range.
Source code in src/holodeck/models/evaluation.py
281 282 283 284 285 286 287 | |
EvaluationMetric
¶
Bases: BaseModel
Evaluation metric configuration.
Represents a single evaluation metric with flexible model configuration, including per-metric LLM model overrides.
validate_custom_prompt(v)
classmethod
¶
Validate custom_prompt is not empty if provided.
Source code in src/holodeck/models/evaluation.py
121 122 123 124 125 126 127 | |
validate_enabled(v)
classmethod
¶
Validate enabled is boolean.
Source code in src/holodeck/models/evaluation.py
81 82 83 84 85 86 87 | |
validate_fail_on_error(v)
classmethod
¶
Validate fail_on_error is boolean.
Source code in src/holodeck/models/evaluation.py
89 90 91 92 93 94 95 | |
validate_metric(v)
classmethod
¶
Validate metric is not empty.
Source code in src/holodeck/models/evaluation.py
65 66 67 68 69 70 71 | |
validate_retry_on_failure(v)
classmethod
¶
Validate retry_on_failure is in valid range.
Source code in src/holodeck/models/evaluation.py
97 98 99 100 101 102 103 | |
validate_scale(v)
classmethod
¶
Validate scale is positive.
Source code in src/holodeck/models/evaluation.py
113 114 115 116 117 118 119 | |
validate_threshold(v)
classmethod
¶
Validate threshold is numeric if provided.
Source code in src/holodeck/models/evaluation.py
73 74 75 76 77 78 79 | |
validate_timeout_ms(v)
classmethod
¶
Validate timeout_ms is positive.
Source code in src/holodeck/models/evaluation.py
105 106 107 108 109 110 111 | |
DeepEval Evaluators (Recommended)¶
DeepEval provides powerful LLM-as-a-judge evaluation using the DeepEval library.
Base Classes¶
DeepEvalBaseEvaluator(model_config=None, threshold=0.5, timeout=60.0, retry_config=None)
¶
Bases: BaseEvaluator
Abstract base class for DeepEval-based evaluators.
This class extends BaseEvaluator to provide DeepEval-specific functionality: - Model configuration and initialization - LLMTestCase construction from evaluation inputs - Result normalization and logging
Subclasses must implement _create_metric() to return the specific DeepEval metric instance.
Note: DeepEval uses different parameter names than Azure AI/NLP: - input (not query) - actual_output (not response) - expected_output (not ground_truth)
Attributes:
| Name | Type | Description |
|---|---|---|
model_config |
Configuration for the evaluation LLM |
|
threshold |
Score threshold for pass/fail determination |
|
model |
The initialized DeepEval model instance |
Example
class MyMetricEvaluator(DeepEvalBaseEvaluator): ... def _create_metric(self): ... return SomeDeepEvalMetric( ... threshold=self._threshold, ... model=self._model ... )
evaluator = MyMetricEvaluator(threshold=0.7) result = await evaluator.evaluate( ... input="What is Python?", ... actual_output="Python is a programming language." ... )
Initialize DeepEval base evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_config
|
DeepEvalModelConfig | None
|
Configuration for the evaluation model. Defaults to Ollama with gpt-oss:20b. |
None
|
threshold
|
float
|
Score threshold for pass/fail (0.0-1.0, default: 0.5) |
0.5
|
timeout
|
float | None
|
Evaluation timeout in seconds (default: 60.0) |
60.0
|
retry_config
|
RetryConfig | None
|
Retry configuration for transient failures |
None
|
Source code in src/holodeck/lib/evaluators/deepeval/base.py
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 | |
name
property
¶
Return evaluator name (class name by default).
evaluate(**kwargs)
async
¶
Evaluate with timeout and retry logic.
This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Evaluation parameters (query, response, context, ground_truth, etc.) |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Evaluation result dictionary |
Raises:
| Type | Description |
|---|---|
TimeoutError
|
If evaluation exceeds timeout |
EvaluationError
|
If evaluation fails after retries |
Example
evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95
Source code in src/holodeck/lib/evaluators/base.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |
get_param_spec()
classmethod
¶
Get the parameter specification for this evaluator.
Returns:
| Type | Description |
|---|---|
ParamSpec
|
ParamSpec declaring required/optional parameters and context flags. |
Source code in src/holodeck/lib/evaluators/base.py
121 122 123 124 125 126 127 128 | |
DeepEvalModelConfig
¶
Bases: BaseModel
Configuration adapter for DeepEval model classes.
This class bridges HoloDeck's LLMProvider configuration to DeepEval's native model classes (GPTModel, AzureOpenAIModel, AnthropicModel, OllamaModel).
The default configuration uses Ollama with gpt-oss:20b for local evaluation without requiring API keys.
Attributes:
| Name | Type | Description |
|---|---|---|
provider |
ProviderEnum
|
LLM provider to use (defaults to Ollama) |
model_name |
str
|
Name of the model (defaults to gpt-oss:20b) |
api_key |
str | None
|
API key for cloud providers (not required for Ollama) |
endpoint |
str | None
|
API endpoint URL (required for Azure, optional for Ollama) |
api_version |
str | None
|
Azure OpenAI API version |
deployment_name |
str | None
|
Azure OpenAI deployment name |
temperature |
float
|
Temperature for generation (defaults to 0.0 for determinism) |
API Key Behavior
- OpenAI: API key can be provided via
api_keyfield or theOPENAI_API_KEYenvironment variable. If neither is set, DeepEval's GPTModel will raise an error at runtime. - Anthropic: API key can be provided via
api_keyfield or theANTHROPIC_API_KEYenvironment variable. If neither is set, DeepEval's AnthropicModel will raise an error at runtime. - Azure OpenAI: The
api_keyfield is required and validated at configuration time (no environment variable fallback). - Ollama: No API key required (local inference).
Example
config = DeepEvalModelConfig() # Default Ollama model = config.to_deepeval_model()
openai_config = DeepEvalModelConfig( ... provider=ProviderEnum.OPENAI, ... model_name="gpt-4o", ... api_key="sk-..." # Or set OPENAI_API_KEY env var ... )
to_deepeval_model()
¶
Convert configuration to native DeepEval model class.
Returns the appropriate DeepEval model class instance based on the configured provider.
Returns:
| Type | Description |
|---|---|
DeepEvalModel
|
DeepEval model instance (GPTModel, AzureOpenAIModel, |
DeepEvalModel
|
AnthropicModel, or OllamaModel) |
Raises:
| Type | Description |
|---|---|
ValueError
|
If provider is not supported |
Source code in src/holodeck/lib/evaluators/deepeval/config.py
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 | |
validate_provider_requirements()
¶
Validate that required fields are present for each provider.
Raises:
| Type | Description |
|---|---|
ValueError
|
If required fields are missing for the provider |
Source code in src/holodeck/lib/evaluators/deepeval/config.py
93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | |
GEval Evaluator¶
The GEval evaluator uses the G-Eval algorithm with chain-of-thought prompting for custom criteria evaluation.
GEvalEvaluator(name, criteria, evaluation_params=None, evaluation_steps=None, model_config=None, threshold=0.5, strict_mode=False, timeout=60.0, retry_config=None)
¶
Bases: DeepEvalBaseEvaluator
G-Eval custom criteria evaluator.
Evaluates LLM outputs against user-defined criteria using the G-Eval algorithm, which combines chain-of-thought prompting with token probability scoring.
G-Eval works in two phases: 1. Step Generation: Auto-generates evaluation steps from the criteria 2. Scoring: Uses the steps to score the test case on a 1-5 scale (normalized to 0-1)
Attributes:
| Name | Type | Description |
|---|---|---|
_metric_name |
Custom name for this evaluation metric |
|
_criteria |
Natural language criteria for evaluation |
|
_evaluation_params |
Test case fields to include in evaluation |
|
_evaluation_steps |
Optional explicit evaluation steps |
|
_strict_mode |
Whether to use binary scoring (1.0 or 0.0) |
Example
evaluator = GEvalEvaluator( ... name="Professionalism", ... criteria="Evaluate if the response uses professional language", ... threshold=0.7 ... ) result = await evaluator.evaluate( ... input="Write me an email", ... actual_output="Dear Sir/Madam, ..." ... ) print(result["score"]) # 0.85 print(result["passed"]) # True
Initialize G-Eval evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Metric identifier (e.g., "Correctness", "Helpfulness") |
required |
criteria
|
str
|
Natural language evaluation criteria |
required |
evaluation_params
|
list[str] | None
|
Test case fields to include in evaluation. Valid options: ["input", "actual_output", "expected_output", "context", "retrieval_context"] Default: ["actual_output"] |
None
|
evaluation_steps
|
list[str] | None
|
Explicit evaluation steps. If None, G-Eval auto-generates steps from the criteria. |
None
|
model_config
|
DeepEvalModelConfig | None
|
LLM judge configuration. Defaults to Ollama gpt-oss:20b. |
None
|
threshold
|
float
|
Pass/fail score threshold (0.0-1.0). Default: 0.5. |
0.5
|
strict_mode
|
bool
|
If True, scores are binary (1.0 or 0.0). Default: False. |
False
|
timeout
|
float | None
|
Evaluation timeout in seconds. Default: 60.0. |
60.0
|
retry_config
|
RetryConfig | None
|
Retry configuration for transient failures. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If invalid evaluation_params are provided. |
Source code in src/holodeck/lib/evaluators/deepeval/geval.py
63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 | |
name
property
¶
Return the custom metric name.
evaluate(**kwargs)
async
¶
Evaluate with timeout and retry logic.
This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Evaluation parameters (query, response, context, ground_truth, etc.) |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Evaluation result dictionary |
Raises:
| Type | Description |
|---|---|
TimeoutError
|
If evaluation exceeds timeout |
EvaluationError
|
If evaluation fails after retries |
Example
evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95
Source code in src/holodeck/lib/evaluators/base.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |
get_param_spec()
classmethod
¶
Get the parameter specification for this evaluator.
Returns:
| Type | Description |
|---|---|
ParamSpec
|
ParamSpec declaring required/optional parameters and context flags. |
Source code in src/holodeck/lib/evaluators/base.py
121 122 123 124 125 126 127 128 | |
RAG Evaluators¶
RAG evaluators measure retrieval-augmented generation pipeline quality.
FaithfulnessEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None)
¶
Bases: DeepEvalBaseEvaluator
Faithfulness evaluator for detecting hallucinations.
Detects hallucinations by comparing agent response to retrieval context. Returns a low score if the response contains information not found in the retrieval context (hallucination detected).
Required inputs
- input: User query
- actual_output: Agent response
- retrieval_context: List of retrieved text chunks
Example
evaluator = FaithfulnessEvaluator(threshold=0.8) result = await evaluator.evaluate( ... input="What are the store hours?", ... actual_output="Store is open 24/7.", ... retrieval_context=["Store hours: Mon-Fri 9am-5pm"] ... ) print(result["score"]) # Low score (hallucination detected)
Attributes:
| Name | Type | Description |
|---|---|---|
_include_reason |
Whether to include reasoning in results. |
Initialize Faithfulness evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_config
|
DeepEvalModelConfig | None
|
LLM judge configuration. Defaults to Ollama gpt-oss:20b. |
None
|
threshold
|
float
|
Pass/fail score threshold (0.0-1.0). Default: 0.5. |
0.5
|
include_reason
|
bool
|
Whether to include reasoning in results. Default: True. |
True
|
timeout
|
float | None
|
Evaluation timeout in seconds. Default: 60.0. |
60.0
|
retry_config
|
RetryConfig | None
|
Retry configuration for transient failures. |
None
|
Source code in src/holodeck/lib/evaluators/deepeval/faithfulness.py
53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | |
name
property
¶
Return the metric name.
evaluate(**kwargs)
async
¶
Evaluate with timeout and retry logic.
This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Evaluation parameters (query, response, context, ground_truth, etc.) |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Evaluation result dictionary |
Raises:
| Type | Description |
|---|---|
TimeoutError
|
If evaluation exceeds timeout |
EvaluationError
|
If evaluation fails after retries |
Example
evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95
Source code in src/holodeck/lib/evaluators/base.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |
get_param_spec()
classmethod
¶
Get the parameter specification for this evaluator.
Returns:
| Type | Description |
|---|---|
ParamSpec
|
ParamSpec declaring required/optional parameters and context flags. |
Source code in src/holodeck/lib/evaluators/base.py
121 122 123 124 125 126 127 128 | |
AnswerRelevancyEvaluator(model_config=None, threshold=0.5, include_reason=True, strict_mode=False, timeout=60.0, retry_config=None)
¶
Bases: DeepEvalBaseEvaluator
Answer Relevancy evaluator - measures statement relevance to input.
Evaluates how relevant the response statements are to the input query. Unlike other RAG metrics, this does NOT require retrieval_context.
Required inputs
- input: User query
- actual_output: Agent response
Example
evaluator = AnswerRelevancyEvaluator(threshold=0.7) result = await evaluator.evaluate( ... input="What is the return policy?", ... actual_output="We offer a 30-day full refund at no extra cost." ... ) print(result["score"]) # High score if relevant
Attributes:
| Name | Type | Description |
|---|---|---|
_include_reason |
Whether to include reasoning in results. |
|
_strict_mode |
Whether to use binary scoring (1.0 or 0.0). |
Initialize Answer Relevancy evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_config
|
DeepEvalModelConfig | None
|
LLM judge configuration. Defaults to Ollama gpt-oss:20b. |
None
|
threshold
|
float
|
Pass/fail score threshold (0.0-1.0). Default: 0.5. |
0.5
|
include_reason
|
bool
|
Whether to include reasoning in results. Default: True. |
True
|
strict_mode
|
bool
|
Binary scoring mode (1.0 or 0.0 only). Default: False. |
False
|
timeout
|
float | None
|
Evaluation timeout in seconds. Default: 60.0. |
60.0
|
retry_config
|
RetryConfig | None
|
Retry configuration for transient failures. |
None
|
Source code in src/holodeck/lib/evaluators/deepeval/answer_relevancy.py
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | |
name
property
¶
Return the metric name.
evaluate(**kwargs)
async
¶
Evaluate with timeout and retry logic.
This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Evaluation parameters (query, response, context, ground_truth, etc.) |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Evaluation result dictionary |
Raises:
| Type | Description |
|---|---|
TimeoutError
|
If evaluation exceeds timeout |
EvaluationError
|
If evaluation fails after retries |
Example
evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95
Source code in src/holodeck/lib/evaluators/base.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |
get_param_spec()
classmethod
¶
Get the parameter specification for this evaluator.
Returns:
| Type | Description |
|---|---|
ParamSpec
|
ParamSpec declaring required/optional parameters and context flags. |
Source code in src/holodeck/lib/evaluators/base.py
121 122 123 124 125 126 127 128 | |
ContextualRelevancyEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None)
¶
Bases: DeepEvalBaseEvaluator
Contextual Relevancy evaluator for RAG pipelines.
Measures the relevance of retrieved context to the user query. Returns the proportion of chunks that are relevant to the query.
Required inputs
- input: User query
- actual_output: Agent response
- retrieval_context: List of retrieved text chunks
Example
evaluator = ContextualRelevancyEvaluator(threshold=0.6) result = await evaluator.evaluate( ... input="What is the pricing?", ... actual_output="Basic plan is $10/month.", ... retrieval_context=[ ... "Pricing: Basic $10, Pro $25", # Relevant ... "Company founded in 2020", # Irrelevant ... ] ... ) print(result["score"]) # 0.5 (1 of 2 chunks relevant)
Attributes:
| Name | Type | Description |
|---|---|---|
_include_reason |
Whether to include reasoning in results. |
Initialize Contextual Relevancy evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_config
|
DeepEvalModelConfig | None
|
LLM judge configuration. Defaults to Ollama gpt-oss:20b. |
None
|
threshold
|
float
|
Pass/fail score threshold (0.0-1.0). Default: 0.5. |
0.5
|
include_reason
|
bool
|
Whether to include reasoning in results. Default: True. |
True
|
timeout
|
float | None
|
Evaluation timeout in seconds. Default: 60.0. |
60.0
|
retry_config
|
RetryConfig | None
|
Retry configuration for transient failures. |
None
|
Source code in src/holodeck/lib/evaluators/deepeval/contextual_relevancy.py
55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 | |
name
property
¶
Return the metric name.
evaluate(**kwargs)
async
¶
Evaluate with timeout and retry logic.
This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Evaluation parameters (query, response, context, ground_truth, etc.) |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Evaluation result dictionary |
Raises:
| Type | Description |
|---|---|
TimeoutError
|
If evaluation exceeds timeout |
EvaluationError
|
If evaluation fails after retries |
Example
evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95
Source code in src/holodeck/lib/evaluators/base.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |
get_param_spec()
classmethod
¶
Get the parameter specification for this evaluator.
Returns:
| Type | Description |
|---|---|
ParamSpec
|
ParamSpec declaring required/optional parameters and context flags. |
Source code in src/holodeck/lib/evaluators/base.py
121 122 123 124 125 126 127 128 | |
ContextualPrecisionEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None)
¶
Bases: DeepEvalBaseEvaluator
Contextual Precision evaluator for RAG pipelines.
Evaluates the ranking quality of retrieved chunks. Measures whether relevant chunks appear before irrelevant ones.
Required inputs
- input: User query
- actual_output: Agent response
- expected_output: Ground truth answer
- retrieval_context: List of retrieved text chunks (order matters)
Example
evaluator = ContextualPrecisionEvaluator(threshold=0.7) result = await evaluator.evaluate( ... input="What is X?", ... actual_output="X is...", ... expected_output="X is the correct definition.", ... retrieval_context=[ ... "Irrelevant info", # Bad: irrelevant first ... "X is the definition", # Good: relevant ... ] ... ) print(result["score"]) # Lower due to poor ranking
Attributes:
| Name | Type | Description |
|---|---|---|
_include_reason |
Whether to include reasoning in results. |
Initialize Contextual Precision evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_config
|
DeepEvalModelConfig | None
|
LLM judge configuration. Defaults to Ollama gpt-oss:20b. |
None
|
threshold
|
float
|
Pass/fail score threshold (0.0-1.0). Default: 0.5. |
0.5
|
include_reason
|
bool
|
Whether to include reasoning in results. Default: True. |
True
|
timeout
|
float | None
|
Evaluation timeout in seconds. Default: 60.0. |
60.0
|
retry_config
|
RetryConfig | None
|
Retry configuration for transient failures. |
None
|
Source code in src/holodeck/lib/evaluators/deepeval/contextual_precision.py
61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 | |
name
property
¶
Return the metric name.
evaluate(**kwargs)
async
¶
Evaluate with timeout and retry logic.
This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Evaluation parameters (query, response, context, ground_truth, etc.) |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Evaluation result dictionary |
Raises:
| Type | Description |
|---|---|
TimeoutError
|
If evaluation exceeds timeout |
EvaluationError
|
If evaluation fails after retries |
Example
evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95
Source code in src/holodeck/lib/evaluators/base.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |
get_param_spec()
classmethod
¶
Get the parameter specification for this evaluator.
Returns:
| Type | Description |
|---|---|
ParamSpec
|
ParamSpec declaring required/optional parameters and context flags. |
Source code in src/holodeck/lib/evaluators/base.py
121 122 123 124 125 126 127 128 | |
ContextualRecallEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None)
¶
Bases: DeepEvalBaseEvaluator
Contextual Recall evaluator for RAG pipelines.
Measures retrieval completeness against expected output. Evaluates whether retrieval context contains all facts needed to produce the expected output.
Required inputs
- input: User query
- actual_output: Agent response
- expected_output: Ground truth answer
- retrieval_context: List of retrieved text chunks
Example
evaluator = ContextualRecallEvaluator(threshold=0.8) result = await evaluator.evaluate( ... input="List all features", ... actual_output="Features are A and B", ... expected_output="Features are A, B, and C", ... retrieval_context=["Feature A: ...", "Feature B: ..."] ... ) print(result["score"]) # ~0.67 (missing Feature C)
Attributes:
| Name | Type | Description |
|---|---|---|
_include_reason |
Whether to include reasoning in results. |
Initialize Contextual Recall evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_config
|
DeepEvalModelConfig | None
|
LLM judge configuration. Defaults to Ollama gpt-oss:20b. |
None
|
threshold
|
float
|
Pass/fail score threshold (0.0-1.0). Default: 0.5. |
0.5
|
include_reason
|
bool
|
Whether to include reasoning in results. Default: True. |
True
|
timeout
|
float | None
|
Evaluation timeout in seconds. Default: 60.0. |
60.0
|
retry_config
|
RetryConfig | None
|
Retry configuration for transient failures. |
None
|
Source code in src/holodeck/lib/evaluators/deepeval/contextual_recall.py
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 | |
name
property
¶
Return the metric name.
evaluate(**kwargs)
async
¶
Evaluate with timeout and retry logic.
This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Evaluation parameters (query, response, context, ground_truth, etc.) |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Evaluation result dictionary |
Raises:
| Type | Description |
|---|---|
TimeoutError
|
If evaluation exceeds timeout |
EvaluationError
|
If evaluation fails after retries |
Example
evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95
Source code in src/holodeck/lib/evaluators/base.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |
get_param_spec()
classmethod
¶
Get the parameter specification for this evaluator.
Returns:
| Type | Description |
|---|---|
ParamSpec
|
ParamSpec declaring required/optional parameters and context flags. |
Source code in src/holodeck/lib/evaluators/base.py
121 122 123 124 125 126 127 128 | |
Usage Examples¶
DeepEval GEval Metrics¶
from holodeck.lib.evaluators.deepeval import GEvalEvaluator, DeepEvalModelConfig
from holodeck.models.llm import LLMProvider
# Configure model
model_config = DeepEvalModelConfig(
provider=LLMProvider.OLLAMA,
name="llama3.2:latest",
temperature=0.0
)
# Create evaluator with custom criteria
evaluator = GEvalEvaluator(
name="Coherence",
criteria="Evaluate whether the response is clear and well-structured.",
evaluation_steps=[
"Check if the response uses clear language.",
"Assess if the explanation is easy to follow."
],
evaluation_params=["actual_output"],
model_config=model_config,
threshold=0.7
)
# Evaluate
result = await evaluator.evaluate(
actual_output="The password can be reset by clicking 'Forgot Password' on the login page.",
input="How do I reset my password?"
)
print(f"Score: {result.score}")
print(f"Passed: {result.passed}")
print(f"Reason: {result.reason}")
DeepEval RAG Metrics¶
from holodeck.lib.evaluators.deepeval import (
FaithfulnessEvaluator,
AnswerRelevancyEvaluator,
DeepEvalModelConfig
)
from holodeck.models.llm import LLMProvider
# Configure model
model_config = DeepEvalModelConfig(
provider=LLMProvider.OLLAMA,
name="llama3.2:latest",
temperature=0.0
)
# Faithfulness - detect hallucinations
faithfulness = FaithfulnessEvaluator(
model_config=model_config,
threshold=0.8,
include_reason=True
)
result = await faithfulness.evaluate(
input="What is our return policy?",
actual_output="You can return items within 30 days for a full refund.",
retrieval_context=[
"Our return policy allows returns within 30 days of purchase.",
"Full refunds are provided for items in original condition."
]
)
# Answer Relevancy - check response addresses query
relevancy = AnswerRelevancyEvaluator(
model_config=model_config,
threshold=0.7
)
result = await relevancy.evaluate(
input="How do I reset my password?",
actual_output="Click 'Forgot Password' on the login page and follow the email instructions."
)
NLP Metrics¶
from holodeck.lib.evaluators.nlp_metrics import compute_f1_score, compute_rouge
# Compute F1 score
prediction = "the cat is on the mat"
reference = "a cat is on the mat"
f1 = compute_f1_score(prediction, reference)
print(f"F1 Score: {f1}")
# Compute ROUGE scores
scores = compute_rouge(prediction, reference)
print(f"ROUGE-1: {scores['rouge1']}")
print(f"ROUGE-2: {scores['rouge2']}")
print(f"ROUGE-L: {scores['rougeL']}")
Metric Configuration in YAML¶
DeepEval GEval Metric¶
evaluations:
model:
provider: ollama
name: llama3.2:latest
temperature: 0.0
metrics:
- type: geval
name: "Coherence"
criteria: "Evaluate whether the response is clear and well-structured."
evaluation_steps:
- "Check if the response uses clear language."
- "Assess if the explanation is easy to follow."
evaluation_params:
- actual_output
- input
threshold: 0.7
enabled: true
fail_on_error: false
DeepEval RAG Metrics¶
evaluations:
model:
provider: ollama
name: llama3.2:latest
temperature: 0.0
metrics:
# Faithfulness - hallucination detection
- type: rag
metric_type: faithfulness
threshold: 0.8
include_reason: true
# Answer Relevancy
- type: rag
metric_type: answer_relevancy
threshold: 0.7
# Contextual Relevancy
- type: rag
metric_type: contextual_relevancy
threshold: 0.75
# Contextual Precision
- type: rag
metric_type: contextual_precision
threshold: 0.8
# Contextual Recall
- type: rag
metric_type: contextual_recall
threshold: 0.7
NLP Metrics¶
evaluations:
metrics:
- type: standard
metric: f1_score
threshold: 0.8
- type: standard
metric: bleu
threshold: 0.6
- type: standard
metric: rouge
threshold: 0.7
- type: standard
metric: meteor
threshold: 0.65
Per-Metric Model Override¶
evaluations:
model:
provider: ollama
name: llama3.2:latest # Default: free, local
metrics:
- type: rag
metric_type: faithfulness
threshold: 0.9
model: # Override for critical metric
provider: openai
name: gpt-4
Legacy AI Metrics (Deprecated)¶
DEPRECATED: Azure AI-based metrics are deprecated and will be removed in a future version. Migrate to DeepEval metrics for better flexibility and local model support.
Migration Guide¶
| Legacy Metric | Recommended Replacement |
|---|---|
groundedness |
type: rag, metric_type: faithfulness |
relevance |
type: rag, metric_type: answer_relevancy |
coherence |
type: geval with custom criteria |
safety |
type: geval with custom criteria |
Legacy Usage (Not Recommended)¶
# DEPRECATED - Use DeepEval evaluators instead
from holodeck.lib.evaluators.azure_ai import AzureAIEvaluator
evaluator = AzureAIEvaluator(model="gpt-4", api_key="your-key")
result = await evaluator.evaluate_groundedness(
response="Paris is the capital of France",
context="France's capital city is known for the Eiffel Tower",
)
Legacy YAML Configuration (Not Recommended)¶
# DEPRECATED - Use type: geval or type: rag instead
evaluations:
metrics:
- type: standard
metric: groundedness # Deprecated
threshold: 0.8
- type: standard
metric: relevance # Deprecated
threshold: 0.75
- type: standard
metric: coherence # Deprecated
threshold: 0.7
- type: standard
metric: safety # Deprecated
threshold: 0.9
Integration with Test Runner¶
The test runner automatically:
- Loads evaluation configuration from agent YAML
- Creates appropriate evaluators based on metric type
- Invokes evaluators on test outputs
- Extracts retrieval_context from tool results (for RAG metrics)
- Collects metric scores
- Compares against thresholds
- Includes results in test report
Test Runner Evaluator Creation¶
# Internal test runner logic (simplified)
def _create_evaluators(self, metrics: list[MetricType]) -> dict:
evaluators = {}
for metric in metrics:
if metric.type == "geval":
evaluators[metric.name] = GEvalEvaluator(
name=metric.name,
criteria=metric.criteria,
evaluation_steps=metric.evaluation_steps,
evaluation_params=metric.evaluation_params,
model_config=self._get_model_config(metric),
threshold=metric.threshold,
strict_mode=metric.strict_mode
)
elif metric.type == "rag":
evaluator_class = RAG_EVALUATOR_MAP[metric.metric_type]
evaluators[metric.metric_type] = evaluator_class(
model_config=self._get_model_config(metric),
threshold=metric.threshold,
include_reason=metric.include_reason
)
elif metric.type == "standard":
# NLP or legacy metrics
evaluators[metric.metric] = self._create_standard_evaluator(metric)
return evaluators
Error Handling¶
DeepEval Errors¶
from holodeck.lib.evaluators.deepeval.errors import (
DeepEvalError,
ProviderNotSupportedError
)
try:
result = await evaluator.evaluate(actual_output="...")
except DeepEvalError as e:
print(f"Evaluation failed: {e.message}")
print(f"Metric: {e.metric_name}")
print(f"Test case: {e.test_case_summary}")
except ProviderNotSupportedError as e:
print(f"Provider not supported: {e}")
Soft vs Hard Failures¶
metrics:
# Soft failure - continues on error
- type: geval
name: "Quality"
criteria: "..."
fail_on_error: false # Default
# Hard failure - stops test on error
- type: rag
metric_type: faithfulness
fail_on_error: true
Related Documentation¶
- Evaluations Guide: Configuration and usage guide
- Test Runner: Test execution framework
- Data Models: EvaluationConfig and MetricConfig models
- Configuration Loading: Loading evaluation configs