Evaluation Framework API¶
HoloDeck provides a flexible evaluation framework for measuring agent response quality. The framework supports three tiers of metrics:
- Standard NLP Metrics -- Traditional text-comparison metrics (BLEU, ROUGE, METEOR) that require no LLM
- Azure AI Metrics -- AI-assisted quality metrics via Azure AI Evaluation SDK (groundedness, relevance, coherence, fluency, similarity)
- DeepEval Metrics -- LLM-as-a-judge evaluation with multi-provider support (G-Eval custom criteria, RAG pipeline metrics)
All evaluators share a common base class with retry logic, timeout handling, and a unified parameter specification system.
Architecture Overview¶
BaseEvaluator (base.py)
├── BLEUEvaluator (nlp_metrics.py)
├── ROUGEEvaluator (nlp_metrics.py)
├── METEOREvaluator (nlp_metrics.py)
├── AzureAIEvaluator (azure_ai.py)
│ ├── GroundednessEvaluator
│ ├── RelevanceEvaluator
│ ├── CoherenceEvaluator
│ ├── FluencyEvaluator
│ └── SimilarityEvaluator
└── DeepEvalBaseEvaluator (deepeval/base.py)
├── GEvalEvaluator (deepeval/geval.py)
├── FaithfulnessEvaluator (deepeval/faithfulness.py)
├── AnswerRelevancyEvaluator (deepeval/answer_relevancy.py)
├── ContextualRelevancyEvaluator (deepeval/contextual_relevancy.py)
├── ContextualPrecisionEvaluator (deepeval/contextual_precision.py)
└── ContextualRecallEvaluator (deepeval/contextual_recall.py)
Configuration Models¶
Evaluation metrics are configured in agent.yaml using Pydantic models from holodeck.models.evaluation. The metrics list uses a discriminated union on the type field (standard, geval, or rag).
YAML Configuration Example¶
evaluations:
model: # Default LLM for all LLM-based metrics
provider: openai
name: gpt-4o
temperature: 0.0
metrics:
# Standard NLP metric (no LLM required)
- type: standard
metric: bleu
threshold: 0.4
# G-Eval custom criteria (LLM-as-judge)
- type: geval
name: Helpfulness
criteria: "Evaluate if the response provides actionable information"
evaluation_params: [actual_output, input]
threshold: 0.7
# RAG pipeline metric
- type: rag
metric_type: faithfulness
threshold: 0.8
include_reason: true
EvaluationConfig¶
EvaluationConfig
¶
Bases: BaseModel
Evaluation framework configuration.
Container for evaluation metrics with optional default model configuration. Supports standard EvaluationMetric, GEvalMetric (custom criteria), and RAGMetric (RAG pipeline evaluation).
validate_metrics(v)
classmethod
¶
Validate metrics list is not empty.
Source code in src/holodeck/models/evaluation.py
491 492 493 494 495 496 497 498 499 | |
MetricType¶
The discriminated union that routes to the correct metric model based on the type field:
MetricType = Annotated[
EvaluationMetric | GEvalMetric | RAGMetric,
Field(discriminator="type"),
]
EvaluationMetric¶
Standard metric configuration (type: standard).
EvaluationMetric
¶
Bases: BaseModel
Evaluation metric configuration.
Represents a single evaluation metric with flexible model configuration, including per-metric LLM model overrides.
validate_custom_prompt(v)
classmethod
¶
Validate custom_prompt is not empty if provided.
Source code in src/holodeck/models/evaluation.py
188 189 190 191 192 193 194 | |
validate_enabled(v)
classmethod
¶
Validate enabled is boolean.
Source code in src/holodeck/models/evaluation.py
148 149 150 151 152 153 154 | |
validate_fail_on_error(v)
classmethod
¶
Validate fail_on_error is boolean.
Source code in src/holodeck/models/evaluation.py
156 157 158 159 160 161 162 | |
validate_retry_on_failure(v)
classmethod
¶
Validate retry_on_failure is in valid range.
Source code in src/holodeck/models/evaluation.py
164 165 166 167 168 169 170 | |
validate_scale(v)
classmethod
¶
Validate scale is positive.
Source code in src/holodeck/models/evaluation.py
180 181 182 183 184 185 186 | |
validate_threshold(v)
classmethod
¶
Validate threshold is numeric if provided.
Source code in src/holodeck/models/evaluation.py
140 141 142 143 144 145 146 | |
validate_timeout_ms(v)
classmethod
¶
Validate timeout_ms is positive.
Source code in src/holodeck/models/evaluation.py
172 173 174 175 176 177 178 | |
GEvalMetric¶
G-Eval custom criteria configuration (type: geval).
GEvalMetric
¶
Bases: BaseModel
G-Eval custom criteria metric configuration.
Uses discriminator pattern with type="geval" to distinguish from standard EvaluationMetric instances in a discriminated union.
G-Eval enables custom evaluation criteria defined in natural language, using chain-of-thought prompting with LLM-based scoring.
Example
metric = GEvalMetric( ... name="Professionalism", ... criteria="Evaluate if the response uses professional language", ... threshold=0.7 ... )
validate_criteria(v)
classmethod
¶
Validate criteria is not empty.
Source code in src/holodeck/models/evaluation.py
280 281 282 283 284 285 286 | |
validate_evaluation_params(v)
classmethod
¶
Validate evaluation_params contains valid values.
Source code in src/holodeck/models/evaluation.py
288 289 290 291 292 293 294 295 296 297 298 299 300 | |
validate_name(v)
classmethod
¶
Validate name is not empty.
Source code in src/holodeck/models/evaluation.py
272 273 274 275 276 277 278 | |
validate_threshold(v)
classmethod
¶
Validate threshold is in valid range.
Source code in src/holodeck/models/evaluation.py
302 303 304 305 306 307 308 | |
RAGMetric¶
RAG pipeline metric configuration (type: rag).
RAGMetric
¶
Bases: BaseModel
RAG pipeline evaluation metric configuration.
Uses discriminator pattern with type="rag" to distinguish from standard EvaluationMetric and GEvalMetric instances in a discriminated union.
RAG metrics evaluate the quality of retrieval-augmented generation pipelines: - Faithfulness: Detects hallucinations by comparing response to context - ContextualRelevancy: Measures relevance of retrieved chunks to query - ContextualPrecision: Evaluates ranking quality of retrieved chunks - ContextualRecall: Measures retrieval completeness against expected output
Example
metric = RAGMetric( ... metric_type=RAGMetricType.FAITHFULNESS, ... threshold=0.8 ... )
validate_threshold(v)
classmethod
¶
Validate threshold is in valid range.
Source code in src/holodeck/models/evaluation.py
363 364 365 366 367 368 369 | |
RAGMetricType¶
RAGMetricType
¶
Bases: str, Enum
RAG pipeline evaluation metric types.
These metrics evaluate the quality of Retrieval-Augmented Generation (RAG) pipelines by assessing various aspects of retrieval and response generation.
Base Framework¶
All evaluators inherit from BaseEvaluator, which provides retry logic with exponential backoff, timeout handling, and a parameter specification system.
BaseEvaluator¶
BaseEvaluator(timeout=60.0, retry_config=None)
¶
Bases: ABC
Abstract base class for all evaluation metrics.
This class provides retry logic, timeout handling, and a common interface for all evaluators (AI-assisted and NLP metrics).
Attributes:
| Name | Type | Description |
|---|---|---|
timeout |
Timeout in seconds for evaluation (default: 60s, None for no timeout) |
|
retry_config |
Configuration for retry logic with exponential backoff |
|
name |
str
|
Evaluator name (defaults to class name) |
PARAM_SPEC |
ParamSpec
|
Class attribute declaring required/optional parameters |
Example
class MyEvaluator(BaseEvaluator): ... PARAM_SPEC = ParamSpec( ... required=frozenset({EvalParam.RESPONSE, EvalParam.QUERY}) ... ) ... async def _evaluate_impl(self, **kwargs): ... return {"score": 0.85, "passed": True}
evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate(query="test", response="answer")
Initialize base evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
timeout
|
float | None
|
Timeout in seconds (None for no timeout) |
60.0
|
retry_config
|
RetryConfig | None
|
Retry configuration (uses defaults if not provided) |
None
|
Source code in src/holodeck/lib/evaluators/base.py
97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | |
name
property
¶
Return evaluator name (class name by default).
evaluate(**kwargs)
async
¶
Evaluate with timeout and retry logic.
This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Evaluation parameters (query, response, context, ground_truth, etc.) |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Evaluation result dictionary |
Raises:
| Type | Description |
|---|---|
TimeoutError
|
If evaluation exceeds timeout |
EvaluationError
|
If evaluation fails after retries |
Example
evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95
Source code in src/holodeck/lib/evaluators/base.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |
get_param_spec()
classmethod
¶
Get the parameter specification for this evaluator.
Returns:
| Type | Description |
|---|---|
ParamSpec
|
ParamSpec declaring required/optional parameters and context flags. |
Source code in src/holodeck/lib/evaluators/base.py
121 122 123 124 125 126 127 128 | |
RetryConfig¶
RetryConfig
¶
Bases: BaseModel
Configuration for retry logic with exponential backoff.
Attributes:
| Name | Type | Description |
|---|---|---|
max_retries |
int
|
Maximum number of retry attempts (default: 3) |
base_delay |
float
|
Base delay in seconds for exponential backoff (default: 2.0) |
max_delay |
float
|
Maximum delay between retries in seconds (default: 60.0) |
exponential_base |
float
|
Exponential base for backoff calculation (default: 2.0) |
Example
config = RetryConfig(max_retries=3, base_delay=2.0)
Delays will be: 2.0s, 4.0s, 8.0s¶
validate_delays(v)
classmethod
¶
Validate delays are positive.
Source code in src/holodeck/lib/evaluators/base.py
59 60 61 62 63 64 65 | |
validate_max_retries(v)
classmethod
¶
Validate max_retries is non-negative.
Source code in src/holodeck/lib/evaluators/base.py
51 52 53 54 55 56 57 | |
EvaluationError¶
EvaluationError
¶
Bases: Exception
Exception raised when evaluation fails after all retry attempts.
Parameter Specification¶
The param_spec module defines a standard way for evaluators to declare their required and optional inputs. This enables the test runner to validate inputs before calling the evaluator.
EvalParam¶
EvalParam
¶
Bases: str, Enum
Standard evaluation parameter names.
Two naming conventions are supported: - Azure AI / NLP: RESPONSE, QUERY, GROUND_TRUTH - DeepEval: ACTUAL_OUTPUT, INPUT, EXPECTED_OUTPUT
Both conventions share CONTEXT and RETRIEVAL_CONTEXT.
ParamSpec¶
ParamSpec
¶
Bases: NamedTuple
Parameter specification for an evaluator.
Declares which parameters an evaluator requires and optionally accepts, plus flags for special context handling.
Attributes:
| Name | Type | Description |
|---|---|---|
required |
frozenset[EvalParam]
|
Parameters that must be provided for evaluation. |
optional |
frozenset[EvalParam]
|
Parameters that may be provided but aren't required. |
uses_context |
bool
|
Whether file content should be passed as context. |
uses_retrieval_context |
bool
|
Whether retrieval context from tools is needed. |
Example
spec = ParamSpec( ... required=frozenset({EvalParam.RESPONSE, EvalParam.QUERY}), ... optional=frozenset({EvalParam.CONTEXT}), ... uses_context=True, ... )
uses_deepeval_params()
¶
Check if this spec uses DeepEval parameter naming convention.
Returns:
| Type | Description |
|---|---|
bool
|
True if any required or optional param is a DeepEval param. |
Source code in src/holodeck/lib/evaluators/param_spec.py
71 72 73 74 75 76 77 78 | |
DEEPEVAL_PARAMS¶
A frozenset of DeepEval-specific parameter names (INPUT, ACTUAL_OUTPUT, EXPECTED_OUTPUT) used by ParamSpec.uses_deepeval_params() to detect the DeepEval naming convention.
DEEPEVAL_PARAMS = frozenset(
{EvalParam.INPUT, EvalParam.ACTUAL_OUTPUT, EvalParam.EXPECTED_OUTPUT}
)
Two naming conventions exist side-by-side:
| Convention | Query | Response | Reference |
|---|---|---|---|
| Azure AI / NLP | query |
response |
ground_truth |
| DeepEval | input |
actual_output |
expected_output |
Both conventions share context and retrieval_context.
Standard NLP Metrics¶
Traditional text-comparison metrics that do not require an LLM. All NLP evaluators require response and ground_truth parameters.
BLEUEvaluator¶
Uses SacreBLEU with exponential smoothing. Scores are normalized from SacreBLEU's 0--100 scale to 0.0--1.0.
BLEUEvaluator(threshold=None, timeout=60.0, **kwargs)
¶
Bases: BaseEvaluator
BLEU score evaluator using SacreBLEU with smoothing.
BLEU (Bilingual Evaluation Understudy) measures precision of n-gram matches between prediction and reference text. Uses SacreBLEU with exponential smoothing to handle short sentences and avoid zero scores when there are no 4-gram matches.
Score Range: 0.0-1.0 (normalized from SacreBLEU's 0-100 scale) Higher scores indicate better match to reference text.
Attributes:
| Name | Type | Description |
|---|---|---|
threshold |
Minimum passing score (0.0-1.0) |
|
timeout |
Timeout in seconds for evaluation |
|
retry_config |
Retry configuration for transient failures |
Example
evaluator = BLEUEvaluator(threshold=0.5) result = await evaluator.evaluate( ... response="The cat sat on the mat", ... ground_truth="The cat is on the mat" ... ) print(result["bleu"]) # 0.0-1.0 print(result["passed"]) # True if >= threshold
Initialize BLEU evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold
|
float | None
|
Minimum passing score (0.0-1.0) |
None
|
timeout
|
float | None
|
Timeout in seconds |
60.0
|
**kwargs
|
Any
|
Additional arguments passed to BaseEvaluator |
{}
|
Source code in src/holodeck/lib/evaluators/nlp_metrics.py
108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 | |
name
property
¶
Return evaluator name (class name by default).
evaluate(**kwargs)
async
¶
Evaluate with timeout and retry logic.
This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Evaluation parameters (query, response, context, ground_truth, etc.) |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Evaluation result dictionary |
Raises:
| Type | Description |
|---|---|
TimeoutError
|
If evaluation exceeds timeout |
EvaluationError
|
If evaluation fails after retries |
Example
evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95
Source code in src/holodeck/lib/evaluators/base.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |
get_param_spec()
classmethod
¶
Get the parameter specification for this evaluator.
Returns:
| Type | Description |
|---|---|
ParamSpec
|
ParamSpec declaring required/optional parameters and context flags. |
Source code in src/holodeck/lib/evaluators/base.py
121 122 123 124 125 126 127 128 | |
ROUGEEvaluator¶
Returns all three ROUGE variants (rouge1, rouge2, rougeL). The variant parameter controls which variant is used for the threshold check.
ROUGEEvaluator(threshold=None, variant='rougeL', timeout=60.0, **kwargs)
¶
Bases: BaseEvaluator
ROUGE score evaluator.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures recall of n-gram overlaps between prediction and reference. Commonly used for summarization evaluation.
Variants: - ROUGE-1: Unigram overlap - ROUGE-2: Bigram overlap - ROUGE-L: Longest common subsequence
Score Range: 0.0-1.0 (F1 score) Higher scores indicate better recall of reference text.
Attributes:
| Name | Type | Description |
|---|---|---|
threshold |
Minimum passing score (0.0-1.0) |
|
variant |
ROUGE variant to use for threshold check ("rouge1", "rouge2", "rougeL") |
|
timeout |
Timeout in seconds for evaluation |
|
retry_config |
Retry configuration for transient failures |
Example
evaluator = ROUGEEvaluator(threshold=0.6, variant="rougeL") result = await evaluator.evaluate( ... response="The cat sat on the mat", ... ground_truth="The cat is on the mat" ... ) print(result["rouge1"]) # 0.0-1.0 print(result["rouge2"]) # 0.0-1.0 print(result["rougeL"]) # 0.0-1.0 print(result["passed"]) # True if rougeL >= threshold
Initialize ROUGE evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold
|
float | None
|
Minimum passing score (0.0-1.0) |
None
|
variant
|
str
|
ROUGE variant for threshold check ("rouge1", "rouge2", "rougeL") |
'rougeL'
|
timeout
|
float | None
|
Timeout in seconds |
60.0
|
**kwargs
|
Any
|
Additional arguments passed to BaseEvaluator |
{}
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If variant is not valid |
Source code in src/holodeck/lib/evaluators/nlp_metrics.py
226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 | |
name
property
¶
Return evaluator name (class name by default).
evaluate(**kwargs)
async
¶
Evaluate with timeout and retry logic.
This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Evaluation parameters (query, response, context, ground_truth, etc.) |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Evaluation result dictionary |
Raises:
| Type | Description |
|---|---|
TimeoutError
|
If evaluation exceeds timeout |
EvaluationError
|
If evaluation fails after retries |
Example
evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95
Source code in src/holodeck/lib/evaluators/base.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |
get_param_spec()
classmethod
¶
Get the parameter specification for this evaluator.
Returns:
| Type | Description |
|---|---|
ParamSpec
|
ParamSpec declaring required/optional parameters and context flags. |
Source code in src/holodeck/lib/evaluators/base.py
121 122 123 124 125 126 127 128 | |
METEOREvaluator¶
Synonym-aware matching with stemming for better correlation with human judgment.
METEOREvaluator(threshold=None, timeout=60.0, **kwargs)
¶
Bases: BaseEvaluator
METEOR score evaluator.
METEOR (Metric for Evaluation of Translation with Explicit ORdering) measures translation quality using synonym matching, stemming, and paraphrase detection. Provides better correlation with human judgment than BLEU.
Score Range: 0.0-1.0 Higher scores indicate better semantic match to reference text.
Attributes:
| Name | Type | Description |
|---|---|---|
threshold |
Minimum passing score (0.0-1.0) |
|
timeout |
Timeout in seconds for evaluation |
|
retry_config |
Retry configuration for transient failures |
Example
evaluator = METEOREvaluator(threshold=0.7) result = await evaluator.evaluate( ... response="The automobile is red", ... ground_truth="The car is red" ... ) print(result["meteor"]) # Higher than BLEU due to synonym handling print(result["passed"]) # True if >= threshold
Initialize METEOR evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
threshold
|
float | None
|
Minimum passing score (0.0-1.0) |
None
|
timeout
|
float | None
|
Timeout in seconds |
60.0
|
**kwargs
|
Any
|
Additional arguments passed to BaseEvaluator |
{}
|
Source code in src/holodeck/lib/evaluators/nlp_metrics.py
352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 | |
name
property
¶
Return evaluator name (class name by default).
evaluate(**kwargs)
async
¶
Evaluate with timeout and retry logic.
This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Evaluation parameters (query, response, context, ground_truth, etc.) |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Evaluation result dictionary |
Raises:
| Type | Description |
|---|---|
TimeoutError
|
If evaluation exceeds timeout |
EvaluationError
|
If evaluation fails after retries |
Example
evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95
Source code in src/holodeck/lib/evaluators/base.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |
get_param_spec()
classmethod
¶
Get the parameter specification for this evaluator.
Returns:
| Type | Description |
|---|---|
ParamSpec
|
ParamSpec declaring required/optional parameters and context flags. |
Source code in src/holodeck/lib/evaluators/base.py
121 122 123 124 125 126 127 128 | |
NLPMetricsError¶
NLPMetricsError
¶
Bases: EvaluationError
Exception raised when NLP metric computation fails.
NLP Metrics Usage¶
from holodeck.lib.evaluators.nlp_metrics import BLEUEvaluator, ROUGEEvaluator
bleu = BLEUEvaluator(threshold=0.5)
result = await bleu.evaluate(
response="The cat sat on the mat",
ground_truth="The cat is on the mat",
)
print(result["bleu"]) # 0.0-1.0
print(result["passed"]) # True if >= 0.5
rouge = ROUGEEvaluator(threshold=0.6, variant="rougeL")
result = await rouge.evaluate(
response="The cat sat on the mat",
ground_truth="The cat is on the mat",
)
print(result["rouge1"], result["rouge2"], result["rougeL"])
NLP Metrics Summary¶
| Metric | Score Key | Score Range | Use Case |
|---|---|---|---|
BLEUEvaluator |
bleu |
0.0--1.0 | Precision-focused n-gram matching |
ROUGEEvaluator |
rouge1, rouge2, rougeL |
0.0--1.0 | Recall-focused overlap (summarization) |
METEOREvaluator |
meteor |
0.0--1.0 | Synonym-aware semantic similarity |
Azure AI Metrics¶
AI-assisted quality metrics powered by the Azure AI Evaluation SDK. All Azure evaluators normalize scores from a 1--5 scale to 0.0--1.0.
ModelConfig¶
ModelConfig
¶
Bases: BaseModel
Azure OpenAI model configuration for evaluators.
Attributes:
| Name | Type | Description |
|---|---|---|
azure_endpoint |
str
|
Azure OpenAI endpoint URL |
api_key |
str
|
Azure OpenAI API key |
azure_deployment |
str
|
Azure deployment name (e.g., "gpt-4o", "gpt-4o-mini") |
api_version |
str
|
Azure OpenAI API version (default: "2024-02-15-preview") |
Example
config = ModelConfig( ... azure_endpoint="https://my-resource.openai.azure.com/", ... api_key="my-api-key", ... azure_deployment="gpt-4o" ... )
AzureAIEvaluator¶
AzureAIEvaluator(model_config, timeout=60.0, retry_config=None)
¶
Bases: BaseEvaluator
Base class for Azure AI Evaluation SDK evaluators.
Provides common functionality for all Azure AI evaluators: - Model configuration - Retry logic with exponential backoff - Timeout handling - Score normalization (5-point scale to 0-1)
Attributes:
| Name | Type | Description |
|---|---|---|
model_config |
Azure OpenAI model configuration |
|
timeout |
Timeout in seconds (default: 60s) |
|
retry_config |
Retry configuration with exponential backoff |
Example
config = ModelConfig( ... azure_endpoint="https://test.openai.azure.com/", ... api_key="key", ... azure_deployment="gpt-4o-mini" ... ) evaluator = RelevanceEvaluator(model_config=config) result = await evaluator.evaluate(query="test", response="answer")
Initialize Azure AI evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_config
|
ModelConfig
|
Azure OpenAI model configuration |
required |
timeout
|
float | None
|
Timeout in seconds (default: 60s, None for no timeout) |
60.0
|
retry_config
|
RetryConfig | None
|
Retry configuration (uses defaults if not provided) |
None
|
Source code in src/holodeck/lib/evaluators/azure_ai.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | |
name
property
¶
Return evaluator name (class name by default).
evaluate(**kwargs)
async
¶
Evaluate with timeout and retry logic.
This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Evaluation parameters (query, response, context, ground_truth, etc.) |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Evaluation result dictionary |
Raises:
| Type | Description |
|---|---|
TimeoutError
|
If evaluation exceeds timeout |
EvaluationError
|
If evaluation fails after retries |
Example
evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95
Source code in src/holodeck/lib/evaluators/base.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |
get_param_spec()
classmethod
¶
Get the parameter specification for this evaluator.
Returns:
| Type | Description |
|---|---|
ParamSpec
|
ParamSpec declaring required/optional parameters and context flags. |
Source code in src/holodeck/lib/evaluators/base.py
121 122 123 124 125 126 127 128 | |
GroundednessEvaluator¶
Assesses whether all claims in the response are supported by the provided context. Use an expensive model (e.g., gpt-4o) for this critical metric.
GroundednessEvaluator(model_config, timeout=60.0, retry_config=None)
¶
Bases: AzureAIEvaluator
Groundedness evaluator using Azure AI Evaluation SDK.
Assesses correspondence between claims in AI-generated answers and source context. Measures factual accuracy by verifying that all claims in the response are supported by the provided context.
Query parameter is optional but recommended for better accuracy.
Scale: 1-5 (normalized to 0.0-1.0)
Example
config = ModelConfig( ... azure_endpoint="https://test.openai.azure.com/", ... api_key="key", ... azure_deployment="gpt-4o" # Use expensive model for critical metric ... ) evaluator = GroundednessEvaluator(model_config=config) result = await evaluator.evaluate( ... query="What is the capital?", ... response="The capital is Paris.", ... context="France's capital is Paris." ... ) print(result["score"]) # 0.0-1.0 0.95
Initialize Azure AI evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_config
|
ModelConfig
|
Azure OpenAI model configuration |
required |
timeout
|
float | None
|
Timeout in seconds (default: 60s, None for no timeout) |
60.0
|
retry_config
|
RetryConfig | None
|
Retry configuration (uses defaults if not provided) |
None
|
Source code in src/holodeck/lib/evaluators/azure_ai.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | |
name
property
¶
Return evaluator name (class name by default).
evaluate(**kwargs)
async
¶
Evaluate with timeout and retry logic.
This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Evaluation parameters (query, response, context, ground_truth, etc.) |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Evaluation result dictionary |
Raises:
| Type | Description |
|---|---|
TimeoutError
|
If evaluation exceeds timeout |
EvaluationError
|
If evaluation fails after retries |
Example
evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95
Source code in src/holodeck/lib/evaluators/base.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |
get_param_spec()
classmethod
¶
Get the parameter specification for this evaluator.
Returns:
| Type | Description |
|---|---|
ParamSpec
|
ParamSpec declaring required/optional parameters and context flags. |
Source code in src/holodeck/lib/evaluators/base.py
121 122 123 124 125 126 127 128 | |
RelevanceEvaluator¶
Measures whether the response directly addresses the user's question.
RelevanceEvaluator(model_config, timeout=60.0, retry_config=None)
¶
Bases: AzureAIEvaluator
Relevance evaluator using Azure AI Evaluation SDK.
Measures relevance of response to query. Assesses whether the response directly addresses the user's question or request.
Scale: 1-5 (normalized to 0.0-1.0)
Example
config = ModelConfig( ... azure_endpoint="https://test.openai.azure.com/", ... api_key="key", ... azure_deployment="gpt-4o" # Critical metric ... ) evaluator = RelevanceEvaluator(model_config=config) result = await evaluator.evaluate( ... query="What is ML?", ... response="ML is machine learning, a subset of AI." ... )
Initialize Azure AI evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_config
|
ModelConfig
|
Azure OpenAI model configuration |
required |
timeout
|
float | None
|
Timeout in seconds (default: 60s, None for no timeout) |
60.0
|
retry_config
|
RetryConfig | None
|
Retry configuration (uses defaults if not provided) |
None
|
Source code in src/holodeck/lib/evaluators/azure_ai.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | |
name
property
¶
Return evaluator name (class name by default).
evaluate(**kwargs)
async
¶
Evaluate with timeout and retry logic.
This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Evaluation parameters (query, response, context, ground_truth, etc.) |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Evaluation result dictionary |
Raises:
| Type | Description |
|---|---|
TimeoutError
|
If evaluation exceeds timeout |
EvaluationError
|
If evaluation fails after retries |
Example
evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95
Source code in src/holodeck/lib/evaluators/base.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |
get_param_spec()
classmethod
¶
Get the parameter specification for this evaluator.
Returns:
| Type | Description |
|---|---|
ParamSpec
|
ParamSpec declaring required/optional parameters and context flags. |
Source code in src/holodeck/lib/evaluators/base.py
121 122 123 124 125 126 127 128 | |
CoherenceEvaluator¶
Evaluates logical flow and readability of the response.
CoherenceEvaluator(model_config, timeout=60.0, retry_config=None)
¶
Bases: AzureAIEvaluator
Coherence evaluator using Azure AI Evaluation SDK.
Evaluates logical flow and readability. Measures how well the response is organized and whether ideas connect logically.
Scale: 1-5 (normalized to 0.0-1.0)
Example
config = ModelConfig( ... azure_endpoint="https://test.openai.azure.com/", ... api_key="key", ... azure_deployment="gpt-4o-mini" # Less critical metric ... ) evaluator = CoherenceEvaluator(model_config=config) result = await evaluator.evaluate( ... query="Explain X", ... response="X is... Furthermore... In conclusion..." ... )
Initialize Azure AI evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_config
|
ModelConfig
|
Azure OpenAI model configuration |
required |
timeout
|
float | None
|
Timeout in seconds (default: 60s, None for no timeout) |
60.0
|
retry_config
|
RetryConfig | None
|
Retry configuration (uses defaults if not provided) |
None
|
Source code in src/holodeck/lib/evaluators/azure_ai.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | |
name
property
¶
Return evaluator name (class name by default).
evaluate(**kwargs)
async
¶
Evaluate with timeout and retry logic.
This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Evaluation parameters (query, response, context, ground_truth, etc.) |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Evaluation result dictionary |
Raises:
| Type | Description |
|---|---|
TimeoutError
|
If evaluation exceeds timeout |
EvaluationError
|
If evaluation fails after retries |
Example
evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95
Source code in src/holodeck/lib/evaluators/base.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |
get_param_spec()
classmethod
¶
Get the parameter specification for this evaluator.
Returns:
| Type | Description |
|---|---|
ParamSpec
|
ParamSpec declaring required/optional parameters and context flags. |
Source code in src/holodeck/lib/evaluators/base.py
121 122 123 124 125 126 127 128 | |
FluencyEvaluator¶
Assesses grammar, spelling, punctuation, word choice, and sentence structure.
FluencyEvaluator(model_config, timeout=60.0, retry_config=None)
¶
Bases: AzureAIEvaluator
Fluency evaluator using Azure AI Evaluation SDK.
Assesses language quality. Measures grammar, spelling, punctuation, word choice, and sentence structure.
Scale: 1-5 (normalized to 0.0-1.0)
Example
config = ModelConfig( ... azure_endpoint="https://test.openai.azure.com/", ... api_key="key", ... azure_deployment="gpt-4o-mini" # Less critical metric ... ) evaluator = FluencyEvaluator(model_config=config) result = await evaluator.evaluate( ... query="Test", ... response="This is a well-written response." ... )
Initialize Azure AI evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_config
|
ModelConfig
|
Azure OpenAI model configuration |
required |
timeout
|
float | None
|
Timeout in seconds (default: 60s, None for no timeout) |
60.0
|
retry_config
|
RetryConfig | None
|
Retry configuration (uses defaults if not provided) |
None
|
Source code in src/holodeck/lib/evaluators/azure_ai.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | |
name
property
¶
Return evaluator name (class name by default).
evaluate(**kwargs)
async
¶
Evaluate with timeout and retry logic.
This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Evaluation parameters (query, response, context, ground_truth, etc.) |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Evaluation result dictionary |
Raises:
| Type | Description |
|---|---|
TimeoutError
|
If evaluation exceeds timeout |
EvaluationError
|
If evaluation fails after retries |
Example
evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95
Source code in src/holodeck/lib/evaluators/base.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |
get_param_spec()
classmethod
¶
Get the parameter specification for this evaluator.
Returns:
| Type | Description |
|---|---|
ParamSpec
|
ParamSpec declaring required/optional parameters and context flags. |
Source code in src/holodeck/lib/evaluators/base.py
121 122 123 124 125 126 127 128 | |
SimilarityEvaluator¶
Compares semantic similarity between response and ground truth.
SimilarityEvaluator(model_config, timeout=60.0, retry_config=None)
¶
Bases: AzureAIEvaluator
Similarity evaluator using Azure AI Evaluation SDK.
Compares semantic similarity between response and ground truth. Measures how closely the response matches the expected answer.
Requires ground_truth parameter.
Scale: 1-5 (normalized to 0.0-1.0)
Example
config = ModelConfig( ... azure_endpoint="https://test.openai.azure.com/", ... api_key="key", ... azure_deployment="gpt-4o-mini" ... ) evaluator = SimilarityEvaluator(model_config=config) result = await evaluator.evaluate( ... query="What is 2+2?", ... response="The answer is 4.", ... ground_truth="2+2 equals 4." ... )
Initialize Azure AI evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_config
|
ModelConfig
|
Azure OpenAI model configuration |
required |
timeout
|
float | None
|
Timeout in seconds (default: 60s, None for no timeout) |
60.0
|
retry_config
|
RetryConfig | None
|
Retry configuration (uses defaults if not provided) |
None
|
Source code in src/holodeck/lib/evaluators/azure_ai.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | |
name
property
¶
Return evaluator name (class name by default).
evaluate(**kwargs)
async
¶
Evaluate with timeout and retry logic.
This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Evaluation parameters (query, response, context, ground_truth, etc.) |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Evaluation result dictionary |
Raises:
| Type | Description |
|---|---|
TimeoutError
|
If evaluation exceeds timeout |
EvaluationError
|
If evaluation fails after retries |
Example
evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95
Source code in src/holodeck/lib/evaluators/base.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |
get_param_spec()
classmethod
¶
Get the parameter specification for this evaluator.
Returns:
| Type | Description |
|---|---|
ParamSpec
|
ParamSpec declaring required/optional parameters and context flags. |
Source code in src/holodeck/lib/evaluators/base.py
121 122 123 124 125 126 127 128 | |
Azure AI Usage¶
from holodeck.lib.evaluators.azure_ai import (
ModelConfig,
GroundednessEvaluator,
RelevanceEvaluator,
)
config = ModelConfig(
azure_endpoint="https://my-resource.openai.azure.com/",
api_key="my-api-key",
azure_deployment="gpt-4o",
)
groundedness = GroundednessEvaluator(model_config=config)
result = await groundedness.evaluate(
query="What is the capital of France?",
response="The capital of France is Paris.",
context="France is a country in Europe. Its capital is Paris.",
)
print(result["score"]) # 0.0-1.0 (normalized from 1-5)
print(result["groundedness"]) # Raw 1-5 score
print(result["reasoning"]) # LLM explanation
Azure AI Metrics Summary¶
| Evaluator | Required Params | Optional Params | Score Key |
|---|---|---|---|
GroundednessEvaluator |
response, context |
query |
groundedness |
RelevanceEvaluator |
response, query |
context |
relevance |
CoherenceEvaluator |
response, query |
-- | coherence |
FluencyEvaluator |
response, query |
-- | fluency |
SimilarityEvaluator |
response, query, ground_truth |
-- | similarity |
DeepEval Metrics¶
LLM-as-a-judge evaluation with multi-provider support (OpenAI, Azure OpenAI, Anthropic, Ollama). DeepEval metrics use a different parameter naming convention (input, actual_output, expected_output) but HoloDeck's DeepEvalBaseEvaluator also accepts Azure/NLP aliases (query, response, ground_truth).
DeepEvalModelConfig¶
DeepEvalModelConfig
¶
Bases: BaseModel
Configuration adapter for DeepEval model classes.
This class bridges HoloDeck's LLMProvider configuration to DeepEval's native model classes (GPTModel, AzureOpenAIModel, AnthropicModel, OllamaModel).
The default configuration uses Ollama with gpt-oss:20b for local evaluation without requiring API keys.
Attributes:
| Name | Type | Description |
|---|---|---|
provider |
ProviderEnum
|
LLM provider to use (defaults to Ollama) |
model_name |
str
|
Name of the model (defaults to gpt-oss:20b) |
api_key |
str | None
|
API key for cloud providers (not required for Ollama) |
endpoint |
str | None
|
API endpoint URL (required for Azure, optional for Ollama) |
api_version |
str | None
|
Azure OpenAI API version |
deployment_name |
str | None
|
Azure OpenAI deployment name |
temperature |
float
|
Temperature for generation (defaults to 0.0 for determinism) |
API Key Behavior
- OpenAI: API key can be provided via
api_keyfield or theOPENAI_API_KEYenvironment variable. If neither is set, DeepEval's GPTModel will raise an error at runtime. - Anthropic: API key can be provided via
api_keyfield or theANTHROPIC_API_KEYenvironment variable. If neither is set, DeepEval's AnthropicModel will raise an error at runtime. - Azure OpenAI: The
api_keyfield is required and validated at configuration time (no environment variable fallback). - Ollama: No API key required (local inference).
Example
config = DeepEvalModelConfig() # Default Ollama model = config.to_deepeval_model()
openai_config = DeepEvalModelConfig( ... provider=ProviderEnum.OPENAI, ... model_name="gpt-4o", ... api_key="sk-..." # Or set OPENAI_API_KEY env var ... )
to_deepeval_model()
¶
Convert configuration to native DeepEval model class.
Returns the appropriate DeepEval model class instance based on the configured provider.
Returns:
| Type | Description |
|---|---|
DeepEvalModel
|
DeepEval model instance (GPTModel, AzureOpenAIModel, |
DeepEvalModel
|
AnthropicModel, or OllamaModel) |
Raises:
| Type | Description |
|---|---|
ValueError
|
If provider is not supported |
Source code in src/holodeck/lib/evaluators/deepeval/config.py
111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 | |
validate_provider_requirements()
¶
Validate that required fields are present for each provider.
Raises:
| Type | Description |
|---|---|
ValueError
|
If required fields are missing for the provider |
Source code in src/holodeck/lib/evaluators/deepeval/config.py
93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 | |
DeepEvalBaseEvaluator¶
DeepEvalBaseEvaluator(model_config=None, threshold=0.5, timeout=60.0, retry_config=None, observability_config=None)
¶
Bases: BaseEvaluator
Abstract base class for DeepEval-based evaluators.
This class extends BaseEvaluator to provide DeepEval-specific functionality: - Model configuration and initialization - LLMTestCase construction from evaluation inputs - Result normalization and logging
Subclasses must implement _create_metric() to return the specific DeepEval metric instance.
Note: DeepEval uses different parameter names than Azure AI/NLP: - input (not query) - actual_output (not response) - expected_output (not ground_truth)
Attributes:
| Name | Type | Description |
|---|---|---|
model_config |
Configuration for the evaluation LLM |
|
threshold |
Score threshold for pass/fail determination |
|
model |
The initialized DeepEval model instance |
Example
class MyMetricEvaluator(DeepEvalBaseEvaluator): ... def _create_metric(self): ... return SomeDeepEvalMetric( ... threshold=self._threshold, ... model=self._model ... )
evaluator = MyMetricEvaluator(threshold=0.7) result = await evaluator.evaluate( ... input="What is Python?", ... actual_output="Python is a programming language." ... )
Initialize DeepEval base evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_config
|
DeepEvalModelConfig | None
|
Configuration for the evaluation model. Defaults to Ollama with gpt-oss:20b. |
None
|
threshold
|
float
|
Score threshold for pass/fail (0.0-1.0, default: 0.5) |
0.5
|
timeout
|
float | None
|
Evaluation timeout in seconds (default: 60.0) |
60.0
|
retry_config
|
RetryConfig | None
|
Retry configuration for transient failures |
None
|
observability_config
|
TracingConfig | None
|
Tracing configuration for span instrumentation. If None, no spans are created. |
None
|
Source code in src/holodeck/lib/evaluators/deepeval/base.py
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 | |
name
property
¶
Return evaluator name (class name by default).
evaluate(**kwargs)
async
¶
Evaluate with timeout and retry logic.
This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Evaluation parameters (query, response, context, ground_truth, etc.) |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Evaluation result dictionary |
Raises:
| Type | Description |
|---|---|
TimeoutError
|
If evaluation exceeds timeout |
EvaluationError
|
If evaluation fails after retries |
Example
evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95
Source code in src/holodeck/lib/evaluators/base.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |
get_param_spec()
classmethod
¶
Get the parameter specification for this evaluator.
Returns:
| Type | Description |
|---|---|
ParamSpec
|
ParamSpec declaring required/optional parameters and context flags. |
Source code in src/holodeck/lib/evaluators/base.py
121 122 123 124 125 126 127 128 | |
DeepEvalError¶
DeepEvalError(message, metric_name, original_error=None, test_case_summary=None)
¶
Bases: EvaluationError
Wraps errors from the DeepEval library with additional context.
This exception provides debugging information when DeepEval metrics fail, including the metric name and a summary of the test case that triggered the error.
Attributes:
| Name | Type | Description |
|---|---|---|
metric_name |
Name of the DeepEval metric that failed |
|
original_error |
The underlying exception from DeepEval |
|
test_case_summary |
Truncated input/output data for debugging |
Initialize DeepEvalError with context.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
message
|
str
|
Human-readable error message |
required |
metric_name
|
str
|
Name of the metric that failed |
required |
original_error
|
Exception | None
|
The underlying exception from DeepEval |
None
|
test_case_summary
|
dict[str, Any] | None
|
Dictionary with truncated test case fields |
None
|
Source code in src/holodeck/lib/evaluators/deepeval/errors.py
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | |
ProviderNotSupportedError¶
ProviderNotSupportedError(message, evaluator_type, configured_provider, supported_providers)
¶
Bases: EvaluationError
Raised when an evaluator is used with an incompatible LLM provider.
This error is raised early during evaluator initialization to prevent confusing runtime errors when users misconfigure provider settings.
Attributes:
| Name | Type | Description |
|---|---|---|
evaluator_type |
The type of evaluator that requires specific providers |
|
configured_provider |
The provider that was incorrectly configured |
|
supported_providers |
List of providers that are supported |
Initialize ProviderNotSupportedError with context.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
message
|
str
|
Human-readable error message |
required |
evaluator_type
|
str
|
The evaluator class that raised the error |
required |
configured_provider
|
str
|
The provider that was configured |
required |
supported_providers
|
list[str]
|
List of valid provider names |
required |
Source code in src/holodeck/lib/evaluators/deepeval/errors.py
24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | |
G-Eval: Custom Criteria¶
GEvalEvaluator¶
GEvalEvaluator(name, criteria, evaluation_params=None, evaluation_steps=None, model_config=None, threshold=0.5, strict_mode=False, timeout=60.0, retry_config=None, observability_config=None)
¶
Bases: DeepEvalBaseEvaluator
G-Eval custom criteria evaluator.
Evaluates LLM outputs against user-defined criteria using the G-Eval algorithm, which combines chain-of-thought prompting with token probability scoring.
G-Eval works in two phases: 1. Step Generation: Auto-generates evaluation steps from the criteria 2. Scoring: Uses the steps to score the test case on a 1-5 scale (normalized to 0-1)
Attributes:
| Name | Type | Description |
|---|---|---|
_metric_name |
Custom name for this evaluation metric |
|
_criteria |
Natural language criteria for evaluation |
|
_evaluation_params |
Test case fields to include in evaluation |
|
_evaluation_steps |
Optional explicit evaluation steps |
|
_strict_mode |
Whether to use binary scoring (1.0 or 0.0) |
Example
evaluator = GEvalEvaluator( ... name="Professionalism", ... criteria="Evaluate if the response uses professional language", ... threshold=0.7 ... ) result = await evaluator.evaluate( ... input="Write me an email", ... actual_output="Dear Sir/Madam, ..." ... ) print(result["score"]) # 0.85 print(result["passed"]) # True
Initialize G-Eval evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Metric identifier (e.g., "Correctness", "Helpfulness") |
required |
criteria
|
str
|
Natural language evaluation criteria |
required |
evaluation_params
|
list[str] | None
|
Test case fields to include in evaluation. Valid options: ["input", "actual_output", "expected_output", "context", "retrieval_context"] Default: ["actual_output"] |
None
|
evaluation_steps
|
list[str] | None
|
Explicit evaluation steps. If None, G-Eval auto-generates steps from the criteria. |
None
|
model_config
|
DeepEvalModelConfig | None
|
LLM judge configuration. Defaults to Ollama gpt-oss:20b. |
None
|
threshold
|
float
|
Pass/fail score threshold (0.0-1.0). Default: 0.5. |
0.5
|
strict_mode
|
bool
|
If True, scores are binary (1.0 or 0.0). Default: False. |
False
|
timeout
|
float | None
|
Evaluation timeout in seconds. Default: 60.0. |
60.0
|
retry_config
|
RetryConfig | None
|
Retry configuration for transient failures. |
None
|
observability_config
|
TracingConfig | None
|
Tracing configuration for span instrumentation. If None, no spans are created. |
None
|
Raises:
| Type | Description |
|---|---|
ValueError
|
If invalid evaluation_params are provided. |
Source code in src/holodeck/lib/evaluators/deepeval/geval.py
70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 | |
name
property
¶
Return the custom metric name.
evaluate(**kwargs)
async
¶
Evaluate with timeout and retry logic.
This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Evaluation parameters (query, response, context, ground_truth, etc.) |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Evaluation result dictionary |
Raises:
| Type | Description |
|---|---|
TimeoutError
|
If evaluation exceeds timeout |
EvaluationError
|
If evaluation fails after retries |
Example
evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95
Source code in src/holodeck/lib/evaluators/base.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |
get_param_spec()
classmethod
¶
Get the parameter specification for this evaluator.
Returns:
| Type | Description |
|---|---|
ParamSpec
|
ParamSpec declaring required/optional parameters and context flags. |
Source code in src/holodeck/lib/evaluators/base.py
121 122 123 124 125 126 127 128 | |
G-Eval Usage¶
from holodeck.lib.evaluators.deepeval import GEvalEvaluator, DeepEvalModelConfig
from holodeck.models.llm import ProviderEnum
config = DeepEvalModelConfig(
provider=ProviderEnum.OPENAI,
model_name="gpt-4o",
api_key="sk-...",
)
evaluator = GEvalEvaluator(
name="Professionalism",
criteria="Evaluate if the response uses professional language and avoids slang.",
evaluation_params=["actual_output", "input"],
evaluation_steps=[
"Check if the language is formal and professional",
"Verify no slang or casual expressions are used",
],
model_config=config,
threshold=0.7,
strict_mode=False,
)
result = await evaluator.evaluate(
input="Write a business email",
actual_output="Dear Sir/Madam, I am writing to inquire about...",
)
print(result["score"]) # 0.0-1.0
print(result["passed"]) # True if >= 0.7
print(result["reasoning"]) # LLM-generated explanation
G-Eval YAML Configuration¶
evaluations:
model:
provider: openai
name: gpt-4o
temperature: 0.0
metrics:
- type: geval
name: Professionalism
criteria: |
Evaluate if the response uses professional language,
avoids slang, and maintains a respectful tone.
evaluation_steps:
- "Check if the language is formal and professional"
- "Verify no slang or casual expressions are used"
- "Assess the overall respectful tone"
evaluation_params:
- actual_output
- input
threshold: 0.7
strict_mode: false
Valid evaluation_params values: input, actual_output, expected_output, context, retrieval_context.
RAG Pipeline Metrics¶
RAG evaluators measure retrieval-augmented generation quality. All RAG evaluators (except AnswerRelevancyEvaluator) require retrieval_context.
FaithfulnessEvaluator¶
Detects hallucinations by checking whether the response is supported by the retrieval context.
FaithfulnessEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None, observability_config=None)
¶
Bases: DeepEvalBaseEvaluator
Faithfulness evaluator for detecting hallucinations.
Detects hallucinations by comparing agent response to retrieval context. Returns a low score if the response contains information not found in the retrieval context (hallucination detected).
Required inputs
- input: User query
- actual_output: Agent response
- retrieval_context: List of retrieved text chunks
Example
evaluator = FaithfulnessEvaluator(threshold=0.8) result = await evaluator.evaluate( ... input="What are the store hours?", ... actual_output="Store is open 24/7.", ... retrieval_context=["Store hours: Mon-Fri 9am-5pm"] ... ) print(result["score"]) # Low score (hallucination detected)
Attributes:
| Name | Type | Description |
|---|---|---|
_include_reason |
Whether to include reasoning in results. |
Initialize Faithfulness evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_config
|
DeepEvalModelConfig | None
|
LLM judge configuration. Defaults to Ollama gpt-oss:20b. |
None
|
threshold
|
float
|
Pass/fail score threshold (0.0-1.0). Default: 0.5. |
0.5
|
include_reason
|
bool
|
Whether to include reasoning in results. Default: True. |
True
|
timeout
|
float | None
|
Evaluation timeout in seconds. Default: 60.0. |
60.0
|
retry_config
|
RetryConfig | None
|
Retry configuration for transient failures. |
None
|
observability_config
|
TracingConfig | None
|
Tracing configuration for span instrumentation. If None, no spans are created. |
None
|
Source code in src/holodeck/lib/evaluators/deepeval/faithfulness.py
58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | |
name
property
¶
Return the metric name.
evaluate(**kwargs)
async
¶
Evaluate with timeout and retry logic.
This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Evaluation parameters (query, response, context, ground_truth, etc.) |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Evaluation result dictionary |
Raises:
| Type | Description |
|---|---|
TimeoutError
|
If evaluation exceeds timeout |
EvaluationError
|
If evaluation fails after retries |
Example
evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95
Source code in src/holodeck/lib/evaluators/base.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |
get_param_spec()
classmethod
¶
Get the parameter specification for this evaluator.
Returns:
| Type | Description |
|---|---|
ParamSpec
|
ParamSpec declaring required/optional parameters and context flags. |
Source code in src/holodeck/lib/evaluators/base.py
121 122 123 124 125 126 127 128 | |
AnswerRelevancyEvaluator¶
Measures whether response statements are relevant to the input query. Does not require retrieval_context.
AnswerRelevancyEvaluator(model_config=None, threshold=0.5, include_reason=True, strict_mode=False, timeout=60.0, retry_config=None, observability_config=None)
¶
Bases: DeepEvalBaseEvaluator
Answer Relevancy evaluator - measures statement relevance to input.
Evaluates how relevant the response statements are to the input query. Unlike other RAG metrics, this does NOT require retrieval_context.
Required inputs
- input: User query
- actual_output: Agent response
Example
evaluator = AnswerRelevancyEvaluator(threshold=0.7) result = await evaluator.evaluate( ... input="What is the return policy?", ... actual_output="We offer a 30-day full refund at no extra cost." ... ) print(result["score"]) # High score if relevant
Attributes:
| Name | Type | Description |
|---|---|---|
_include_reason |
Whether to include reasoning in results. |
|
_strict_mode |
Whether to use binary scoring (1.0 or 0.0). |
Initialize Answer Relevancy evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_config
|
DeepEvalModelConfig | None
|
LLM judge configuration. Defaults to Ollama gpt-oss:20b. |
None
|
threshold
|
float
|
Pass/fail score threshold (0.0-1.0). Default: 0.5. |
0.5
|
include_reason
|
bool
|
Whether to include reasoning in results. Default: True. |
True
|
strict_mode
|
bool
|
Binary scoring mode (1.0 or 0.0 only). Default: False. |
False
|
timeout
|
float | None
|
Evaluation timeout in seconds. Default: 60.0. |
60.0
|
retry_config
|
RetryConfig | None
|
Retry configuration for transient failures. |
None
|
observability_config
|
TracingConfig | None
|
Tracing configuration for span instrumentation. If None, no spans are created. |
None
|
Source code in src/holodeck/lib/evaluators/deepeval/answer_relevancy.py
54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | |
name
property
¶
Return the metric name.
evaluate(**kwargs)
async
¶
Evaluate with timeout and retry logic.
This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Evaluation parameters (query, response, context, ground_truth, etc.) |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Evaluation result dictionary |
Raises:
| Type | Description |
|---|---|
TimeoutError
|
If evaluation exceeds timeout |
EvaluationError
|
If evaluation fails after retries |
Example
evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95
Source code in src/holodeck/lib/evaluators/base.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |
get_param_spec()
classmethod
¶
Get the parameter specification for this evaluator.
Returns:
| Type | Description |
|---|---|
ParamSpec
|
ParamSpec declaring required/optional parameters and context flags. |
Source code in src/holodeck/lib/evaluators/base.py
121 122 123 124 125 126 127 128 | |
ContextualRelevancyEvaluator¶
Measures the proportion of retrieved chunks that are relevant to the query.
ContextualRelevancyEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None, observability_config=None)
¶
Bases: DeepEvalBaseEvaluator
Contextual Relevancy evaluator for RAG pipelines.
Measures the relevance of retrieved context to the user query. Returns the proportion of chunks that are relevant to the query.
Required inputs
- input: User query
- actual_output: Agent response
- retrieval_context: List of retrieved text chunks
Example
evaluator = ContextualRelevancyEvaluator(threshold=0.6) result = await evaluator.evaluate( ... input="What is the pricing?", ... actual_output="Basic plan is $10/month.", ... retrieval_context=[ ... "Pricing: Basic $10, Pro $25", # Relevant ... "Company founded in 2020", # Irrelevant ... ] ... ) print(result["score"]) # 0.5 (1 of 2 chunks relevant)
Attributes:
| Name | Type | Description |
|---|---|---|
_include_reason |
Whether to include reasoning in results. |
Initialize Contextual Relevancy evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_config
|
DeepEvalModelConfig | None
|
LLM judge configuration. Defaults to Ollama gpt-oss:20b. |
None
|
threshold
|
float
|
Pass/fail score threshold (0.0-1.0). Default: 0.5. |
0.5
|
include_reason
|
bool
|
Whether to include reasoning in results. Default: True. |
True
|
timeout
|
float | None
|
Evaluation timeout in seconds. Default: 60.0. |
60.0
|
retry_config
|
RetryConfig | None
|
Retry configuration for transient failures. |
None
|
observability_config
|
TracingConfig | None
|
Tracing configuration for span instrumentation. If None, no spans are created. |
None
|
Source code in src/holodeck/lib/evaluators/deepeval/contextual_relevancy.py
60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | |
name
property
¶
Return the metric name.
evaluate(**kwargs)
async
¶
Evaluate with timeout and retry logic.
This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Evaluation parameters (query, response, context, ground_truth, etc.) |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Evaluation result dictionary |
Raises:
| Type | Description |
|---|---|
TimeoutError
|
If evaluation exceeds timeout |
EvaluationError
|
If evaluation fails after retries |
Example
evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95
Source code in src/holodeck/lib/evaluators/base.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |
get_param_spec()
classmethod
¶
Get the parameter specification for this evaluator.
Returns:
| Type | Description |
|---|---|
ParamSpec
|
ParamSpec declaring required/optional parameters and context flags. |
Source code in src/holodeck/lib/evaluators/base.py
121 122 123 124 125 126 127 128 | |
ContextualPrecisionEvaluator¶
Evaluates ranking quality -- whether relevant chunks appear before irrelevant ones.
ContextualPrecisionEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None, observability_config=None)
¶
Bases: DeepEvalBaseEvaluator
Contextual Precision evaluator for RAG pipelines.
Evaluates the ranking quality of retrieved chunks. Measures whether relevant chunks appear before irrelevant ones.
Required inputs
- input: User query
- actual_output: Agent response
- expected_output: Ground truth answer
- retrieval_context: List of retrieved text chunks (order matters)
Example
evaluator = ContextualPrecisionEvaluator(threshold=0.7) result = await evaluator.evaluate( ... input="What is X?", ... actual_output="X is...", ... expected_output="X is the correct definition.", ... retrieval_context=[ ... "Irrelevant info", # Bad: irrelevant first ... "X is the definition", # Good: relevant ... ] ... ) print(result["score"]) # Lower due to poor ranking
Attributes:
| Name | Type | Description |
|---|---|---|
_include_reason |
Whether to include reasoning in results. |
Initialize Contextual Precision evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_config
|
DeepEvalModelConfig | None
|
LLM judge configuration. Defaults to Ollama gpt-oss:20b. |
None
|
threshold
|
float
|
Pass/fail score threshold (0.0-1.0). Default: 0.5. |
0.5
|
include_reason
|
bool
|
Whether to include reasoning in results. Default: True. |
True
|
timeout
|
float | None
|
Evaluation timeout in seconds. Default: 60.0. |
60.0
|
retry_config
|
RetryConfig | None
|
Retry configuration for transient failures. |
None
|
observability_config
|
TracingConfig | None
|
Tracing configuration for span instrumentation. If None, no spans are created. |
None
|
Source code in src/holodeck/lib/evaluators/deepeval/contextual_precision.py
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 | |
name
property
¶
Return the metric name.
evaluate(**kwargs)
async
¶
Evaluate with timeout and retry logic.
This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Evaluation parameters (query, response, context, ground_truth, etc.) |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Evaluation result dictionary |
Raises:
| Type | Description |
|---|---|
TimeoutError
|
If evaluation exceeds timeout |
EvaluationError
|
If evaluation fails after retries |
Example
evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95
Source code in src/holodeck/lib/evaluators/base.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |
get_param_spec()
classmethod
¶
Get the parameter specification for this evaluator.
Returns:
| Type | Description |
|---|---|
ParamSpec
|
ParamSpec declaring required/optional parameters and context flags. |
Source code in src/holodeck/lib/evaluators/base.py
121 122 123 124 125 126 127 128 | |
ContextualRecallEvaluator¶
Measures retrieval completeness -- whether the context contains all facts needed to produce the expected output.
ContextualRecallEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None, observability_config=None)
¶
Bases: DeepEvalBaseEvaluator
Contextual Recall evaluator for RAG pipelines.
Measures retrieval completeness against expected output. Evaluates whether retrieval context contains all facts needed to produce the expected output.
Required inputs
- input: User query
- actual_output: Agent response
- expected_output: Ground truth answer
- retrieval_context: List of retrieved text chunks
Example
evaluator = ContextualRecallEvaluator(threshold=0.8) result = await evaluator.evaluate( ... input="List all features", ... actual_output="Features are A and B", ... expected_output="Features are A, B, and C", ... retrieval_context=["Feature A: ...", "Feature B: ..."] ... ) print(result["score"]) # ~0.67 (missing Feature C)
Attributes:
| Name | Type | Description |
|---|---|---|
_include_reason |
Whether to include reasoning in results. |
Initialize Contextual Recall evaluator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_config
|
DeepEvalModelConfig | None
|
LLM judge configuration. Defaults to Ollama gpt-oss:20b. |
None
|
threshold
|
float
|
Pass/fail score threshold (0.0-1.0). Default: 0.5. |
0.5
|
include_reason
|
bool
|
Whether to include reasoning in results. Default: True. |
True
|
timeout
|
float | None
|
Evaluation timeout in seconds. Default: 60.0. |
60.0
|
retry_config
|
RetryConfig | None
|
Retry configuration for transient failures. |
None
|
observability_config
|
TracingConfig | None
|
Tracing configuration for span instrumentation. If None, no spans are created. |
None
|
Source code in src/holodeck/lib/evaluators/deepeval/contextual_recall.py
64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 | |
name
property
¶
Return the metric name.
evaluate(**kwargs)
async
¶
Evaluate with timeout and retry logic.
This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
**kwargs
|
Any
|
Evaluation parameters (query, response, context, ground_truth, etc.) |
{}
|
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Evaluation result dictionary |
Raises:
| Type | Description |
|---|---|
TimeoutError
|
If evaluation exceeds timeout |
EvaluationError
|
If evaluation fails after retries |
Example
evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95
Source code in src/holodeck/lib/evaluators/base.py
262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 | |
get_param_spec()
classmethod
¶
Get the parameter specification for this evaluator.
Returns:
| Type | Description |
|---|---|
ParamSpec
|
ParamSpec declaring required/optional parameters and context flags. |
Source code in src/holodeck/lib/evaluators/base.py
121 122 123 124 125 126 127 128 | |
RAG Metrics Usage¶
from holodeck.lib.evaluators.deepeval import (
FaithfulnessEvaluator,
AnswerRelevancyEvaluator,
ContextualRelevancyEvaluator,
ContextualPrecisionEvaluator,
ContextualRecallEvaluator,
DeepEvalModelConfig,
)
config = DeepEvalModelConfig() # Default: Ollama with gpt-oss:20b
# Faithfulness (hallucination detection)
faithfulness = FaithfulnessEvaluator(model_config=config, threshold=0.8)
result = await faithfulness.evaluate(
input="What are the store hours?",
actual_output="Store is open 24/7.",
retrieval_context=["Store hours: Mon-Fri 9am-5pm"],
)
print(result["score"]) # Low score -- hallucination detected
# Answer Relevancy (no retrieval_context needed)
relevancy = AnswerRelevancyEvaluator(model_config=config, threshold=0.7)
result = await relevancy.evaluate(
input="What is the return policy?",
actual_output="We offer 30-day returns at no extra cost.",
)
# Contextual Precision (ranking quality)
precision = ContextualPrecisionEvaluator(model_config=config, threshold=0.7)
result = await precision.evaluate(
input="What is X?",
actual_output="X is a programming concept.",
expected_output="X is a well-known programming paradigm.",
retrieval_context=["X is a programming paradigm.", "Unrelated info"],
)
RAG YAML Configuration¶
evaluations:
model:
provider: openai
name: gpt-4o
metrics:
- type: rag
metric_type: faithfulness
threshold: 0.8
include_reason: true
- type: rag
metric_type: answer_relevancy
threshold: 0.7
- type: rag
metric_type: contextual_relevancy
threshold: 0.6
- type: rag
metric_type: contextual_precision
threshold: 0.7
- type: rag
metric_type: contextual_recall
threshold: 0.6
RAG Metrics Summary¶
| Evaluator | Required Params | Requires retrieval_context |
Measures |
|---|---|---|---|
FaithfulnessEvaluator |
input, actual_output, retrieval_context |
Yes | Hallucination detection |
AnswerRelevancyEvaluator |
input, actual_output |
No | Response relevance to query |
ContextualRelevancyEvaluator |
input, actual_output, retrieval_context |
Yes | Chunk relevance to query |
ContextualPrecisionEvaluator |
input, actual_output, expected_output, retrieval_context |
Yes | Ranking quality of chunks |
ContextualRecallEvaluator |
input, actual_output, expected_output, retrieval_context |
Yes | Retrieval completeness |
Complete Agent Configuration Example¶
name: customer-support-agent
model:
provider: openai
name: gpt-4o
evaluations:
model:
provider: openai
name: gpt-4o
temperature: 0.0
metrics:
# Standard NLP metrics (no LLM required)
- type: standard
metric: bleu
threshold: 0.4
- type: standard
metric: rouge
threshold: 0.5
# Custom G-Eval criteria
- type: geval
name: Helpfulness
criteria: "Evaluate if the response provides actionable, helpful information"
evaluation_params: [actual_output, input]
threshold: 0.7
# RAG evaluation
- type: rag
metric_type: faithfulness
threshold: 0.8
include_reason: true
test_cases:
- name: "Refund policy question"
input: "What is your refund policy?"
ground_truth: "We offer a 30-day money-back guarantee on all products."
retrieval_context:
- "Refund Policy: All products come with a 30-day money-back guarantee."
- "Returns must be initiated within 30 days of purchase."
- name: "Product recommendation"
input: "I need a laptop for video editing"
expected_tools: [search_products, get_specifications]
evaluations:
- type: geval
name: TechnicalAccuracy
criteria: "Verify the response contains accurate technical specifications"
threshold: 0.8
Run tests with:
holodeck test agent.yaml --verbose --output report.md --format markdown