Skip to content

Evaluation Framework API

HoloDeck provides a flexible evaluation framework for measuring agent response quality. The framework supports three tiers of metrics:

  1. Standard NLP Metrics -- Traditional text-comparison metrics (BLEU, ROUGE, METEOR) that require no LLM
  2. Azure AI Metrics -- AI-assisted quality metrics via Azure AI Evaluation SDK (groundedness, relevance, coherence, fluency, similarity)
  3. DeepEval Metrics -- LLM-as-a-judge evaluation with multi-provider support (G-Eval custom criteria, RAG pipeline metrics)

All evaluators share a common base class with retry logic, timeout handling, and a unified parameter specification system.


Architecture Overview

BaseEvaluator (base.py)
├── BLEUEvaluator (nlp_metrics.py)
├── ROUGEEvaluator (nlp_metrics.py)
├── METEOREvaluator (nlp_metrics.py)
├── AzureAIEvaluator (azure_ai.py)
│   ├── GroundednessEvaluator
│   ├── RelevanceEvaluator
│   ├── CoherenceEvaluator
│   ├── FluencyEvaluator
│   └── SimilarityEvaluator
└── DeepEvalBaseEvaluator (deepeval/base.py)
    ├── GEvalEvaluator (deepeval/geval.py)
    ├── FaithfulnessEvaluator (deepeval/faithfulness.py)
    ├── AnswerRelevancyEvaluator (deepeval/answer_relevancy.py)
    ├── ContextualRelevancyEvaluator (deepeval/contextual_relevancy.py)
    ├── ContextualPrecisionEvaluator (deepeval/contextual_precision.py)
    └── ContextualRecallEvaluator (deepeval/contextual_recall.py)

Configuration Models

Evaluation metrics are configured in agent.yaml using Pydantic models from holodeck.models.evaluation. The metrics list uses a discriminated union on the type field (standard, geval, or rag).

YAML Configuration Example

evaluations:
  model:                          # Default LLM for all LLM-based metrics
    provider: openai
    name: gpt-4o
    temperature: 0.0
  metrics:
    # Standard NLP metric (no LLM required)
    - type: standard
      metric: bleu
      threshold: 0.4

    # G-Eval custom criteria (LLM-as-judge)
    - type: geval
      name: Helpfulness
      criteria: "Evaluate if the response provides actionable information"
      evaluation_params: [actual_output, input]
      threshold: 0.7

    # RAG pipeline metric
    - type: rag
      metric_type: faithfulness
      threshold: 0.8
      include_reason: true

EvaluationConfig

EvaluationConfig

Bases: BaseModel

Evaluation framework configuration.

Container for evaluation metrics with optional default model configuration. Supports standard EvaluationMetric, GEvalMetric (custom criteria), and RAGMetric (RAG pipeline evaluation).

validate_metrics(v) classmethod

Validate metrics list is not empty.

Source code in src/holodeck/models/evaluation.py
491
492
493
494
495
496
497
498
499
@field_validator("metrics")
@classmethod
def validate_metrics(
    cls, v: list[EvaluationMetric | GEvalMetric | RAGMetric | CodeMetric]
) -> list[EvaluationMetric | GEvalMetric | RAGMetric | CodeMetric]:
    """Validate metrics list is not empty."""
    if not v:
        raise ValueError("metrics must have at least one metric")
    return v

MetricType

The discriminated union that routes to the correct metric model based on the type field:

MetricType = Annotated[
    EvaluationMetric | GEvalMetric | RAGMetric,
    Field(discriminator="type"),
]

EvaluationMetric

Standard metric configuration (type: standard).

EvaluationMetric

Bases: BaseModel

Evaluation metric configuration.

Represents a single evaluation metric with flexible model configuration, including per-metric LLM model overrides.

validate_custom_prompt(v) classmethod

Validate custom_prompt is not empty if provided.

Source code in src/holodeck/models/evaluation.py
188
189
190
191
192
193
194
@field_validator("custom_prompt")
@classmethod
def validate_custom_prompt(cls, v: str | None) -> str | None:
    """Validate custom_prompt is not empty if provided."""
    if v is not None and (not v or not v.strip()):
        raise ValueError("custom_prompt must be non-empty if provided")
    return v

validate_enabled(v) classmethod

Validate enabled is boolean.

Source code in src/holodeck/models/evaluation.py
148
149
150
151
152
153
154
@field_validator("enabled")
@classmethod
def validate_enabled(cls, v: bool) -> bool:
    """Validate enabled is boolean."""
    if not isinstance(v, bool):
        raise ValueError("enabled must be boolean")
    return v

validate_fail_on_error(v) classmethod

Validate fail_on_error is boolean.

Source code in src/holodeck/models/evaluation.py
156
157
158
159
160
161
162
@field_validator("fail_on_error")
@classmethod
def validate_fail_on_error(cls, v: bool) -> bool:
    """Validate fail_on_error is boolean."""
    if not isinstance(v, bool):
        raise ValueError("fail_on_error must be boolean")
    return v

validate_retry_on_failure(v) classmethod

Validate retry_on_failure is in valid range.

Source code in src/holodeck/models/evaluation.py
164
165
166
167
168
169
170
@field_validator("retry_on_failure")
@classmethod
def validate_retry_on_failure(cls, v: int | None) -> int | None:
    """Validate retry_on_failure is in valid range."""
    if v is not None and (v < 1 or v > 3):
        raise ValueError("retry_on_failure must be between 1 and 3")
    return v

validate_scale(v) classmethod

Validate scale is positive.

Source code in src/holodeck/models/evaluation.py
180
181
182
183
184
185
186
@field_validator("scale")
@classmethod
def validate_scale(cls, v: int | None) -> int | None:
    """Validate scale is positive."""
    if v is not None and v <= 0:
        raise ValueError("scale must be positive")
    return v

validate_threshold(v) classmethod

Validate threshold is numeric if provided.

Source code in src/holodeck/models/evaluation.py
140
141
142
143
144
145
146
@field_validator("threshold")
@classmethod
def validate_threshold(cls, v: float | None) -> float | None:
    """Validate threshold is numeric if provided."""
    if v is not None and not isinstance(v, int | float):
        raise ValueError("threshold must be numeric")
    return v

validate_timeout_ms(v) classmethod

Validate timeout_ms is positive.

Source code in src/holodeck/models/evaluation.py
172
173
174
175
176
177
178
@field_validator("timeout_ms")
@classmethod
def validate_timeout_ms(cls, v: int | None) -> int | None:
    """Validate timeout_ms is positive."""
    if v is not None and v <= 0:
        raise ValueError("timeout_ms must be positive")
    return v

GEvalMetric

G-Eval custom criteria configuration (type: geval).

GEvalMetric

Bases: BaseModel

G-Eval custom criteria metric configuration.

Uses discriminator pattern with type="geval" to distinguish from standard EvaluationMetric instances in a discriminated union.

G-Eval enables custom evaluation criteria defined in natural language, using chain-of-thought prompting with LLM-based scoring.

Example

metric = GEvalMetric( ... name="Professionalism", ... criteria="Evaluate if the response uses professional language", ... threshold=0.7 ... )

validate_criteria(v) classmethod

Validate criteria is not empty.

Source code in src/holodeck/models/evaluation.py
280
281
282
283
284
285
286
@field_validator("criteria")
@classmethod
def validate_criteria(cls, v: str) -> str:
    """Validate criteria is not empty."""
    if not v or not v.strip():
        raise ValueError("criteria must be a non-empty string")
    return v

validate_evaluation_params(v) classmethod

Validate evaluation_params contains valid values.

Source code in src/holodeck/models/evaluation.py
288
289
290
291
292
293
294
295
296
297
298
299
300
@field_validator("evaluation_params")
@classmethod
def validate_evaluation_params(cls, v: list[str]) -> list[str]:
    """Validate evaluation_params contains valid values."""
    if not v:
        raise ValueError("evaluation_params must not be empty")
    invalid_params = set(v) - VALID_EVALUATION_PARAMS
    if invalid_params:
        raise ValueError(
            f"Invalid evaluation_params: {sorted(invalid_params)}. "
            f"Valid options: {sorted(VALID_EVALUATION_PARAMS)}"
        )
    return v

validate_name(v) classmethod

Validate name is not empty.

Source code in src/holodeck/models/evaluation.py
272
273
274
275
276
277
278
@field_validator("name")
@classmethod
def validate_name(cls, v: str) -> str:
    """Validate name is not empty."""
    if not v or not v.strip():
        raise ValueError("name must be a non-empty string")
    return v

validate_threshold(v) classmethod

Validate threshold is in valid range.

Source code in src/holodeck/models/evaluation.py
302
303
304
305
306
307
308
@field_validator("threshold")
@classmethod
def validate_threshold(cls, v: float | None) -> float | None:
    """Validate threshold is in valid range."""
    if v is not None and (v < 0.0 or v > 1.0):
        raise ValueError("threshold must be between 0.0 and 1.0")
    return v

RAGMetric

RAG pipeline metric configuration (type: rag).

RAGMetric

Bases: BaseModel

RAG pipeline evaluation metric configuration.

Uses discriminator pattern with type="rag" to distinguish from standard EvaluationMetric and GEvalMetric instances in a discriminated union.

RAG metrics evaluate the quality of retrieval-augmented generation pipelines: - Faithfulness: Detects hallucinations by comparing response to context - ContextualRelevancy: Measures relevance of retrieved chunks to query - ContextualPrecision: Evaluates ranking quality of retrieved chunks - ContextualRecall: Measures retrieval completeness against expected output

Example

metric = RAGMetric( ... metric_type=RAGMetricType.FAITHFULNESS, ... threshold=0.8 ... )

validate_threshold(v) classmethod

Validate threshold is in valid range.

Source code in src/holodeck/models/evaluation.py
363
364
365
366
367
368
369
@field_validator("threshold")
@classmethod
def validate_threshold(cls, v: float) -> float:
    """Validate threshold is in valid range."""
    if v < 0.0 or v > 1.0:
        raise ValueError("threshold must be between 0.0 and 1.0")
    return v

RAGMetricType

RAGMetricType

Bases: str, Enum

RAG pipeline evaluation metric types.

These metrics evaluate the quality of Retrieval-Augmented Generation (RAG) pipelines by assessing various aspects of retrieval and response generation.


Base Framework

All evaluators inherit from BaseEvaluator, which provides retry logic with exponential backoff, timeout handling, and a parameter specification system.

BaseEvaluator

BaseEvaluator(timeout=60.0, retry_config=None)

Bases: ABC

Abstract base class for all evaluation metrics.

This class provides retry logic, timeout handling, and a common interface for all evaluators (AI-assisted and NLP metrics).

Attributes:

Name Type Description
timeout

Timeout in seconds for evaluation (default: 60s, None for no timeout)

retry_config

Configuration for retry logic with exponential backoff

name str

Evaluator name (defaults to class name)

PARAM_SPEC ParamSpec

Class attribute declaring required/optional parameters

Example

class MyEvaluator(BaseEvaluator): ... PARAM_SPEC = ParamSpec( ... required=frozenset({EvalParam.RESPONSE, EvalParam.QUERY}) ... ) ... async def _evaluate_impl(self, **kwargs): ... return {"score": 0.85, "passed": True}

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate(query="test", response="answer")

Initialize base evaluator.

Parameters:

Name Type Description Default
timeout float | None

Timeout in seconds (None for no timeout)

60.0
retry_config RetryConfig | None

Retry configuration (uses defaults if not provided)

None
Source code in src/holodeck/lib/evaluators/base.py
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
def __init__(
    self,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
) -> None:
    """Initialize base evaluator.

    Args:
        timeout: Timeout in seconds (None for no timeout)
        retry_config: Retry configuration (uses defaults if not provided)
    """
    self.timeout = timeout
    self.retry_config = retry_config or RetryConfig()

    logger.debug(
        f"Evaluator initialized: {self.name}, timeout={timeout}s, "
        f"max_retries={self.retry_config.max_retries}"
    )

name property

Return evaluator name (class name by default).

evaluate(**kwargs) async

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name Type Description Default
**kwargs Any

Evaluation parameters (query, response, context, ground_truth, etc.)

{}

Returns:

Type Description
dict[str, Any]

Evaluation result dictionary

Raises:

Type Description
TimeoutError

If evaluation exceeds timeout

EvaluationError

If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

get_param_spec() classmethod

Get the parameter specification for this evaluator.

Returns:

Type Description
ParamSpec

ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py
121
122
123
124
125
126
127
128
@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

RetryConfig

RetryConfig

Bases: BaseModel

Configuration for retry logic with exponential backoff.

Attributes:

Name Type Description
max_retries int

Maximum number of retry attempts (default: 3)

base_delay float

Base delay in seconds for exponential backoff (default: 2.0)

max_delay float

Maximum delay between retries in seconds (default: 60.0)

exponential_base float

Exponential base for backoff calculation (default: 2.0)

Example

config = RetryConfig(max_retries=3, base_delay=2.0)

Delays will be: 2.0s, 4.0s, 8.0s

validate_delays(v) classmethod

Validate delays are positive.

Source code in src/holodeck/lib/evaluators/base.py
59
60
61
62
63
64
65
@field_validator("base_delay", "max_delay")
@classmethod
def validate_delays(cls, v: float) -> float:
    """Validate delays are positive."""
    if v <= 0:
        raise ValueError("Delays must be positive")
    return v

validate_max_retries(v) classmethod

Validate max_retries is non-negative.

Source code in src/holodeck/lib/evaluators/base.py
51
52
53
54
55
56
57
@field_validator("max_retries")
@classmethod
def validate_max_retries(cls, v: int) -> int:
    """Validate max_retries is non-negative."""
    if v < 0:
        raise ValueError("max_retries must be non-negative")
    return v

EvaluationError

EvaluationError

Bases: Exception

Exception raised when evaluation fails after all retry attempts.


Parameter Specification

The param_spec module defines a standard way for evaluators to declare their required and optional inputs. This enables the test runner to validate inputs before calling the evaluator.

EvalParam

EvalParam

Bases: str, Enum

Standard evaluation parameter names.

Two naming conventions are supported: - Azure AI / NLP: RESPONSE, QUERY, GROUND_TRUTH - DeepEval: ACTUAL_OUTPUT, INPUT, EXPECTED_OUTPUT

Both conventions share CONTEXT and RETRIEVAL_CONTEXT.

ParamSpec

ParamSpec

Bases: NamedTuple

Parameter specification for an evaluator.

Declares which parameters an evaluator requires and optionally accepts, plus flags for special context handling.

Attributes:

Name Type Description
required frozenset[EvalParam]

Parameters that must be provided for evaluation.

optional frozenset[EvalParam]

Parameters that may be provided but aren't required.

uses_context bool

Whether file content should be passed as context.

uses_retrieval_context bool

Whether retrieval context from tools is needed.

Example

spec = ParamSpec( ... required=frozenset({EvalParam.RESPONSE, EvalParam.QUERY}), ... optional=frozenset({EvalParam.CONTEXT}), ... uses_context=True, ... )

uses_deepeval_params()

Check if this spec uses DeepEval parameter naming convention.

Returns:

Type Description
bool

True if any required or optional param is a DeepEval param.

Source code in src/holodeck/lib/evaluators/param_spec.py
71
72
73
74
75
76
77
78
def uses_deepeval_params(self) -> bool:
    """Check if this spec uses DeepEval parameter naming convention.

    Returns:
        True if any required or optional param is a DeepEval param.
    """
    all_params = self.required | self.optional
    return bool(all_params & DEEPEVAL_PARAMS)

DEEPEVAL_PARAMS

A frozenset of DeepEval-specific parameter names (INPUT, ACTUAL_OUTPUT, EXPECTED_OUTPUT) used by ParamSpec.uses_deepeval_params() to detect the DeepEval naming convention.

DEEPEVAL_PARAMS = frozenset(
    {EvalParam.INPUT, EvalParam.ACTUAL_OUTPUT, EvalParam.EXPECTED_OUTPUT}
)

Two naming conventions exist side-by-side:

Convention Query Response Reference
Azure AI / NLP query response ground_truth
DeepEval input actual_output expected_output

Both conventions share context and retrieval_context.


Standard NLP Metrics

Traditional text-comparison metrics that do not require an LLM. All NLP evaluators require response and ground_truth parameters.

BLEUEvaluator

Uses SacreBLEU with exponential smoothing. Scores are normalized from SacreBLEU's 0--100 scale to 0.0--1.0.

BLEUEvaluator(threshold=None, timeout=60.0, **kwargs)

Bases: BaseEvaluator

BLEU score evaluator using SacreBLEU with smoothing.

BLEU (Bilingual Evaluation Understudy) measures precision of n-gram matches between prediction and reference text. Uses SacreBLEU with exponential smoothing to handle short sentences and avoid zero scores when there are no 4-gram matches.

Score Range: 0.0-1.0 (normalized from SacreBLEU's 0-100 scale) Higher scores indicate better match to reference text.

Attributes:

Name Type Description
threshold

Minimum passing score (0.0-1.0)

timeout

Timeout in seconds for evaluation

retry_config

Retry configuration for transient failures

Example

evaluator = BLEUEvaluator(threshold=0.5) result = await evaluator.evaluate( ... response="The cat sat on the mat", ... ground_truth="The cat is on the mat" ... ) print(result["bleu"]) # 0.0-1.0 print(result["passed"]) # True if >= threshold

Initialize BLEU evaluator.

Parameters:

Name Type Description Default
threshold float | None

Minimum passing score (0.0-1.0)

None
timeout float | None

Timeout in seconds

60.0
**kwargs Any

Additional arguments passed to BaseEvaluator

{}
Source code in src/holodeck/lib/evaluators/nlp_metrics.py
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
def __init__(
    self,
    threshold: float | None = None,
    timeout: float | None = 60.0,
    **kwargs: Any,
) -> None:
    """Initialize BLEU evaluator.

    Args:
        threshold: Minimum passing score (0.0-1.0)
        timeout: Timeout in seconds
        **kwargs: Additional arguments passed to BaseEvaluator
    """
    super().__init__(timeout=timeout, **kwargs)
    self.threshold = threshold

name property

Return evaluator name (class name by default).

evaluate(**kwargs) async

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name Type Description Default
**kwargs Any

Evaluation parameters (query, response, context, ground_truth, etc.)

{}

Returns:

Type Description
dict[str, Any]

Evaluation result dictionary

Raises:

Type Description
TimeoutError

If evaluation exceeds timeout

EvaluationError

If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

get_param_spec() classmethod

Get the parameter specification for this evaluator.

Returns:

Type Description
ParamSpec

ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py
121
122
123
124
125
126
127
128
@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

ROUGEEvaluator

Returns all three ROUGE variants (rouge1, rouge2, rougeL). The variant parameter controls which variant is used for the threshold check.

ROUGEEvaluator(threshold=None, variant='rougeL', timeout=60.0, **kwargs)

Bases: BaseEvaluator

ROUGE score evaluator.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures recall of n-gram overlaps between prediction and reference. Commonly used for summarization evaluation.

Variants: - ROUGE-1: Unigram overlap - ROUGE-2: Bigram overlap - ROUGE-L: Longest common subsequence

Score Range: 0.0-1.0 (F1 score) Higher scores indicate better recall of reference text.

Attributes:

Name Type Description
threshold

Minimum passing score (0.0-1.0)

variant

ROUGE variant to use for threshold check ("rouge1", "rouge2", "rougeL")

timeout

Timeout in seconds for evaluation

retry_config

Retry configuration for transient failures

Example

evaluator = ROUGEEvaluator(threshold=0.6, variant="rougeL") result = await evaluator.evaluate( ... response="The cat sat on the mat", ... ground_truth="The cat is on the mat" ... ) print(result["rouge1"]) # 0.0-1.0 print(result["rouge2"]) # 0.0-1.0 print(result["rougeL"]) # 0.0-1.0 print(result["passed"]) # True if rougeL >= threshold

Initialize ROUGE evaluator.

Parameters:

Name Type Description Default
threshold float | None

Minimum passing score (0.0-1.0)

None
variant str

ROUGE variant for threshold check ("rouge1", "rouge2", "rougeL")

'rougeL'
timeout float | None

Timeout in seconds

60.0
**kwargs Any

Additional arguments passed to BaseEvaluator

{}

Raises:

Type Description
ValueError

If variant is not valid

Source code in src/holodeck/lib/evaluators/nlp_metrics.py
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
def __init__(
    self,
    threshold: float | None = None,
    variant: str = "rougeL",
    timeout: float | None = 60.0,
    **kwargs: Any,
) -> None:
    """Initialize ROUGE evaluator.

    Args:
        threshold: Minimum passing score (0.0-1.0)
        variant: ROUGE variant for threshold check
            ("rouge1", "rouge2", "rougeL")
        timeout: Timeout in seconds
        **kwargs: Additional arguments passed to BaseEvaluator

    Raises:
        ValueError: If variant is not valid
    """
    super().__init__(timeout=timeout, **kwargs)
    self.threshold = threshold

    valid_variants = {"rouge1", "rouge2", "rougeL"}
    if variant not in valid_variants:
        raise ValueError(f"variant must be one of {valid_variants}, got: {variant}")
    self.variant = variant
    self._metric = None  # Lazy loaded

name property

Return evaluator name (class name by default).

evaluate(**kwargs) async

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name Type Description Default
**kwargs Any

Evaluation parameters (query, response, context, ground_truth, etc.)

{}

Returns:

Type Description
dict[str, Any]

Evaluation result dictionary

Raises:

Type Description
TimeoutError

If evaluation exceeds timeout

EvaluationError

If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

get_param_spec() classmethod

Get the parameter specification for this evaluator.

Returns:

Type Description
ParamSpec

ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py
121
122
123
124
125
126
127
128
@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

METEOREvaluator

Synonym-aware matching with stemming for better correlation with human judgment.

METEOREvaluator(threshold=None, timeout=60.0, **kwargs)

Bases: BaseEvaluator

METEOR score evaluator.

METEOR (Metric for Evaluation of Translation with Explicit ORdering) measures translation quality using synonym matching, stemming, and paraphrase detection. Provides better correlation with human judgment than BLEU.

Score Range: 0.0-1.0 Higher scores indicate better semantic match to reference text.

Attributes:

Name Type Description
threshold

Minimum passing score (0.0-1.0)

timeout

Timeout in seconds for evaluation

retry_config

Retry configuration for transient failures

Example

evaluator = METEOREvaluator(threshold=0.7) result = await evaluator.evaluate( ... response="The automobile is red", ... ground_truth="The car is red" ... ) print(result["meteor"]) # Higher than BLEU due to synonym handling print(result["passed"]) # True if >= threshold

Initialize METEOR evaluator.

Parameters:

Name Type Description Default
threshold float | None

Minimum passing score (0.0-1.0)

None
timeout float | None

Timeout in seconds

60.0
**kwargs Any

Additional arguments passed to BaseEvaluator

{}
Source code in src/holodeck/lib/evaluators/nlp_metrics.py
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
def __init__(
    self,
    threshold: float | None = None,
    timeout: float | None = 60.0,
    **kwargs: Any,
) -> None:
    """Initialize METEOR evaluator.

    Args:
        threshold: Minimum passing score (0.0-1.0)
        timeout: Timeout in seconds
        **kwargs: Additional arguments passed to BaseEvaluator
    """
    super().__init__(timeout=timeout, **kwargs)
    self.threshold = threshold
    self._metric = None  # Lazy loaded

name property

Return evaluator name (class name by default).

evaluate(**kwargs) async

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name Type Description Default
**kwargs Any

Evaluation parameters (query, response, context, ground_truth, etc.)

{}

Returns:

Type Description
dict[str, Any]

Evaluation result dictionary

Raises:

Type Description
TimeoutError

If evaluation exceeds timeout

EvaluationError

If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

get_param_spec() classmethod

Get the parameter specification for this evaluator.

Returns:

Type Description
ParamSpec

ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py
121
122
123
124
125
126
127
128
@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

NLPMetricsError

NLPMetricsError

Bases: EvaluationError

Exception raised when NLP metric computation fails.

NLP Metrics Usage

from holodeck.lib.evaluators.nlp_metrics import BLEUEvaluator, ROUGEEvaluator

bleu = BLEUEvaluator(threshold=0.5)
result = await bleu.evaluate(
    response="The cat sat on the mat",
    ground_truth="The cat is on the mat",
)
print(result["bleu"])    # 0.0-1.0
print(result["passed"])  # True if >= 0.5

rouge = ROUGEEvaluator(threshold=0.6, variant="rougeL")
result = await rouge.evaluate(
    response="The cat sat on the mat",
    ground_truth="The cat is on the mat",
)
print(result["rouge1"], result["rouge2"], result["rougeL"])

NLP Metrics Summary

Metric Score Key Score Range Use Case
BLEUEvaluator bleu 0.0--1.0 Precision-focused n-gram matching
ROUGEEvaluator rouge1, rouge2, rougeL 0.0--1.0 Recall-focused overlap (summarization)
METEOREvaluator meteor 0.0--1.0 Synonym-aware semantic similarity

Azure AI Metrics

AI-assisted quality metrics powered by the Azure AI Evaluation SDK. All Azure evaluators normalize scores from a 1--5 scale to 0.0--1.0.

ModelConfig

ModelConfig

Bases: BaseModel

Azure OpenAI model configuration for evaluators.

Attributes:

Name Type Description
azure_endpoint str

Azure OpenAI endpoint URL

api_key str

Azure OpenAI API key

azure_deployment str

Azure deployment name (e.g., "gpt-4o", "gpt-4o-mini")

api_version str

Azure OpenAI API version (default: "2024-02-15-preview")

Example

config = ModelConfig( ... azure_endpoint="https://my-resource.openai.azure.com/", ... api_key="my-api-key", ... azure_deployment="gpt-4o" ... )

AzureAIEvaluator

AzureAIEvaluator(model_config, timeout=60.0, retry_config=None)

Bases: BaseEvaluator

Base class for Azure AI Evaluation SDK evaluators.

Provides common functionality for all Azure AI evaluators: - Model configuration - Retry logic with exponential backoff - Timeout handling - Score normalization (5-point scale to 0-1)

Attributes:

Name Type Description
model_config

Azure OpenAI model configuration

timeout

Timeout in seconds (default: 60s)

retry_config

Retry configuration with exponential backoff

Example

config = ModelConfig( ... azure_endpoint="https://test.openai.azure.com/", ... api_key="key", ... azure_deployment="gpt-4o-mini" ... ) evaluator = RelevanceEvaluator(model_config=config) result = await evaluator.evaluate(query="test", response="answer")

Initialize Azure AI evaluator.

Parameters:

Name Type Description Default
model_config ModelConfig

Azure OpenAI model configuration

required
timeout float | None

Timeout in seconds (default: 60s, None for no timeout)

60.0
retry_config RetryConfig | None

Retry configuration (uses defaults if not provided)

None
Source code in src/holodeck/lib/evaluators/azure_ai.py
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
def __init__(
    self,
    model_config: ModelConfig,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
) -> None:
    """Initialize Azure AI evaluator.

    Args:
        model_config: Azure OpenAI model configuration
        timeout: Timeout in seconds (default: 60s, None for no timeout)
        retry_config: Retry configuration (uses defaults if not provided)
    """
    super().__init__(timeout=timeout, retry_config=retry_config)
    self.model_config = model_config

name property

Return evaluator name (class name by default).

evaluate(**kwargs) async

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name Type Description Default
**kwargs Any

Evaluation parameters (query, response, context, ground_truth, etc.)

{}

Returns:

Type Description
dict[str, Any]

Evaluation result dictionary

Raises:

Type Description
TimeoutError

If evaluation exceeds timeout

EvaluationError

If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

get_param_spec() classmethod

Get the parameter specification for this evaluator.

Returns:

Type Description
ParamSpec

ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py
121
122
123
124
125
126
127
128
@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

GroundednessEvaluator

Assesses whether all claims in the response are supported by the provided context. Use an expensive model (e.g., gpt-4o) for this critical metric.

GroundednessEvaluator(model_config, timeout=60.0, retry_config=None)

Bases: AzureAIEvaluator

Groundedness evaluator using Azure AI Evaluation SDK.

Assesses correspondence between claims in AI-generated answers and source context. Measures factual accuracy by verifying that all claims in the response are supported by the provided context.

Query parameter is optional but recommended for better accuracy.

Scale: 1-5 (normalized to 0.0-1.0)

Example

config = ModelConfig( ... azure_endpoint="https://test.openai.azure.com/", ... api_key="key", ... azure_deployment="gpt-4o" # Use expensive model for critical metric ... ) evaluator = GroundednessEvaluator(model_config=config) result = await evaluator.evaluate( ... query="What is the capital?", ... response="The capital is Paris.", ... context="France's capital is Paris." ... ) print(result["score"]) # 0.0-1.0 0.95

Initialize Azure AI evaluator.

Parameters:

Name Type Description Default
model_config ModelConfig

Azure OpenAI model configuration

required
timeout float | None

Timeout in seconds (default: 60s, None for no timeout)

60.0
retry_config RetryConfig | None

Retry configuration (uses defaults if not provided)

None
Source code in src/holodeck/lib/evaluators/azure_ai.py
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
def __init__(
    self,
    model_config: ModelConfig,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
) -> None:
    """Initialize Azure AI evaluator.

    Args:
        model_config: Azure OpenAI model configuration
        timeout: Timeout in seconds (default: 60s, None for no timeout)
        retry_config: Retry configuration (uses defaults if not provided)
    """
    super().__init__(timeout=timeout, retry_config=retry_config)
    self.model_config = model_config

name property

Return evaluator name (class name by default).

evaluate(**kwargs) async

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name Type Description Default
**kwargs Any

Evaluation parameters (query, response, context, ground_truth, etc.)

{}

Returns:

Type Description
dict[str, Any]

Evaluation result dictionary

Raises:

Type Description
TimeoutError

If evaluation exceeds timeout

EvaluationError

If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

get_param_spec() classmethod

Get the parameter specification for this evaluator.

Returns:

Type Description
ParamSpec

ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py
121
122
123
124
125
126
127
128
@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

RelevanceEvaluator

Measures whether the response directly addresses the user's question.

RelevanceEvaluator(model_config, timeout=60.0, retry_config=None)

Bases: AzureAIEvaluator

Relevance evaluator using Azure AI Evaluation SDK.

Measures relevance of response to query. Assesses whether the response directly addresses the user's question or request.

Scale: 1-5 (normalized to 0.0-1.0)

Example

config = ModelConfig( ... azure_endpoint="https://test.openai.azure.com/", ... api_key="key", ... azure_deployment="gpt-4o" # Critical metric ... ) evaluator = RelevanceEvaluator(model_config=config) result = await evaluator.evaluate( ... query="What is ML?", ... response="ML is machine learning, a subset of AI." ... )

Initialize Azure AI evaluator.

Parameters:

Name Type Description Default
model_config ModelConfig

Azure OpenAI model configuration

required
timeout float | None

Timeout in seconds (default: 60s, None for no timeout)

60.0
retry_config RetryConfig | None

Retry configuration (uses defaults if not provided)

None
Source code in src/holodeck/lib/evaluators/azure_ai.py
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
def __init__(
    self,
    model_config: ModelConfig,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
) -> None:
    """Initialize Azure AI evaluator.

    Args:
        model_config: Azure OpenAI model configuration
        timeout: Timeout in seconds (default: 60s, None for no timeout)
        retry_config: Retry configuration (uses defaults if not provided)
    """
    super().__init__(timeout=timeout, retry_config=retry_config)
    self.model_config = model_config

name property

Return evaluator name (class name by default).

evaluate(**kwargs) async

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name Type Description Default
**kwargs Any

Evaluation parameters (query, response, context, ground_truth, etc.)

{}

Returns:

Type Description
dict[str, Any]

Evaluation result dictionary

Raises:

Type Description
TimeoutError

If evaluation exceeds timeout

EvaluationError

If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

get_param_spec() classmethod

Get the parameter specification for this evaluator.

Returns:

Type Description
ParamSpec

ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py
121
122
123
124
125
126
127
128
@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

CoherenceEvaluator

Evaluates logical flow and readability of the response.

CoherenceEvaluator(model_config, timeout=60.0, retry_config=None)

Bases: AzureAIEvaluator

Coherence evaluator using Azure AI Evaluation SDK.

Evaluates logical flow and readability. Measures how well the response is organized and whether ideas connect logically.

Scale: 1-5 (normalized to 0.0-1.0)

Example

config = ModelConfig( ... azure_endpoint="https://test.openai.azure.com/", ... api_key="key", ... azure_deployment="gpt-4o-mini" # Less critical metric ... ) evaluator = CoherenceEvaluator(model_config=config) result = await evaluator.evaluate( ... query="Explain X", ... response="X is... Furthermore... In conclusion..." ... )

Initialize Azure AI evaluator.

Parameters:

Name Type Description Default
model_config ModelConfig

Azure OpenAI model configuration

required
timeout float | None

Timeout in seconds (default: 60s, None for no timeout)

60.0
retry_config RetryConfig | None

Retry configuration (uses defaults if not provided)

None
Source code in src/holodeck/lib/evaluators/azure_ai.py
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
def __init__(
    self,
    model_config: ModelConfig,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
) -> None:
    """Initialize Azure AI evaluator.

    Args:
        model_config: Azure OpenAI model configuration
        timeout: Timeout in seconds (default: 60s, None for no timeout)
        retry_config: Retry configuration (uses defaults if not provided)
    """
    super().__init__(timeout=timeout, retry_config=retry_config)
    self.model_config = model_config

name property

Return evaluator name (class name by default).

evaluate(**kwargs) async

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name Type Description Default
**kwargs Any

Evaluation parameters (query, response, context, ground_truth, etc.)

{}

Returns:

Type Description
dict[str, Any]

Evaluation result dictionary

Raises:

Type Description
TimeoutError

If evaluation exceeds timeout

EvaluationError

If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

get_param_spec() classmethod

Get the parameter specification for this evaluator.

Returns:

Type Description
ParamSpec

ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py
121
122
123
124
125
126
127
128
@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

FluencyEvaluator

Assesses grammar, spelling, punctuation, word choice, and sentence structure.

FluencyEvaluator(model_config, timeout=60.0, retry_config=None)

Bases: AzureAIEvaluator

Fluency evaluator using Azure AI Evaluation SDK.

Assesses language quality. Measures grammar, spelling, punctuation, word choice, and sentence structure.

Scale: 1-5 (normalized to 0.0-1.0)

Example

config = ModelConfig( ... azure_endpoint="https://test.openai.azure.com/", ... api_key="key", ... azure_deployment="gpt-4o-mini" # Less critical metric ... ) evaluator = FluencyEvaluator(model_config=config) result = await evaluator.evaluate( ... query="Test", ... response="This is a well-written response." ... )

Initialize Azure AI evaluator.

Parameters:

Name Type Description Default
model_config ModelConfig

Azure OpenAI model configuration

required
timeout float | None

Timeout in seconds (default: 60s, None for no timeout)

60.0
retry_config RetryConfig | None

Retry configuration (uses defaults if not provided)

None
Source code in src/holodeck/lib/evaluators/azure_ai.py
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
def __init__(
    self,
    model_config: ModelConfig,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
) -> None:
    """Initialize Azure AI evaluator.

    Args:
        model_config: Azure OpenAI model configuration
        timeout: Timeout in seconds (default: 60s, None for no timeout)
        retry_config: Retry configuration (uses defaults if not provided)
    """
    super().__init__(timeout=timeout, retry_config=retry_config)
    self.model_config = model_config

name property

Return evaluator name (class name by default).

evaluate(**kwargs) async

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name Type Description Default
**kwargs Any

Evaluation parameters (query, response, context, ground_truth, etc.)

{}

Returns:

Type Description
dict[str, Any]

Evaluation result dictionary

Raises:

Type Description
TimeoutError

If evaluation exceeds timeout

EvaluationError

If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

get_param_spec() classmethod

Get the parameter specification for this evaluator.

Returns:

Type Description
ParamSpec

ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py
121
122
123
124
125
126
127
128
@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

SimilarityEvaluator

Compares semantic similarity between response and ground truth.

SimilarityEvaluator(model_config, timeout=60.0, retry_config=None)

Bases: AzureAIEvaluator

Similarity evaluator using Azure AI Evaluation SDK.

Compares semantic similarity between response and ground truth. Measures how closely the response matches the expected answer.

Requires ground_truth parameter.

Scale: 1-5 (normalized to 0.0-1.0)

Example

config = ModelConfig( ... azure_endpoint="https://test.openai.azure.com/", ... api_key="key", ... azure_deployment="gpt-4o-mini" ... ) evaluator = SimilarityEvaluator(model_config=config) result = await evaluator.evaluate( ... query="What is 2+2?", ... response="The answer is 4.", ... ground_truth="2+2 equals 4." ... )

Initialize Azure AI evaluator.

Parameters:

Name Type Description Default
model_config ModelConfig

Azure OpenAI model configuration

required
timeout float | None

Timeout in seconds (default: 60s, None for no timeout)

60.0
retry_config RetryConfig | None

Retry configuration (uses defaults if not provided)

None
Source code in src/holodeck/lib/evaluators/azure_ai.py
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
def __init__(
    self,
    model_config: ModelConfig,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
) -> None:
    """Initialize Azure AI evaluator.

    Args:
        model_config: Azure OpenAI model configuration
        timeout: Timeout in seconds (default: 60s, None for no timeout)
        retry_config: Retry configuration (uses defaults if not provided)
    """
    super().__init__(timeout=timeout, retry_config=retry_config)
    self.model_config = model_config

name property

Return evaluator name (class name by default).

evaluate(**kwargs) async

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name Type Description Default
**kwargs Any

Evaluation parameters (query, response, context, ground_truth, etc.)

{}

Returns:

Type Description
dict[str, Any]

Evaluation result dictionary

Raises:

Type Description
TimeoutError

If evaluation exceeds timeout

EvaluationError

If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

get_param_spec() classmethod

Get the parameter specification for this evaluator.

Returns:

Type Description
ParamSpec

ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py
121
122
123
124
125
126
127
128
@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

Azure AI Usage

from holodeck.lib.evaluators.azure_ai import (
    ModelConfig,
    GroundednessEvaluator,
    RelevanceEvaluator,
)

config = ModelConfig(
    azure_endpoint="https://my-resource.openai.azure.com/",
    api_key="my-api-key",
    azure_deployment="gpt-4o",
)

groundedness = GroundednessEvaluator(model_config=config)
result = await groundedness.evaluate(
    query="What is the capital of France?",
    response="The capital of France is Paris.",
    context="France is a country in Europe. Its capital is Paris.",
)
print(result["score"])          # 0.0-1.0 (normalized from 1-5)
print(result["groundedness"])   # Raw 1-5 score
print(result["reasoning"])      # LLM explanation

Azure AI Metrics Summary

Evaluator Required Params Optional Params Score Key
GroundednessEvaluator response, context query groundedness
RelevanceEvaluator response, query context relevance
CoherenceEvaluator response, query -- coherence
FluencyEvaluator response, query -- fluency
SimilarityEvaluator response, query, ground_truth -- similarity

DeepEval Metrics

LLM-as-a-judge evaluation with multi-provider support (OpenAI, Azure OpenAI, Anthropic, Ollama). DeepEval metrics use a different parameter naming convention (input, actual_output, expected_output) but HoloDeck's DeepEvalBaseEvaluator also accepts Azure/NLP aliases (query, response, ground_truth).

DeepEvalModelConfig

DeepEvalModelConfig

Bases: BaseModel

Configuration adapter for DeepEval model classes.

This class bridges HoloDeck's LLMProvider configuration to DeepEval's native model classes (GPTModel, AzureOpenAIModel, AnthropicModel, OllamaModel).

The default configuration uses Ollama with gpt-oss:20b for local evaluation without requiring API keys.

Attributes:

Name Type Description
provider ProviderEnum

LLM provider to use (defaults to Ollama)

model_name str

Name of the model (defaults to gpt-oss:20b)

api_key str | None

API key for cloud providers (not required for Ollama)

endpoint str | None

API endpoint URL (required for Azure, optional for Ollama)

api_version str | None

Azure OpenAI API version

deployment_name str | None

Azure OpenAI deployment name

temperature float

Temperature for generation (defaults to 0.0 for determinism)

API Key Behavior
  • OpenAI: API key can be provided via api_key field or the OPENAI_API_KEY environment variable. If neither is set, DeepEval's GPTModel will raise an error at runtime.
  • Anthropic: API key can be provided via api_key field or the ANTHROPIC_API_KEY environment variable. If neither is set, DeepEval's AnthropicModel will raise an error at runtime.
  • Azure OpenAI: The api_key field is required and validated at configuration time (no environment variable fallback).
  • Ollama: No API key required (local inference).
Example

config = DeepEvalModelConfig() # Default Ollama model = config.to_deepeval_model()

openai_config = DeepEvalModelConfig( ... provider=ProviderEnum.OPENAI, ... model_name="gpt-4o", ... api_key="sk-..." # Or set OPENAI_API_KEY env var ... )

to_deepeval_model()

Convert configuration to native DeepEval model class.

Returns the appropriate DeepEval model class instance based on the configured provider.

Returns:

Type Description
DeepEvalModel

DeepEval model instance (GPTModel, AzureOpenAIModel,

DeepEvalModel

AnthropicModel, or OllamaModel)

Raises:

Type Description
ValueError

If provider is not supported

Source code in src/holodeck/lib/evaluators/deepeval/config.py
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
def to_deepeval_model(self) -> DeepEvalModel:
    """Convert configuration to native DeepEval model class.

    Returns the appropriate DeepEval model class instance based on
    the configured provider.

    Returns:
        DeepEval model instance (GPTModel, AzureOpenAIModel,
        AnthropicModel, or OllamaModel)

    Raises:
        ValueError: If provider is not supported
    """
    if self.provider == ProviderEnum.OPENAI:
        from deepeval.models import GPTModel

        kwargs: dict[str, Any] = {
            "model": self.model_name,
            "temperature": self.temperature,
        }
        if self.api_key:
            kwargs["api_key"] = self.api_key
        return GPTModel(**kwargs)

    elif self.provider == ProviderEnum.AZURE_OPENAI:
        from deepeval.models import AzureOpenAIModel

        # DeepEval 3.7.x renamed the constructor kwargs:
        #   model_name → model (hard-required; not aliased)
        #   azure_endpoint → base_url (aliased, deprecation warning)
        #   openai_api_version → api_version (not aliased — must rename)
        #   azure_openai_api_key → api_key (aliased, deprecation warning)
        # We use the new names directly to avoid the warnings.
        return AzureOpenAIModel(
            model=self.model_name,
            deployment_name=self.deployment_name,
            base_url=self.endpoint,
            api_version=self.api_version,
            api_key=self.api_key,
            temperature=1.0,  # reasoning models require temperature=1.0
        )

    elif self.provider == ProviderEnum.ANTHROPIC:
        from deepeval.models import AnthropicModel

        kwargs = {
            "model": self.model_name,
            "temperature": self.temperature,
        }
        if self.api_key:
            kwargs["api_key"] = self.api_key
        return AnthropicModel(**kwargs)

    elif self.provider == ProviderEnum.OLLAMA:
        from deepeval.models import OllamaModel

        return OllamaModel(
            model=self.model_name,
            base_url=self.endpoint or "http://localhost:11434",
            temperature=self.temperature,
        )

    else:
        raise ValueError(f"Unsupported provider: {self.provider}")

validate_provider_requirements()

Validate that required fields are present for each provider.

Raises:

Type Description
ValueError

If required fields are missing for the provider

Source code in src/holodeck/lib/evaluators/deepeval/config.py
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
@model_validator(mode="after")
def validate_provider_requirements(self) -> "DeepEvalModelConfig":
    """Validate that required fields are present for each provider.

    Raises:
        ValueError: If required fields are missing for the provider
    """
    if self.provider == ProviderEnum.AZURE_OPENAI:
        if not self.endpoint:
            raise ValueError("endpoint is required for Azure OpenAI provider")
        if not self.deployment_name:
            raise ValueError(
                "deployment_name is required for Azure OpenAI provider"
            )
        if not self.api_key:
            raise ValueError("api_key is required for Azure OpenAI provider")
    return self

DeepEvalBaseEvaluator

DeepEvalBaseEvaluator(model_config=None, threshold=0.5, timeout=60.0, retry_config=None, observability_config=None)

Bases: BaseEvaluator

Abstract base class for DeepEval-based evaluators.

This class extends BaseEvaluator to provide DeepEval-specific functionality: - Model configuration and initialization - LLMTestCase construction from evaluation inputs - Result normalization and logging

Subclasses must implement _create_metric() to return the specific DeepEval metric instance.

Note: DeepEval uses different parameter names than Azure AI/NLP: - input (not query) - actual_output (not response) - expected_output (not ground_truth)

Attributes:

Name Type Description
model_config

Configuration for the evaluation LLM

threshold

Score threshold for pass/fail determination

model

The initialized DeepEval model instance

Example

class MyMetricEvaluator(DeepEvalBaseEvaluator): ... def _create_metric(self): ... return SomeDeepEvalMetric( ... threshold=self._threshold, ... model=self._model ... )

evaluator = MyMetricEvaluator(threshold=0.7) result = await evaluator.evaluate( ... input="What is Python?", ... actual_output="Python is a programming language." ... )

Initialize DeepEval base evaluator.

Parameters:

Name Type Description Default
model_config DeepEvalModelConfig | None

Configuration for the evaluation model. Defaults to Ollama with gpt-oss:20b.

None
threshold float

Score threshold for pass/fail (0.0-1.0, default: 0.5)

0.5
timeout float | None

Evaluation timeout in seconds (default: 60.0)

60.0
retry_config RetryConfig | None

Retry configuration for transient failures

None
observability_config TracingConfig | None

Tracing configuration for span instrumentation. If None, no spans are created.

None
Source code in src/holodeck/lib/evaluators/deepeval/base.py
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
def __init__(
    self,
    model_config: DeepEvalModelConfig | None = None,
    threshold: float = 0.5,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
    observability_config: TracingConfig | None = None,
) -> None:
    """Initialize DeepEval base evaluator.

    Args:
        model_config: Configuration for the evaluation model.
                     Defaults to Ollama with gpt-oss:20b.
        threshold: Score threshold for pass/fail (0.0-1.0, default: 0.5)
        timeout: Evaluation timeout in seconds (default: 60.0)
        retry_config: Retry configuration for transient failures
        observability_config: Tracing configuration for span instrumentation.
                             If None, no spans are created.
    """
    super().__init__(timeout=timeout, retry_config=retry_config)
    self._model_config = model_config or DeepEvalModelConfig()
    self._model = self._model_config.to_deepeval_model()
    self._threshold = threshold
    self._observability_config = observability_config

    logger.debug(
        f"DeepEval evaluator initialized: {self.name}, "
        f"provider={self._model_config.provider.value}, "
        f"model={self._model_config.model_name}, "
        f"threshold={threshold}"
    )

name property

Return evaluator name (class name by default).

evaluate(**kwargs) async

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name Type Description Default
**kwargs Any

Evaluation parameters (query, response, context, ground_truth, etc.)

{}

Returns:

Type Description
dict[str, Any]

Evaluation result dictionary

Raises:

Type Description
TimeoutError

If evaluation exceeds timeout

EvaluationError

If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

get_param_spec() classmethod

Get the parameter specification for this evaluator.

Returns:

Type Description
ParamSpec

ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py
121
122
123
124
125
126
127
128
@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

DeepEvalError

DeepEvalError(message, metric_name, original_error=None, test_case_summary=None)

Bases: EvaluationError

Wraps errors from the DeepEval library with additional context.

This exception provides debugging information when DeepEval metrics fail, including the metric name and a summary of the test case that triggered the error.

Attributes:

Name Type Description
metric_name

Name of the DeepEval metric that failed

original_error

The underlying exception from DeepEval

test_case_summary

Truncated input/output data for debugging

Initialize DeepEvalError with context.

Parameters:

Name Type Description Default
message str

Human-readable error message

required
metric_name str

Name of the metric that failed

required
original_error Exception | None

The underlying exception from DeepEval

None
test_case_summary dict[str, Any] | None

Dictionary with truncated test case fields

None
Source code in src/holodeck/lib/evaluators/deepeval/errors.py
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
def __init__(
    self,
    message: str,
    metric_name: str,
    original_error: Exception | None = None,
    test_case_summary: dict[str, Any] | None = None,
) -> None:
    """Initialize DeepEvalError with context.

    Args:
        message: Human-readable error message
        metric_name: Name of the metric that failed
        original_error: The underlying exception from DeepEval
        test_case_summary: Dictionary with truncated test case fields
    """
    super().__init__(message)
    self.metric_name = metric_name
    self.original_error = original_error
    self.test_case_summary = test_case_summary or {}

ProviderNotSupportedError

ProviderNotSupportedError(message, evaluator_type, configured_provider, supported_providers)

Bases: EvaluationError

Raised when an evaluator is used with an incompatible LLM provider.

This error is raised early during evaluator initialization to prevent confusing runtime errors when users misconfigure provider settings.

Attributes:

Name Type Description
evaluator_type

The type of evaluator that requires specific providers

configured_provider

The provider that was incorrectly configured

supported_providers

List of providers that are supported

Initialize ProviderNotSupportedError with context.

Parameters:

Name Type Description Default
message str

Human-readable error message

required
evaluator_type str

The evaluator class that raised the error

required
configured_provider str

The provider that was configured

required
supported_providers list[str]

List of valid provider names

required
Source code in src/holodeck/lib/evaluators/deepeval/errors.py
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
def __init__(
    self,
    message: str,
    evaluator_type: str,
    configured_provider: str,
    supported_providers: list[str],
) -> None:
    """Initialize ProviderNotSupportedError with context.

    Args:
        message: Human-readable error message
        evaluator_type: The evaluator class that raised the error
        configured_provider: The provider that was configured
        supported_providers: List of valid provider names
    """
    super().__init__(message)
    self.evaluator_type = evaluator_type
    self.configured_provider = configured_provider
    self.supported_providers = supported_providers

G-Eval: Custom Criteria

GEvalEvaluator

GEvalEvaluator(name, criteria, evaluation_params=None, evaluation_steps=None, model_config=None, threshold=0.5, strict_mode=False, timeout=60.0, retry_config=None, observability_config=None)

Bases: DeepEvalBaseEvaluator

G-Eval custom criteria evaluator.

Evaluates LLM outputs against user-defined criteria using the G-Eval algorithm, which combines chain-of-thought prompting with token probability scoring.

G-Eval works in two phases: 1. Step Generation: Auto-generates evaluation steps from the criteria 2. Scoring: Uses the steps to score the test case on a 1-5 scale (normalized to 0-1)

Attributes:

Name Type Description
_metric_name

Custom name for this evaluation metric

_criteria

Natural language criteria for evaluation

_evaluation_params

Test case fields to include in evaluation

_evaluation_steps

Optional explicit evaluation steps

_strict_mode

Whether to use binary scoring (1.0 or 0.0)

Example

evaluator = GEvalEvaluator( ... name="Professionalism", ... criteria="Evaluate if the response uses professional language", ... threshold=0.7 ... ) result = await evaluator.evaluate( ... input="Write me an email", ... actual_output="Dear Sir/Madam, ..." ... ) print(result["score"]) # 0.85 print(result["passed"]) # True

Initialize G-Eval evaluator.

Parameters:

Name Type Description Default
name str

Metric identifier (e.g., "Correctness", "Helpfulness")

required
criteria str

Natural language evaluation criteria

required
evaluation_params list[str] | None

Test case fields to include in evaluation. Valid options: ["input", "actual_output", "expected_output", "context", "retrieval_context"] Default: ["actual_output"]

None
evaluation_steps list[str] | None

Explicit evaluation steps. If None, G-Eval auto-generates steps from the criteria.

None
model_config DeepEvalModelConfig | None

LLM judge configuration. Defaults to Ollama gpt-oss:20b.

None
threshold float

Pass/fail score threshold (0.0-1.0). Default: 0.5.

0.5
strict_mode bool

If True, scores are binary (1.0 or 0.0). Default: False.

False
timeout float | None

Evaluation timeout in seconds. Default: 60.0.

60.0
retry_config RetryConfig | None

Retry configuration for transient failures.

None
observability_config TracingConfig | None

Tracing configuration for span instrumentation. If None, no spans are created.

None

Raises:

Type Description
ValueError

If invalid evaluation_params are provided.

Source code in src/holodeck/lib/evaluators/deepeval/geval.py
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
def __init__(
    self,
    name: str,
    criteria: str,
    evaluation_params: list[str] | None = None,
    evaluation_steps: list[str] | None = None,
    model_config: DeepEvalModelConfig | None = None,
    threshold: float = 0.5,
    strict_mode: bool = False,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
    observability_config: TracingConfig | None = None,
) -> None:
    """Initialize G-Eval evaluator.

    Args:
        name: Metric identifier (e.g., "Correctness", "Helpfulness")
        criteria: Natural language evaluation criteria
        evaluation_params: Test case fields to include in evaluation.
            Valid options: ["input", "actual_output", "expected_output",
                          "context", "retrieval_context"]
            Default: ["actual_output"]
        evaluation_steps: Explicit evaluation steps. If None, G-Eval
            auto-generates steps from the criteria.
        model_config: LLM judge configuration. Defaults to Ollama gpt-oss:20b.
        threshold: Pass/fail score threshold (0.0-1.0). Default: 0.5.
        strict_mode: If True, scores are binary (1.0 or 0.0). Default: False.
        timeout: Evaluation timeout in seconds. Default: 60.0.
        retry_config: Retry configuration for transient failures.
        observability_config: Tracing configuration for span instrumentation.
                             If None, no spans are created.

    Raises:
        ValueError: If invalid evaluation_params are provided.
    """
    # Validate and set evaluation params before calling super().__init__
    if evaluation_params is None:
        evaluation_params = ["actual_output"]

    # Validate evaluation params
    for param in evaluation_params:
        if param not in VALID_EVALUATION_PARAMS:
            raise ValueError(
                f"Invalid evaluation_param: '{param}'. "
                f"Valid options: {sorted(VALID_EVALUATION_PARAMS)}"
            )

    self._metric_name = name
    self._criteria = criteria
    self._evaluation_params = evaluation_params
    self._evaluation_steps = evaluation_steps
    self._strict_mode = strict_mode

    super().__init__(
        model_config=model_config,
        threshold=threshold,
        timeout=timeout,
        retry_config=retry_config,
        observability_config=observability_config,
    )

    logger.debug(
        f"GEvalEvaluator initialized: name={name}, "
        f"criteria_len={len(criteria)}, "
        f"evaluation_params={evaluation_params}, "
        f"strict_mode={strict_mode}"
    )

name property

Return the custom metric name.

evaluate(**kwargs) async

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name Type Description Default
**kwargs Any

Evaluation parameters (query, response, context, ground_truth, etc.)

{}

Returns:

Type Description
dict[str, Any]

Evaluation result dictionary

Raises:

Type Description
TimeoutError

If evaluation exceeds timeout

EvaluationError

If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

get_param_spec() classmethod

Get the parameter specification for this evaluator.

Returns:

Type Description
ParamSpec

ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py
121
122
123
124
125
126
127
128
@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

G-Eval Usage

from holodeck.lib.evaluators.deepeval import GEvalEvaluator, DeepEvalModelConfig
from holodeck.models.llm import ProviderEnum

config = DeepEvalModelConfig(
    provider=ProviderEnum.OPENAI,
    model_name="gpt-4o",
    api_key="sk-...",
)

evaluator = GEvalEvaluator(
    name="Professionalism",
    criteria="Evaluate if the response uses professional language and avoids slang.",
    evaluation_params=["actual_output", "input"],
    evaluation_steps=[
        "Check if the language is formal and professional",
        "Verify no slang or casual expressions are used",
    ],
    model_config=config,
    threshold=0.7,
    strict_mode=False,
)

result = await evaluator.evaluate(
    input="Write a business email",
    actual_output="Dear Sir/Madam, I am writing to inquire about...",
)
print(result["score"])      # 0.0-1.0
print(result["passed"])     # True if >= 0.7
print(result["reasoning"])  # LLM-generated explanation

G-Eval YAML Configuration

evaluations:
  model:
    provider: openai
    name: gpt-4o
    temperature: 0.0
  metrics:
    - type: geval
      name: Professionalism
      criteria: |
        Evaluate if the response uses professional language,
        avoids slang, and maintains a respectful tone.
      evaluation_steps:
        - "Check if the language is formal and professional"
        - "Verify no slang or casual expressions are used"
        - "Assess the overall respectful tone"
      evaluation_params:
        - actual_output
        - input
      threshold: 0.7
      strict_mode: false

Valid evaluation_params values: input, actual_output, expected_output, context, retrieval_context.


RAG Pipeline Metrics

RAG evaluators measure retrieval-augmented generation quality. All RAG evaluators (except AnswerRelevancyEvaluator) require retrieval_context.

FaithfulnessEvaluator

Detects hallucinations by checking whether the response is supported by the retrieval context.

FaithfulnessEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None, observability_config=None)

Bases: DeepEvalBaseEvaluator

Faithfulness evaluator for detecting hallucinations.

Detects hallucinations by comparing agent response to retrieval context. Returns a low score if the response contains information not found in the retrieval context (hallucination detected).

Required inputs
  • input: User query
  • actual_output: Agent response
  • retrieval_context: List of retrieved text chunks
Example

evaluator = FaithfulnessEvaluator(threshold=0.8) result = await evaluator.evaluate( ... input="What are the store hours?", ... actual_output="Store is open 24/7.", ... retrieval_context=["Store hours: Mon-Fri 9am-5pm"] ... ) print(result["score"]) # Low score (hallucination detected)

Attributes:

Name Type Description
_include_reason

Whether to include reasoning in results.

Initialize Faithfulness evaluator.

Parameters:

Name Type Description Default
model_config DeepEvalModelConfig | None

LLM judge configuration. Defaults to Ollama gpt-oss:20b.

None
threshold float

Pass/fail score threshold (0.0-1.0). Default: 0.5.

0.5
include_reason bool

Whether to include reasoning in results. Default: True.

True
timeout float | None

Evaluation timeout in seconds. Default: 60.0.

60.0
retry_config RetryConfig | None

Retry configuration for transient failures.

None
observability_config TracingConfig | None

Tracing configuration for span instrumentation. If None, no spans are created.

None
Source code in src/holodeck/lib/evaluators/deepeval/faithfulness.py
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
def __init__(
    self,
    model_config: DeepEvalModelConfig | None = None,
    threshold: float = 0.5,
    include_reason: bool = True,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
    observability_config: TracingConfig | None = None,
) -> None:
    """Initialize Faithfulness evaluator.

    Args:
        model_config: LLM judge configuration. Defaults to Ollama gpt-oss:20b.
        threshold: Pass/fail score threshold (0.0-1.0). Default: 0.5.
        include_reason: Whether to include reasoning in results. Default: True.
        timeout: Evaluation timeout in seconds. Default: 60.0.
        retry_config: Retry configuration for transient failures.
        observability_config: Tracing configuration for span instrumentation.
                             If None, no spans are created.
    """
    self._include_reason = include_reason

    super().__init__(
        model_config=model_config,
        threshold=threshold,
        timeout=timeout,
        retry_config=retry_config,
        observability_config=observability_config,
    )

    logger.debug(
        f"FaithfulnessEvaluator initialized: "
        f"provider={self._model_config.provider.value}, "
        f"model={self._model_config.model_name}, "
        f"threshold={threshold}, include_reason={include_reason}"
    )

name property

Return the metric name.

evaluate(**kwargs) async

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name Type Description Default
**kwargs Any

Evaluation parameters (query, response, context, ground_truth, etc.)

{}

Returns:

Type Description
dict[str, Any]

Evaluation result dictionary

Raises:

Type Description
TimeoutError

If evaluation exceeds timeout

EvaluationError

If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

get_param_spec() classmethod

Get the parameter specification for this evaluator.

Returns:

Type Description
ParamSpec

ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py
121
122
123
124
125
126
127
128
@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

AnswerRelevancyEvaluator

Measures whether response statements are relevant to the input query. Does not require retrieval_context.

AnswerRelevancyEvaluator(model_config=None, threshold=0.5, include_reason=True, strict_mode=False, timeout=60.0, retry_config=None, observability_config=None)

Bases: DeepEvalBaseEvaluator

Answer Relevancy evaluator - measures statement relevance to input.

Evaluates how relevant the response statements are to the input query. Unlike other RAG metrics, this does NOT require retrieval_context.

Required inputs
  • input: User query
  • actual_output: Agent response
Example

evaluator = AnswerRelevancyEvaluator(threshold=0.7) result = await evaluator.evaluate( ... input="What is the return policy?", ... actual_output="We offer a 30-day full refund at no extra cost." ... ) print(result["score"]) # High score if relevant

Attributes:

Name Type Description
_include_reason

Whether to include reasoning in results.

_strict_mode

Whether to use binary scoring (1.0 or 0.0).

Initialize Answer Relevancy evaluator.

Parameters:

Name Type Description Default
model_config DeepEvalModelConfig | None

LLM judge configuration. Defaults to Ollama gpt-oss:20b.

None
threshold float

Pass/fail score threshold (0.0-1.0). Default: 0.5.

0.5
include_reason bool

Whether to include reasoning in results. Default: True.

True
strict_mode bool

Binary scoring mode (1.0 or 0.0 only). Default: False.

False
timeout float | None

Evaluation timeout in seconds. Default: 60.0.

60.0
retry_config RetryConfig | None

Retry configuration for transient failures.

None
observability_config TracingConfig | None

Tracing configuration for span instrumentation. If None, no spans are created.

None
Source code in src/holodeck/lib/evaluators/deepeval/answer_relevancy.py
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
def __init__(
    self,
    model_config: DeepEvalModelConfig | None = None,
    threshold: float = 0.5,
    include_reason: bool = True,
    strict_mode: bool = False,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
    observability_config: TracingConfig | None = None,
) -> None:
    """Initialize Answer Relevancy evaluator.

    Args:
        model_config: LLM judge configuration. Defaults to Ollama gpt-oss:20b.
        threshold: Pass/fail score threshold (0.0-1.0). Default: 0.5.
        include_reason: Whether to include reasoning in results. Default: True.
        strict_mode: Binary scoring mode (1.0 or 0.0 only). Default: False.
        timeout: Evaluation timeout in seconds. Default: 60.0.
        retry_config: Retry configuration for transient failures.
        observability_config: Tracing configuration for span instrumentation.
                             If None, no spans are created.
    """
    self._include_reason = include_reason
    self._strict_mode = strict_mode

    super().__init__(
        model_config=model_config,
        threshold=threshold,
        timeout=timeout,
        retry_config=retry_config,
        observability_config=observability_config,
    )

    logger.debug(
        f"AnswerRelevancyEvaluator initialized: "
        f"provider={self._model_config.provider.value}, "
        f"model={self._model_config.model_name}, "
        f"threshold={threshold}, include_reason={include_reason}, "
        f"strict_mode={strict_mode}"
    )

name property

Return the metric name.

evaluate(**kwargs) async

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name Type Description Default
**kwargs Any

Evaluation parameters (query, response, context, ground_truth, etc.)

{}

Returns:

Type Description
dict[str, Any]

Evaluation result dictionary

Raises:

Type Description
TimeoutError

If evaluation exceeds timeout

EvaluationError

If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

get_param_spec() classmethod

Get the parameter specification for this evaluator.

Returns:

Type Description
ParamSpec

ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py
121
122
123
124
125
126
127
128
@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

ContextualRelevancyEvaluator

Measures the proportion of retrieved chunks that are relevant to the query.

ContextualRelevancyEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None, observability_config=None)

Bases: DeepEvalBaseEvaluator

Contextual Relevancy evaluator for RAG pipelines.

Measures the relevance of retrieved context to the user query. Returns the proportion of chunks that are relevant to the query.

Required inputs
  • input: User query
  • actual_output: Agent response
  • retrieval_context: List of retrieved text chunks
Example

evaluator = ContextualRelevancyEvaluator(threshold=0.6) result = await evaluator.evaluate( ... input="What is the pricing?", ... actual_output="Basic plan is $10/month.", ... retrieval_context=[ ... "Pricing: Basic $10, Pro $25", # Relevant ... "Company founded in 2020", # Irrelevant ... ] ... ) print(result["score"]) # 0.5 (1 of 2 chunks relevant)

Attributes:

Name Type Description
_include_reason

Whether to include reasoning in results.

Initialize Contextual Relevancy evaluator.

Parameters:

Name Type Description Default
model_config DeepEvalModelConfig | None

LLM judge configuration. Defaults to Ollama gpt-oss:20b.

None
threshold float

Pass/fail score threshold (0.0-1.0). Default: 0.5.

0.5
include_reason bool

Whether to include reasoning in results. Default: True.

True
timeout float | None

Evaluation timeout in seconds. Default: 60.0.

60.0
retry_config RetryConfig | None

Retry configuration for transient failures.

None
observability_config TracingConfig | None

Tracing configuration for span instrumentation. If None, no spans are created.

None
Source code in src/holodeck/lib/evaluators/deepeval/contextual_relevancy.py
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
def __init__(
    self,
    model_config: DeepEvalModelConfig | None = None,
    threshold: float = 0.5,
    include_reason: bool = True,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
    observability_config: TracingConfig | None = None,
) -> None:
    """Initialize Contextual Relevancy evaluator.

    Args:
        model_config: LLM judge configuration. Defaults to Ollama gpt-oss:20b.
        threshold: Pass/fail score threshold (0.0-1.0). Default: 0.5.
        include_reason: Whether to include reasoning in results. Default: True.
        timeout: Evaluation timeout in seconds. Default: 60.0.
        retry_config: Retry configuration for transient failures.
        observability_config: Tracing configuration for span instrumentation.
                             If None, no spans are created.
    """
    self._include_reason = include_reason

    super().__init__(
        model_config=model_config,
        threshold=threshold,
        timeout=timeout,
        retry_config=retry_config,
        observability_config=observability_config,
    )

    logger.debug(
        f"ContextualRelevancyEvaluator initialized: "
        f"provider={self._model_config.provider.value}, "
        f"model={self._model_config.model_name}, "
        f"threshold={threshold}, include_reason={include_reason}"
    )

name property

Return the metric name.

evaluate(**kwargs) async

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name Type Description Default
**kwargs Any

Evaluation parameters (query, response, context, ground_truth, etc.)

{}

Returns:

Type Description
dict[str, Any]

Evaluation result dictionary

Raises:

Type Description
TimeoutError

If evaluation exceeds timeout

EvaluationError

If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

get_param_spec() classmethod

Get the parameter specification for this evaluator.

Returns:

Type Description
ParamSpec

ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py
121
122
123
124
125
126
127
128
@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

ContextualPrecisionEvaluator

Evaluates ranking quality -- whether relevant chunks appear before irrelevant ones.

ContextualPrecisionEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None, observability_config=None)

Bases: DeepEvalBaseEvaluator

Contextual Precision evaluator for RAG pipelines.

Evaluates the ranking quality of retrieved chunks. Measures whether relevant chunks appear before irrelevant ones.

Required inputs
  • input: User query
  • actual_output: Agent response
  • expected_output: Ground truth answer
  • retrieval_context: List of retrieved text chunks (order matters)
Example

evaluator = ContextualPrecisionEvaluator(threshold=0.7) result = await evaluator.evaluate( ... input="What is X?", ... actual_output="X is...", ... expected_output="X is the correct definition.", ... retrieval_context=[ ... "Irrelevant info", # Bad: irrelevant first ... "X is the definition", # Good: relevant ... ] ... ) print(result["score"]) # Lower due to poor ranking

Attributes:

Name Type Description
_include_reason

Whether to include reasoning in results.

Initialize Contextual Precision evaluator.

Parameters:

Name Type Description Default
model_config DeepEvalModelConfig | None

LLM judge configuration. Defaults to Ollama gpt-oss:20b.

None
threshold float

Pass/fail score threshold (0.0-1.0). Default: 0.5.

0.5
include_reason bool

Whether to include reasoning in results. Default: True.

True
timeout float | None

Evaluation timeout in seconds. Default: 60.0.

60.0
retry_config RetryConfig | None

Retry configuration for transient failures.

None
observability_config TracingConfig | None

Tracing configuration for span instrumentation. If None, no spans are created.

None
Source code in src/holodeck/lib/evaluators/deepeval/contextual_precision.py
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
def __init__(
    self,
    model_config: DeepEvalModelConfig | None = None,
    threshold: float = 0.5,
    include_reason: bool = True,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
    observability_config: TracingConfig | None = None,
) -> None:
    """Initialize Contextual Precision evaluator.

    Args:
        model_config: LLM judge configuration. Defaults to Ollama gpt-oss:20b.
        threshold: Pass/fail score threshold (0.0-1.0). Default: 0.5.
        include_reason: Whether to include reasoning in results. Default: True.
        timeout: Evaluation timeout in seconds. Default: 60.0.
        retry_config: Retry configuration for transient failures.
        observability_config: Tracing configuration for span instrumentation.
                             If None, no spans are created.
    """
    self._include_reason = include_reason

    super().__init__(
        model_config=model_config,
        threshold=threshold,
        timeout=timeout,
        retry_config=retry_config,
        observability_config=observability_config,
    )

    logger.debug(
        f"ContextualPrecisionEvaluator initialized: "
        f"provider={self._model_config.provider.value}, "
        f"model={self._model_config.model_name}, "
        f"threshold={threshold}, include_reason={include_reason}"
    )

name property

Return the metric name.

evaluate(**kwargs) async

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name Type Description Default
**kwargs Any

Evaluation parameters (query, response, context, ground_truth, etc.)

{}

Returns:

Type Description
dict[str, Any]

Evaluation result dictionary

Raises:

Type Description
TimeoutError

If evaluation exceeds timeout

EvaluationError

If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

get_param_spec() classmethod

Get the parameter specification for this evaluator.

Returns:

Type Description
ParamSpec

ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py
121
122
123
124
125
126
127
128
@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

ContextualRecallEvaluator

Measures retrieval completeness -- whether the context contains all facts needed to produce the expected output.

ContextualRecallEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None, observability_config=None)

Bases: DeepEvalBaseEvaluator

Contextual Recall evaluator for RAG pipelines.

Measures retrieval completeness against expected output. Evaluates whether retrieval context contains all facts needed to produce the expected output.

Required inputs
  • input: User query
  • actual_output: Agent response
  • expected_output: Ground truth answer
  • retrieval_context: List of retrieved text chunks
Example

evaluator = ContextualRecallEvaluator(threshold=0.8) result = await evaluator.evaluate( ... input="List all features", ... actual_output="Features are A and B", ... expected_output="Features are A, B, and C", ... retrieval_context=["Feature A: ...", "Feature B: ..."] ... ) print(result["score"]) # ~0.67 (missing Feature C)

Attributes:

Name Type Description
_include_reason

Whether to include reasoning in results.

Initialize Contextual Recall evaluator.

Parameters:

Name Type Description Default
model_config DeepEvalModelConfig | None

LLM judge configuration. Defaults to Ollama gpt-oss:20b.

None
threshold float

Pass/fail score threshold (0.0-1.0). Default: 0.5.

0.5
include_reason bool

Whether to include reasoning in results. Default: True.

True
timeout float | None

Evaluation timeout in seconds. Default: 60.0.

60.0
retry_config RetryConfig | None

Retry configuration for transient failures.

None
observability_config TracingConfig | None

Tracing configuration for span instrumentation. If None, no spans are created.

None
Source code in src/holodeck/lib/evaluators/deepeval/contextual_recall.py
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
def __init__(
    self,
    model_config: DeepEvalModelConfig | None = None,
    threshold: float = 0.5,
    include_reason: bool = True,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
    observability_config: TracingConfig | None = None,
) -> None:
    """Initialize Contextual Recall evaluator.

    Args:
        model_config: LLM judge configuration. Defaults to Ollama gpt-oss:20b.
        threshold: Pass/fail score threshold (0.0-1.0). Default: 0.5.
        include_reason: Whether to include reasoning in results. Default: True.
        timeout: Evaluation timeout in seconds. Default: 60.0.
        retry_config: Retry configuration for transient failures.
        observability_config: Tracing configuration for span instrumentation.
                             If None, no spans are created.
    """
    self._include_reason = include_reason

    super().__init__(
        model_config=model_config,
        threshold=threshold,
        timeout=timeout,
        retry_config=retry_config,
        observability_config=observability_config,
    )

    logger.debug(
        f"ContextualRecallEvaluator initialized: "
        f"provider={self._model_config.provider.value}, "
        f"model={self._model_config.model_name}, "
        f"threshold={threshold}, include_reason={include_reason}"
    )

name property

Return the metric name.

evaluate(**kwargs) async

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name Type Description Default
**kwargs Any

Evaluation parameters (query, response, context, ground_truth, etc.)

{}

Returns:

Type Description
dict[str, Any]

Evaluation result dictionary

Raises:

Type Description
TimeoutError

If evaluation exceeds timeout

EvaluationError

If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

get_param_spec() classmethod

Get the parameter specification for this evaluator.

Returns:

Type Description
ParamSpec

ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py
121
122
123
124
125
126
127
128
@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

RAG Metrics Usage

from holodeck.lib.evaluators.deepeval import (
    FaithfulnessEvaluator,
    AnswerRelevancyEvaluator,
    ContextualRelevancyEvaluator,
    ContextualPrecisionEvaluator,
    ContextualRecallEvaluator,
    DeepEvalModelConfig,
)

config = DeepEvalModelConfig()  # Default: Ollama with gpt-oss:20b

# Faithfulness (hallucination detection)
faithfulness = FaithfulnessEvaluator(model_config=config, threshold=0.8)
result = await faithfulness.evaluate(
    input="What are the store hours?",
    actual_output="Store is open 24/7.",
    retrieval_context=["Store hours: Mon-Fri 9am-5pm"],
)
print(result["score"])  # Low score -- hallucination detected

# Answer Relevancy (no retrieval_context needed)
relevancy = AnswerRelevancyEvaluator(model_config=config, threshold=0.7)
result = await relevancy.evaluate(
    input="What is the return policy?",
    actual_output="We offer 30-day returns at no extra cost.",
)

# Contextual Precision (ranking quality)
precision = ContextualPrecisionEvaluator(model_config=config, threshold=0.7)
result = await precision.evaluate(
    input="What is X?",
    actual_output="X is a programming concept.",
    expected_output="X is a well-known programming paradigm.",
    retrieval_context=["X is a programming paradigm.", "Unrelated info"],
)

RAG YAML Configuration

evaluations:
  model:
    provider: openai
    name: gpt-4o
  metrics:
    - type: rag
      metric_type: faithfulness
      threshold: 0.8
      include_reason: true

    - type: rag
      metric_type: answer_relevancy
      threshold: 0.7

    - type: rag
      metric_type: contextual_relevancy
      threshold: 0.6

    - type: rag
      metric_type: contextual_precision
      threshold: 0.7

    - type: rag
      metric_type: contextual_recall
      threshold: 0.6

RAG Metrics Summary

Evaluator Required Params Requires retrieval_context Measures
FaithfulnessEvaluator input, actual_output, retrieval_context Yes Hallucination detection
AnswerRelevancyEvaluator input, actual_output No Response relevance to query
ContextualRelevancyEvaluator input, actual_output, retrieval_context Yes Chunk relevance to query
ContextualPrecisionEvaluator input, actual_output, expected_output, retrieval_context Yes Ranking quality of chunks
ContextualRecallEvaluator input, actual_output, expected_output, retrieval_context Yes Retrieval completeness

Complete Agent Configuration Example

name: customer-support-agent
model:
  provider: openai
  name: gpt-4o

evaluations:
  model:
    provider: openai
    name: gpt-4o
    temperature: 0.0
  metrics:
    # Standard NLP metrics (no LLM required)
    - type: standard
      metric: bleu
      threshold: 0.4
    - type: standard
      metric: rouge
      threshold: 0.5

    # Custom G-Eval criteria
    - type: geval
      name: Helpfulness
      criteria: "Evaluate if the response provides actionable, helpful information"
      evaluation_params: [actual_output, input]
      threshold: 0.7

    # RAG evaluation
    - type: rag
      metric_type: faithfulness
      threshold: 0.8
      include_reason: true

test_cases:
  - name: "Refund policy question"
    input: "What is your refund policy?"
    ground_truth: "We offer a 30-day money-back guarantee on all products."
    retrieval_context:
      - "Refund Policy: All products come with a 30-day money-back guarantee."
      - "Returns must be initiated within 30 days of purchase."

  - name: "Product recommendation"
    input: "I need a laptop for video editing"
    expected_tools: [search_products, get_specifications]
    evaluations:
      - type: geval
        name: TechnicalAccuracy
        criteria: "Verify the response contains accurate technical specifications"
        threshold: 0.8

Run tests with:

holodeck test agent.yaml --verbose --output report.md --format markdown