Skip to content

Evaluation Framework API

HoloDeck provides a flexible evaluation framework for measuring agent response quality. The framework supports three tiers of metrics:

  1. DeepEval Metrics (Recommended) - LLM-as-a-judge with GEval and RAG metrics
  2. NLP Metrics (Standard) - Algorithmic text comparison
  3. Legacy AI Metrics (Deprecated) - Azure AI-based metrics

Evaluation Configuration Models

EvaluationConfig

Bases: BaseModel

Evaluation framework configuration.

Container for evaluation metrics with optional default model configuration. Supports standard EvaluationMetric, GEvalMetric (custom criteria), and RAGMetric (RAG pipeline evaluation).

validate_metrics(v) classmethod

Validate metrics list is not empty.

Source code in src/holodeck/models/evaluation.py
314
315
316
317
318
319
320
321
322
@field_validator("metrics")
@classmethod
def validate_metrics(
    cls, v: list[EvaluationMetric | GEvalMetric | RAGMetric]
) -> list[EvaluationMetric | GEvalMetric | RAGMetric]:
    """Validate metrics list is not empty."""
    if not v:
        raise ValueError("metrics must have at least one metric")
    return v

MetricType = Annotated[EvaluationMetric | GEvalMetric | RAGMetric, Field(discriminator='type')] module-attribute

GEvalMetric

Bases: BaseModel

G-Eval custom criteria metric configuration.

Uses discriminator pattern with type="geval" to distinguish from standard EvaluationMetric instances in a discriminated union.

G-Eval enables custom evaluation criteria defined in natural language, using chain-of-thought prompting with LLM-based scoring.

Example

metric = GEvalMetric( ... name="Professionalism", ... criteria="Evaluate if the response uses professional language", ... threshold=0.7 ... )

validate_criteria(v) classmethod

Validate criteria is not empty.

Source code in src/holodeck/models/evaluation.py
198
199
200
201
202
203
204
@field_validator("criteria")
@classmethod
def validate_criteria(cls, v: str) -> str:
    """Validate criteria is not empty."""
    if not v or not v.strip():
        raise ValueError("criteria must be a non-empty string")
    return v

validate_evaluation_params(v) classmethod

Validate evaluation_params contains valid values.

Source code in src/holodeck/models/evaluation.py
206
207
208
209
210
211
212
213
214
215
216
217
218
@field_validator("evaluation_params")
@classmethod
def validate_evaluation_params(cls, v: list[str]) -> list[str]:
    """Validate evaluation_params contains valid values."""
    if not v:
        raise ValueError("evaluation_params must not be empty")
    invalid_params = set(v) - VALID_EVALUATION_PARAMS
    if invalid_params:
        raise ValueError(
            f"Invalid evaluation_params: {sorted(invalid_params)}. "
            f"Valid options: {sorted(VALID_EVALUATION_PARAMS)}"
        )
    return v

validate_name(v) classmethod

Validate name is not empty.

Source code in src/holodeck/models/evaluation.py
190
191
192
193
194
195
196
@field_validator("name")
@classmethod
def validate_name(cls, v: str) -> str:
    """Validate name is not empty."""
    if not v or not v.strip():
        raise ValueError("name must be a non-empty string")
    return v

validate_threshold(v) classmethod

Validate threshold is in valid range.

Source code in src/holodeck/models/evaluation.py
220
221
222
223
224
225
226
@field_validator("threshold")
@classmethod
def validate_threshold(cls, v: float | None) -> float | None:
    """Validate threshold is in valid range."""
    if v is not None and (v < 0.0 or v > 1.0):
        raise ValueError("threshold must be between 0.0 and 1.0")
    return v

RAGMetric

Bases: BaseModel

RAG pipeline evaluation metric configuration.

Uses discriminator pattern with type="rag" to distinguish from standard EvaluationMetric and GEvalMetric instances in a discriminated union.

RAG metrics evaluate the quality of retrieval-augmented generation pipelines: - Faithfulness: Detects hallucinations by comparing response to context - ContextualRelevancy: Measures relevance of retrieved chunks to query - ContextualPrecision: Evaluates ranking quality of retrieved chunks - ContextualRecall: Measures retrieval completeness against expected output

Example

metric = RAGMetric( ... metric_type=RAGMetricType.FAITHFULNESS, ... threshold=0.8 ... )

validate_threshold(v) classmethod

Validate threshold is in valid range.

Source code in src/holodeck/models/evaluation.py
281
282
283
284
285
286
287
@field_validator("threshold")
@classmethod
def validate_threshold(cls, v: float) -> float:
    """Validate threshold is in valid range."""
    if v < 0.0 or v > 1.0:
        raise ValueError("threshold must be between 0.0 and 1.0")
    return v

EvaluationMetric

Bases: BaseModel

Evaluation metric configuration.

Represents a single evaluation metric with flexible model configuration, including per-metric LLM model overrides.

validate_custom_prompt(v) classmethod

Validate custom_prompt is not empty if provided.

Source code in src/holodeck/models/evaluation.py
121
122
123
124
125
126
127
@field_validator("custom_prompt")
@classmethod
def validate_custom_prompt(cls, v: str | None) -> str | None:
    """Validate custom_prompt is not empty if provided."""
    if v is not None and (not v or not v.strip()):
        raise ValueError("custom_prompt must be non-empty if provided")
    return v

validate_enabled(v) classmethod

Validate enabled is boolean.

Source code in src/holodeck/models/evaluation.py
81
82
83
84
85
86
87
@field_validator("enabled")
@classmethod
def validate_enabled(cls, v: bool) -> bool:
    """Validate enabled is boolean."""
    if not isinstance(v, bool):
        raise ValueError("enabled must be boolean")
    return v

validate_fail_on_error(v) classmethod

Validate fail_on_error is boolean.

Source code in src/holodeck/models/evaluation.py
89
90
91
92
93
94
95
@field_validator("fail_on_error")
@classmethod
def validate_fail_on_error(cls, v: bool) -> bool:
    """Validate fail_on_error is boolean."""
    if not isinstance(v, bool):
        raise ValueError("fail_on_error must be boolean")
    return v

validate_metric(v) classmethod

Validate metric is not empty.

Source code in src/holodeck/models/evaluation.py
65
66
67
68
69
70
71
@field_validator("metric")
@classmethod
def validate_metric(cls, v: str) -> str:
    """Validate metric is not empty."""
    if not v or not v.strip():
        raise ValueError("metric must be a non-empty string")
    return v

validate_retry_on_failure(v) classmethod

Validate retry_on_failure is in valid range.

Source code in src/holodeck/models/evaluation.py
 97
 98
 99
100
101
102
103
@field_validator("retry_on_failure")
@classmethod
def validate_retry_on_failure(cls, v: int | None) -> int | None:
    """Validate retry_on_failure is in valid range."""
    if v is not None and (v < 1 or v > 3):
        raise ValueError("retry_on_failure must be between 1 and 3")
    return v

validate_scale(v) classmethod

Validate scale is positive.

Source code in src/holodeck/models/evaluation.py
113
114
115
116
117
118
119
@field_validator("scale")
@classmethod
def validate_scale(cls, v: int | None) -> int | None:
    """Validate scale is positive."""
    if v is not None and v <= 0:
        raise ValueError("scale must be positive")
    return v

validate_threshold(v) classmethod

Validate threshold is numeric if provided.

Source code in src/holodeck/models/evaluation.py
73
74
75
76
77
78
79
@field_validator("threshold")
@classmethod
def validate_threshold(cls, v: float | None) -> float | None:
    """Validate threshold is numeric if provided."""
    if v is not None and not isinstance(v, int | float):
        raise ValueError("threshold must be numeric")
    return v

validate_timeout_ms(v) classmethod

Validate timeout_ms is positive.

Source code in src/holodeck/models/evaluation.py
105
106
107
108
109
110
111
@field_validator("timeout_ms")
@classmethod
def validate_timeout_ms(cls, v: int | None) -> int | None:
    """Validate timeout_ms is positive."""
    if v is not None and v <= 0:
        raise ValueError("timeout_ms must be positive")
    return v

DeepEval provides powerful LLM-as-a-judge evaluation using the DeepEval library.

Base Classes

DeepEvalBaseEvaluator(model_config=None, threshold=0.5, timeout=60.0, retry_config=None)

Bases: BaseEvaluator

Abstract base class for DeepEval-based evaluators.

This class extends BaseEvaluator to provide DeepEval-specific functionality: - Model configuration and initialization - LLMTestCase construction from evaluation inputs - Result normalization and logging

Subclasses must implement _create_metric() to return the specific DeepEval metric instance.

Note: DeepEval uses different parameter names than Azure AI/NLP: - input (not query) - actual_output (not response) - expected_output (not ground_truth)

Attributes:

Name Type Description
model_config

Configuration for the evaluation LLM

threshold

Score threshold for pass/fail determination

model

The initialized DeepEval model instance

Example

class MyMetricEvaluator(DeepEvalBaseEvaluator): ... def _create_metric(self): ... return SomeDeepEvalMetric( ... threshold=self._threshold, ... model=self._model ... )

evaluator = MyMetricEvaluator(threshold=0.7) result = await evaluator.evaluate( ... input="What is Python?", ... actual_output="Python is a programming language." ... )

Initialize DeepEval base evaluator.

Parameters:

Name Type Description Default
model_config DeepEvalModelConfig | None

Configuration for the evaluation model. Defaults to Ollama with gpt-oss:20b.

None
threshold float

Score threshold for pass/fail (0.0-1.0, default: 0.5)

0.5
timeout float | None

Evaluation timeout in seconds (default: 60.0)

60.0
retry_config RetryConfig | None

Retry configuration for transient failures

None
Source code in src/holodeck/lib/evaluators/deepeval/base.py
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
def __init__(
    self,
    model_config: DeepEvalModelConfig | None = None,
    threshold: float = 0.5,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
) -> None:
    """Initialize DeepEval base evaluator.

    Args:
        model_config: Configuration for the evaluation model.
                     Defaults to Ollama with gpt-oss:20b.
        threshold: Score threshold for pass/fail (0.0-1.0, default: 0.5)
        timeout: Evaluation timeout in seconds (default: 60.0)
        retry_config: Retry configuration for transient failures
    """
    super().__init__(timeout=timeout, retry_config=retry_config)
    self._model_config = model_config or DeepEvalModelConfig()
    self._model = self._model_config.to_deepeval_model()
    self._threshold = threshold

    logger.debug(
        f"DeepEval evaluator initialized: {self.name}, "
        f"provider={self._model_config.provider.value}, "
        f"model={self._model_config.model_name}, "
        f"threshold={threshold}"
    )

name property

Return evaluator name (class name by default).

evaluate(**kwargs) async

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name Type Description Default
**kwargs Any

Evaluation parameters (query, response, context, ground_truth, etc.)

{}

Returns:

Type Description
dict[str, Any]

Evaluation result dictionary

Raises:

Type Description
TimeoutError

If evaluation exceeds timeout

EvaluationError

If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

get_param_spec() classmethod

Get the parameter specification for this evaluator.

Returns:

Type Description
ParamSpec

ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py
121
122
123
124
125
126
127
128
@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

DeepEvalModelConfig

Bases: BaseModel

Configuration adapter for DeepEval model classes.

This class bridges HoloDeck's LLMProvider configuration to DeepEval's native model classes (GPTModel, AzureOpenAIModel, AnthropicModel, OllamaModel).

The default configuration uses Ollama with gpt-oss:20b for local evaluation without requiring API keys.

Attributes:

Name Type Description
provider ProviderEnum

LLM provider to use (defaults to Ollama)

model_name str

Name of the model (defaults to gpt-oss:20b)

api_key str | None

API key for cloud providers (not required for Ollama)

endpoint str | None

API endpoint URL (required for Azure, optional for Ollama)

api_version str | None

Azure OpenAI API version

deployment_name str | None

Azure OpenAI deployment name

temperature float

Temperature for generation (defaults to 0.0 for determinism)

API Key Behavior
  • OpenAI: API key can be provided via api_key field or the OPENAI_API_KEY environment variable. If neither is set, DeepEval's GPTModel will raise an error at runtime.
  • Anthropic: API key can be provided via api_key field or the ANTHROPIC_API_KEY environment variable. If neither is set, DeepEval's AnthropicModel will raise an error at runtime.
  • Azure OpenAI: The api_key field is required and validated at configuration time (no environment variable fallback).
  • Ollama: No API key required (local inference).
Example

config = DeepEvalModelConfig() # Default Ollama model = config.to_deepeval_model()

openai_config = DeepEvalModelConfig( ... provider=ProviderEnum.OPENAI, ... model_name="gpt-4o", ... api_key="sk-..." # Or set OPENAI_API_KEY env var ... )

to_deepeval_model()

Convert configuration to native DeepEval model class.

Returns the appropriate DeepEval model class instance based on the configured provider.

Returns:

Type Description
DeepEvalModel

DeepEval model instance (GPTModel, AzureOpenAIModel,

DeepEvalModel

AnthropicModel, or OllamaModel)

Raises:

Type Description
ValueError

If provider is not supported

Source code in src/holodeck/lib/evaluators/deepeval/config.py
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
def to_deepeval_model(self) -> DeepEvalModel:
    """Convert configuration to native DeepEval model class.

    Returns the appropriate DeepEval model class instance based on
    the configured provider.

    Returns:
        DeepEval model instance (GPTModel, AzureOpenAIModel,
        AnthropicModel, or OllamaModel)

    Raises:
        ValueError: If provider is not supported
    """
    if self.provider == ProviderEnum.OPENAI:
        from deepeval.models import GPTModel

        kwargs: dict[str, Any] = {
            "model": self.model_name,
            "temperature": self.temperature,
        }
        if self.api_key:
            kwargs["api_key"] = self.api_key
        return GPTModel(**kwargs)

    elif self.provider == ProviderEnum.AZURE_OPENAI:
        from deepeval.models import AzureOpenAIModel

        return AzureOpenAIModel(
            model_name=self.model_name,
            deployment_name=self.deployment_name,
            azure_endpoint=self.endpoint,
            openai_api_version=self.api_version,
            azure_openai_api_key=self.api_key,
            temperature=1.0,  # reasoning models require temperature=1.0
        )

    elif self.provider == ProviderEnum.ANTHROPIC:
        from deepeval.models import AnthropicModel

        kwargs = {
            "model": self.model_name,
            "temperature": self.temperature,
        }
        if self.api_key:
            kwargs["api_key"] = self.api_key
        return AnthropicModel(**kwargs)

    elif self.provider == ProviderEnum.OLLAMA:
        from deepeval.models import OllamaModel

        return OllamaModel(
            model=self.model_name,
            base_url=self.endpoint or "http://localhost:11434",
            temperature=self.temperature,
        )

    else:
        raise ValueError(f"Unsupported provider: {self.provider}")

validate_provider_requirements()

Validate that required fields are present for each provider.

Raises:

Type Description
ValueError

If required fields are missing for the provider

Source code in src/holodeck/lib/evaluators/deepeval/config.py
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
@model_validator(mode="after")
def validate_provider_requirements(self) -> "DeepEvalModelConfig":
    """Validate that required fields are present for each provider.

    Raises:
        ValueError: If required fields are missing for the provider
    """
    if self.provider == ProviderEnum.AZURE_OPENAI:
        if not self.endpoint:
            raise ValueError("endpoint is required for Azure OpenAI provider")
        if not self.deployment_name:
            raise ValueError(
                "deployment_name is required for Azure OpenAI provider"
            )
        if not self.api_key:
            raise ValueError("api_key is required for Azure OpenAI provider")
    return self

GEval Evaluator

The GEval evaluator uses the G-Eval algorithm with chain-of-thought prompting for custom criteria evaluation.

GEvalEvaluator(name, criteria, evaluation_params=None, evaluation_steps=None, model_config=None, threshold=0.5, strict_mode=False, timeout=60.0, retry_config=None)

Bases: DeepEvalBaseEvaluator

G-Eval custom criteria evaluator.

Evaluates LLM outputs against user-defined criteria using the G-Eval algorithm, which combines chain-of-thought prompting with token probability scoring.

G-Eval works in two phases: 1. Step Generation: Auto-generates evaluation steps from the criteria 2. Scoring: Uses the steps to score the test case on a 1-5 scale (normalized to 0-1)

Attributes:

Name Type Description
_metric_name

Custom name for this evaluation metric

_criteria

Natural language criteria for evaluation

_evaluation_params

Test case fields to include in evaluation

_evaluation_steps

Optional explicit evaluation steps

_strict_mode

Whether to use binary scoring (1.0 or 0.0)

Example

evaluator = GEvalEvaluator( ... name="Professionalism", ... criteria="Evaluate if the response uses professional language", ... threshold=0.7 ... ) result = await evaluator.evaluate( ... input="Write me an email", ... actual_output="Dear Sir/Madam, ..." ... ) print(result["score"]) # 0.85 print(result["passed"]) # True

Initialize G-Eval evaluator.

Parameters:

Name Type Description Default
name str

Metric identifier (e.g., "Correctness", "Helpfulness")

required
criteria str

Natural language evaluation criteria

required
evaluation_params list[str] | None

Test case fields to include in evaluation. Valid options: ["input", "actual_output", "expected_output", "context", "retrieval_context"] Default: ["actual_output"]

None
evaluation_steps list[str] | None

Explicit evaluation steps. If None, G-Eval auto-generates steps from the criteria.

None
model_config DeepEvalModelConfig | None

LLM judge configuration. Defaults to Ollama gpt-oss:20b.

None
threshold float

Pass/fail score threshold (0.0-1.0). Default: 0.5.

0.5
strict_mode bool

If True, scores are binary (1.0 or 0.0). Default: False.

False
timeout float | None

Evaluation timeout in seconds. Default: 60.0.

60.0
retry_config RetryConfig | None

Retry configuration for transient failures.

None

Raises:

Type Description
ValueError

If invalid evaluation_params are provided.

Source code in src/holodeck/lib/evaluators/deepeval/geval.py
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
def __init__(
    self,
    name: str,
    criteria: str,
    evaluation_params: list[str] | None = None,
    evaluation_steps: list[str] | None = None,
    model_config: DeepEvalModelConfig | None = None,
    threshold: float = 0.5,
    strict_mode: bool = False,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
) -> None:
    """Initialize G-Eval evaluator.

    Args:
        name: Metric identifier (e.g., "Correctness", "Helpfulness")
        criteria: Natural language evaluation criteria
        evaluation_params: Test case fields to include in evaluation.
            Valid options: ["input", "actual_output", "expected_output",
                          "context", "retrieval_context"]
            Default: ["actual_output"]
        evaluation_steps: Explicit evaluation steps. If None, G-Eval
            auto-generates steps from the criteria.
        model_config: LLM judge configuration. Defaults to Ollama gpt-oss:20b.
        threshold: Pass/fail score threshold (0.0-1.0). Default: 0.5.
        strict_mode: If True, scores are binary (1.0 or 0.0). Default: False.
        timeout: Evaluation timeout in seconds. Default: 60.0.
        retry_config: Retry configuration for transient failures.

    Raises:
        ValueError: If invalid evaluation_params are provided.
    """
    # Validate and set evaluation params before calling super().__init__
    if evaluation_params is None:
        evaluation_params = ["actual_output"]

    # Validate evaluation params
    for param in evaluation_params:
        if param not in VALID_EVALUATION_PARAMS:
            raise ValueError(
                f"Invalid evaluation_param: '{param}'. "
                f"Valid options: {sorted(VALID_EVALUATION_PARAMS)}"
            )

    self._metric_name = name
    self._criteria = criteria
    self._evaluation_params = evaluation_params
    self._evaluation_steps = evaluation_steps
    self._strict_mode = strict_mode

    super().__init__(
        model_config=model_config,
        threshold=threshold,
        timeout=timeout,
        retry_config=retry_config,
    )

    logger.debug(
        f"GEvalEvaluator initialized: name={name}, "
        f"criteria_len={len(criteria)}, "
        f"evaluation_params={evaluation_params}, "
        f"strict_mode={strict_mode}"
    )

name property

Return the custom metric name.

evaluate(**kwargs) async

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name Type Description Default
**kwargs Any

Evaluation parameters (query, response, context, ground_truth, etc.)

{}

Returns:

Type Description
dict[str, Any]

Evaluation result dictionary

Raises:

Type Description
TimeoutError

If evaluation exceeds timeout

EvaluationError

If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

get_param_spec() classmethod

Get the parameter specification for this evaluator.

Returns:

Type Description
ParamSpec

ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py
121
122
123
124
125
126
127
128
@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

RAG Evaluators

RAG evaluators measure retrieval-augmented generation pipeline quality.

FaithfulnessEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None)

Bases: DeepEvalBaseEvaluator

Faithfulness evaluator for detecting hallucinations.

Detects hallucinations by comparing agent response to retrieval context. Returns a low score if the response contains information not found in the retrieval context (hallucination detected).

Required inputs
  • input: User query
  • actual_output: Agent response
  • retrieval_context: List of retrieved text chunks
Example

evaluator = FaithfulnessEvaluator(threshold=0.8) result = await evaluator.evaluate( ... input="What are the store hours?", ... actual_output="Store is open 24/7.", ... retrieval_context=["Store hours: Mon-Fri 9am-5pm"] ... ) print(result["score"]) # Low score (hallucination detected)

Attributes:

Name Type Description
_include_reason

Whether to include reasoning in results.

Initialize Faithfulness evaluator.

Parameters:

Name Type Description Default
model_config DeepEvalModelConfig | None

LLM judge configuration. Defaults to Ollama gpt-oss:20b.

None
threshold float

Pass/fail score threshold (0.0-1.0). Default: 0.5.

0.5
include_reason bool

Whether to include reasoning in results. Default: True.

True
timeout float | None

Evaluation timeout in seconds. Default: 60.0.

60.0
retry_config RetryConfig | None

Retry configuration for transient failures.

None
Source code in src/holodeck/lib/evaluators/deepeval/faithfulness.py
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
def __init__(
    self,
    model_config: DeepEvalModelConfig | None = None,
    threshold: float = 0.5,
    include_reason: bool = True,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
) -> None:
    """Initialize Faithfulness evaluator.

    Args:
        model_config: LLM judge configuration. Defaults to Ollama gpt-oss:20b.
        threshold: Pass/fail score threshold (0.0-1.0). Default: 0.5.
        include_reason: Whether to include reasoning in results. Default: True.
        timeout: Evaluation timeout in seconds. Default: 60.0.
        retry_config: Retry configuration for transient failures.
    """
    self._include_reason = include_reason

    super().__init__(
        model_config=model_config,
        threshold=threshold,
        timeout=timeout,
        retry_config=retry_config,
    )

    logger.debug(
        f"FaithfulnessEvaluator initialized: "
        f"provider={self._model_config.provider.value}, "
        f"model={self._model_config.model_name}, "
        f"threshold={threshold}, include_reason={include_reason}"
    )

name property

Return the metric name.

evaluate(**kwargs) async

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name Type Description Default
**kwargs Any

Evaluation parameters (query, response, context, ground_truth, etc.)

{}

Returns:

Type Description
dict[str, Any]

Evaluation result dictionary

Raises:

Type Description
TimeoutError

If evaluation exceeds timeout

EvaluationError

If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

get_param_spec() classmethod

Get the parameter specification for this evaluator.

Returns:

Type Description
ParamSpec

ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py
121
122
123
124
125
126
127
128
@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

AnswerRelevancyEvaluator(model_config=None, threshold=0.5, include_reason=True, strict_mode=False, timeout=60.0, retry_config=None)

Bases: DeepEvalBaseEvaluator

Answer Relevancy evaluator - measures statement relevance to input.

Evaluates how relevant the response statements are to the input query. Unlike other RAG metrics, this does NOT require retrieval_context.

Required inputs
  • input: User query
  • actual_output: Agent response
Example

evaluator = AnswerRelevancyEvaluator(threshold=0.7) result = await evaluator.evaluate( ... input="What is the return policy?", ... actual_output="We offer a 30-day full refund at no extra cost." ... ) print(result["score"]) # High score if relevant

Attributes:

Name Type Description
_include_reason

Whether to include reasoning in results.

_strict_mode

Whether to use binary scoring (1.0 or 0.0).

Initialize Answer Relevancy evaluator.

Parameters:

Name Type Description Default
model_config DeepEvalModelConfig | None

LLM judge configuration. Defaults to Ollama gpt-oss:20b.

None
threshold float

Pass/fail score threshold (0.0-1.0). Default: 0.5.

0.5
include_reason bool

Whether to include reasoning in results. Default: True.

True
strict_mode bool

Binary scoring mode (1.0 or 0.0 only). Default: False.

False
timeout float | None

Evaluation timeout in seconds. Default: 60.0.

60.0
retry_config RetryConfig | None

Retry configuration for transient failures.

None
Source code in src/holodeck/lib/evaluators/deepeval/answer_relevancy.py
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
def __init__(
    self,
    model_config: DeepEvalModelConfig | None = None,
    threshold: float = 0.5,
    include_reason: bool = True,
    strict_mode: bool = False,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
) -> None:
    """Initialize Answer Relevancy evaluator.

    Args:
        model_config: LLM judge configuration. Defaults to Ollama gpt-oss:20b.
        threshold: Pass/fail score threshold (0.0-1.0). Default: 0.5.
        include_reason: Whether to include reasoning in results. Default: True.
        strict_mode: Binary scoring mode (1.0 or 0.0 only). Default: False.
        timeout: Evaluation timeout in seconds. Default: 60.0.
        retry_config: Retry configuration for transient failures.
    """
    self._include_reason = include_reason
    self._strict_mode = strict_mode

    super().__init__(
        model_config=model_config,
        threshold=threshold,
        timeout=timeout,
        retry_config=retry_config,
    )

    logger.debug(
        f"AnswerRelevancyEvaluator initialized: "
        f"provider={self._model_config.provider.value}, "
        f"model={self._model_config.model_name}, "
        f"threshold={threshold}, include_reason={include_reason}, "
        f"strict_mode={strict_mode}"
    )

name property

Return the metric name.

evaluate(**kwargs) async

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name Type Description Default
**kwargs Any

Evaluation parameters (query, response, context, ground_truth, etc.)

{}

Returns:

Type Description
dict[str, Any]

Evaluation result dictionary

Raises:

Type Description
TimeoutError

If evaluation exceeds timeout

EvaluationError

If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

get_param_spec() classmethod

Get the parameter specification for this evaluator.

Returns:

Type Description
ParamSpec

ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py
121
122
123
124
125
126
127
128
@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

ContextualRelevancyEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None)

Bases: DeepEvalBaseEvaluator

Contextual Relevancy evaluator for RAG pipelines.

Measures the relevance of retrieved context to the user query. Returns the proportion of chunks that are relevant to the query.

Required inputs
  • input: User query
  • actual_output: Agent response
  • retrieval_context: List of retrieved text chunks
Example

evaluator = ContextualRelevancyEvaluator(threshold=0.6) result = await evaluator.evaluate( ... input="What is the pricing?", ... actual_output="Basic plan is $10/month.", ... retrieval_context=[ ... "Pricing: Basic $10, Pro $25", # Relevant ... "Company founded in 2020", # Irrelevant ... ] ... ) print(result["score"]) # 0.5 (1 of 2 chunks relevant)

Attributes:

Name Type Description
_include_reason

Whether to include reasoning in results.

Initialize Contextual Relevancy evaluator.

Parameters:

Name Type Description Default
model_config DeepEvalModelConfig | None

LLM judge configuration. Defaults to Ollama gpt-oss:20b.

None
threshold float

Pass/fail score threshold (0.0-1.0). Default: 0.5.

0.5
include_reason bool

Whether to include reasoning in results. Default: True.

True
timeout float | None

Evaluation timeout in seconds. Default: 60.0.

60.0
retry_config RetryConfig | None

Retry configuration for transient failures.

None
Source code in src/holodeck/lib/evaluators/deepeval/contextual_relevancy.py
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
def __init__(
    self,
    model_config: DeepEvalModelConfig | None = None,
    threshold: float = 0.5,
    include_reason: bool = True,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
) -> None:
    """Initialize Contextual Relevancy evaluator.

    Args:
        model_config: LLM judge configuration. Defaults to Ollama gpt-oss:20b.
        threshold: Pass/fail score threshold (0.0-1.0). Default: 0.5.
        include_reason: Whether to include reasoning in results. Default: True.
        timeout: Evaluation timeout in seconds. Default: 60.0.
        retry_config: Retry configuration for transient failures.
    """
    self._include_reason = include_reason

    super().__init__(
        model_config=model_config,
        threshold=threshold,
        timeout=timeout,
        retry_config=retry_config,
    )

    logger.debug(
        f"ContextualRelevancyEvaluator initialized: "
        f"provider={self._model_config.provider.value}, "
        f"model={self._model_config.model_name}, "
        f"threshold={threshold}, include_reason={include_reason}"
    )

name property

Return the metric name.

evaluate(**kwargs) async

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name Type Description Default
**kwargs Any

Evaluation parameters (query, response, context, ground_truth, etc.)

{}

Returns:

Type Description
dict[str, Any]

Evaluation result dictionary

Raises:

Type Description
TimeoutError

If evaluation exceeds timeout

EvaluationError

If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

get_param_spec() classmethod

Get the parameter specification for this evaluator.

Returns:

Type Description
ParamSpec

ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py
121
122
123
124
125
126
127
128
@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

ContextualPrecisionEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None)

Bases: DeepEvalBaseEvaluator

Contextual Precision evaluator for RAG pipelines.

Evaluates the ranking quality of retrieved chunks. Measures whether relevant chunks appear before irrelevant ones.

Required inputs
  • input: User query
  • actual_output: Agent response
  • expected_output: Ground truth answer
  • retrieval_context: List of retrieved text chunks (order matters)
Example

evaluator = ContextualPrecisionEvaluator(threshold=0.7) result = await evaluator.evaluate( ... input="What is X?", ... actual_output="X is...", ... expected_output="X is the correct definition.", ... retrieval_context=[ ... "Irrelevant info", # Bad: irrelevant first ... "X is the definition", # Good: relevant ... ] ... ) print(result["score"]) # Lower due to poor ranking

Attributes:

Name Type Description
_include_reason

Whether to include reasoning in results.

Initialize Contextual Precision evaluator.

Parameters:

Name Type Description Default
model_config DeepEvalModelConfig | None

LLM judge configuration. Defaults to Ollama gpt-oss:20b.

None
threshold float

Pass/fail score threshold (0.0-1.0). Default: 0.5.

0.5
include_reason bool

Whether to include reasoning in results. Default: True.

True
timeout float | None

Evaluation timeout in seconds. Default: 60.0.

60.0
retry_config RetryConfig | None

Retry configuration for transient failures.

None
Source code in src/holodeck/lib/evaluators/deepeval/contextual_precision.py
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
def __init__(
    self,
    model_config: DeepEvalModelConfig | None = None,
    threshold: float = 0.5,
    include_reason: bool = True,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
) -> None:
    """Initialize Contextual Precision evaluator.

    Args:
        model_config: LLM judge configuration. Defaults to Ollama gpt-oss:20b.
        threshold: Pass/fail score threshold (0.0-1.0). Default: 0.5.
        include_reason: Whether to include reasoning in results. Default: True.
        timeout: Evaluation timeout in seconds. Default: 60.0.
        retry_config: Retry configuration for transient failures.
    """
    self._include_reason = include_reason

    super().__init__(
        model_config=model_config,
        threshold=threshold,
        timeout=timeout,
        retry_config=retry_config,
    )

    logger.debug(
        f"ContextualPrecisionEvaluator initialized: "
        f"provider={self._model_config.provider.value}, "
        f"model={self._model_config.model_name}, "
        f"threshold={threshold}, include_reason={include_reason}"
    )

name property

Return the metric name.

evaluate(**kwargs) async

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name Type Description Default
**kwargs Any

Evaluation parameters (query, response, context, ground_truth, etc.)

{}

Returns:

Type Description
dict[str, Any]

Evaluation result dictionary

Raises:

Type Description
TimeoutError

If evaluation exceeds timeout

EvaluationError

If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

get_param_spec() classmethod

Get the parameter specification for this evaluator.

Returns:

Type Description
ParamSpec

ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py
121
122
123
124
125
126
127
128
@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

ContextualRecallEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None)

Bases: DeepEvalBaseEvaluator

Contextual Recall evaluator for RAG pipelines.

Measures retrieval completeness against expected output. Evaluates whether retrieval context contains all facts needed to produce the expected output.

Required inputs
  • input: User query
  • actual_output: Agent response
  • expected_output: Ground truth answer
  • retrieval_context: List of retrieved text chunks
Example

evaluator = ContextualRecallEvaluator(threshold=0.8) result = await evaluator.evaluate( ... input="List all features", ... actual_output="Features are A and B", ... expected_output="Features are A, B, and C", ... retrieval_context=["Feature A: ...", "Feature B: ..."] ... ) print(result["score"]) # ~0.67 (missing Feature C)

Attributes:

Name Type Description
_include_reason

Whether to include reasoning in results.

Initialize Contextual Recall evaluator.

Parameters:

Name Type Description Default
model_config DeepEvalModelConfig | None

LLM judge configuration. Defaults to Ollama gpt-oss:20b.

None
threshold float

Pass/fail score threshold (0.0-1.0). Default: 0.5.

0.5
include_reason bool

Whether to include reasoning in results. Default: True.

True
timeout float | None

Evaluation timeout in seconds. Default: 60.0.

60.0
retry_config RetryConfig | None

Retry configuration for transient failures.

None
Source code in src/holodeck/lib/evaluators/deepeval/contextual_recall.py
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
def __init__(
    self,
    model_config: DeepEvalModelConfig | None = None,
    threshold: float = 0.5,
    include_reason: bool = True,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
) -> None:
    """Initialize Contextual Recall evaluator.

    Args:
        model_config: LLM judge configuration. Defaults to Ollama gpt-oss:20b.
        threshold: Pass/fail score threshold (0.0-1.0). Default: 0.5.
        include_reason: Whether to include reasoning in results. Default: True.
        timeout: Evaluation timeout in seconds. Default: 60.0.
        retry_config: Retry configuration for transient failures.
    """
    self._include_reason = include_reason

    super().__init__(
        model_config=model_config,
        threshold=threshold,
        timeout=timeout,
        retry_config=retry_config,
    )

    logger.debug(
        f"ContextualRecallEvaluator initialized: "
        f"provider={self._model_config.provider.value}, "
        f"model={self._model_config.model_name}, "
        f"threshold={threshold}, include_reason={include_reason}"
    )

name property

Return the metric name.

evaluate(**kwargs) async

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name Type Description Default
**kwargs Any

Evaluation parameters (query, response, context, ground_truth, etc.)

{}

Returns:

Type Description
dict[str, Any]

Evaluation result dictionary

Raises:

Type Description
TimeoutError

If evaluation exceeds timeout

EvaluationError

If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

get_param_spec() classmethod

Get the parameter specification for this evaluator.

Returns:

Type Description
ParamSpec

ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py
121
122
123
124
125
126
127
128
@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

Usage Examples

DeepEval GEval Metrics

from holodeck.lib.evaluators.deepeval import GEvalEvaluator, DeepEvalModelConfig
from holodeck.models.llm import LLMProvider

# Configure model
model_config = DeepEvalModelConfig(
    provider=LLMProvider.OLLAMA,
    name="llama3.2:latest",
    temperature=0.0
)

# Create evaluator with custom criteria
evaluator = GEvalEvaluator(
    name="Coherence",
    criteria="Evaluate whether the response is clear and well-structured.",
    evaluation_steps=[
        "Check if the response uses clear language.",
        "Assess if the explanation is easy to follow."
    ],
    evaluation_params=["actual_output"],
    model_config=model_config,
    threshold=0.7
)

# Evaluate
result = await evaluator.evaluate(
    actual_output="The password can be reset by clicking 'Forgot Password' on the login page.",
    input="How do I reset my password?"
)

print(f"Score: {result.score}")
print(f"Passed: {result.passed}")
print(f"Reason: {result.reason}")

DeepEval RAG Metrics

from holodeck.lib.evaluators.deepeval import (
    FaithfulnessEvaluator,
    AnswerRelevancyEvaluator,
    DeepEvalModelConfig
)
from holodeck.models.llm import LLMProvider

# Configure model
model_config = DeepEvalModelConfig(
    provider=LLMProvider.OLLAMA,
    name="llama3.2:latest",
    temperature=0.0
)

# Faithfulness - detect hallucinations
faithfulness = FaithfulnessEvaluator(
    model_config=model_config,
    threshold=0.8,
    include_reason=True
)

result = await faithfulness.evaluate(
    input="What is our return policy?",
    actual_output="You can return items within 30 days for a full refund.",
    retrieval_context=[
        "Our return policy allows returns within 30 days of purchase.",
        "Full refunds are provided for items in original condition."
    ]
)

# Answer Relevancy - check response addresses query
relevancy = AnswerRelevancyEvaluator(
    model_config=model_config,
    threshold=0.7
)

result = await relevancy.evaluate(
    input="How do I reset my password?",
    actual_output="Click 'Forgot Password' on the login page and follow the email instructions."
)

NLP Metrics

from holodeck.lib.evaluators.nlp_metrics import compute_f1_score, compute_rouge

# Compute F1 score
prediction = "the cat is on the mat"
reference = "a cat is on the mat"
f1 = compute_f1_score(prediction, reference)
print(f"F1 Score: {f1}")

# Compute ROUGE scores
scores = compute_rouge(prediction, reference)
print(f"ROUGE-1: {scores['rouge1']}")
print(f"ROUGE-2: {scores['rouge2']}")
print(f"ROUGE-L: {scores['rougeL']}")

Metric Configuration in YAML

DeepEval GEval Metric

evaluations:
  model:
    provider: ollama
    name: llama3.2:latest
    temperature: 0.0

  metrics:
    - type: geval
      name: "Coherence"
      criteria: "Evaluate whether the response is clear and well-structured."
      evaluation_steps:
        - "Check if the response uses clear language."
        - "Assess if the explanation is easy to follow."
      evaluation_params:
        - actual_output
        - input
      threshold: 0.7
      enabled: true
      fail_on_error: false

DeepEval RAG Metrics

evaluations:
  model:
    provider: ollama
    name: llama3.2:latest
    temperature: 0.0

  metrics:
    # Faithfulness - hallucination detection
    - type: rag
      metric_type: faithfulness
      threshold: 0.8
      include_reason: true

    # Answer Relevancy
    - type: rag
      metric_type: answer_relevancy
      threshold: 0.7

    # Contextual Relevancy
    - type: rag
      metric_type: contextual_relevancy
      threshold: 0.75

    # Contextual Precision
    - type: rag
      metric_type: contextual_precision
      threshold: 0.8

    # Contextual Recall
    - type: rag
      metric_type: contextual_recall
      threshold: 0.7

NLP Metrics

evaluations:
  metrics:
    - type: standard
      metric: f1_score
      threshold: 0.8

    - type: standard
      metric: bleu
      threshold: 0.6

    - type: standard
      metric: rouge
      threshold: 0.7

    - type: standard
      metric: meteor
      threshold: 0.65

Per-Metric Model Override

evaluations:
  model:
    provider: ollama
    name: llama3.2:latest  # Default: free, local

  metrics:
    - type: rag
      metric_type: faithfulness
      threshold: 0.9
      model:  # Override for critical metric
        provider: openai
        name: gpt-4

Legacy AI Metrics (Deprecated)

DEPRECATED: Azure AI-based metrics are deprecated and will be removed in a future version. Migrate to DeepEval metrics for better flexibility and local model support.

Migration Guide

Legacy Metric Recommended Replacement
groundedness type: rag, metric_type: faithfulness
relevance type: rag, metric_type: answer_relevancy
coherence type: geval with custom criteria
safety type: geval with custom criteria
# DEPRECATED - Use DeepEval evaluators instead
from holodeck.lib.evaluators.azure_ai import AzureAIEvaluator

evaluator = AzureAIEvaluator(model="gpt-4", api_key="your-key")

result = await evaluator.evaluate_groundedness(
    response="Paris is the capital of France",
    context="France's capital city is known for the Eiffel Tower",
)
# DEPRECATED - Use type: geval or type: rag instead
evaluations:
  metrics:
    - type: standard
      metric: groundedness  # Deprecated
      threshold: 0.8

    - type: standard
      metric: relevance     # Deprecated
      threshold: 0.75

    - type: standard
      metric: coherence     # Deprecated
      threshold: 0.7

    - type: standard
      metric: safety        # Deprecated
      threshold: 0.9

Integration with Test Runner

The test runner automatically:

  1. Loads evaluation configuration from agent YAML
  2. Creates appropriate evaluators based on metric type
  3. Invokes evaluators on test outputs
  4. Extracts retrieval_context from tool results (for RAG metrics)
  5. Collects metric scores
  6. Compares against thresholds
  7. Includes results in test report

Test Runner Evaluator Creation

# Internal test runner logic (simplified)
def _create_evaluators(self, metrics: list[MetricType]) -> dict:
    evaluators = {}

    for metric in metrics:
        if metric.type == "geval":
            evaluators[metric.name] = GEvalEvaluator(
                name=metric.name,
                criteria=metric.criteria,
                evaluation_steps=metric.evaluation_steps,
                evaluation_params=metric.evaluation_params,
                model_config=self._get_model_config(metric),
                threshold=metric.threshold,
                strict_mode=metric.strict_mode
            )
        elif metric.type == "rag":
            evaluator_class = RAG_EVALUATOR_MAP[metric.metric_type]
            evaluators[metric.metric_type] = evaluator_class(
                model_config=self._get_model_config(metric),
                threshold=metric.threshold,
                include_reason=metric.include_reason
            )
        elif metric.type == "standard":
            # NLP or legacy metrics
            evaluators[metric.metric] = self._create_standard_evaluator(metric)

    return evaluators

Error Handling

DeepEval Errors

from holodeck.lib.evaluators.deepeval.errors import (
    DeepEvalError,
    ProviderNotSupportedError
)

try:
    result = await evaluator.evaluate(actual_output="...")
except DeepEvalError as e:
    print(f"Evaluation failed: {e.message}")
    print(f"Metric: {e.metric_name}")
    print(f"Test case: {e.test_case_summary}")
except ProviderNotSupportedError as e:
    print(f"Provider not supported: {e}")

Soft vs Hard Failures

metrics:
  # Soft failure - continues on error
  - type: geval
    name: "Quality"
    criteria: "..."
    fail_on_error: false  # Default

  # Hard failure - stops test on error
  - type: rag
    metric_type: faithfulness
    fail_on_error: true