Evaluation Framework API¶

HoloDeck provides a flexible evaluation framework for measuring agent response quality. The framework supports three tiers of metrics:

DeepEval Metrics (Recommended) - LLM-as-a-judge with GEval and RAG metrics
NLP Metrics (Standard) - Algorithmic text comparison
Legacy AI Metrics (Deprecated) - Azure AI-based metrics

Evaluation Configuration Models¶

`EvaluationConfig` ¶

Bases: BaseModel

Evaluation framework configuration.

Container for evaluation metrics with optional default model configuration. Supports standard EvaluationMetric, GEvalMetric (custom criteria), and RAGMetric (RAG pipeline evaluation).

`validate_metrics(v)` `classmethod` ¶

Validate metrics list is not empty.

Source code in src/holodeck/models/evaluation.py

@field_validator("metrics")
@classmethod
def validate_metrics(
    cls, v: list[EvaluationMetric | GEvalMetric | RAGMetric]
) -> list[EvaluationMetric | GEvalMetric | RAGMetric]:
    """Validate metrics list is not empty."""
    if not v:
        raise ValueError("metrics must have at least one metric")
    return v

`MetricType = Annotated[EvaluationMetric | GEvalMetric | RAGMetric, Field(discriminator='type')]` `module-attribute` ¶

`GEvalMetric` ¶

Bases: BaseModel

G-Eval custom criteria metric configuration.

Uses discriminator pattern with type="geval" to distinguish from standard EvaluationMetric instances in a discriminated union.

G-Eval enables custom evaluation criteria defined in natural language, using chain-of-thought prompting with LLM-based scoring.

Example

metric = GEvalMetric( ... name="Professionalism", ... criteria="Evaluate if the response uses professional language", ... threshold=0.7 ... )

`validate_criteria(v)` `classmethod` ¶

Validate criteria is not empty.

Source code in src/holodeck/models/evaluation.py

@field_validator("criteria")
@classmethod
def validate_criteria(cls, v: str) -> str:
    """Validate criteria is not empty."""
    if not v or not v.strip():
        raise ValueError("criteria must be a non-empty string")
    return v

`validate_evaluation_params(v)` `classmethod` ¶

Validate evaluation_params contains valid values.

Source code in src/holodeck/models/evaluation.py

@field_validator("evaluation_params")
@classmethod
def validate_evaluation_params(cls, v: list[str]) -> list[str]:
    """Validate evaluation_params contains valid values."""
    if not v:
        raise ValueError("evaluation_params must not be empty")
    invalid_params = set(v) - VALID_EVALUATION_PARAMS
    if invalid_params:
        raise ValueError(
            f"Invalid evaluation_params: {sorted(invalid_params)}. "
            f"Valid options: {sorted(VALID_EVALUATION_PARAMS)}"
        )
    return v

`validate_name(v)` `classmethod` ¶

Validate name is not empty.

Source code in src/holodeck/models/evaluation.py

@field_validator("name")
@classmethod
def validate_name(cls, v: str) -> str:
    """Validate name is not empty."""
    if not v or not v.strip():
        raise ValueError("name must be a non-empty string")
    return v

`validate_threshold(v)` `classmethod` ¶

Validate threshold is in valid range.

Source code in src/holodeck/models/evaluation.py

@field_validator("threshold")
@classmethod
def validate_threshold(cls, v: float | None) -> float | None:
    """Validate threshold is in valid range."""
    if v is not None and (v < 0.0 or v > 1.0):
        raise ValueError("threshold must be between 0.0 and 1.0")
    return v

`RAGMetric` ¶

Bases: BaseModel

RAG pipeline evaluation metric configuration.

Uses discriminator pattern with type="rag" to distinguish from standard EvaluationMetric and GEvalMetric instances in a discriminated union.

RAG metrics evaluate the quality of retrieval-augmented generation pipelines: - Faithfulness: Detects hallucinations by comparing response to context - ContextualRelevancy: Measures relevance of retrieved chunks to query - ContextualPrecision: Evaluates ranking quality of retrieved chunks - ContextualRecall: Measures retrieval completeness against expected output

Example

metric = RAGMetric( ... metric_type=RAGMetricType.FAITHFULNESS, ... threshold=0.8 ... )

`validate_threshold(v)` `classmethod` ¶

Validate threshold is in valid range.

Source code in src/holodeck/models/evaluation.py

@field_validator("threshold")
@classmethod
def validate_threshold(cls, v: float) -> float:
    """Validate threshold is in valid range."""
    if v < 0.0 or v > 1.0:
        raise ValueError("threshold must be between 0.0 and 1.0")
    return v

`EvaluationMetric` ¶

Bases: BaseModel

Evaluation metric configuration.

Represents a single evaluation metric with flexible model configuration, including per-metric LLM model overrides.

`validate_custom_prompt(v)` `classmethod` ¶

Validate custom_prompt is not empty if provided.

Source code in src/holodeck/models/evaluation.py

@field_validator("custom_prompt")
@classmethod
def validate_custom_prompt(cls, v: str | None) -> str | None:
    """Validate custom_prompt is not empty if provided."""
    if v is not None and (not v or not v.strip()):
        raise ValueError("custom_prompt must be non-empty if provided")
    return v

`validate_enabled(v)` `classmethod` ¶

Validate enabled is boolean.

Source code in src/holodeck/models/evaluation.py

@field_validator("enabled")
@classmethod
def validate_enabled(cls, v: bool) -> bool:
    """Validate enabled is boolean."""
    if not isinstance(v, bool):
        raise ValueError("enabled must be boolean")
    return v

`validate_fail_on_error(v)` `classmethod` ¶

Validate fail_on_error is boolean.

Source code in src/holodeck/models/evaluation.py

@field_validator("fail_on_error")
@classmethod
def validate_fail_on_error(cls, v: bool) -> bool:
    """Validate fail_on_error is boolean."""
    if not isinstance(v, bool):
        raise ValueError("fail_on_error must be boolean")
    return v

`validate_metric(v)` `classmethod` ¶

Validate metric is not empty.

Source code in src/holodeck/models/evaluation.py

@field_validator("metric")
@classmethod
def validate_metric(cls, v: str) -> str:
    """Validate metric is not empty."""
    if not v or not v.strip():
        raise ValueError("metric must be a non-empty string")
    return v

`validate_retry_on_failure(v)` `classmethod` ¶

Validate retry_on_failure is in valid range.

Source code in src/holodeck/models/evaluation.py

@field_validator("retry_on_failure")
@classmethod
def validate_retry_on_failure(cls, v: int | None) -> int | None:
    """Validate retry_on_failure is in valid range."""
    if v is not None and (v < 1 or v > 3):
        raise ValueError("retry_on_failure must be between 1 and 3")
    return v

`validate_scale(v)` `classmethod` ¶

Validate scale is positive.

Source code in src/holodeck/models/evaluation.py

@field_validator("scale")
@classmethod
def validate_scale(cls, v: int | None) -> int | None:
    """Validate scale is positive."""
    if v is not None and v <= 0:
        raise ValueError("scale must be positive")
    return v

`validate_threshold(v)` `classmethod` ¶

Validate threshold is numeric if provided.

Source code in src/holodeck/models/evaluation.py

@field_validator("threshold")
@classmethod
def validate_threshold(cls, v: float | None) -> float | None:
    """Validate threshold is numeric if provided."""
    if v is not None and not isinstance(v, int | float):
        raise ValueError("threshold must be numeric")
    return v

`validate_timeout_ms(v)` `classmethod` ¶

Validate timeout_ms is positive.

Source code in src/holodeck/models/evaluation.py

@field_validator("timeout_ms")
@classmethod
def validate_timeout_ms(cls, v: int | None) -> int | None:
    """Validate timeout_ms is positive."""
    if v is not None and v <= 0:
        raise ValueError("timeout_ms must be positive")
    return v

DeepEval Evaluators (Recommended)¶

DeepEval provides powerful LLM-as-a-judge evaluation using the DeepEval library.

Base Classes¶

`DeepEvalBaseEvaluator(model_config=None, threshold=0.5, timeout=60.0, retry_config=None)` ¶

Bases: BaseEvaluator

Abstract base class for DeepEval-based evaluators.

This class extends BaseEvaluator to provide DeepEval-specific functionality: - Model configuration and initialization - LLMTestCase construction from evaluation inputs - Result normalization and logging

Subclasses must implement _create_metric() to return the specific DeepEval metric instance.

Note: DeepEval uses different parameter names than Azure AI/NLP: - input (not query) - actual_output (not response) - expected_output (not ground_truth)

Attributes:

Name	Type	Description
`model_config`		Configuration for the evaluation LLM
`threshold`		Score threshold for pass/fail determination
`model`		The initialized DeepEval model instance

Example

class MyMetricEvaluator(DeepEvalBaseEvaluator): ... def _create_metric(self): ... return SomeDeepEvalMetric( ... threshold=self._threshold, ... model=self._model ... )

evaluator = MyMetricEvaluator(threshold=0.7) result = await evaluator.evaluate( ... input="What is Python?", ... actual_output="Python is a programming language." ... )

Initialize DeepEval base evaluator.

Parameters:

Name	Type	Description	Default
`model_config`	`DeepEvalModelConfig \| None`	Configuration for the evaluation model. Defaults to Ollama with gpt-oss:20b.	`None`
`threshold`	`float`	Score threshold for pass/fail (0.0-1.0, default: 0.5)	`0.5`
`timeout`	`float \| None`	Evaluation timeout in seconds (default: 60.0)	`60.0`
`retry_config`	`RetryConfig \| None`	Retry configuration for transient failures	`None`

Source code in src/holodeck/lib/evaluators/deepeval/base.py

def __init__(
    self,
    model_config: DeepEvalModelConfig | None = None,
    threshold: float = 0.5,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
) -> None:
    """Initialize DeepEval base evaluator.

    Args:
        model_config: Configuration for the evaluation model.
                     Defaults to Ollama with gpt-oss:20b.
        threshold: Score threshold for pass/fail (0.0-1.0, default: 0.5)
        timeout: Evaluation timeout in seconds (default: 60.0)
        retry_config: Retry configuration for transient failures
    """
    super().__init__(timeout=timeout, retry_config=retry_config)
    self._model_config = model_config or DeepEvalModelConfig()
    self._model = self._model_config.to_deepeval_model()
    self._threshold = threshold

    logger.debug(
        f"DeepEval evaluator initialized: {self.name}, "
        f"provider={self._model_config.provider.value}, "
        f"model={self._model_config.model_name}, "
        f"threshold={threshold}"
    )

`name` `property` ¶

Return evaluator name (class name by default).

`evaluate(**kwargs)` `async` ¶

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name	Type	Description	Default
`**kwargs`	`Any`	Evaluation parameters (query, response, context, ground_truth, etc.)	`{}`

Returns:

Type	Description
`dict[str, Any]`	Evaluation result dictionary

Raises:

Type	Description
`TimeoutError`	If evaluation exceeds timeout
`EvaluationError`	If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py

async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

`get_param_spec()` `classmethod` ¶

Get the parameter specification for this evaluator.

Returns:

Type	Description
`ParamSpec`	ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py

@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

`DeepEvalModelConfig` ¶

Bases: BaseModel

Configuration adapter for DeepEval model classes.

This class bridges HoloDeck's LLMProvider configuration to DeepEval's native model classes (GPTModel, AzureOpenAIModel, AnthropicModel, OllamaModel).

The default configuration uses Ollama with gpt-oss:20b for local evaluation without requiring API keys.

Attributes:

Name	Type	Description
`provider`	`ProviderEnum`	LLM provider to use (defaults to Ollama)
`model_name`	`str`	Name of the model (defaults to gpt-oss:20b)
`api_key`	`str \| None`	API key for cloud providers (not required for Ollama)
`endpoint`	`str \| None`	API endpoint URL (required for Azure, optional for Ollama)
`api_version`	`str \| None`	Azure OpenAI API version
`deployment_name`	`str \| None`	Azure OpenAI deployment name
`temperature`	`float`	Temperature for generation (defaults to 0.0 for determinism)

API Key Behavior

OpenAI: API key can be provided via api_key field or the OPENAI_API_KEY environment variable. If neither is set, DeepEval's GPTModel will raise an error at runtime.
Anthropic: API key can be provided via api_key field or the ANTHROPIC_API_KEY environment variable. If neither is set, DeepEval's AnthropicModel will raise an error at runtime.
Azure OpenAI: The api_key field is required and validated at configuration time (no environment variable fallback).
Ollama: No API key required (local inference).

Example

config = DeepEvalModelConfig() # Default Ollama model = config.to_deepeval_model()

openai_config = DeepEvalModelConfig( ... provider=ProviderEnum.OPENAI, ... model_name="gpt-4o", ... api_key="sk-..." # Or set OPENAI_API_KEY env var ... )

`to_deepeval_model()` ¶

Convert configuration to native DeepEval model class.

Returns the appropriate DeepEval model class instance based on the configured provider.

Returns:

Type	Description
`DeepEvalModel`	DeepEval model instance (GPTModel, AzureOpenAIModel,
`DeepEvalModel`	AnthropicModel, or OllamaModel)

Raises:

Type	Description
`ValueError`	If provider is not supported

Source code in src/holodeck/lib/evaluators/deepeval/config.py

def to_deepeval_model(self) -> DeepEvalModel:
    """Convert configuration to native DeepEval model class.

    Returns the appropriate DeepEval model class instance based on
    the configured provider.

    Returns:
        DeepEval model instance (GPTModel, AzureOpenAIModel,
        AnthropicModel, or OllamaModel)

    Raises:
        ValueError: If provider is not supported
    """
    if self.provider == ProviderEnum.OPENAI:
        from deepeval.models import GPTModel

        kwargs: dict[str, Any] = {
            "model": self.model_name,
            "temperature": self.temperature,
        }
        if self.api_key:
            kwargs["api_key"] = self.api_key
        return GPTModel(**kwargs)

    elif self.provider == ProviderEnum.AZURE_OPENAI:
        from deepeval.models import AzureOpenAIModel

        return AzureOpenAIModel(
            model_name=self.model_name,
            deployment_name=self.deployment_name,
            azure_endpoint=self.endpoint,
            openai_api_version=self.api_version,
            azure_openai_api_key=self.api_key,
            temperature=1.0,  # reasoning models require temperature=1.0
        )

    elif self.provider == ProviderEnum.ANTHROPIC:
        from deepeval.models import AnthropicModel

        kwargs = {
            "model": self.model_name,
            "temperature": self.temperature,
        }
        if self.api_key:
            kwargs["api_key"] = self.api_key
        return AnthropicModel(**kwargs)

    elif self.provider == ProviderEnum.OLLAMA:
        from deepeval.models import OllamaModel

        return OllamaModel(
            model=self.model_name,
            base_url=self.endpoint or "http://localhost:11434",
            temperature=self.temperature,
        )

    else:
        raise ValueError(f"Unsupported provider: {self.provider}")

`validate_provider_requirements()` ¶

Validate that required fields are present for each provider.

Raises:

Type	Description
`ValueError`	If required fields are missing for the provider

Source code in src/holodeck/lib/evaluators/deepeval/config.py

@model_validator(mode="after")
def validate_provider_requirements(self) -> "DeepEvalModelConfig":
    """Validate that required fields are present for each provider.

    Raises:
        ValueError: If required fields are missing for the provider
    """
    if self.provider == ProviderEnum.AZURE_OPENAI:
        if not self.endpoint:
            raise ValueError("endpoint is required for Azure OpenAI provider")
        if not self.deployment_name:
            raise ValueError(
                "deployment_name is required for Azure OpenAI provider"
            )
        if not self.api_key:
            raise ValueError("api_key is required for Azure OpenAI provider")
    return self

GEval Evaluator¶

The GEval evaluator uses the G-Eval algorithm with chain-of-thought prompting for custom criteria evaluation.

`GEvalEvaluator(name, criteria, evaluation_params=None, evaluation_steps=None, model_config=None, threshold=0.5, strict_mode=False, timeout=60.0, retry_config=None)` ¶

Bases: DeepEvalBaseEvaluator

G-Eval custom criteria evaluator.

Evaluates LLM outputs against user-defined criteria using the G-Eval algorithm, which combines chain-of-thought prompting with token probability scoring.

G-Eval works in two phases: 1. Step Generation: Auto-generates evaluation steps from the criteria 2. Scoring: Uses the steps to score the test case on a 1-5 scale (normalized to 0-1)

Attributes:

Name	Type	Description
`_metric_name`		Custom name for this evaluation metric
`_criteria`		Natural language criteria for evaluation
`_evaluation_params`		Test case fields to include in evaluation
`_evaluation_steps`		Optional explicit evaluation steps
`_strict_mode`		Whether to use binary scoring (1.0 or 0.0)

Example

evaluator = GEvalEvaluator( ... name="Professionalism", ... criteria="Evaluate if the response uses professional language", ... threshold=0.7 ... ) result = await evaluator.evaluate( ... input="Write me an email", ... actual_output="Dear Sir/Madam, ..." ... ) print(result["score"]) # 0.85 print(result["passed"]) # True

Initialize G-Eval evaluator.

Parameters:

Name	Type	Description	Default
`name`	`str`	Metric identifier (e.g., "Correctness", "Helpfulness")	required
`criteria`	`str`	Natural language evaluation criteria	required
`evaluation_params`	`list[str] \| None`	Test case fields to include in evaluation. Valid options: ["input", "actual_output", "expected_output", "context", "retrieval_context"] Default: ["actual_output"]	`None`
`evaluation_steps`	`list[str] \| None`	Explicit evaluation steps. If None, G-Eval auto-generates steps from the criteria.	`None`
`model_config`	`DeepEvalModelConfig \| None`	LLM judge configuration. Defaults to Ollama gpt-oss:20b.	`None`
`threshold`	`float`	Pass/fail score threshold (0.0-1.0). Default: 0.5.	`0.5`
`strict_mode`	`bool`	If True, scores are binary (1.0 or 0.0). Default: False.	`False`
`timeout`	`float \| None`	Evaluation timeout in seconds. Default: 60.0.	`60.0`
`retry_config`	`RetryConfig \| None`	Retry configuration for transient failures.	`None`

Raises:

Type	Description
`ValueError`	If invalid evaluation_params are provided.

Source code in src/holodeck/lib/evaluators/deepeval/geval.py

def __init__(
    self,
    name: str,
    criteria: str,
    evaluation_params: list[str] | None = None,
    evaluation_steps: list[str] | None = None,
    model_config: DeepEvalModelConfig | None = None,
    threshold: float = 0.5,
    strict_mode: bool = False,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
) -> None:
    """Initialize G-Eval evaluator.

    Args:
        name: Metric identifier (e.g., "Correctness", "Helpfulness")
        criteria: Natural language evaluation criteria
        evaluation_params: Test case fields to include in evaluation.
            Valid options: ["input", "actual_output", "expected_output",
                          "context", "retrieval_context"]
            Default: ["actual_output"]
        evaluation_steps: Explicit evaluation steps. If None, G-Eval
            auto-generates steps from the criteria.
        model_config: LLM judge configuration. Defaults to Ollama gpt-oss:20b.
        threshold: Pass/fail score threshold (0.0-1.0). Default: 0.5.
        strict_mode: If True, scores are binary (1.0 or 0.0). Default: False.
        timeout: Evaluation timeout in seconds. Default: 60.0.
        retry_config: Retry configuration for transient failures.

    Raises:
        ValueError: If invalid evaluation_params are provided.
    """
    # Validate and set evaluation params before calling super().__init__
    if evaluation_params is None:
        evaluation_params = ["actual_output"]

    # Validate evaluation params
    for param in evaluation_params:
        if param not in VALID_EVALUATION_PARAMS:
            raise ValueError(
                f"Invalid evaluation_param: '{param}'. "
                f"Valid options: {sorted(VALID_EVALUATION_PARAMS)}"
            )

    self._metric_name = name
    self._criteria = criteria
    self._evaluation_params = evaluation_params
    self._evaluation_steps = evaluation_steps
    self._strict_mode = strict_mode

    super().__init__(
        model_config=model_config,
        threshold=threshold,
        timeout=timeout,
        retry_config=retry_config,
    )

    logger.debug(
        f"GEvalEvaluator initialized: name={name}, "
        f"criteria_len={len(criteria)}, "
        f"evaluation_params={evaluation_params}, "
        f"strict_mode={strict_mode}"
    )

`name` `property` ¶

Return the custom metric name.

`evaluate(**kwargs)` `async` ¶

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name	Type	Description	Default
`**kwargs`	`Any`	Evaluation parameters (query, response, context, ground_truth, etc.)	`{}`

Returns:

Type	Description
`dict[str, Any]`	Evaluation result dictionary

Raises:

Type	Description
`TimeoutError`	If evaluation exceeds timeout
`EvaluationError`	If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py

async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

`get_param_spec()` `classmethod` ¶

Get the parameter specification for this evaluator.

Returns:

Type	Description
`ParamSpec`	ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py

@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

RAG Evaluators¶

RAG evaluators measure retrieval-augmented generation pipeline quality.

`FaithfulnessEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None)` ¶

Bases: DeepEvalBaseEvaluator

Faithfulness evaluator for detecting hallucinations.

Detects hallucinations by comparing agent response to retrieval context. Returns a low score if the response contains information not found in the retrieval context (hallucination detected).

Required inputs

input: User query
actual_output: Agent response
retrieval_context: List of retrieved text chunks

Example

evaluator = FaithfulnessEvaluator(threshold=0.8) result = await evaluator.evaluate( ... input="What are the store hours?", ... actual_output="Store is open 24/7.", ... retrieval_context=["Store hours: Mon-Fri 9am-5pm"] ... ) print(result["score"]) # Low score (hallucination detected)

Attributes:

Name	Type	Description
`_include_reason`		Whether to include reasoning in results.

Initialize Faithfulness evaluator.

Parameters:

Name	Type	Description	Default
`model_config`	`DeepEvalModelConfig \| None`	LLM judge configuration. Defaults to Ollama gpt-oss:20b.	`None`
`threshold`	`float`	Pass/fail score threshold (0.0-1.0). Default: 0.5.	`0.5`
`include_reason`	`bool`	Whether to include reasoning in results. Default: True.	`True`
`timeout`	`float \| None`	Evaluation timeout in seconds. Default: 60.0.	`60.0`
`retry_config`	`RetryConfig \| None`	Retry configuration for transient failures.	`None`

Source code in src/holodeck/lib/evaluators/deepeval/faithfulness.py

def __init__(
    self,
    model_config: DeepEvalModelConfig | None = None,
    threshold: float = 0.5,
    include_reason: bool = True,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
) -> None:
    """Initialize Faithfulness evaluator.

    Args:
        model_config: LLM judge configuration. Defaults to Ollama gpt-oss:20b.
        threshold: Pass/fail score threshold (0.0-1.0). Default: 0.5.
        include_reason: Whether to include reasoning in results. Default: True.
        timeout: Evaluation timeout in seconds. Default: 60.0.
        retry_config: Retry configuration for transient failures.
    """
    self._include_reason = include_reason

    super().__init__(
        model_config=model_config,
        threshold=threshold,
        timeout=timeout,
        retry_config=retry_config,
    )

    logger.debug(
        f"FaithfulnessEvaluator initialized: "
        f"provider={self._model_config.provider.value}, "
        f"model={self._model_config.model_name}, "
        f"threshold={threshold}, include_reason={include_reason}"
    )

`name` `property` ¶

Return the metric name.

`evaluate(**kwargs)` `async` ¶

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name	Type	Description	Default
`**kwargs`	`Any`	Evaluation parameters (query, response, context, ground_truth, etc.)	`{}`

Returns:

Type	Description
`dict[str, Any]`	Evaluation result dictionary

Raises:

Type	Description
`TimeoutError`	If evaluation exceeds timeout
`EvaluationError`	If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py

async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

`get_param_spec()` `classmethod` ¶

Get the parameter specification for this evaluator.

Returns:

Type	Description
`ParamSpec`	ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py

@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

`AnswerRelevancyEvaluator(model_config=None, threshold=0.5, include_reason=True, strict_mode=False, timeout=60.0, retry_config=None)` ¶

Bases: DeepEvalBaseEvaluator

Answer Relevancy evaluator - measures statement relevance to input.

Evaluates how relevant the response statements are to the input query. Unlike other RAG metrics, this does NOT require retrieval_context.

Required inputs

input: User query
actual_output: Agent response

Example

evaluator = AnswerRelevancyEvaluator(threshold=0.7) result = await evaluator.evaluate( ... input="What is the return policy?", ... actual_output="We offer a 30-day full refund at no extra cost." ... ) print(result["score"]) # High score if relevant

Attributes:

Name	Type	Description
`_include_reason`		Whether to include reasoning in results.
`_strict_mode`		Whether to use binary scoring (1.0 or 0.0).

Initialize Answer Relevancy evaluator.

Parameters:

Name	Type	Description	Default
`model_config`	`DeepEvalModelConfig \| None`	LLM judge configuration. Defaults to Ollama gpt-oss:20b.	`None`
`threshold`	`float`	Pass/fail score threshold (0.0-1.0). Default: 0.5.	`0.5`
`include_reason`	`bool`	Whether to include reasoning in results. Default: True.	`True`
`strict_mode`	`bool`	Binary scoring mode (1.0 or 0.0 only). Default: False.	`False`
`timeout`	`float \| None`	Evaluation timeout in seconds. Default: 60.0.	`60.0`
`retry_config`	`RetryConfig \| None`	Retry configuration for transient failures.	`None`

Source code in src/holodeck/lib/evaluators/deepeval/answer_relevancy.py

def __init__(
    self,
    model_config: DeepEvalModelConfig | None = None,
    threshold: float = 0.5,
    include_reason: bool = True,
    strict_mode: bool = False,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
) -> None:
    """Initialize Answer Relevancy evaluator.

    Args:
        model_config: LLM judge configuration. Defaults to Ollama gpt-oss:20b.
        threshold: Pass/fail score threshold (0.0-1.0). Default: 0.5.
        include_reason: Whether to include reasoning in results. Default: True.
        strict_mode: Binary scoring mode (1.0 or 0.0 only). Default: False.
        timeout: Evaluation timeout in seconds. Default: 60.0.
        retry_config: Retry configuration for transient failures.
    """
    self._include_reason = include_reason
    self._strict_mode = strict_mode

    super().__init__(
        model_config=model_config,
        threshold=threshold,
        timeout=timeout,
        retry_config=retry_config,
    )

    logger.debug(
        f"AnswerRelevancyEvaluator initialized: "
        f"provider={self._model_config.provider.value}, "
        f"model={self._model_config.model_name}, "
        f"threshold={threshold}, include_reason={include_reason}, "
        f"strict_mode={strict_mode}"
    )

`name` `property` ¶

Return the metric name.

`evaluate(**kwargs)` `async` ¶

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name	Type	Description	Default
`**kwargs`	`Any`	Evaluation parameters (query, response, context, ground_truth, etc.)	`{}`

Returns:

Type	Description
`dict[str, Any]`	Evaluation result dictionary

Raises:

Type	Description
`TimeoutError`	If evaluation exceeds timeout
`EvaluationError`	If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py

async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

`get_param_spec()` `classmethod` ¶

Get the parameter specification for this evaluator.

Returns:

Type	Description
`ParamSpec`	ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py

@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

`ContextualRelevancyEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None)` ¶

Bases: DeepEvalBaseEvaluator

Contextual Relevancy evaluator for RAG pipelines.

Measures the relevance of retrieved context to the user query. Returns the proportion of chunks that are relevant to the query.

Required inputs

input: User query
actual_output: Agent response
retrieval_context: List of retrieved text chunks

Example

evaluator = ContextualRelevancyEvaluator(threshold=0.6) result = await evaluator.evaluate( ... input="What is the pricing?", ... actual_output="Basic plan is $10/month.", ... retrieval_context=[ ... "Pricing: Basic $10, Pro $25", # Relevant ... "Company founded in 2020", # Irrelevant ... ] ... ) print(result["score"]) # 0.5 (1 of 2 chunks relevant)

Attributes:

Name	Type	Description
`_include_reason`		Whether to include reasoning in results.

Initialize Contextual Relevancy evaluator.

Parameters:

Name	Type	Description	Default
`model_config`	`DeepEvalModelConfig \| None`	LLM judge configuration. Defaults to Ollama gpt-oss:20b.	`None`
`threshold`	`float`	Pass/fail score threshold (0.0-1.0). Default: 0.5.	`0.5`
`include_reason`	`bool`	Whether to include reasoning in results. Default: True.	`True`
`timeout`	`float \| None`	Evaluation timeout in seconds. Default: 60.0.	`60.0`
`retry_config`	`RetryConfig \| None`	Retry configuration for transient failures.	`None`

Source code in src/holodeck/lib/evaluators/deepeval/contextual_relevancy.py

def __init__(
    self,
    model_config: DeepEvalModelConfig | None = None,
    threshold: float = 0.5,
    include_reason: bool = True,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
) -> None:
    """Initialize Contextual Relevancy evaluator.

    Args:
        model_config: LLM judge configuration. Defaults to Ollama gpt-oss:20b.
        threshold: Pass/fail score threshold (0.0-1.0). Default: 0.5.
        include_reason: Whether to include reasoning in results. Default: True.
        timeout: Evaluation timeout in seconds. Default: 60.0.
        retry_config: Retry configuration for transient failures.
    """
    self._include_reason = include_reason

    super().__init__(
        model_config=model_config,
        threshold=threshold,
        timeout=timeout,
        retry_config=retry_config,
    )

    logger.debug(
        f"ContextualRelevancyEvaluator initialized: "
        f"provider={self._model_config.provider.value}, "
        f"model={self._model_config.model_name}, "
        f"threshold={threshold}, include_reason={include_reason}"
    )

`name` `property` ¶

Return the metric name.

`evaluate(**kwargs)` `async` ¶

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name	Type	Description	Default
`**kwargs`	`Any`	Evaluation parameters (query, response, context, ground_truth, etc.)	`{}`

Returns:

Type	Description
`dict[str, Any]`	Evaluation result dictionary

Raises:

Type	Description
`TimeoutError`	If evaluation exceeds timeout
`EvaluationError`	If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py

async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

`get_param_spec()` `classmethod` ¶

Get the parameter specification for this evaluator.

Returns:

Type	Description
`ParamSpec`	ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py

@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

`ContextualPrecisionEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None)` ¶

Bases: DeepEvalBaseEvaluator

Contextual Precision evaluator for RAG pipelines.

Evaluates the ranking quality of retrieved chunks. Measures whether relevant chunks appear before irrelevant ones.

Required inputs

input: User query
actual_output: Agent response
expected_output: Ground truth answer
retrieval_context: List of retrieved text chunks (order matters)

Example

evaluator = ContextualPrecisionEvaluator(threshold=0.7) result = await evaluator.evaluate( ... input="What is X?", ... actual_output="X is...", ... expected_output="X is the correct definition.", ... retrieval_context=[ ... "Irrelevant info", # Bad: irrelevant first ... "X is the definition", # Good: relevant ... ] ... ) print(result["score"]) # Lower due to poor ranking

Attributes:

Name	Type	Description
`_include_reason`		Whether to include reasoning in results.

Initialize Contextual Precision evaluator.

Parameters:

Name	Type	Description	Default
`model_config`	`DeepEvalModelConfig \| None`	LLM judge configuration. Defaults to Ollama gpt-oss:20b.	`None`
`threshold`	`float`	Pass/fail score threshold (0.0-1.0). Default: 0.5.	`0.5`
`include_reason`	`bool`	Whether to include reasoning in results. Default: True.	`True`
`timeout`	`float \| None`	Evaluation timeout in seconds. Default: 60.0.	`60.0`
`retry_config`	`RetryConfig \| None`	Retry configuration for transient failures.	`None`

Source code in src/holodeck/lib/evaluators/deepeval/contextual_precision.py

def __init__(
    self,
    model_config: DeepEvalModelConfig | None = None,
    threshold: float = 0.5,
    include_reason: bool = True,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
) -> None:
    """Initialize Contextual Precision evaluator.

    Args:
        model_config: LLM judge configuration. Defaults to Ollama gpt-oss:20b.
        threshold: Pass/fail score threshold (0.0-1.0). Default: 0.5.
        include_reason: Whether to include reasoning in results. Default: True.
        timeout: Evaluation timeout in seconds. Default: 60.0.
        retry_config: Retry configuration for transient failures.
    """
    self._include_reason = include_reason

    super().__init__(
        model_config=model_config,
        threshold=threshold,
        timeout=timeout,
        retry_config=retry_config,
    )

    logger.debug(
        f"ContextualPrecisionEvaluator initialized: "
        f"provider={self._model_config.provider.value}, "
        f"model={self._model_config.model_name}, "
        f"threshold={threshold}, include_reason={include_reason}"
    )

`name` `property` ¶

Return the metric name.

`evaluate(**kwargs)` `async` ¶

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name	Type	Description	Default
`**kwargs`	`Any`	Evaluation parameters (query, response, context, ground_truth, etc.)	`{}`

Returns:

Type	Description
`dict[str, Any]`	Evaluation result dictionary

Raises:

Type	Description
`TimeoutError`	If evaluation exceeds timeout
`EvaluationError`	If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py

async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

`get_param_spec()` `classmethod` ¶

Get the parameter specification for this evaluator.

Returns:

Type	Description
`ParamSpec`	ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py

@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

`ContextualRecallEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None)` ¶

Bases: DeepEvalBaseEvaluator

Contextual Recall evaluator for RAG pipelines.

Measures retrieval completeness against expected output. Evaluates whether retrieval context contains all facts needed to produce the expected output.

Required inputs

input: User query
actual_output: Agent response
expected_output: Ground truth answer
retrieval_context: List of retrieved text chunks

Example

evaluator = ContextualRecallEvaluator(threshold=0.8) result = await evaluator.evaluate( ... input="List all features", ... actual_output="Features are A and B", ... expected_output="Features are A, B, and C", ... retrieval_context=["Feature A: ...", "Feature B: ..."] ... ) print(result["score"]) # ~0.67 (missing Feature C)

Attributes:

Name	Type	Description
`_include_reason`		Whether to include reasoning in results.

Initialize Contextual Recall evaluator.

Parameters:

Name	Type	Description	Default
`model_config`	`DeepEvalModelConfig \| None`	LLM judge configuration. Defaults to Ollama gpt-oss:20b.	`None`
`threshold`	`float`	Pass/fail score threshold (0.0-1.0). Default: 0.5.	`0.5`
`include_reason`	`bool`	Whether to include reasoning in results. Default: True.	`True`
`timeout`	`float \| None`	Evaluation timeout in seconds. Default: 60.0.	`60.0`
`retry_config`	`RetryConfig \| None`	Retry configuration for transient failures.	`None`

Source code in src/holodeck/lib/evaluators/deepeval/contextual_recall.py

def __init__(
    self,
    model_config: DeepEvalModelConfig | None = None,
    threshold: float = 0.5,
    include_reason: bool = True,
    timeout: float | None = 60.0,
    retry_config: RetryConfig | None = None,
) -> None:
    """Initialize Contextual Recall evaluator.

    Args:
        model_config: LLM judge configuration. Defaults to Ollama gpt-oss:20b.
        threshold: Pass/fail score threshold (0.0-1.0). Default: 0.5.
        include_reason: Whether to include reasoning in results. Default: True.
        timeout: Evaluation timeout in seconds. Default: 60.0.
        retry_config: Retry configuration for transient failures.
    """
    self._include_reason = include_reason

    super().__init__(
        model_config=model_config,
        threshold=threshold,
        timeout=timeout,
        retry_config=retry_config,
    )

    logger.debug(
        f"ContextualRecallEvaluator initialized: "
        f"provider={self._model_config.provider.value}, "
        f"model={self._model_config.model_name}, "
        f"threshold={threshold}, include_reason={include_reason}"
    )

`name` `property` ¶

Return the metric name.

`evaluate(**kwargs)` `async` ¶

Evaluate with timeout and retry logic.

This is the main public interface for evaluation. It wraps the implementation with timeout and retry handling.

Parameters:

Name	Type	Description	Default
`**kwargs`	`Any`	Evaluation parameters (query, response, context, ground_truth, etc.)	`{}`

Returns:

Type	Description
`dict[str, Any]`	Evaluation result dictionary

Raises:

Type	Description
`TimeoutError`	If evaluation exceeds timeout
`EvaluationError`	If evaluation fails after retries

Example

evaluator = MyEvaluator(timeout=30.0) result = await evaluator.evaluate( ... query="What is the capital of France?", ... response="The capital of France is Paris.", ... context="France is a country in Europe.", ... ground_truth="Paris" ... ) print(result["score"]) 0.95

Source code in src/holodeck/lib/evaluators/base.py

async def evaluate(self, **kwargs: Any) -> dict[str, Any]:
    """Evaluate with timeout and retry logic.

    This is the main public interface for evaluation. It wraps the
    implementation with timeout and retry handling.

    Args:
        **kwargs: Evaluation parameters
            (query, response, context, ground_truth, etc.)

    Returns:
        Evaluation result dictionary

    Raises:
        asyncio.TimeoutError: If evaluation exceeds timeout
        EvaluationError: If evaluation fails after retries

    Example:
        >>> evaluator = MyEvaluator(timeout=30.0)
        >>> result = await evaluator.evaluate(
        ...     query="What is the capital of France?",
        ...     response="The capital of France is Paris.",
        ...     context="France is a country in Europe.",
        ...     ground_truth="Paris"
        ... )
        >>> print(result["score"])
        0.95
    """
    logger.debug(f"Starting evaluation: {self.name} (timeout={self.timeout}s)")

    if self.timeout is None:
        # No timeout - evaluate directly with retry
        logger.debug(f"Evaluation {self.name}: no timeout")
        return await self._evaluate_with_retry(**kwargs)

    # Apply timeout using asyncio.wait_for
    try:
        logger.debug(f"Evaluation {self.name}: applying timeout of {self.timeout}s")
        return await asyncio.wait_for(
            self._evaluate_with_retry(**kwargs), timeout=self.timeout
        )
    except TimeoutError:
        logger.error(f"Evaluation {self.name} exceeded timeout of {self.timeout}s")
        raise  # Re-raise timeout error as-is

`get_param_spec()` `classmethod` ¶

Get the parameter specification for this evaluator.

Returns:

Type	Description
`ParamSpec`	ParamSpec declaring required/optional parameters and context flags.

Source code in src/holodeck/lib/evaluators/base.py

@classmethod
def get_param_spec(cls) -> ParamSpec:
    """Get the parameter specification for this evaluator.

    Returns:
        ParamSpec declaring required/optional parameters and context flags.
    """
    return cls.PARAM_SPEC

Usage Examples¶

DeepEval GEval Metrics¶

from holodeck.lib.evaluators.deepeval import GEvalEvaluator, DeepEvalModelConfig
from holodeck.models.llm import LLMProvider

# Configure model
model_config = DeepEvalModelConfig(
    provider=LLMProvider.OLLAMA,
    name="llama3.2:latest",
    temperature=0.0
)

# Create evaluator with custom criteria
evaluator = GEvalEvaluator(
    name="Coherence",
    criteria="Evaluate whether the response is clear and well-structured.",
    evaluation_steps=[
        "Check if the response uses clear language.",
        "Assess if the explanation is easy to follow."
    ],
    evaluation_params=["actual_output"],
    model_config=model_config,
    threshold=0.7
)

# Evaluate
result = await evaluator.evaluate(
    actual_output="The password can be reset by clicking 'Forgot Password' on the login page.",
    input="How do I reset my password?"
)

print(f"Score: {result.score}")
print(f"Passed: {result.passed}")
print(f"Reason: {result.reason}")

DeepEval RAG Metrics¶

from holodeck.lib.evaluators.deepeval import (
    FaithfulnessEvaluator,
    AnswerRelevancyEvaluator,
    DeepEvalModelConfig
)
from holodeck.models.llm import LLMProvider

# Configure model
model_config = DeepEvalModelConfig(
    provider=LLMProvider.OLLAMA,
    name="llama3.2:latest",
    temperature=0.0
)

# Faithfulness - detect hallucinations
faithfulness = FaithfulnessEvaluator(
    model_config=model_config,
    threshold=0.8,
    include_reason=True
)

result = await faithfulness.evaluate(
    input="What is our return policy?",
    actual_output="You can return items within 30 days for a full refund.",
    retrieval_context=[
        "Our return policy allows returns within 30 days of purchase.",
        "Full refunds are provided for items in original condition."
    ]
)

# Answer Relevancy - check response addresses query
relevancy = AnswerRelevancyEvaluator(
    model_config=model_config,
    threshold=0.7
)

result = await relevancy.evaluate(
    input="How do I reset my password?",
    actual_output="Click 'Forgot Password' on the login page and follow the email instructions."
)

NLP Metrics¶

from holodeck.lib.evaluators.nlp_metrics import compute_f1_score, compute_rouge

# Compute F1 score
prediction = "the cat is on the mat"
reference = "a cat is on the mat"
f1 = compute_f1_score(prediction, reference)
print(f"F1 Score: {f1}")

# Compute ROUGE scores
scores = compute_rouge(prediction, reference)
print(f"ROUGE-1: {scores['rouge1']}")
print(f"ROUGE-2: {scores['rouge2']}")
print(f"ROUGE-L: {scores['rougeL']}")

Metric Configuration in YAML¶

DeepEval GEval Metric¶

evaluations:
  model:
    provider: ollama
    name: llama3.2:latest
    temperature: 0.0

  metrics:
    - type: geval
      name: "Coherence"
      criteria: "Evaluate whether the response is clear and well-structured."
      evaluation_steps:
        - "Check if the response uses clear language."
        - "Assess if the explanation is easy to follow."
      evaluation_params:
        - actual_output
        - input
      threshold: 0.7
      enabled: true
      fail_on_error: false

DeepEval RAG Metrics¶

evaluations:
  model:
    provider: ollama
    name: llama3.2:latest
    temperature: 0.0

  metrics:
    # Faithfulness - hallucination detection
    - type: rag
      metric_type: faithfulness
      threshold: 0.8
      include_reason: true

    # Answer Relevancy
    - type: rag
      metric_type: answer_relevancy
      threshold: 0.7

    # Contextual Relevancy
    - type: rag
      metric_type: contextual_relevancy
      threshold: 0.75

    # Contextual Precision
    - type: rag
      metric_type: contextual_precision
      threshold: 0.8

    # Contextual Recall
    - type: rag
      metric_type: contextual_recall
      threshold: 0.7

NLP Metrics¶

evaluations:
  metrics:
    - type: standard
      metric: f1_score
      threshold: 0.8

    - type: standard
      metric: bleu
      threshold: 0.6

    - type: standard
      metric: rouge
      threshold: 0.7

    - type: standard
      metric: meteor
      threshold: 0.65

Per-Metric Model Override¶

evaluations:
  model:
    provider: ollama
    name: llama3.2:latest  # Default: free, local

  metrics:
    - type: rag
      metric_type: faithfulness
      threshold: 0.9
      model:  # Override for critical metric
        provider: openai
        name: gpt-4

Legacy AI Metrics (Deprecated)¶

DEPRECATED: Azure AI-based metrics are deprecated and will be removed in a future version. Migrate to DeepEval metrics for better flexibility and local model support.

Migration Guide¶

Legacy Metric	Recommended Replacement
`groundedness`	`type: rag`, `metric_type: faithfulness`
`relevance`	`type: rag`, `metric_type: answer_relevancy`
`coherence`	`type: geval` with custom criteria
`safety`	`type: geval` with custom criteria

Legacy Usage (Not Recommended)¶

# DEPRECATED - Use DeepEval evaluators instead
from holodeck.lib.evaluators.azure_ai import AzureAIEvaluator

evaluator = AzureAIEvaluator(model="gpt-4", api_key="your-key")

result = await evaluator.evaluate_groundedness(
    response="Paris is the capital of France",
    context="France's capital city is known for the Eiffel Tower",
)

Legacy YAML Configuration (Not Recommended)¶

# DEPRECATED - Use type: geval or type: rag instead
evaluations:
  metrics:
    - type: standard
      metric: groundedness  # Deprecated
      threshold: 0.8

    - type: standard
      metric: relevance     # Deprecated
      threshold: 0.75

    - type: standard
      metric: coherence     # Deprecated
      threshold: 0.7

    - type: standard
      metric: safety        # Deprecated
      threshold: 0.9

Integration with Test Runner¶

The test runner automatically:

Loads evaluation configuration from agent YAML
Creates appropriate evaluators based on metric type
Invokes evaluators on test outputs
Extracts retrieval_context from tool results (for RAG metrics)
Collects metric scores
Compares against thresholds
Includes results in test report

Test Runner Evaluator Creation¶

# Internal test runner logic (simplified)
def _create_evaluators(self, metrics: list[MetricType]) -> dict:
    evaluators = {}

    for metric in metrics:
        if metric.type == "geval":
            evaluators[metric.name] = GEvalEvaluator(
                name=metric.name,
                criteria=metric.criteria,
                evaluation_steps=metric.evaluation_steps,
                evaluation_params=metric.evaluation_params,
                model_config=self._get_model_config(metric),
                threshold=metric.threshold,
                strict_mode=metric.strict_mode
            )
        elif metric.type == "rag":
            evaluator_class = RAG_EVALUATOR_MAP[metric.metric_type]
            evaluators[metric.metric_type] = evaluator_class(
                model_config=self._get_model_config(metric),
                threshold=metric.threshold,
                include_reason=metric.include_reason
            )
        elif metric.type == "standard":
            # NLP or legacy metrics
            evaluators[metric.metric] = self._create_standard_evaluator(metric)

    return evaluators

Error Handling¶

DeepEval Errors¶

from holodeck.lib.evaluators.deepeval.errors import (
    DeepEvalError,
    ProviderNotSupportedError
)

try:
    result = await evaluator.evaluate(actual_output="...")
except DeepEvalError as e:
    print(f"Evaluation failed: {e.message}")
    print(f"Metric: {e.metric_name}")
    print(f"Test case: {e.test_case_summary}")
except ProviderNotSupportedError as e:
    print(f"Provider not supported: {e}")

Soft vs Hard Failures¶

metrics:
  # Soft failure - continues on error
  - type: geval
    name: "Quality"
    criteria: "..."
    fail_on_error: false  # Default

  # Hard failure - stops test on error
  - type: rag
    metric_type: faithfulness
    fail_on_error: true

Evaluations Guide: Configuration and usage guide
Test Runner: Test execution framework
Data Models: EvaluationConfig and MetricConfig models
Configuration Loading: Loading evaluation configs

Evaluation Framework API¶

Evaluation Configuration Models¶

EvaluationConfig ¶

validate_metrics(v) classmethod ¶

MetricType = Annotated[EvaluationMetric | GEvalMetric | RAGMetric, Field(discriminator='type')] module-attribute ¶

GEvalMetric ¶

validate_criteria(v) classmethod ¶

validate_evaluation_params(v) classmethod ¶

validate_name(v) classmethod ¶

validate_threshold(v) classmethod ¶

RAGMetric ¶

validate_threshold(v) classmethod ¶

EvaluationMetric ¶

validate_custom_prompt(v) classmethod ¶

validate_enabled(v) classmethod ¶

validate_fail_on_error(v) classmethod ¶

validate_metric(v) classmethod ¶

validate_retry_on_failure(v) classmethod ¶

validate_scale(v) classmethod ¶

validate_threshold(v) classmethod ¶

validate_timeout_ms(v) classmethod ¶

DeepEval Evaluators (Recommended)¶

Base Classes¶

DeepEvalBaseEvaluator(model_config=None, threshold=0.5, timeout=60.0, retry_config=None) ¶

name property ¶

evaluate(**kwargs) async ¶

get_param_spec() classmethod ¶

DeepEvalModelConfig ¶

to_deepeval_model() ¶

validate_provider_requirements() ¶

GEval Evaluator¶

GEvalEvaluator(name, criteria, evaluation_params=None, evaluation_steps=None, model_config=None, threshold=0.5, strict_mode=False, timeout=60.0, retry_config=None) ¶

name property ¶

evaluate(**kwargs) async ¶

get_param_spec() classmethod ¶

RAG Evaluators¶

FaithfulnessEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None) ¶

name property ¶

evaluate(**kwargs) async ¶

get_param_spec() classmethod ¶

AnswerRelevancyEvaluator(model_config=None, threshold=0.5, include_reason=True, strict_mode=False, timeout=60.0, retry_config=None) ¶

name property ¶

evaluate(**kwargs) async ¶

get_param_spec() classmethod ¶

ContextualRelevancyEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None) ¶

name property ¶

evaluate(**kwargs) async ¶

get_param_spec() classmethod ¶

ContextualPrecisionEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None) ¶

name property ¶

evaluate(**kwargs) async ¶

get_param_spec() classmethod ¶

ContextualRecallEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None) ¶

name property ¶

evaluate(**kwargs) async ¶

get_param_spec() classmethod ¶

Usage Examples¶

DeepEval GEval Metrics¶

DeepEval RAG Metrics¶

NLP Metrics¶

Metric Configuration in YAML¶

DeepEval GEval Metric¶

DeepEval RAG Metrics¶

NLP Metrics¶

Per-Metric Model Override¶

Legacy AI Metrics (Deprecated)¶

Migration Guide¶

Legacy Usage (Not Recommended)¶

Legacy YAML Configuration (Not Recommended)¶

Integration with Test Runner¶

Test Runner Evaluator Creation¶

Error Handling¶

DeepEval Errors¶

Soft vs Hard Failures¶

Related Documentation¶

`EvaluationConfig` ¶

`validate_metrics(v)` `classmethod` ¶

`MetricType = Annotated[EvaluationMetric | GEvalMetric | RAGMetric, Field(discriminator='type')]` `module-attribute` ¶

`GEvalMetric` ¶

`validate_criteria(v)` `classmethod` ¶

`validate_evaluation_params(v)` `classmethod` ¶

`validate_name(v)` `classmethod` ¶

`validate_threshold(v)` `classmethod` ¶

`RAGMetric` ¶

`validate_threshold(v)` `classmethod` ¶

`EvaluationMetric` ¶

`validate_custom_prompt(v)` `classmethod` ¶

`validate_enabled(v)` `classmethod` ¶

`validate_fail_on_error(v)` `classmethod` ¶

`validate_metric(v)` `classmethod` ¶

`validate_retry_on_failure(v)` `classmethod` ¶

`validate_scale(v)` `classmethod` ¶

`validate_threshold(v)` `classmethod` ¶

`validate_timeout_ms(v)` `classmethod` ¶

`DeepEvalBaseEvaluator(model_config=None, threshold=0.5, timeout=60.0, retry_config=None)` ¶

`name` `property` ¶

`evaluate(**kwargs)` `async` ¶

`get_param_spec()` `classmethod` ¶

`DeepEvalModelConfig` ¶

`to_deepeval_model()` ¶

`validate_provider_requirements()` ¶

`GEvalEvaluator(name, criteria, evaluation_params=None, evaluation_steps=None, model_config=None, threshold=0.5, strict_mode=False, timeout=60.0, retry_config=None)` ¶

`name` `property` ¶

`evaluate(**kwargs)` `async` ¶

`get_param_spec()` `classmethod` ¶

`FaithfulnessEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None)` ¶

`name` `property` ¶

`evaluate(**kwargs)` `async` ¶

`get_param_spec()` `classmethod` ¶

`AnswerRelevancyEvaluator(model_config=None, threshold=0.5, include_reason=True, strict_mode=False, timeout=60.0, retry_config=None)` ¶

`name` `property` ¶

`evaluate(**kwargs)` `async` ¶

`get_param_spec()` `classmethod` ¶

`ContextualRelevancyEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None)` ¶

`name` `property` ¶

`evaluate(**kwargs)` `async` ¶

`get_param_spec()` `classmethod` ¶

`ContextualPrecisionEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None)` ¶

`name` `property` ¶

`evaluate(**kwargs)` `async` ¶

`get_param_spec()` `classmethod` ¶

`ContextualRecallEvaluator(model_config=None, threshold=0.5, include_reason=True, timeout=60.0, retry_config=None)` ¶

`name` `property` ¶

`evaluate(**kwargs)` `async` ¶

`get_param_spec()` `classmethod` ¶