Test Execution Framework API¶
The test runner orchestrates the complete test execution pipeline for HoloDeck agents, from configuration resolution through agent invocation, evaluation, and result reporting.
The framework follows a sequential flow:
- Load agent configuration from YAML
- Resolve execution configuration (CLI > YAML > env > defaults)
- Initialize components (FileProcessor, AgentFactory/Backend, Evaluators)
- Execute each test case (file processing, agent invocation, tool validation, evaluation)
- Generate a
TestReportwith summary statistics
Executor¶
The executor module coordinates all stages of test execution. It owns configuration resolution, evaluator creation, the agent invocation dispatch (backend or legacy factory), and report generation.
TestExecutor¶
TestExecutor(agent_config_path, execution_config=None, file_processor=None, agent_factory=None, evaluators=None, config_loader=None, progress_callback=None, on_test_start=None, force_ingest=False, agent_config=None, resolved_execution_config=None, backend=None, allow_side_effects=False)
¶
Executor for running agent test cases.
Orchestrates the complete test execution flow: 1. Loads agent configuration from YAML file 2. Resolves execution configuration (CLI > YAML > env > defaults) 3. Initializes components (FileProcessor, AgentFactory, Evaluators) 4. Executes test cases sequentially 5. Generates test report with results and summary
Attributes:
| Name | Type | Description |
|---|---|---|
agent_config_path |
Path to agent configuration YAML file |
|
cli_config |
Execution config from CLI flags (optional) |
|
agent_config |
Loaded agent configuration |
|
config |
Resolved execution configuration |
|
file_processor |
FileProcessor instance |
|
agent_factory |
AgentFactory | None
|
AgentFactory instance |
evaluators |
Dictionary of evaluator instances by metric name |
|
config_loader |
ConfigLoader instance |
|
progress_callback |
Optional callback function for progress reporting |
Initialize test executor with optional dependency injection.
Follows dependency injection pattern for testability. Dependencies can be: - Injected explicitly (for testing with mocks) - Created automatically using factory methods (for normal usage)
When backend is provided, the executor uses the provider-agnostic
AgentBackend.invoke_once() path and skips AgentFactory creation.
When neither backend nor agent_factory is provided, the executor
can auto-select a backend via BackendSelector at execution time.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
agent_config_path
|
str
|
Path to agent configuration file |
required |
execution_config
|
ExecutionConfig | None
|
Optional execution config from CLI flags |
None
|
file_processor
|
FileProcessor | None
|
Optional FileProcessor instance (auto-created if None) |
None
|
agent_factory
|
AgentFactory | None
|
Optional AgentFactory instance (auto-created if None) |
None
|
evaluators
|
dict[str, BaseEvaluator] | None
|
Optional dict of evaluator instances (auto-created if None) |
None
|
config_loader
|
ConfigLoader | None
|
Optional ConfigLoader instance (auto-created if None) |
None
|
progress_callback
|
Callable[[TestResult], None] | None
|
Optional callback function called after each test. Called with TestResult instance. Use for progress display. |
None
|
force_ingest
|
bool
|
Force re-ingestion of vector store source files. |
False
|
agent_config
|
Agent | None
|
Optional pre-loaded Agent config (auto-loaded if None) |
None
|
resolved_execution_config
|
ExecutionConfig | None
|
Optional pre-resolved execution config (auto-resolved if None) |
None
|
backend
|
AgentBackend | None
|
Optional AgentBackend instance. When provided, the executor uses invoke_once() instead of AgentFactory. |
None
|
allow_side_effects
|
bool
|
Allow bash/file_system.write in test mode (passed to BackendSelector when auto-selecting). |
False
|
Source code in src/holodeck/lib/test_runner/executor.py
207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 | |
execute_tests()
async
¶
Execute all test cases and generate report.
Returns:
| Type | Description |
|---|---|
TestReport
|
TestReport with all results and summary statistics |
Source code in src/holodeck/lib/test_runner/executor.py
534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 | |
shutdown()
async
¶
Shutdown executor and cleanup resources.
Must be called from the same task context where the executor was used to properly cleanup MCP plugins and other async resources.
Source code in src/holodeck/lib/test_runner/executor.py
1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 | |
validate_tool_calls¶
Standalone helper that checks actual tool calls against expected tool names using
substring matching. Returns True, False, or None (when validation is skipped).
validate_tool_calls(actual, expected)
¶
Validate actual tool calls against expected tools.
Tool call validation checks that each expected tool name is found within at least one actual tool call. This uses substring matching - if any actual tool name contains the expected tool name, it's considered a match.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
actual
|
list[str]
|
List of tool names actually called by agent |
required |
expected
|
list[str] | None
|
List of expected tool names from test case (None = skip validation) |
required |
Returns:
| Type | Description |
|---|---|
bool | None
|
True if all expected tools are found (substring match) in actual |
bool | None
|
False if any expected tool is not found in any actual tool |
bool | None
|
None if expected is None (validation skipped) |
Examples:
- expected=["search"], actual=["vectorstore-search"] -> True
- expected=["search", "fetch"], actual=["search_tool", "fetch_data"] -> True
- expected=["search"], actual=["fetch"] -> False
Source code in src/holodeck/lib/test_runner/executor.py
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 | |
RAGEvaluatorConstructor¶
Protocol that defines the common constructor signature shared by all RAG evaluator
classes (FaithfulnessEvaluator, ContextualRelevancyEvaluator, etc.). Used as the
value type in RAG_EVALUATOR_MAP.
RAGEvaluatorConstructor
¶
Bases: Protocol
Protocol for RAG evaluator constructors with full type safety.
Defines the common constructor signature for all RAG evaluators. The actual evaluators may have additional parameters with defaults (timeout, retry_config) but this Protocol captures what we use.
RAG_EVALUATOR_MAP¶
Module-level dictionary mapping RAGMetricType enum members to their evaluator
constructor. Eliminates repetitive if/elif chains when creating RAG evaluators.
RAG_EVALUATOR_MAP = {RAGMetricType.FAITHFULNESS: FaithfulnessEvaluator, RAGMetricType.CONTEXTUAL_RELEVANCY: ContextualRelevancyEvaluator, RAGMetricType.CONTEXTUAL_PRECISION: ContextualPrecisionEvaluator, RAGMetricType.CONTEXTUAL_RECALL: ContextualRecallEvaluator, RAGMetricType.ANSWER_RELEVANCY: AnswerRelevancyEvaluator}
module-attribute
¶
Agent Factory¶
The agent factory module provides Semantic Kernel-based agent creation, invocation with timeout/retry logic, and response/tool-call extraction.
AgentFactory¶
AgentFactory(agent_config, max_retries=DEFAULT_MAX_RETRIES, retry_delay=DEFAULT_RETRY_DELAY_SECONDS, retry_exponential_base=DEFAULT_RETRY_EXPONENTIAL_BASE, force_ingest=False, execution_config=None)
¶
Factory for creating and executing agents using Semantic Kernel.
Handles Kernel creation, agent invocation, response extraction, and tool call handling with support for multiple LLM providers.
Initialize agent factory with Semantic Kernel.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
agent_config
|
Agent
|
Agent configuration with model and instructions |
required |
max_retries
|
int
|
Maximum number of retry attempts for transient failures |
DEFAULT_MAX_RETRIES
|
retry_delay
|
float
|
Base delay in seconds for exponential backoff |
DEFAULT_RETRY_DELAY_SECONDS
|
retry_exponential_base
|
float
|
Exponential base for backoff calculation |
DEFAULT_RETRY_EXPONENTIAL_BASE
|
force_ingest
|
bool
|
Force re-ingestion of vector store source files |
False
|
execution_config
|
ExecutionConfig | None
|
Execution configuration for timeouts and file processing |
None
|
Raises:
| Type | Description |
|---|---|
AgentFactoryError
|
If kernel initialization fails |
Source code in src/holodeck/lib/test_runner/agent_factory.py
550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 | |
create_thread_run()
async
¶
Create a new isolated agent thread run.
Each thread run has its own ChatHistory, suitable for: - Individual test case execution - Isolated conversation sessions
This method ensures tools are initialized before creating the run.
Returns:
| Type | Description |
|---|---|
AgentThreadRun
|
A new AgentThreadRun instance with fresh chat history. |
Source code in src/holodeck/lib/test_runner/agent_factory.py
1163 1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 1174 1175 1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 | |
shutdown()
async
¶
Shutdown all MCP plugins and release resources.
Must be called from the same task context where the factory was used. Properly exits all MCP plugin async context managers to avoid 'Attempted to exit cancel scope in a different task' errors.
Source code in src/holodeck/lib/test_runner/agent_factory.py
1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 | |
AgentThreadRun¶
Encapsulates a single agent execution thread with an isolated ChatHistory.
Created by AgentFactory.create_thread_run() to ensure test-case isolation.
AgentThreadRun(agent, kernel, kernel_arguments, timeout=None, max_retries=DEFAULT_MAX_RETRIES, retry_delay=DEFAULT_RETRY_DELAY_SECONDS, retry_exponential_base=DEFAULT_RETRY_EXPONENTIAL_BASE, observability_enabled=False, tool_filter_manager=None)
¶
Encapsulates a single agent execution thread with isolated chat history.
Each instance maintains its own ChatHistory, ensuring test case isolation. Created by AgentFactory.create_thread_run().
This class owns the invocation logic and response extraction methods, providing complete isolation between different test cases or chat sessions.
Initialize an agent thread run with isolated chat history.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
agent
|
Agent
|
Semantic Kernel agent instance. |
required |
kernel
|
Kernel
|
Configured Kernel instance. |
required |
kernel_arguments
|
KernelArguments
|
KernelArguments for agent invocation. |
required |
timeout
|
float | None
|
Timeout in seconds for agent invocation. |
None
|
max_retries
|
int
|
Maximum retry attempts for transient failures. |
DEFAULT_MAX_RETRIES
|
retry_delay
|
float
|
Base delay in seconds for exponential backoff. |
DEFAULT_RETRY_DELAY_SECONDS
|
retry_exponential_base
|
float
|
Exponential base for backoff calculation. |
DEFAULT_RETRY_EXPONENTIAL_BASE
|
observability_enabled
|
bool
|
Whether OTel tracing is enabled. |
False
|
tool_filter_manager
|
ToolFilterManager | None
|
Optional manager for filtering tools per request. |
None
|
Source code in src/holodeck/lib/test_runner/agent_factory.py
105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 | |
invoke(user_input)
async
¶
Invoke agent with user input.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
user_input
|
str
|
User's input message. |
required |
Returns:
| Type | Description |
|---|---|
AgentExecutionResult
|
AgentExecutionResult with tool_calls and complete chat_history. |
Raises:
| Type | Description |
|---|---|
AgentFactoryError
|
If invocation fails after retries. |
Source code in src/holodeck/lib/test_runner/agent_factory.py
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 | |
AgentExecutionResult¶
Dataclass returned by AgentThreadRun.invoke() containing tool calls, tool results,
the full conversation history, optional token usage, and the extracted response text.
AgentExecutionResult(tool_calls, tool_results, chat_history, token_usage=None, response='')
dataclass
¶
Result of agent execution containing tool calls and conversation history.
Attributes:
| Name | Type | Description |
|---|---|---|
tool_calls |
list[dict[str, Any]]
|
List of tool calls made by the agent during execution. Each dict contains 'name' and 'arguments' keys. |
tool_results |
list[dict[str, Any]]
|
List of tool execution results for retrieval context. Each dict contains 'name' (tool name) and 'result' (execution output). |
chat_history |
ChatHistory
|
Complete conversation history including user inputs and agent responses |
token_usage |
TokenUsage | None
|
Token usage metadata if provided by LLM provider |
Reporter¶
Generates comprehensive Markdown reports from TestReport objects, including summary
tables, per-test sections, metric details, tool-usage validation, and file metadata.
generate_markdown_report¶
generate_markdown_report(report)
¶
Generate a comprehensive markdown report from test results.
Creates a formatted markdown document containing: - Report header with agent name and metadata - Summary statistics table - Detailed test result sections with all fields
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
report
|
TestReport
|
The TestReport containing all test results and summary data. |
required |
Returns:
| Type | Description |
|---|---|
str
|
A formatted markdown string ready for display or file output. |
Source code in src/holodeck/lib/test_runner/reporter.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 | |
Progress¶
Real-time progress display with TTY detection. Interactive terminals get colored symbols and spinners; CI/CD environments get plain-text output compatible with log aggregation systems.
ProgressIndicator¶
ProgressIndicator(total_tests, quiet=False, verbose=False)
¶
Bases: SpinnerMixin
Displays progress during test execution with TTY-aware formatting.
Detects whether stdout is a terminal (TTY) and adjusts output accordingly: - TTY (interactive): Colored symbols, spinners, ANSI formatting - Non-TTY (CI/CD): Plain text, compatible with log aggregation systems
Inherits spinner animation from SpinnerMixin.
Attributes:
| Name | Type | Description |
|---|---|---|
total_tests |
Total number of tests to execute |
|
current_test |
Number of tests completed so far |
|
passed |
Number of tests that passed |
|
failed |
Number of tests that failed |
|
quiet |
Suppress progress output (only show summary) |
|
verbose |
Show detailed output including timing |
Initialize progress indicator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
total_tests
|
int
|
Total number of tests to execute |
required |
quiet
|
bool
|
If True, suppress progress output (only show summary) |
False
|
verbose
|
bool
|
If True, show detailed output with timing information |
False
|
Source code in src/holodeck/lib/test_runner/progress.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | |
get_progress_line()
¶
Get current progress display line.
Returns:
| Type | Description |
|---|---|
str
|
Progress string showing current test count and status |
str
|
Empty string if quiet mode is enabled |
Source code in src/holodeck/lib/test_runner/progress.py
201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 | |
get_spinner_char()
¶
Get current spinner character and advance rotation.
Returns:
| Type | Description |
|---|---|
str
|
Current spinner character from the braille sequence. |
Source code in src/holodeck/lib/ui/spinner.py
36 37 38 39 40 41 42 43 44 | |
get_spinner_line()
¶
Get current spinner line for running test.
Returns:
| Type | Description |
|---|---|
str
|
Formatted spinner string (e.g. "⠋ Test 1/5: Running...") |
Source code in src/holodeck/lib/test_runner/progress.py
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 | |
get_summary()
¶
Get summary statistics for all completed tests.
Returns:
| Type | Description |
|---|---|
str
|
Formatted summary string with pass/fail counts and rate |
Source code in src/holodeck/lib/test_runner/progress.py
228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 | |
start_test(test_name)
¶
Mark a test as started.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
test_name
|
str
|
Name of the test starting |
required |
Source code in src/holodeck/lib/test_runner/progress.py
68 69 70 71 72 73 74 | |
update(result)
¶
Update progress with a completed test result.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
result
|
TestResult
|
TestResult instance from a completed test |
required |
Source code in src/holodeck/lib/test_runner/progress.py
54 55 56 57 58 59 60 61 62 63 64 65 66 | |
Eval Kwargs Builder¶
Type-safe construction of evaluation keyword arguments based on each evaluator's
ParamSpec. Handles the parameter-name divergence between evaluator families
(Azure AI/NLP use response/query; DeepEval uses actual_output/input).
EvalKwargsBuilder¶
EvalKwargsBuilder(agent_response, input_query=None, ground_truth=None, file_content=None, retrieval_context=None)
¶
Builder for evaluation kwargs based on evaluator specifications.
Constructs eval_kwargs dictionaries based on: 1. Evaluator's ParamSpec (required/optional parameters) 2. Available data (test case inputs, file content, tool results) 3. Evaluator type (DeepEval vs Azure AI/NLP param names)
Example
builder = EvalKwargsBuilder( ... input_query="What is X?", ... agent_response="X is...", ... ground_truth="X is the answer", ... file_content="Context from files...", ... retrieval_context=["chunk1", "chunk2"], ... ) kwargs = builder.build_for(evaluator) result = await evaluator.evaluate(**kwargs)
Initialize the kwargs builder.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
agent_response
|
str
|
Agent's response text (always required). |
required |
input_query
|
str | None
|
User's input query. |
None
|
ground_truth
|
str | None
|
Expected ground truth answer. |
None
|
file_content
|
str | None
|
Combined content from processed files. |
None
|
retrieval_context
|
list[str] | None
|
List of retrieved text chunks for RAG metrics. |
None
|
Source code in src/holodeck/lib/test_runner/eval_kwargs_builder.py
45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | |
build_for(evaluator)
¶
Build eval_kwargs for a specific evaluator.
The method: 1. Gets the evaluator's PARAM_SPEC 2. Determines if it uses DeepEval param names (input/actual_output) or standard names (query/response) 3. Builds kwargs with the appropriate keys
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
evaluator
|
BaseEvaluator
|
The evaluator instance to build kwargs for. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, Any]
|
Dictionary of kwargs ready for evaluator.evaluate(). |
Source code in src/holodeck/lib/test_runner/eval_kwargs_builder.py
68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 | |
build_retrieval_context_from_tools¶
Extracts retrieval context strings from tool results, filtering to only those tools marked as retrieval tools.
build_retrieval_context_from_tools(tool_results, retrieval_tool_names)
¶
Extract retrieval context from tool results.
Only includes results from tools marked as retrieval tools.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tool_results
|
list[dict[str, Any]]
|
List of tool result dicts with 'name' and 'result' keys. The 'result' value can be a string, list of strings, or other types. |
required |
retrieval_tool_names
|
set[str]
|
Set of tool names that provide retrieval context. |
required |
Returns:
| Type | Description |
|---|---|
list[str] | None
|
List of retrieval context strings, or None if none found. |
Source code in src/holodeck/lib/test_runner/eval_kwargs_builder.py
140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 | |
Example Usage¶
from holodeck.lib.test_runner.executor import TestExecutor
from holodeck.lib.test_runner.reporter import generate_markdown_report
from holodeck.lib.test_runner.progress import ProgressIndicator
from holodeck.config.loader import ConfigLoader
# Load agent configuration
loader = ConfigLoader()
agent = loader.load_agent_yaml("agent.yaml")
# Set up progress tracking
progress = ProgressIndicator(total_tests=len(agent.test_cases or []))
# Create executor with progress callback
executor = TestExecutor(
agent_config_path="agent.yaml",
progress_callback=progress.update,
on_test_start=lambda tc: progress.start_test(tc.name or "unnamed"),
)
# Run all test cases
report = await executor.execute_tests()
# Display summary
print(progress.get_summary())
# Generate markdown report
markdown = generate_markdown_report(report)
with open("report.md", "w") as f:
f.write(markdown)
Using EvalKwargsBuilder directly¶
from holodeck.lib.test_runner.eval_kwargs_builder import (
EvalKwargsBuilder,
build_retrieval_context_from_tools,
)
# Build retrieval context from tool results
retrieval_ctx = build_retrieval_context_from_tools(
tool_results=[
{"name": "search_kb", "result": "Refund policy allows 30-day returns."},
{"name": "get_user", "result": "User: Alice"},
],
retrieval_tool_names={"search_kb"},
)
# Build kwargs for an evaluator
builder = EvalKwargsBuilder(
agent_response="We offer 30-day returns.",
input_query="What is your refund policy?",
ground_truth="30-day money-back guarantee on all products.",
retrieval_context=retrieval_ctx,
)
kwargs = builder.build_for(evaluator)
result = await evaluator.evaluate(**kwargs)
Related Documentation¶
- Data Models -- Test case and result Pydantic models
- Evaluation Framework -- Metrics, evaluators, and
ParamSpec - Configuration Loading --
ConfigLoaderand resolution hierarchy - Backend Abstraction --
AgentBackend,BackendSelector, andExecutionResult