Skip to content

Tools

The holodeck.tools package provides tool implementations that extend agent capabilities with semantic search, hierarchical document retrieval, and MCP server integration.

Module Overview

Module Description
holodeck.tools Package exports for tools, mixins, and utilities
holodeck.tools.base_tool Mixin classes for embedding and database configuration
holodeck.tools.common Shared constants and pure utility functions
holodeck.tools.vectorstore_tool Semantic search over unstructured and structured data
holodeck.tools.hierarchical_document_tool Structure-aware document retrieval with hybrid search

holodeck.tools.base_tool

Mixin classes that encapsulate common functionality shared between VectorStoreTool and HierarchicalDocumentTool. Mixins are used instead of inheritance because the tools have fundamentally different record types and core behaviors.

EmbeddingServiceMixin

EmbeddingServiceMixin

Mixin for embedding service injection.

Provides the set_embedding_service method used by AgentFactory to inject a Semantic Kernel TextEmbedding service for generating real embeddings.

Required instance attributes (set by subclass init): _embedding_service: Any - stores the injected service config: Any - tool configuration with a .name attribute

set_embedding_service(service)

Set the embedding service for generating embeddings.

This method allows AgentFactory to inject a Semantic Kernel TextEmbedding service for generating real embeddings instead of placeholder zeros.

Parameters:

Name Type Description Default
service Any

Semantic Kernel TextEmbedding service instance (OpenAITextEmbedding or AzureTextEmbedding).

required
Source code in src/holodeck/tools/base_tool.py
50
51
52
53
54
55
56
57
58
59
60
61
62
def set_embedding_service(self, service: Any) -> None:
    """Set the embedding service for generating embeddings.

    This method allows AgentFactory to inject a Semantic Kernel TextEmbedding
    service for generating real embeddings instead of placeholder zeros.

    Args:
        service: Semantic Kernel TextEmbedding service instance
            (OpenAITextEmbedding or AzureTextEmbedding).
    """
    self._embedding_service = service
    tool_name = getattr(getattr(self, "config", None), "name", "unknown")
    logger.debug(f"Embedding service set for tool: {tool_name}")

DatabaseConfigMixin

DatabaseConfigMixin

Mixin for database configuration resolution and collection creation.

Provides methods for resolving database configuration from various formats (None, string reference, DatabaseConfig object) and creating vector store collections with automatic fallback to in-memory storage.

Required instance attributes (set by subclass init): config: Any - tool configuration with .name and .database attributes _provider: str - stores the resolved provider name _collection: Any - stores the created collection instance _embedding_dimensions: int | None - embedding dimensions

_resolve_database_config(database)

Resolve database configuration to provider and connection kwargs.

Handles three types of database configuration: 1. None - use in-memory storage 2. String reference - unresolved reference, warn and use in-memory 3. DatabaseConfig object - extract provider and connection parameters

Parameters:

Name Type Description Default
database DatabaseConfig | str | None

Database configuration (from tool config)

required

Returns:

Type Description
tuple[str, dict[str, Any]]

Tuple of (provider_name, connection_kwargs)

Example

provider, kwargs = self._resolve_database_config(None) provider 'in-memory' kwargs {}

Source code in src/holodeck/tools/base_tool.py
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
def _resolve_database_config(
    self, database: DatabaseConfig | str | None
) -> tuple[str, dict[str, Any]]:
    """Resolve database configuration to provider and connection kwargs.

    Handles three types of database configuration:
    1. None - use in-memory storage
    2. String reference - unresolved reference, warn and use in-memory
    3. DatabaseConfig object - extract provider and connection parameters

    Args:
        database: Database configuration (from tool config)

    Returns:
        Tuple of (provider_name, connection_kwargs)

    Example:
        >>> provider, kwargs = self._resolve_database_config(None)
        >>> provider
        'in-memory'
        >>> kwargs
        {}
    """
    tool_name = getattr(getattr(self, "config", None), "name", "unknown")

    if isinstance(database, str):
        # Unresolved string reference - this shouldn't happen if merge_configs
        # was called, but fall back to in-memory with a warning
        logger.warning(
            f"Tool '{tool_name}' has unresolved database "
            f"reference '{database}'. Falling back to in-memory storage."
        )
        return "in-memory", {}

    if database is not None:
        # DatabaseConfig object - use its settings
        provider = database.provider
        connection_kwargs: dict[str, Any] = {}
        if database.connection_string:
            connection_kwargs["connection_string"] = (
                database.connection_string.get_secret_value()
            )
        # Add extra fields from DatabaseConfig (extra="allow")
        if hasattr(database, "model_extra"):
            extra_fields = database.model_extra or {}
            connection_kwargs.update(extra_fields)
        return provider, connection_kwargs

    # None - use in-memory
    return "in-memory", {}

_create_collection_with_fallback(provider, dimensions, connection_kwargs, record_class=None, definition=None)

Create a vector store collection with fallback to in-memory.

Attempts to create a collection with the specified provider. If creation fails (e.g., database unreachable), falls back to in-memory storage.

Parameters:

Name Type Description Default
provider str

Vector store provider name

required
dimensions int

Embedding dimensions for the collection

required
connection_kwargs dict[str, Any]

Provider-specific connection parameters

required
record_class type[Any] | None

Optional custom record class for the collection

None
definition Any | None

Optional VectorStoreCollectionDefinition for structured data

None

Returns:

Type Description
Any

Created collection instance

Raises:

Type Description
Exception

If in-memory provider also fails (shouldn't happen)

Source code in src/holodeck/tools/base_tool.py
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
def _create_collection_with_fallback(
    self,
    provider: str,
    dimensions: int,
    connection_kwargs: dict[str, Any],
    record_class: type[Any] | None = None,
    definition: Any | None = None,
) -> Any:
    """Create a vector store collection with fallback to in-memory.

    Attempts to create a collection with the specified provider. If creation
    fails (e.g., database unreachable), falls back to in-memory storage.

    Args:
        provider: Vector store provider name
        dimensions: Embedding dimensions for the collection
        connection_kwargs: Provider-specific connection parameters
        record_class: Optional custom record class for the collection
        definition: Optional VectorStoreCollectionDefinition for structured data

    Returns:
        Created collection instance

    Raises:
        Exception: If in-memory provider also fails (shouldn't happen)
    """
    from holodeck.lib.vector_store import get_collection_factory

    try:
        factory = get_collection_factory(
            provider=provider,
            dimensions=dimensions,
            record_class=record_class,
            definition=definition,
            **connection_kwargs,
        )
        collection = factory()
        logger.info(
            f"Vector store connected: provider={provider}, "
            f"dimensions={dimensions}"
        )
        self._provider = provider
        return collection

    except (ImportError, ConnectionError, Exception) as e:
        # Fall back to in-memory storage for non-in-memory providers
        if provider != "in-memory":
            logger.warning(
                f"Failed to connect to {provider}: {e}. "
                "Falling back to in-memory storage."
            )
            factory = get_collection_factory(
                provider="in-memory",
                dimensions=dimensions,
                record_class=record_class,
                definition=definition,
            )
            collection = factory()
            logger.info("Using in-memory vector storage (fallback)")
            self._provider = "in-memory"
            return collection
        else:
            # Don't catch errors for in-memory provider
            raise

holodeck.tools.common

Shared constants and pure utility functions used by VectorStoreTool, HierarchicalDocumentTool, and other tool implementations. Follows the DRY principle for file handling, path resolution, and embedding generation.

Constants

SUPPORTED_EXTENSIONS

SUPPORTED_EXTENSIONS = frozenset({'.txt', '.md', '.pdf', '.csv', '.json'}) module-attribute

FILE_TYPE_MAPPING

FILE_TYPE_MAPPING = {'.txt': 'text', '.md': 'text', '.pdf': 'pdf', '.csv': 'csv', '.json': 'text'} module-attribute

Functions

get_file_type

get_file_type(path)

Get FileInput type from file extension.

Maps file extensions to the appropriate type value for FileProcessor. Defaults to "text" for unknown extensions.

Parameters:

Name Type Description Default
path str | Path

File path (string or Path object)

required

Returns:

Type Description
str

FileInput type string ("text", "pdf", "csv", etc.)

Example

get_file_type("document.pdf") 'pdf' get_file_type(Path("data/file.csv")) 'csv' get_file_type("unknown.xyz") 'text'

Source code in src/holodeck/tools/common.py
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
def get_file_type(path: str | Path) -> str:
    """Get FileInput type from file extension.

    Maps file extensions to the appropriate type value for FileProcessor.
    Defaults to "text" for unknown extensions.

    Args:
        path: File path (string or Path object)

    Returns:
        FileInput type string ("text", "pdf", "csv", etc.)

    Example:
        >>> get_file_type("document.pdf")
        'pdf'
        >>> get_file_type(Path("data/file.csv"))
        'csv'
        >>> get_file_type("unknown.xyz")
        'text'
    """
    if isinstance(path, str):
        path = Path(path)
    extension = path.suffix.lower()
    return FILE_TYPE_MAPPING.get(extension, "text")

resolve_source_path

resolve_source_path(source, base_dir=None)

Resolve a source path relative to a base directory.

This function handles path resolution in priority order: 1. If source is absolute, return as-is 2. If base_dir is provided, resolve relative to base_dir 3. Try agent_base_dir context variable 4. Fall back to current working directory

Parameters:

Name Type Description Default
source str

Source path to resolve (from tool config)

required
base_dir str | None

Optional base directory for relative path resolution

None

Returns:

Type Description
Path

Resolved absolute Path to the source

Example

resolve_source_path("/absolute/path/file.txt") PosixPath('/absolute/path/file.txt') resolve_source_path("relative/file.txt", "/base") PosixPath('/base/relative/file.txt')

Source code in src/holodeck/tools/common.py
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
def resolve_source_path(source: str, base_dir: str | None = None) -> Path:
    """Resolve a source path relative to a base directory.

    This function handles path resolution in priority order:
    1. If source is absolute, return as-is
    2. If base_dir is provided, resolve relative to base_dir
    3. Try agent_base_dir context variable
    4. Fall back to current working directory

    Args:
        source: Source path to resolve (from tool config)
        base_dir: Optional base directory for relative path resolution

    Returns:
        Resolved absolute Path to the source

    Example:
        >>> resolve_source_path("/absolute/path/file.txt")
        PosixPath('/absolute/path/file.txt')
        >>> resolve_source_path("relative/file.txt", "/base")
        PosixPath('/base/relative/file.txt')
    """
    source_path = Path(source)

    # If path is absolute, use it directly
    if source_path.is_absolute():
        return source_path

    # Resolve relative to base directory
    # Priority: explicit base_dir > context var > cwd
    effective_base = base_dir
    if effective_base is None:
        # Try to get from context variable
        from holodeck.config.context import agent_base_dir

        effective_base = agent_base_dir.get()

    if effective_base:
        return (Path(effective_base) / source).resolve()

    return source_path.resolve()

discover_files

discover_files(source_path, extensions=None)

Discover files to ingest from a source path.

Recursively traverses directories and filters by supported extensions. For single files, validates the extension is supported.

Parameters:

Name Type Description Default
source_path Path

Resolved path (file or directory) to discover from

required
extensions frozenset[str] | None

Set of supported extensions (default: SUPPORTED_EXTENSIONS)

None

Returns:

Type Description
list[Path]

List of Path objects for files to process, sorted for deterministic order

Note

This function does not validate file existence - that should be checked before calling this function.

Example

discover_files(Path("/docs")) [PosixPath('/docs/file1.md'), PosixPath('/docs/subdir/file2.txt')]

Source code in src/holodeck/tools/common.py
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
def discover_files(
    source_path: Path,
    extensions: frozenset[str] | None = None,
) -> list[Path]:
    """Discover files to ingest from a source path.

    Recursively traverses directories and filters by supported extensions.
    For single files, validates the extension is supported.

    Args:
        source_path: Resolved path (file or directory) to discover from
        extensions: Set of supported extensions (default: SUPPORTED_EXTENSIONS)

    Returns:
        List of Path objects for files to process, sorted for deterministic order

    Note:
        This function does not validate file existence - that should be
        checked before calling this function.

    Example:
        >>> discover_files(Path("/docs"))
        [PosixPath('/docs/file1.md'), PosixPath('/docs/subdir/file2.txt')]
    """
    if extensions is None:
        extensions = SUPPORTED_EXTENSIONS

    if source_path.is_file():
        # Single file - check if supported
        if source_path.suffix.lower() in extensions:
            return [source_path]
        logger.warning(
            f"File {source_path} has unsupported extension "
            f"{source_path.suffix}. Supported: {extensions}"
        )
        return []

    if source_path.is_dir():
        # Directory - recursively find all supported files
        discovered: list[Path] = []
        for file_path in source_path.rglob("*"):
            if file_path.is_file():
                if file_path.suffix.lower() in extensions:
                    discovered.append(file_path)
                else:
                    logger.debug(
                        f"Skipping unsupported file: {file_path} "
                        f"(extension: {file_path.suffix})"
                    )
        # Sort for deterministic ordering
        return sorted(discovered)

    # Path doesn't exist - return empty list
    return []

generate_placeholder_embeddings

generate_placeholder_embeddings(count, dimensions=1536)

Generate placeholder embedding vectors for testing.

Creates zero-valued embedding vectors when no embedding service is available. Useful for testing and development without LLM API access.

Parameters:

Name Type Description Default
count int

Number of embeddings to generate

required
dimensions int

Embedding vector dimensions (default: 1536)

1536

Returns:

Type Description
list[list[float]]

List of zero-valued embedding vectors

Example

embeddings = generate_placeholder_embeddings(3, 768) len(embeddings) 3 len(embeddings[0]) 768

Source code in src/holodeck/tools/common.py
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
def generate_placeholder_embeddings(
    count: int,
    dimensions: int = 1536,
) -> list[list[float]]:
    """Generate placeholder embedding vectors for testing.

    Creates zero-valued embedding vectors when no embedding service is available.
    Useful for testing and development without LLM API access.

    Args:
        count: Number of embeddings to generate
        dimensions: Embedding vector dimensions (default: 1536)

    Returns:
        List of zero-valued embedding vectors

    Example:
        >>> embeddings = generate_placeholder_embeddings(3, 768)
        >>> len(embeddings)
        3
        >>> len(embeddings[0])
        768
    """
    logger.debug(f"Generated {count} placeholder embeddings, dim={dimensions}")
    return [[0.0] * dimensions for _ in range(count)]

holodeck.tools.vectorstore_tool

Provides semantic search over files and directories containing text data or structured data (CSV, JSON, JSONL files with field mapping). Supports automatic file discovery, text chunking, embedding generation, vector storage, and modification-time tracking for incremental ingestion.

VectorStoreTool

VectorStoreTool(config, base_dir=None, execution_config=None)

Bases: EmbeddingServiceMixin, DatabaseConfigMixin

Vectorstore tool for semantic search over unstructured data.

This tool enables agents to perform semantic search over documents by: 1. Discovering files from configured source (file or directory) 2. Converting files to markdown using FileProcessor 3. Chunking text for optimal embedding generation 4. Generating embeddings via Semantic Kernel services 5. Storing document chunks in a vector database 6. Performing similarity search on queries

Inherits from

EmbeddingServiceMixin: Provides set_embedding_service() method DatabaseConfigMixin: Provides database config resolution and collection creation

Attributes:

Name Type Description
config

Tool configuration from agent.yaml

is_initialized bool

Whether the tool has been initialized

document_count int

Number of document chunks stored

last_ingest_time datetime | None

Timestamp of last ingestion

Example

config = VectorstoreTool( ... name="knowledge_base", ... description="Search product docs", ... source="data/docs/" ... ) tool = VectorStoreTool(config) await tool.initialize() results = await tool.search("How do I authenticate?")

Initialize VectorStoreTool with configuration.

Parameters:

Name Type Description Default
config VectorstoreTool

VectorstoreTool configuration from agent.yaml containing: - name: Tool identifier - description: Tool description - source: File or directory path to ingest - embedding_model: Optional custom embedding model - database: Optional database configuration - top_k: Number of results to return (default: 5) - min_similarity_score: Minimum score threshold (optional) - chunk_size: Text chunk size in tokens (optional) - chunk_overlap: Chunk overlap in tokens (optional)

required
base_dir str | None

Base directory for resolving relative source paths. If None, source paths are resolved relative to current working directory.

None
execution_config ExecutionConfig | None

Execution configuration for file processing timeouts and caching. If None, default FileProcessor settings are used.

None
Source code in src/holodeck/tools/vectorstore_tool.py
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
def __init__(
    self,
    config: VectorstoreToolConfig,
    base_dir: str | None = None,
    execution_config: ExecutionConfig | None = None,
) -> None:
    """Initialize VectorStoreTool with configuration.

    Args:
        config: VectorstoreTool configuration from agent.yaml containing:
            - name: Tool identifier
            - description: Tool description
            - source: File or directory path to ingest
            - embedding_model: Optional custom embedding model
            - database: Optional database configuration
            - top_k: Number of results to return (default: 5)
            - min_similarity_score: Minimum score threshold (optional)
            - chunk_size: Text chunk size in tokens (optional)
            - chunk_overlap: Chunk overlap in tokens (optional)
        base_dir: Base directory for resolving relative source paths.
            If None, source paths are resolved relative to current
            working directory.
        execution_config: Execution configuration for file processing
            timeouts and caching. If None, default FileProcessor
            settings are used.
    """
    self.config = config
    self._base_dir = base_dir
    self._execution_config = execution_config

    # State tracking
    self.is_initialized: bool = False
    self.document_count: int = 0
    self.last_ingest_time: datetime | None = None

    # Embedding dimensions (resolved during initialization)
    self._embedding_dimensions: int | None = None

    # Initialize components (lazy initialization for some)
    chunk_size = config.chunk_size or TextChunker.DEFAULT_CHUNK_SIZE
    chunk_overlap = config.chunk_overlap or TextChunker.DEFAULT_CHUNK_OVERLAP
    self._text_chunker = TextChunker(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
    )
    self._file_processor: FileProcessor | None = None

    # Embedding service (initialized lazily by AgentFactory)
    self._embedding_service: Any = None

    # Persistent collection instance for vector store operations
    self._collection: Any = None
    self._provider: str = "in-memory"

    # Source context for stable record keys (remote sources)
    self._source_root: Path | None = None
    self._is_remote: bool = False

    logger.debug(
        f"VectorStoreTool initialized: name={config.name}, "
        f"source={config.source}, base_dir={base_dir}, top_k={config.top_k}"
    )

initialize(force_ingest=False, provider_type=None, progress_callback=None) async

Initialize tool and ingest source files.

Discovers files from the configured source, processes them into chunks, generates embeddings, and stores them in the vector database. Source path is resolved relative to base_dir if set.

For structured data mode (when vector_field is configured), loads structured data from CSV/JSON/JSONL files with field mapping.

Parameters:

Name Type Description Default
force_ingest bool

If True, re-ingest all files regardless of modification time.

False
provider_type str | None

LLM provider for dimension auto-detection (defaults to "openai" if not specified)

None
progress_callback Callable[[int, int | None], None] | None

Optional callback invoked after each file is processed (or skipped). Called as callback(current, total) where current is the 1-based file index and total is the total number of discovered files.

None

Raises:

Type Description
FileNotFoundError

If the source path doesn't exist.

RuntimeError

If no supported files are found in source.

ConfigError

If configured fields don't exist in source (structured mode).

Source code in src/holodeck/tools/vectorstore_tool.py
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
async def initialize(
    self,
    force_ingest: bool = False,
    provider_type: str | None = None,
    progress_callback: Callable[[int, int | None], None] | None = None,
) -> None:
    """Initialize tool and ingest source files.

    Discovers files from the configured source, processes them into chunks,
    generates embeddings, and stores them in the vector database.
    Source path is resolved relative to base_dir if set.

    For structured data mode (when vector_field is configured), loads
    structured data from CSV/JSON/JSONL files with field mapping.

    Args:
        force_ingest: If True, re-ingest all files regardless of modification time.
        provider_type: LLM provider for dimension auto-detection
            (defaults to "openai" if not specified)
        progress_callback: Optional callback invoked after each file is
            processed (or skipped). Called as ``callback(current, total)``
            where *current* is the 1-based file index and *total* is the
            total number of discovered files.

    Raises:
        FileNotFoundError: If the source path doesn't exist.
        RuntimeError: If no supported files are found in source.
        ConfigError: If configured fields don't exist in source (structured mode).
    """
    # Default to openai if not specified
    if provider_type is None:
        provider_type = "openai"
        logger.debug(
            f"Defaulting to '{provider_type}' for dimension auto-detection"
        )

    # Branch to structured mode if vector_field is configured
    if self._is_structured_mode():
        await self._initialize_structured(provider_type)
        return

    source_path = self._resolve_source_path()

    # Validate source exists (T035)
    if not source_path.exists():
        # Get effective base_dir for error message
        from holodeck.config.context import agent_base_dir

        effective_base_dir = self._base_dir or agent_base_dir.get()
        raise FileNotFoundError(
            f"Source path does not exist: {source_path} "
            f"(configured source: {self.config.source}, "
            f"base_dir: {effective_base_dir})"
        )

    # Set up collection instance with provider type before processing files
    self._setup_collection(provider_type)

    # Discover files
    files = self._discover_files()

    if not files:
        logger.warning(
            f"No supported files found in source: {self.config.source}. "
            f"Supported extensions: {SUPPORTED_EXTENSIONS}"
        )
        # Still mark as initialized even with no files
        self.is_initialized = True
        self.document_count = 0
        self.last_ingest_time = datetime.now()
        return

    logger.info(f"Discovered {len(files)} files for ingestion")

    # Process each file with mtime checking
    total_chunks = 0
    skipped_files = 0
    processed_count = 0
    total_files = len(files)

    for file_path in files:
        # Check if file needs re-ingestion (unless force_ingest)
        if not force_ingest:
            needs_reingest = await self._needs_reingest(file_path)
            if not needs_reingest:
                logger.debug(f"Skipping unchanged file: {file_path}")
                skipped_files += 1
                processed_count += 1
                if progress_callback is not None:
                    progress_callback(processed_count, total_files)
                continue
        else:
            # Force ingest: delete existing records first
            await self._delete_file_records(file_path)

        source_file = await self._process_file(file_path)
        if source_file is None:
            processed_count += 1
            if progress_callback is not None:
                progress_callback(processed_count, total_files)
            continue

        # Generate embeddings
        embeddings = await self._embed_chunks(source_file.chunks)

        # Store chunks
        chunks_stored = await self._store_chunks(source_file, embeddings)
        total_chunks += chunks_stored

        processed_count += 1
        if progress_callback is not None:
            progress_callback(processed_count, total_files)

    self.document_count = total_chunks
    self.is_initialized = True
    self.last_ingest_time = datetime.now()

    logger.info(
        f"VectorStoreTool initialized: {len(files)} files "
        f"({skipped_files} skipped, up-to-date), {total_chunks} chunks indexed"
    )

search(query) async

Execute semantic search and return formatted results.

Parameters:

Name Type Description Default
query str

Natural language search query.

required

Returns:

Type Description
str

Formatted string with search results including scores and sources.

Raises:

Type Description
RuntimeError

If tool not initialized.

ValueError

If query is empty.

Source code in src/holodeck/tools/vectorstore_tool.py
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
async def search(self, query: str) -> str:
    """Execute semantic search and return formatted results.

    Args:
        query: Natural language search query.

    Returns:
        Formatted string with search results including scores and sources.

    Raises:
        RuntimeError: If tool not initialized.
        ValueError: If query is empty.
    """
    # Validation
    if not self.is_initialized:
        raise RuntimeError(
            "VectorStoreTool must be initialized before search. "
            "Call initialize() first."
        )

    if not query or not query.strip():
        raise ValueError("Search query cannot be empty")

    # Generate query embedding
    query_embeddings = await self._embed_chunks([query])
    query_embedding = query_embeddings[0]

    # Branch to structured search if in structured mode
    if self._is_structured_mode():
        structured_results = await self._search_structured(query_embedding)

        # Apply min_similarity_score filter
        if self.config.min_similarity_score is not None:
            structured_results = [
                r
                for r in structured_results
                if r.score >= self.config.min_similarity_score
            ]

        # Apply top_k limit
        structured_results = structured_results[: self.config.top_k]

        return self._format_structured_results(structured_results, query)

    # Unstructured mode: search document chunks
    doc_results = await self._search_documents(query_embedding)

    # Apply min_similarity_score filter
    if self.config.min_similarity_score is not None:
        doc_results = [
            r for r in doc_results if r.score >= self.config.min_similarity_score
        ]

    # Apply top_k limit (T037)
    doc_results = doc_results[: self.config.top_k]

    # Format results (T034)
    return self._format_results(doc_results, query)

set_embedding_service(service)

Set the embedding service for generating embeddings.

This method allows AgentFactory to inject a Semantic Kernel TextEmbedding service for generating real embeddings instead of placeholder zeros.

Parameters:

Name Type Description Default
service Any

Semantic Kernel TextEmbedding service instance (OpenAITextEmbedding or AzureTextEmbedding).

required
Source code in src/holodeck/tools/base_tool.py
50
51
52
53
54
55
56
57
58
59
60
61
62
def set_embedding_service(self, service: Any) -> None:
    """Set the embedding service for generating embeddings.

    This method allows AgentFactory to inject a Semantic Kernel TextEmbedding
    service for generating real embeddings instead of placeholder zeros.

    Args:
        service: Semantic Kernel TextEmbedding service instance
            (OpenAITextEmbedding or AzureTextEmbedding).
    """
    self._embedding_service = service
    tool_name = getattr(getattr(self, "config", None), "name", "unknown")
    logger.debug(f"Embedding service set for tool: {tool_name}")

holodeck.tools.hierarchical_document_tool

Provides intelligent document search that understands document structure, extracts definitions, and generates optimized context for LLM consumption. Supports semantic, keyword (BM25), and hybrid search modes with configurable weights.

HierarchicalDocumentTool

HierarchicalDocumentTool(config, base_dir=None, execution_config=None)

Bases: EmbeddingServiceMixin, DatabaseConfigMixin

Semantic Kernel tool for hierarchical document retrieval.

This tool provides intelligent document search that understands document structure, extracts definitions, and generates optimized context for LLM consumption.

Inherits from

EmbeddingServiceMixin: Provides set_embedding_service() method DatabaseConfigMixin: Provides database config resolution and collection creation

Attributes:

Name Type Description
config

Tool configuration from HierarchicalDocumentToolConfig.

chunks

Indexed document chunks.

Example

from holodeck.models.tool import HierarchicalDocumentToolConfig config = HierarchicalDocumentToolConfig( ... name="doc_search", ... description="Search policy documents", ... source="./docs/policy.md" ... ) tool = HierarchicalDocumentTool(config) await tool.initialize() results = await tool.search("What are the reporting requirements?")

Initialize the hierarchical document tool.

Parameters:

Name Type Description Default
config HierarchicalDocumentToolConfig

Tool configuration from HierarchicalDocumentToolConfig.

required
base_dir str | None

Optional base directory for resolving relative source paths. If None, source paths are resolved relative to current working directory or agent_base_dir context variable.

None
execution_config ExecutionConfig | None

Execution configuration for file processing timeouts and caching. When provided, the lazy FileProcessor used by _convert_to_markdown is built via FileProcessor.from_execution_config so that execution.file_timeout (and download_timeout / cache_dir) from agent.yaml are honored. When None, FileProcessor's hardcoded defaults (30s timeouts) apply.

None
Source code in src/holodeck/tools/hierarchical_document_tool.py
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
def __init__(
    self,
    config: HierarchicalDocumentToolConfig,
    base_dir: str | None = None,
    execution_config: ExecutionConfig | None = None,
) -> None:
    """Initialize the hierarchical document tool.

    Args:
        config: Tool configuration from HierarchicalDocumentToolConfig.
        base_dir: Optional base directory for resolving relative source paths.
            If None, source paths are resolved relative to current working
            directory or agent_base_dir context variable.
        execution_config: Execution configuration for file processing
            timeouts and caching. When provided, the lazy FileProcessor
            used by ``_convert_to_markdown`` is built via
            ``FileProcessor.from_execution_config`` so that
            ``execution.file_timeout`` (and download_timeout / cache_dir)
            from agent.yaml are honored. When None, FileProcessor's
            hardcoded defaults (30s timeouts) apply.
    """
    self.config = config
    self._base_dir = base_dir
    self._execution_config = execution_config
    self._chunker: StructuredChunker | None = None
    self._searcher: Any = None  # HybridSearcher (future)
    self._context_generator: ContextGenerator | None = None
    self._glossary: dict[str, DefinitionEntry] | None = None
    self._chunks: list[DocumentChunk] = []
    self._initialized = False

    # Service injection attributes
    self._embedding_service: Any = None
    self._collection: Any = None
    self._provider: str = "in-memory"
    self._embedding_dimensions: int | None = None
    self._qdrant_indexes_ensured: bool = False

    # Lazily constructed FileProcessor (see _get_file_processor)
    self._file_processor: FileProcessor | None = None

    # Hybrid search executor (initialized during ingestion)
    self._hybrid_executor: HybridSearchExecutor | None = None

    # Source context for stable record keys (remote sources)
    self._source_root: Path | None = None
    self._is_remote: bool = False

initialize(force_ingest=True, provider_type=None, progress_callback=None) async

Initialize the tool by processing all configured documents.

This method should be called before any search operations. It loads documents, chunks them, extracts definitions, and indexes content for search.

Uses mtime-based incremental ingestion to skip unchanged files. Files are only re-ingested if their modification time is newer than the stored record's mtime.

Parameters:

Name Type Description Default
force_ingest bool

If True, re-ingest all files regardless of modification time. Existing records will be deleted before re-ingestion.

True
provider_type str | None

LLM provider for dimension auto-detection (defaults to "openai" if not specified).

None
progress_callback Callable[[int, int | None], None] | None

Optional callback invoked after each file is processed (or skipped). Called as callback(current, total) where current is the 1-based file index and total is the total number of discovered files.

None

Raises:

Type Description
FileNotFoundError

If a document file is not found.

Source code in src/holodeck/tools/hierarchical_document_tool.py
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
async def initialize(
    self,
    force_ingest: bool = True,
    provider_type: str | None = None,
    progress_callback: Callable[[int, int | None], None] | None = None,
) -> None:
    """Initialize the tool by processing all configured documents.

    This method should be called before any search operations.
    It loads documents, chunks them, extracts definitions, and
    indexes content for search.

    Uses mtime-based incremental ingestion to skip unchanged files.
    Files are only re-ingested if their modification time is newer
    than the stored record's mtime.

    Args:
        force_ingest: If True, re-ingest all files regardless of
            modification time. Existing records will be deleted
            before re-ingestion.
        provider_type: LLM provider for dimension auto-detection
            (defaults to "openai" if not specified).
        progress_callback: Optional callback invoked after each file is
            processed (or skipped). Called as ``callback(current, total)``
            where *current* is the 1-based file index and *total* is the
            total number of discovered files.

    Raises:
        FileNotFoundError: If a document file is not found.
    """
    # Default to openai if not specified
    if provider_type is None:
        provider_type = "openai"
        logger.debug(
            f"Defaulting to '{provider_type}' for dimension auto-detection"
        )

    # Set up collection with provider type for dimension resolution
    self._setup_collection(provider_type)

    # Ingest all documents (with incremental check)
    skipped_files = await self._ingest_documents(
        force_ingest=force_ingest, progress_callback=progress_callback
    )

    # Always set up the hybrid executor — search routes through it.
    self._setup_hybrid_executor()

    # Decide whether to reload the corpus and rebuild the in-process
    # keyword index. Only the in-process BM25 fallback actually needs
    # this: it's ephemeral and dies with the process. For native-hybrid
    # providers (qdrant), keyword retrieval happens inside the store —
    # the in-process index is never read, so loading every chunk into
    # Python on every restart is pure waste. For in-memory providers,
    # the collection itself is reseeded by _ingest_documents (nothing
    # is "skipped"), so there's no reload to do.
    from holodeck.lib.keyword_search import NATIVE_HYBRID_PROVIDERS

    is_native_hybrid = self._provider in NATIVE_HYBRID_PROVIDERS
    needs_corpus_reload = (
        skipped_files > 0 and self._provider != "in-memory" and not is_native_hybrid
    )

    if needs_corpus_reload:
        stored_chunks = await self._load_chunks_from_store()
        if stored_chunks:
            self._chunks = stored_chunks
            await self._build_hybrid_indices()
    elif is_native_hybrid:
        logger.debug(
            "Skipped startup corpus reload for native-hybrid provider "
            f"'{self._provider}'; lazy chunk fetch on miss."
        )

    self._initialized = True
    logger.info(
        f"HierarchicalDocumentTool initialized: "
        f"{len(self._chunks)} chunks from {self.config.source}"
    )

search(query, top_k=None) async

Search documents for content relevant to query.

Parameters:

Name Type Description Default
query str

Search query string.

required
top_k int | None

Override configured top_k.

None

Returns:

Type Description
list[SearchResult]

List of SearchResult objects.

Raises:

Type Description
RuntimeError

If tool is not initialized.

ValueError

If query is empty.

Source code in src/holodeck/tools/hierarchical_document_tool.py
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
async def search(
    self,
    query: str,
    top_k: int | None = None,
) -> list[SearchResult]:
    """Search documents for content relevant to query.

    Args:
        query: Search query string.
        top_k: Override configured top_k.

    Returns:
        List of SearchResult objects.

    Raises:
        RuntimeError: If tool is not initialized.
        ValueError: If query is empty.
    """
    if not self._initialized:
        raise RuntimeError(
            "HierarchicalDocumentTool must be initialized before search. "
            "Call initialize() first."
        )

    if not query or not query.strip():
        raise ValueError("Search query cannot be empty")

    effective_top_k = top_k or self.config.top_k

    # Generate query embedding (needed for semantic and hybrid modes)
    query_embedding = await self._embed_query(query)

    # Route to appropriate search mode
    if self.config.search_mode == SearchMode.KEYWORD:
        results = await self._keyword_search(query, effective_top_k)
    elif self.config.search_mode == SearchMode.HYBRID:
        results = await self._hybrid_search(query, query_embedding, effective_top_k)
    else:
        # SEMANTIC mode (default)
        results = await self._semantic_search(query_embedding, effective_top_k)

    # Filter by min_score if configured
    if self.config.min_score is not None:
        results = [r for r in results if r.fused_score >= self.config.min_score]

    return results

get_context(query, max_tokens=None) async

Get LLM-ready context for a query.

This is a convenience method that searches and formats results into a single context string suitable for LLM prompts.

Parameters:

Name Type Description Default
query str

Query to get context for.

required
max_tokens int | None

Maximum tokens for context (currently unused).

None

Returns:

Type Description
str

Formatted context string.

Raises:

Type Description
RuntimeError

If tool is not initialized.

Source code in src/holodeck/tools/hierarchical_document_tool.py
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
async def get_context(self, query: str, max_tokens: int | None = None) -> str:
    """Get LLM-ready context for a query.

    This is a convenience method that searches and formats results
    into a single context string suitable for LLM prompts.

    Args:
        query: Query to get context for.
        max_tokens: Maximum tokens for context (currently unused).

    Returns:
        Formatted context string.

    Raises:
        RuntimeError: If tool is not initialized.
    """
    results = await self.search(query)

    if not results:
        return f"No relevant context found for: {query}"

    lines = [f"Context for query: {query}", ""]
    for i, result in enumerate(results, 1):
        lines.append(f"[{i}] {result.format()}")
        lines.append("")

    return "\n".join(lines).rstrip()

get_definition(term)

Look up a term's definition.

Parameters:

Name Type Description Default
term str

Term to look up.

required

Returns:

Type Description
dict[str, str] | None

Dictionary with 'term' and 'definition' keys, or None.

Source code in src/holodeck/tools/hierarchical_document_tool.py
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
def get_definition(self, term: str) -> dict[str, str] | None:
    """Look up a term's definition.

    Args:
        term: Term to look up.

    Returns:
        Dictionary with 'term' and 'definition' keys, or None.
    """
    if self._glossary is None:
        return None

    normalized = term.lower().replace(" ", "_")
    entry = self._glossary.get(normalized)
    if entry:
        return {"term": entry.term, "definition": entry.definition_text}
    return None

set_embedding_service(service)

Set the embedding service for generating embeddings.

This method allows AgentFactory to inject a Semantic Kernel TextEmbedding service for generating real embeddings instead of placeholder zeros.

Parameters:

Name Type Description Default
service Any

Semantic Kernel TextEmbedding service instance (OpenAITextEmbedding or AzureTextEmbedding).

required
Source code in src/holodeck/tools/base_tool.py
50
51
52
53
54
55
56
57
58
59
60
61
62
def set_embedding_service(self, service: Any) -> None:
    """Set the embedding service for generating embeddings.

    This method allows AgentFactory to inject a Semantic Kernel TextEmbedding
    service for generating real embeddings instead of placeholder zeros.

    Args:
        service: Semantic Kernel TextEmbedding service instance
            (OpenAITextEmbedding or AzureTextEmbedding).
    """
    self._embedding_service = service
    tool_name = getattr(getattr(self, "config", None), "name", "unknown")
    logger.debug(f"Embedding service set for tool: {tool_name}")

set_context_generator(generator)

Set the context generator for contextual embeddings.

Accepts any ContextGenerator protocol implementation (LLMContextGenerator, ClaudeSDKContextGenerator, etc.).

Parameters:

Name Type Description Default
generator Any

A ContextGenerator protocol implementation.

required
Source code in src/holodeck/tools/hierarchical_document_tool.py
180
181
182
183
184
185
186
187
188
189
def set_context_generator(self, generator: Any) -> None:
    """Set the context generator for contextual embeddings.

    Accepts any ContextGenerator protocol implementation
    (LLMContextGenerator, ClaudeSDKContextGenerator, etc.).

    Args:
        generator: A ContextGenerator protocol implementation.
    """
    self._context_generator = generator