PDF Processor¶
The holodeck.lib.pdf_processor package provides PDF-specific processing
utilities for text extraction with heading detection and page extraction.
These operations are used by FileProcessor to deliver enhanced PDF handling
beyond what markitdown offers natively.
Overview¶
The package exposes two public functions:
| Function | Purpose |
|---|---|
extract_pdf_with_headings |
Extract text from a PDF while detecting headings via bookmarks or font sizes |
extract_pdf_pages |
Extract a subset of pages from a PDF into a temporary file |
Heading Detection Strategies¶
extract_pdf_with_headings supports two strategies, applied in priority order:
- Bookmark-based detection (preferred) -- Uses PDF outline/bookmark entries to identify headings and their hierarchy levels. More reliable when the PDF contains bookmarks.
- Font-size-based detection (fallback) -- Analyzes font sizes against configurable thresholds. Used automatically when bookmarks are absent or when bookmark matching falls below a 30% match rate.
The output is Markdown text with heading markers (#, ##, etc.) suitable for
downstream processing by tools such as StructuredChunker.
Package Exports¶
PDF processing utilities for HoloDeck.
This package provides PDF-specific file processing operations:
-
Heading Extraction: Text extraction using pdfminer that produces markdown with proper heading markers (##). Supports two strategies:
-
Bookmark-based detection (preferred): Uses PDF outline/bookmark entries to identify headings and their hierarchy. More reliable when bookmarks are present.
-
Font-size-based detection (fallback): Analyzes font sizes against configurable thresholds. Used when bookmarks are absent or disabled.
-
Page Extraction: Extract specific pages from PDF files using pypdf, producing temporary PDF files with the selected pages.
These operations are used by FileProcessor to provide enhanced PDF handling beyond what markitdown offers natively.
Example
from holodeck.lib.pdf_processor import extract_pdf_with_headings, extract_pdf_pages
Extract text with heading detection (bookmarks preferred, font-size fallback)¶
markdown = extract_pdf_with_headings(Path("document.pdf"))
Force font-size-only detection¶
markdown = extract_pdf_with_headings(Path("document.pdf"), use_bookmarks=False)
Extract specific pages¶
temp_path = extract_pdf_pages(Path("document.pdf"), pages=[0, 1, 2])
Functions:
| Name | Description |
|---|---|
extract_pdf_with_headings |
Extract PDF text with heading detection |
extract_pdf_pages |
Extract specific pages from a PDF file |
extract_pdf_with_headings(file_path, heading_thresholds=None, use_bookmarks=True)
¶
Extract PDF text with heading detection.
Prefers bookmark-based heading detection when bookmarks are available, falling back to font-size-based detection otherwise.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to PDF file |
required |
heading_thresholds
|
dict[float, int] | None
|
Font size -> heading level mapping. Default: {14.0: 1, 12.0: 2} (14pt+ = h1, 12pt+ = h2) |
None
|
use_bookmarks
|
bool
|
Whether to attempt bookmark-based detection first. Default: True |
True
|
Returns:
| Type | Description |
|---|---|
str
|
Markdown text with heading markers based on detected headings. |
Raises:
| Type | Description |
|---|---|
Exception
|
If PDF parsing fails (caller should handle fallback). |
Source code in src/holodeck/lib/pdf_processor/heading_extractor.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | |
extract_pdf_pages(file_path, pages)
¶
Extract specific pages from PDF into temporary file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to original PDF file |
required |
pages
|
list[int]
|
List of page numbers to extract (0-indexed) |
required |
Returns:
| Type | Description |
|---|---|
Path
|
Path to temporary PDF file with extracted pages |
Raises:
| Type | Description |
|---|---|
ImportError
|
If pypdf is not installed |
ValueError
|
If page numbers are invalid or out of range |
Source code in src/holodeck/lib/pdf_processor/page_extractor.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | |
Heading Extractor¶
extract_pdf_with_headings(file_path, heading_thresholds=None, use_bookmarks=True)
¶
Extract PDF text with heading detection.
Prefers bookmark-based heading detection when bookmarks are available, falling back to font-size-based detection otherwise.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to PDF file |
required |
heading_thresholds
|
dict[float, int] | None
|
Font size -> heading level mapping. Default: {14.0: 1, 12.0: 2} (14pt+ = h1, 12pt+ = h2) |
None
|
use_bookmarks
|
bool
|
Whether to attempt bookmark-based detection first. Default: True |
True
|
Returns:
| Type | Description |
|---|---|
str
|
Markdown text with heading markers based on detected headings. |
Raises:
| Type | Description |
|---|---|
Exception
|
If PDF parsing fails (caller should handle fallback). |
Source code in src/holodeck/lib/pdf_processor/heading_extractor.py
31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 | |
Page Extractor¶
extract_pdf_pages(file_path, pages)
¶
Extract specific pages from PDF into temporary file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
Path
|
Path to original PDF file |
required |
pages
|
list[int]
|
List of page numbers to extract (0-indexed) |
required |
Returns:
| Type | Description |
|---|---|
Path
|
Path to temporary PDF file with extracted pages |
Raises:
| Type | Description |
|---|---|
ImportError
|
If pypdf is not installed |
ValueError
|
If page numbers are invalid or out of range |
Source code in src/holodeck/lib/pdf_processor/page_extractor.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 | |
Usage Examples¶
Extract text with heading detection¶
from pathlib import Path
from holodeck.lib.pdf_processor import extract_pdf_with_headings
# Bookmark-preferred extraction (default)
markdown = extract_pdf_with_headings(Path("document.pdf"))
# Force font-size-only detection
markdown = extract_pdf_with_headings(Path("document.pdf"), use_bookmarks=False)
# Custom font-size thresholds (18pt+ = h1, 14pt+ = h2, 12pt+ = h3)
markdown = extract_pdf_with_headings(
Path("document.pdf"),
heading_thresholds={18.0: 1, 14.0: 2, 12.0: 3},
)
Extract specific pages¶
from pathlib import Path
from holodeck.lib.pdf_processor import extract_pdf_pages
# Extract pages 0, 1, and 4 (0-indexed) into a temporary PDF
temp_path = extract_pdf_pages(Path("large-report.pdf"), pages=[0, 1, 4])
# Use the temporary file for further processing
print(f"Extracted pages saved to: {temp_path}")
Dependencies¶
| Dependency | Used By | Purpose |
|---|---|---|
pdfminer.six |
heading_extractor |
PDF text and font-size extraction, bookmark/outline parsing |
pypdf |
page_extractor |
PDF page-level read/write operations |