Cost & Latency-Aware Optimization¶
Problem Statement¶
How might we let the optimizer trade agent quality against real cost and latency
— so "best" means "good enough, cheaply enough" — without breaking the
loss = 1 − weighted_mean objective (loss ∈ [0,1]) or the compounding
single-best_loss accept rule that drives the coordinate-descent loop?
Recommended Direction¶
Treat cost and latency as a hard budget (feasibility gate), not as terms in
the loss. Quality remains the sole objective the optimizer minimizes; a
candidate is accepted only if it (a) beats the current best loss by min_delta
AND (b) stays within the declared cost and latency ceilings.
This keeps scalarize and the [0,1] loss invariant completely intact, leaves
Optuna's direction="minimize" and the compounding loop unchanged, and answers
the founding concern — runaway model cost — with a ceiling expressed in real
dollars (which the absolute-price-table + priced-token choices make natural).
The accept rule gains one conjunct; nothing else in the objective moves.
Cost basis is priced tokens: Σ over test cases (tokens × per-model rate),
sourced from a price table in the optimizer config. Local models (Ollama) are
priced at $0, so for them the cost ceiling is vacuous and the latency ceiling
carries the economic signal — which is exactly the regime worth caring about.
Key Assumptions to Validate¶
-
token_usageis populated by BOTH backends. Cost is uncomputable if the SK/Ollama path returnsNone. → Run one trial each on Claude and an Ollama agent; assertreport.results[*].token_usageis non-null. This is the #1 kill risk — if SK doesn't fill it, the cost gate can't fire. - Latency is stable enough to gate on. Wall-clock on a shared/dev box
with cold starts and network jitter is high-variance; a flaky gate would
reject good candidates nondeterministically. → Measure run-to-run variance
of
execution_time_msfor a fixed agent; if noisy, ship latency as recorded-only first and gate on cost alone. - A maintainable price table is acceptable. Rates drift. → Ship a small default map for common models + YAML override; unknown model → error (not silent $0).
MVP Scope¶
In:
- optimizer/cost.py: a PriceTable (model_id → $/1M prompt+completion[+cache]
rates) and report_cost(report, prices) -> float, report_latency(report) ->
float (mean of execution_time_ms across test cases).
- OptimizerConfig.budget: optional { cost_max: $, latency_max_ms: ms,
prices: {model_id: rates} }. Absent budget ⇒ current behaviour, unchanged.
- Accept rule in loop.py: feasible = within(cost_max) and within(latency_max);
accepted = feasible and (best_loss − loss > min_delta).
- TrialRecord.cost / TrialRecord.latency_ms + columns in report.md.
- Infeasible-baseline handling: if the original agent already exceeds budget,
accept the first feasible candidate regardless of loss, then resume the
normal rule (so the run can't dead-end against an unreachable baseline).
- Tests: cost computation from a known token_usage + price table; gate rejects
an over-budget improvement; gate accepts an in-budget improvement; infeasible
baseline path.
Out: soft penalty / λ weights, Pareto, percentile (p95) latency, automatic price fetching.
Not Doing (and Why)¶
- Penalty term in the loss — forces λ to fuse unit-conversion with preference and breaks the [0,1] invariant. Revisit only if the binary budget proves too coarse (the "won't economize within budget" gap).
- Pareto / multi-objective — no single
best_loss; that's the spine of the compounding accept rule. It's a rewrite, not an extension. - Latency gating by default if it proves noisy — fall back to recording it and gating cost only, rather than shipping a flaky gate.
Open Questions¶
- Latency aggregate: mean (simple, less noisy to explain) vs p95 (SLA-meaningful but needs more runs)? MVP leans mean.
- Cache-token pricing: fold
cache_read/cache_creationat their own rates, or approximate into prompt rate for v1? - Where does the price table live — inline in
agent.yamlunderevaluations.optimizer.budget.prices, or a shared~/.holodeckprice file referenced by path?