Cost attribution, reporting & management
How pond measures what a run costs, attributes it to the dimensions an enterprise bills on (team / tenant / cost-center / env), reports it, and enforces budgets — across any provider and harness, never tailored to one.
Status: M1–M5 shipped. Built incrementally (see Build sequence). M1 — the labels spine (
runs.labels/telemetry_events.labels, JSONB+GIN, migration0005), the required-label policy (POND_REQUIRED_RUN_LABELS), the per-provider usage extractor registry (swarm/src/sandbox.py), and the exposed rollup (GET /v1/usage,GET /v1/admin/usage,pond usage). M2 — the rate-card pricing primitive + pluggable pricer (app/pricing.py, modelpricingJSONB, migration0006), cached-meter extraction,usage_source/cost_sourcemarkers, and usage folded into the signed attestation (a:v2canonical-message bump inapp/crypto/attestation.py+swarm/src/security.py). M3 — theusage_rollupsdaily rollup table + idempotent recompute job (app/usage_rollup.py, migration0007), a rollup-backed daily report, and a line-item CSV/JSON export. M4 —budgets(migration0008) with admission control at submit (block → 422, warn → log),budget.alertthreshold telemetry, and a scope-aware watchdog that cancels over-budget runs mid-flight (app/budgets.py). M5 — provider-invoice reconciliation (app/reconciliation.py,cost_reconciliations, migration0009): an actual-vs-estimated true-up per (provider, period) yielding a reconciliation factor — the chargeback-grade number, kept distinct from the estimate. The invariants below must hold from each commit.
Code today: app/telemetry.py (emit), app/agent_usage.py (rollups + pricing),
app/models.py (telemetry_events, agent_models), app/routers/orch.py
(/orch/jobs/{jid}/telemetry ingest), app/run_watchdog.py (budget cancel),
swarm/src/sandbox.py (broker usage capture), app/obs.py (OTel metrics).
The shape of the problem
“Cost” is three jobs with different owners. They stack — each needs the one below.
- Measurement — what did this cost? Tokens in/out per model call, priced, attributed. The factual spine.
- Reporting — who spent what, over what period? Rollups, slices, exports.
- Management — don’t exceed X. Budgets, admission control, alerts, kills.
Primary audience: enterprise FinOps / eng-lead — chargeback/showback by team/tenant/cost-center. That target drives every decision below.
What exists today (the foundation)
Pond already has a real cost spine; the gaps are attribution breadth, exposure, pricing fidelity, and cross-run control.
| Capability | State | Where |
|---|---|---|
| Append-only event store, server-stamped attribution | built | telemetry_events (models.py), orch.py ingest |
| Token + cost capture at the credential broker | built | swarm/src/sandbox.py _extract_usage() |
| Provider-reported cost when present, else per-Mtok compute | built | agent_usage.py _cost() |
| Read-time rollups (run / project / model / harness / window) | built, mostly unexposed | agent_usage.py aggregate() |
| Per-run cost ceiling + wall-clock, preemptive cancel | built | run_watchdog.py |
| OTel cost/token/call metrics | built | obs.py |
Two facts shape the whole design:
- Attribution is trustworthy; magnitude is not (on BYO). Pond stamps
run_id/project_id/stage_key/job_idserver-side at ingest — a worker cannot spoof what a cost belongs to. But the token counts originate in the worker’s broker, so an untrusted/BYO pool could under- or over-report. This is the load-bearing constraint (see Trust of the number). aggregate()already computes the FinOps rollup and has no endpoint. Exposing what’s already written is the single highest-leverage step.
Decisions
Attribution model → generic labels + required-label policy
Pond stays product-neutral, so attribution is an open labels: dict[str,str]
map on run submission — not baked-in tenant/team columns. Stored on runs
(JSONB, GIN-indexed) and denormalized onto each telemetry_event at ingest,
stamped server-side from the run exactly like project_id — so labels inherit
the no-spoof property and rollups stay a single-table group-by (no join).
project_id / workflow_id remain first-class soft refs; labels are the open
extension. The FinOps-critical part: pond can enforce required label keys per
consumer/org — reject a run missing costCenter/team. Chargeback is
impossible if any spend can go untagged. (AWS cost-allocation-tags / GCP labels
pattern.) Conventional keys — tenant, costCenter, team, env, user — are
documented for portable reports; the map stays open.
Pricing → a rate-card primitive, not fixed columns
Fixed pricing columns (input_cost_per_mtok, cached_input_cost_per_mtok,
per_request_cost, …) are an if/else ladder wearing a schema: the moment a
provider invents a meter (reasoning tokens, image tokens, a conditional
discount) you’re editing the model table. So pricing is the same shape as
usage — both open meter maps, and cost is their dot product.
-
Usage is already an open bag:
{tokens_in, tokens_out, cached_tokens_in, requests: 1, …}, filled by the extractor registry. New meter = new key. -
Price mirrors it as a rate card — a list of
{meter, unit_price, per}line items, with acurrency(default USD):{"currency": "USD", "rates": [ {"meter": "tokens_in", "unit_price": 3.0, "per": 1000000}, {"meter": "tokens_out", "unit_price": 15.0, "per": 1000000}, {"meter": "cached_tokens_in", "unit_price": 0.30, "per": 1000000}, {"meter": "requests", "unit_price": 0.001,"per": 1} ]} -
Cost =
Σ usage[meter] / per × unit_price.
So cached / cache-write / per-request stop being special — they’re rows.
input/output are rows. The only contract is that the extractor and the rate
card agree on meter names (the shared, open meter namespace — see
neutrality). Back-compat: the existing
*_cost_per_mtok columns read as a 2-line rate card, so current config keeps
working unchanged.
Pricing is applied by a pluggable pricer — cost(usage, pricing) → money,
a registry keyed like the extractors, default "rate_card". This is the escape
hatch for genuinely non-linear estimate logic (volume tiers, conditional
discounts) without ever putting an eval’d expression DSL in the trusted control
plane: a customer registers a pricer (or a consumer-side stage_hooks.py hook)
instead. We commit only to the linear rate card now; tiers/predicates are
deferred until a real provider needs them.
Every cost record carries cost_source: provider_reported | computed, stored
as estimated_cost_usd explicitly so a later reconciled_cost_usd lands
without rewriting history. The layering, weakest-claim to strongest:
rate card (covers ~all real pricing, extensible over meters) → provider-
reported cost (self-pricing providers like OpenRouter — used verbatim today) →
custom pricer hook (bespoke estimate logic) → invoice reconciliation
(financial truth). Correctness on weird pricing comes from the provider/invoice,
not from out-predicting it.
Trust of the number → estimate vs. reconciled
Do not try to make an untrusted worker’s number financially binding — that is unwinnable. Instead, two numbers with different jobs:
- Worker-reported = real-time estimate. Drives UX, dashboards, and budget enforcement. Folded into the worker’s signed attestation (already signed — near-free) for non-repudiation/audit: proves the worker asserted it.
- Provider invoice = chargeback truth. The reconciled figure
(
reconciled_cost_usd) comes from the provider’s own accounting per period, keyed by API key/metadata. On a trusted pool, estimate ≈ truth and reconciliation is a true-up; on BYO you bill from the provider, not the tenant’s self-report. Honest, rather than pretending a signature makes a number true.
Enforcement → layered, building on the watchdog
Per-run ceiling exists. Add, in order: cross-run/time budgets, admission control (pre-flight estimate vs remaining budget), soft alerts (thresholds, separate from the hard kill), and a scope-aware watchdog (project budget exhausted → cancel running runs per policy).
Provider & harness neutrality
The non-negotiable: cost must never become codex-shaped or Anthropic-shaped. This rests on one invariant, two registries, and a shared meter namespace.
Invariant — capture at the broker, never the harness. Usage is read at the
credential broker, an HTTP proxy on the agent’s model-API path
(swarm/src/sandbox.py). It is below whatever CLI runs upstream — codex, claude,
aider, a custom harness are all invisible to it. We measure tokens by watching
API traffic, never by parsing a CLI’s stdout. The tempting shortcut (parse
codex’s usage line) would silently couple us to codex and is forbidden. Harnesses
remain a registry (agent_harnesses, harness.* capability) — adding one is a
row, not a code change; the cost layer must not regress that.
Registry — per-provider usage extractors. Response shapes differ by
provider, so _extract_usage becomes a registry keyed by the model’s provider,
not an if/else ladder:
| Provider | Tokens | Cached | Cost |
|---|---|---|---|
anthropic | usage.input_tokens / output_tokens | cache_read_input_tokens / cache_creation_input_tokens | computed |
openai (+ Azure, compatible) | usage.prompt_tokens / completion_tokens | prompt_tokens_details.cached_tokens | computed |
openrouter | OpenAI shape | OpenAI shape | inline cost |
bedrock / vertex / custom | provider shape | if present | computed |
| default | OpenAI-compatible | if present | computed |
Adding a provider = registering an extractor (same pattern as harnesses / executors). An unknown provider falls back to the OpenAI-compatible default and degrades gracefully rather than recording zero.
Failure mode to design for now. Some provider/transport combos return usage
only in a terminal streaming event, or not at all. The broker must parse
streaming usage events (not just JSON bodies), record a
usage_source: provider_body | stream_event | estimated_from_chars | unavailable marker so reports are honest about confidence, and fall back to a
tokenizer approximation (flagged low-confidence, never silently dropped) when
a provider gives nothing. Low-confidence estimates are exactly why chargeback
truth comes from provider reconciliation, which is provider-native.
Shared meter namespace — the usage⋅price contract. The extractor writes a
usage map {meter: quantity}; the rate card
prices a map {meter: rate}; cost is their dot product. Both sides are open: a
new provider meter is a new key on both, no schema change. The only coupling is
that extractor and rate card agree on meter names — so the meter vocabulary
(tokens_in, tokens_out, cached_tokens_in, requests, …) is the contract to
keep stable, documented next to the extractor registry.
Net per layer: pricing is keyed by (provider, model_id) with org/project
scope, so a mixed Anthropic + OpenAI + Bedrock fleet prices and reports side by
side; rollups group by provider/model/harness as plain data dimensions;
budgets are denominated in normalized cost (USD via the rate card’s currency),
so one budget spans providers uniformly.
The three layers
Layer 1 — Measurement & attribution
labels (JSONB+GIN) on runs → stamped onto telemetry at ingest; required-label
policy; per-provider extractor registry; the rate-card pricing primitive +
pluggable pricer (over the shared meter namespace); cost_source +
usage_source markers; usage folded into the signed attestation.
Layer 2 — Reporting
Expose aggregate(): GET /v1/usage (consumer-scoped) + GET /v1/admin/usage
(org-wide), with group_by (model | harness | project | stage | label key |
time-bucket), a time range, and filters. Add line-item exports (CSV/JSON,
per-run or per-model.call) — FinOps tools (Cloudability / CloudHealth / Apptio)
and warehouses ingest raw rows, not just rollups. For scale, add a daily rollup
table (scope × label-set × model × day) from a rollup job; serve reports from
it, keep raw telemetry for drill-down.
Layer 3 — Management
A budgets table: scope (org | project | label-match), period (day | month |
rolling-N), limit, action (warn | block). Admission control at
POST /v1/runs checks remaining budget against a pre-submission estimate
(historical avg for the workflow, or max-tokens × price). Soft alerts at
50/80/100% emit telemetry events + a webhook, separate from the hard kill.
Scope-aware watchdog extends the per-run ceiling to budget scopes.
Build sequence
Each step independently shippable.
| # | Delivers | Contents |
|---|---|---|
| M1 ✅ | showback + neutrality from day one | labels spine + required-label policy; per-provider extractor registry (so we’re never codex/Anthropic-shaped); expose aggregate() as /v1/usage + /v1/admin/usage with label group-by |
| M2 ✅ | pricing fidelity | rate-card pricing ({meter, unit_price, per} summed) + pluggable pricer, replacing fixed cost columns (legacy columns read as a 2-line card); broker emits extra meters (cached) + streaming usage; cost_source/usage_source; usage folded into the signed attestation (a :v2 canonical-message bump on both codepaths — worker signs its reported meter map for audit/non-repudiation; the billed numbers travel the separate unsigned telemetry path, see Known limitations). Deferred (minor): the estimated_from_chars unavailable path |
| M3 ✅ | scale + FinOps ingestion | usage_rollups table (migration 0007) + idempotent recompute-day job (app/usage_rollup.py, POST /v1/admin/usage/rollup); rollup-backed daily report (GET /v1/admin/usage/daily); line-item CSV/JSON export (GET /v1/admin/usage/export); pond usage rollup/daily/export |
| M4 ✅ | control | budgets table (migration 0008, scope org/project/label × period day/month/rolling, action warn/block) + CRUD; admission control at submit/preflight with a recent-mean estimate (app/budgets.py); budget.alert threshold telemetry (50/80/100%, deduped); scope-aware watchdog cancels over-budget runs mid-flight; pond budget create/list/status/delete |
| M5 ✅ | chargeback-grade truth | provider-invoice import (app/reconciliation.py, cost_reconciliations, migration 0009); actual-vs-estimated true-up → a reconciliation factor per (provider, period); POST/GET /v1/admin/reconciliations, pond reconcile import/list |
M1 = showback. M1–M4 = showback + control. M5 = true chargeback. The extractor registry is in M1 on purpose — neutrality is a foundation, not a retrofit.
Invariants
- Usage is captured at the broker, never the harness. No cost code parses any agent CLI’s output; it reads model-API traffic only.
- Adding a provider is a registry entry, not a branch; an unknown provider falls back to the OpenAI-compatible extractor and degrades gracefully.
- Pricing is a rate card over open meters, not fixed columns. A new meter is a new line item on both the usage map and the rate card; cost is their dot product. Non-linear/bespoke pricing goes through a pluggable pricer, provider- reported cost, or reconciliation — never an eval’d expression in core.
- Attribution is server-stamped and unspoofable —
labels,project_id,run_id,stage_key,job_idare set by pond at ingest, not by the worker. - No untagged spend under a required-label policy — a run missing a required label key is refused.
- The worker signs its reported usage (v2 attestation): the canonical message binds a digest of the reported meter map, so a relaying orchestrator can’t alter that signed map undetected. Audit/non-repudiation only — not the billed number (which travels the unsigned telemetry path) and not financial truth (that’s reconciliation). See Known limitations.
- Cost is clamped at ≥ 0 — a mis-authored negative rate or a negative provider-reported cost can never produce negative cost, which would otherwise silently defeat the budget kill-switch and under-count spend.
- Two cost numbers stay distinct: worker-reported
estimated_cost_usd(real-time, drives budgets) and provider-reconciledreconciled_cost_usd(chargeback truth). The worker’s number is never treated as financially binding on an untrusted pool. - Usage is never silently dropped — when a provider reports nothing, the
record carries
usage_source = unavailable | estimated_from_chars, flagged low-confidence. - Budgets are provider-neutral — denominated in normalized cost, spanning a mixed-provider fleet uniformly.
Known limitations & follow-ups
Surfaced by the M1–M5 verification audit; recorded here rather than hidden.
- Deploy order (attestation v1→v2). A current worker always signs
:v2; a pre-M2 pond verifier only checks:v1and would mark such resultsattestation_verified=False. Upgrade pond (the verifier) before workers. The reverse (new pond, old:v1worker) verifies fine. - Signed usage ≠ billed usage. The attestation signs the worker’s aggregate
usage (non-repudiation). Budgets/rollups/reconciliation bill from the separate,
unsigned
/orch/.../telemetrypath. Nothing yet cross-checks the two — a reconcile step (compare the attested digest to the telemetry sum, flag divergence) is a follow-up. Don’t treat signed usage as the tamper-proof bill. - Budget spend is a live telemetry scan.
spend_microsprices every event in the period in Python, per applicable budget, on the submit/preflight/watchdog paths — O(period volume). Fine at modest scale; the follow-up is rollup-backed spend (sumusage_rollups.cost_microsover closed days + a bounded current-day live scan). - Admission is enforced at the
/v1boundary, not inrunservice.submit_run. A future in-process caller bypasses admission + required-labels; the watchdog still backstops block budgets mid-flight, but warn budgets and the pre-submission estimate are not backstopped. - Alerts piggyback on the run watchdog. Threshold alerts fire while a matching
run is live; a crossing with no applicable run live (e.g. between runs) lags
until the next one. Dedup is best-effort (a per-
(budget, threshold, period)query, not an atomic constraint) so concurrent watchdogs can rarely double-emit a soft alert. An independent sweep is the follow-up. - Reconciliation is a factor true-up. Spend with no registered model and no
providerattr is bucketed(unknown); because the factor scales priced spend, large unpriced spend inflates the factor — register models for an accurate true-up. The window must align to a reconciliation period for an exact number (cross-period is approximate). - No FX. Costs sum in the rate card’s stated currency with no conversion — use one currency per deployment (USD default); a mixed-currency fleet would sum incorrectly. Multi-currency is a follow-up.
- Budget periods reset at UTC midnight (no per-budget timezone / business
calendar).
rollinghas no fixed period, so its alert dedup window slides. - GIN index migrations build non-concurrently (
0005,0007+) — anACCESS EXCLUSIVElock for the build on a largetelemetry_events. BuildCONCURRENTLYor use a maintenance window on a busy deployment. - Response casing is inconsistent.
/v1/admin/usage(anAgentUsageOutmodel) is camelCase, but/usage/dailyand/usage/reconciledreturn raw dicts → snake_case (per_label,label_key). Harmless but a consumer wart; wrap them in aliased models to unify. (Surfaced by the live e2e shakedown.)