title: Rocky — System Redesign status: Draft for review date: 2026-05-02 author: Alex (devarno) + Claude supersedes: docs/specs/2026-05-02-rocky-design.md (in scope, not in deletion — that doc remains the SS-07 RALPH spec)
Rocky — System Redesign
Purpose
Pivot rocky-hq from a single-feature Python prompt-queue tool into a superproject that delivers per-workspace operations & modelling. The existing Python tool becomes one subsystem (SS-07 RALPH) inside a multi-subsystem product whose canonical surface is the existing Next.js console previously developed at devarno-cloud/rocky (@rocky/ops).
Rocky's long-term role: every PETROVA-onboarded project gets exactly one Rocky instance, alongside one GRACE and one TRACEO instance. Rocky owns operations, modelling, prompt automation, and per-workspace knowledge-stack provisioning (1×CAIRNET + 1×LORE per workspace, tiered).
Goals
- Lift the existing Next.js
@rocky/opsconsole (7 shipped subsystems, all VTM Pass) into the new structure without rewrites. - Wrap the current Python
rocky-hqCLI as a console subsystem (SS-07 RALPH) while keeping the standalone CLI usable. - Add SS-08 HEARTH — a tiered per-workspace CAIRNET+LORE provisioner.
- Keep the system Airlock-tenanted: every action resolves to an authenticated Airlock session and a HATCH audit event.
- Ship as OSS by default, with cloud-only Polar.sh entitlement gates layered on top — never leaking commercial paths into self-host installs.
- Match
devarno-cloud's superproject pattern so PETROVA already knows how to govern Rocky.
Non-goals (this redesign)
- Rewriting any of the 7 existing console subsystems.
- Building CAIRNET, LORE, SHIELD, HATCH, or AIRLOCK — they already exist as separate services under
devarno-cloud/. HEARTH is a deployer, not the services. - Designing the PETROVA verb that calls HEARTH (extension lives in
petrova-hq). - Designing the project-specific algo packages themselves (e.g. devarno-cloud financial forecasting). They surface through the existing SS-03 PLUGINS seam; no new subsystem needed.
- Migration of in-flight devarno-cloud tenants. We assume a clean cut where devarno-cloud's
rockysubmodule URL is repointed atrocky-hq/console.
Big design decisions
| Decision | Choice | Why |
|---|---|---|
| Repo shape | Superproject + submodules (mirror of devarno-cloud) |
Reuses PETROVA governance, AGENTS.xml conventions, and push-down policy 0004 from devarno-cloud |
| Console source | Lift devarno-cloud/rocky → rocky-hq/console submodule, no rewrite |
7 subsystems already shipped & VTM Pass; rewriting throws away working code |
| RALPH integration | Keep Python; add ralph serve HTTP/worker mode; console wraps it |
claude-agent-sdk Python is the mature client; standalone CLI use case stays |
| Polyglot | TS in console, Python in RALPH, Go in HEARTH | One language per subsystem, never mixed inside one repo. Go chosen for HEARTH because k8s/Docker/kustomize ecosystem is Go-native and a single static binary is the right OSS distribution shape (rationale below). |
| Auth | Keep Airlock BetterAuth on .devarno.cloud cross-subdomain cookies |
Already working in console/src/lib/auth.ts; tenancy invariant is non-negotiable |
| Audit | Every state change → HATCH event via SS-05 RELAY | "I want evidence of every Rocky use" (operator constraint) |
| Licensing | OSS (MIT or Apache-2 — TBD per submodule) | "Always design for open-source" |
| Monetisation | Polar.sh entitlements gating cloud-hosted HEARTH tiers above solo |
"Always Polar.sh"; OSS solo tier always free |
| HEARTH tier model | Config-driven ProvisioningProfiles, three drivers (LocalDocker / Kustomize / DevarnoCloud) |
Tier ≠ code path; drivers ≠ tiers; OSS parity requires LocalDocker |
| Failure handling for provisioning | No auto-retry; loud failures; admin re-run | Provisioning errors usually mean credential or capacity drift — silent retry hides the real problem |
Architecture
Repository layout
rocky-hq/ # this repo, the superproject
console/ → submodule → @rocky/console (Next.js 16 + React 19, lifted from devarno-cloud/rocky)
ralph/ → submodule → @rocky/ralph (Python 3, current rocky-hq codebase moved here)
hearth/ → submodule → @rocky/hearth (NEW; Go — see "Why Go for HEARTH")
algo/ → submodule → @rocky/algo (project-specific modelling packages, plugged via SS-03 PLUGINS)
contracts/ → submodule → @rocky/contracts (zod + JSON schema + proto for cross-submodule interfaces)
docs/ (in-tree)
registry.yaml (in-tree; lists deployed workspaces + tiers)
AGENTS.xml (in-tree; PETROVA-compatible verb declarations)
CLAUDE.md (in-tree; MR-12 projection only)
pyproject.toml + package.json minimal scaffolds for superproject-level scripts
Per devarno-cloud decision 0004 (push-down policy): the parent CLAUDE.md is projection-only. Build/test/runtime instructions live in each submodule's own CLAUDE.md.
Subsystem map
| ID | Name | Lives in | State today |
|---|---|---|---|
| SS-01 | WORKBENCH | console/src/lib/workbench/ |
Existing, VTM Pass |
| SS-02 | DASHBOARDS | console/src/lib/dashboards/ |
Existing, VTM Pass |
| SS-03 | PLUGINS | console/src/lib/plugins/ |
Existing, VTM Pass; becomes seam for ALGO packages |
| SS-04 | WORKSPACE | console/src/lib/workspace/ |
Existing, VTM Pass |
| SS-05 | RELAY | console/src/lib/relay/ |
Existing, VTM Pass; gains /api/relay/ralph and /api/relay/polar |
| SS-06 | VAULT | console/src/lib/vault/ |
Existing, VTM Pass; stores per-workspace HEARTH credentials |
| SS-07 | RALPH | ralph/ + console/src/lib/ralph/ |
Tool exists in rocky-hq today; console wrapper is NEW |
| SS-08 | HEARTH | hearth/ + console/src/lib/hearth/ |
NEW |
| — | ALGO | algo/ (per-project Python packages) |
Plugged through SS-03 PLUGINS; not a subsystem |
Tenancy invariant (applies to every subsystem)
No Rocky route, worker call, or driver action executes without:
- A valid Airlock session (admin / operator / observer / denied)
- A corresponding HATCH audit event written before the action takes effect
- (Cloud build only) A Polar.sh entitlement check on provisioning, upgrade, and decommission of any tier >
solo. Per-request runtime checks are not required — the tier is enforced through theDeploymentRefrow plus thetier_downgrade_active_datapolicy below.
Self-host (ROCKY_BILLING=disabled, typically with ROCKY_AUTH=local) skips (3) and resolves (1) via the LocalAuth adapter; (2) is non-negotiable in all builds (HATCH defaults to a local JSONL sink in self-host mode).
SS-07 RALPH — wrapping the Python tool
Inside ralph/ (the submodule)
The current rocky-hq Python code moves verbatim: cli.py, runner.py, agent.py, judge.py, merge_driver.py, journal.py, workspace.py, loader.py, plus tests and the existing design spec at docs/specs/2026-05-02-rocky-design.md.
One additive change: a new ralph serve mode.
- HTTP server (FastAPI) accepts
RunConfig + prompts.yamlpayloads. - Background worker (RQ or arq — pick at implementation time, both work) executes the existing pipeline.
- Streams
run.jsonlevents as SSE. - Auth: shared HMAC token from VAULT, never exposed to the browser; the console's API route is the only thing that talks to the worker.
The standalone rocky run CLI continues to work unchanged for self-hosters who don't want the console.
Inside console/src/lib/ralph/ (NEW)
// console/src/lib/ralph/client.ts
submitRun(workspace_slug: string, prompts_yaml: string, config: RunConfig): Promise<{ run_id: string }>
streamEvents(run_id: string): AsyncIterable<JournalEvent>
cancelRun(run_id: string): Promise<void> // admin-only
- Artifacts (plan.md, deviation_pre/post.md, diff.patch, transcripts) → object store already used by SS-04 WORKSPACE.
- New SS-02 dashboard panel
ralph.runs. - New SS-05 RELAY route
/api/relay/ralphso deviation/major events reach HATCH in addition to the per-action HATCH events.
UI surface
- New top-level page
/ralphwith/runs,/runs/[id],/prompts/[file]sub-routes. - No new design system; reuses Tailwind + radix-ui.
- Permissions:
operatorsubmits,observerviews,admincancels mid-run.
SS-08 HEARTH — per-workspace CAIRNET+LORE provisioning
Tiers
| Tier | Audience | CAIRNET shape | LORE shape | Polar.sh gate |
|---|---|---|---|---|
solo |
OSS self-host, single dev | SQLite + local FAISS, 100MB cap | Single-user, 30-day retention | none — always free |
team |
Small team, hosted | Postgres + pgvector, 5GB cap, 5 seats | Multi-user, 90-day retention, RBAC | rocky-team subscription |
studio |
Larger orgs, hosted | Postgres + pgvector, 100GB cap, unlimited seats | Multi-user, 1y retention, audit export | rocky-studio subscription |
bespoke |
Per-deal | Driver-defined | Driver-defined | Out-of-band invoice via Polar.sh |
A tier is a YAML row resolving to a ProvisioningProfile (resource caps + driver flags). Adding a tier never adds code paths.
Inside hearth/ (NEW submodule, Go)
// hearth/internal/driver/driver.go
type Driver interface {
Provision(ctx context.Context, slug string, profile ProvisioningProfile) (DeploymentRef, error)
Status(ctx context.Context, ref DeploymentRef) (Status, error)
Upgrade(ctx context.Context, ref DeploymentRef, profile ProvisioningProfile) (DeploymentRef, error)
Teardown(ctx context.Context, ref DeploymentRef) error
}
Cross-language types come from contracts/ via codegen — ProvisioningProfile, DeploymentRef, Status are defined once (zod or proto, TBD in the contracts spec) and emitted to both Go (via quicktype or buf if proto) and TypeScript.
Why Go for HEARTH
HEARTH is a coordinator: it calls Docker, k8s, and cloud APIs and writes a Postgres row. The decision is dominated by ecosystem fit and distribution shape, not by language semantics.
- Ecosystem fit.
client-go,docker/cli,kustomize,helm, andcontroller-runtime(if HEARTH later evolves into an operator) are all Go-native and importable as libraries with zero shelling out. The TypeScript equivalents (@kubernetes/client-node,dockerode) work but lag in completeness and battle-testing. - Distribution. A single static binary is the right shape for a self-hostable OSS provisioner. Node-based distribution drags in npm trees and runtime version coupling.
- Concurrency model. Goroutines fit parallel provisioning naturally; no async colour or worker-pool ceremony.
Rejected alternatives:
- TypeScript — would let us share the runtime with the console, but the console doesn't actually need to embed HEARTH; it talks to it over a small RPC surface. That's the right place for polyglot. Loses the ecosystem and distribution wins above.
- Rust — compile times slow iteration on a tool that'll get edited a lot.
kube-rsandbollardwork but are behindclient-go/docker/cli. The borrow checker buys little here — the work is I/O-bound API plumbing, not the bug class Rust prevents. - Zig — pre-1.0, std and ABI still churning. No production-grade Docker or k8s client. Zig shines where explicit memory and no hidden allocation matter; none of HEARTH's hot paths qualify.
Drivers shipped:
LocalDocker— dev/self-host. Spins CAIRNET + LORE containers on operator's box. Required for OSS parity.Kustomize— emits manifests for self-hosters running their own k8s; expects them to apply.DevarnoCloud— managed driver, deploys into existing devarno-cloud infrastructure (mirrors whatdevarno-cloud/cairnetanddevarno-cloud/lorealready use).
State store: DeploymentRef rows in the console's existing Postgres — {workspace_slug, tier, driver, endpoint, secrets_vault_path, created, last_status}. Rocky never stores CAIRNET/LORE contents — only deployment metadata. Driver-issued credentials land in SS-06 VAULT keyed by workspace_slug.
Inside console/src/lib/hearth/ (NEW)
provisionWorkspace(slug: string, tier: Tier): Promise<DeploymentRef> // admin-only
getWorkspaceDeployment(slug: string): Promise<DeploymentRef | null> // any role with workspace access
decommissionWorkspace(slug: string): Promise<void> // admin-only
getWorkspaceDeployment is the discovery seam every other subsystem uses to resolve a workspace's CAIRNET/LORE endpoints.
Polar.sh integration
console/src/lib/polar/entitlements.tsis the only place Polar.sh code lives.- Webhook → SS-05 RELAY route
/api/relay/polar→ mutates HEARTH tier onsubscription.active|canceled. - Self-host build sets
ROCKY_BILLING=disabledand forcestier=solo. The Polar adapter compiles to a no-op when disabled. No Polar code paths leak into self-host installs.
UI surface
- New admin-only page
/hearthlisting all deployments and statuses. - Observers see only their own workspace's deployment row.
End-to-end data flow (workspace onboarding → first RALPH run)
1. Human edits petrova-hq/registry.yaml — adds entry with rocky_tier=team
2. PETROVA verb provision_rocky(slug, tier) → console.hearth.provisionWorkspace(slug, "team")
3. SS-08 HEARTH:
a. Airlock check: caller has admin role on workspace
b. Polar.sh check: workspace has active "rocky-team" entitlement (cloud build)
c. Driver=DevarnoCloud → deploys 1×CAIRNET + 1×LORE
d. Credentials → SS-06 VAULT
e. DeploymentRef row → console DB
f. SS-05 RELAY → HATCH "hearth.provisioned"
4. Operator opens /ralph/runs/new:
a. Selects workspace (HEARTH resolves CAIRNET/LORE endpoints)
b. Uploads prompts.yaml
c. submitRun() POSTs to ralph worker with HMAC + workspace context
5. RALPH worker (existing pipeline):
plan → judge → execute → audit → merge per prompt
Events stream SSE → console → browser; mirrored to HATCH
Artifacts → SS-04 object store
6. Run finishes: ralph.runs panel updates; deviation reports linkable from /ralph/runs/[id]
Error handling
| Class | Where | Behaviour |
|---|---|---|
auth_denied |
Airlock check fails anywhere | 403, HATCH audit, no further side effects |
entitlement_missing |
Polar.sh check fails (cloud only) | 402-equivalent, surface upgrade CTA, HATCH audit |
driver_failure |
HEARTH driver errors mid-provision | Roll back partial state, mark DeploymentRef failed, retain logs, no auto-retry |
worker_unreachable |
Console can't reach RALPH worker | "worker offline" banner, queue submission 60s, then fail |
pipeline_deviation |
RALPH judge marks major | Existing on-failure policy (stop|skip|realign) — unchanged |
tier_downgrade_active_data |
Polar cancels, workspace exceeds new caps | Read-only mode 7 days, then HATCH alert + admin decommission flow — never silent data loss |
Every error class produces a HATCH event. Provisioning never auto-retries — failures are loud, admin re-runs explicitly.
Testing strategy
Three tiers (mirrors RALPH's existing pattern)
1. Unit (per submodule, fast, no network)
console: existing vitest suite stays. New unit tests forconsole/src/lib/ralph/,console/src/lib/hearth/,console/src/lib/polar/.ralph(Python): existing pytest suite stays. Add tests forralph serveHTTP layer with FastAPI'sTestClient.hearth: driver protocol contract tests against aFakeDriverthat records calls.
2. Integration (per submodule, real local services)
console: existing 21+ cross-subsystem integration tests stay.hearth:LocalDockerdriver against testcontainers — actually spins up CAIRNET + LORE images, exercisesprovision/status/teardown. GatedROCKY_HEARTH_INTEGRATION=1.- Superproject e2e:
tests/e2e/test_provision_then_ralph.pyprovisions asoloworkspace via LocalDocker, submits the existing 4-prompt mock RALPH run, asserts the full event chain reaches a fake HATCH sink.
3. Smoke (manual, gated, real cloud)
ROCKY_SMOKE=1runs one provision against theDevarnoClouddriver in a sandbox tenant, then tears down. Catches driver/credential drift.- Existing RALPH smoke (
ROCKY_SMOKE=1for the SDK) keeps running.
CI shape
Mirrors devarno-cloud:
- Parent
rocky-hqCI runsgit submodule statusconsistency check + the e2e test only. - Each submodule has its own CI for its own unit + integration tests.
OSS test parity invariant
The solo tier + LocalDocker driver + LocalAuth adapter path must pass the full e2e test with no Polar.sh network calls and no devarno-cloud-tenant credentials. If a self-hoster's CI breaks on the e2e test, we broke OSS.
Licensing
- Per-submodule licensing:
console,ralph,hearth,algo→ MIT (matches the lightweight, infrastructure-tool ethos).contracts→ Apache-2.0 (explicit patent grant for redistributable schemas/IDL).
- The cloud-hosted
rocky.erid.techdeployment is the only Rocky surface that requires Polar.sh entitlements. The OSS build is fully featured, just without billing.
PETROVA hookup (out of scope for this spec, sketched only)
petrova-hq/registry.yaml schema gains:
integrations_applicability:
...existing...
rocky: required | optional | not_applicable
rocky_tier: solo | team | studio | bespoke # only when rocky != not_applicable
PETROVA gains a provision_rocky verb that calls console.hearth.provisionWorkspace with the registry's declared tier. Verb design lives in a separate PETROVA spec.
KAHN integration (operator-facing observability)
KAHN (kahn-hq) is the operator's existing fleet-observability product (PRODUCT.md: "agent-fleet observability tool"). It already ships a vendored producer→consumer contract at kahn-hq/contracts/: transitions.schema.json, graph.schema.json, and a stdlib-only kahn_emit.py (additive-compatible; sister repos vendor it directly per contracts/README.md). Rocky does not duplicate this surface.
- Phase 3 (lift-ralph) commitment: Rocky RALPH adopts KAHN's
transitions.schema.jsonas its on-disk event format. Vendorkahn_emit.pyintoralph/per KAHN's recommended consumption path. Each Rocky run becomes one KAHN run; each prompt becomes a node; plan/execute/audit are node attempts; deviation severity maps onto KAHN'sOutcomeenum (clean | clean_with_flake | partial | stuck | catastrophic). Rocky's deviation-detection metadata rides as KAHN's documented "unknown fields pass through" extension. - Operator UX: Rocky runs surface in KAHN Scope alongside Choco/STRATT/Traceo runs; KAHN Scope is the canonical fleet dashboard. Rocky's console SS-02 DASHBOARDS continues to own ops-and-modelling views (workbench history, plugin registry, vault audit) that KAHN intentionally does not cover.
- What we explicitly DO NOT do here: wrap KAHN's orchestrator as Rocky's runtime. Rocky RALPH stays a sequential prompt-queue with plan/judge/execute/audit/merge; KAHN's prompt-DAG orchestrator solves an adjacent problem (fleet-wide convergence). Re-evaluate post-Phase-3 once we have real Rocky-on-KAHN-emit telemetry; do not unify on speculation.
- Decision record: rationale and rename of SS-08 (FLEET → HEARTH) to avoid the namespace collision are recorded in
docs/decisions/0002-rename-fleet-to-hearth.md.
Phasing (rough order, fully detailed in the implementation plan)
- Stand up the superproject scaffold: empty submodule pointers,
AGENTS.xml,CLAUDE.mdprojection,registry.yaml. - Lift
devarno-cloud/rocky→rocky-hq/console(new repo, fresh history or filtered subtree — TBD; see open question). - Move current
rocky-hqPython code →rocky-hq/ralph. Addralph serve. Wire console SS-07. Adopt KAHN'stransitions.schema.json+ vendorkahn_emit.pyas the RALPH on-disk event format. - Stand up
rocky-hq/contractswith the cross-submodule schemas needed by SS-07. - Build
rocky-hq/hearthwithLocalDockerdriver only. Wire console SS-08. Ship the e2e test. - Add
KustomizeandDevarnoClouddrivers. - Wire Polar.sh entitlements +
/api/relay/polar. - (PETROVA, separate spec) Extend registry schema and add
provision_rockyverb.
Resolved decisions (formerly open questions)
- Submodule history →
git subtree split. Preserves commit history and the VTM Pass lineage. Slightly slower lift, materially more credible provenance for the 7 already-shipped subsystems. - HEARTH implementation language → Go. Rationale in "Why Go for HEARTH" above.
contracts/license → Apache-2.0. All other submodules are MIT.contracts/is the only repo whose artifacts (zod schemas, proto, generated clients) get redistributed downstream by consumers, where the explicit patent grant in Apache-2 is the standard-of-care choice for schema/IDL libraries.algo/shape → one repo with namespaced packages.@rocky/algo-devarno-finance,@rocky/algo-foo, etc. PETROVA governs one slug, package boundary is the per-project unit. Re-evaluate at ≥3 consumers — split if individual project teams need independent release cadence.- OSS solo identity →
LocalAuthadapter. Self-host withROCKY_AUTH=localresolves the Airlock session to a local single-user identity (file-backed). Tenancy invariant still holds; the HATCH event sink defaults to a local JSONL file in this mode. Cloud build (ROCKY_AUTH=airlock) is unchanged.