title: Phase 3e — CRDT WS transport + SS-04 ralph run-history date: 2026-05-04 status: Accepted phase: 3e spec: docs/specs/2026-05-02-rocky-system-redesign.md §SS-07, §Phasing-3 predecessor: docs/specs/2026-05-03-rocky-console-ralph-wrapper-phase-3d.md
Phase 3e — CRDT WS transport + SS-04 ralph run-history
Console-only design (no ralph/ changes; no parent code beyond a submodule pointer bump and the MILESTONES close-row).
1. Goal
Close the two carry-over items deferred from Phase 3d so Phase 3 can be declared complete:
- Replace the 501 stub at
console/src/app/api/ralph/prompts/crdt/[...path]/route.ts:26with real bidirectional Yjs sync, satisfying gate G3 (CRDT publish) end-to-end. - Replace the hardcoded-zero
getRalphRunsStubatconsole/src/lib/dashboards/data.ts:54-65with a live SS-04 read-through aggregator overRalphClient.listRuns(), so thedash/ralph-runspanel renders true 24-hour activity.
No spec scope beyond these two items remains in §Phasing-3 (KAHN adoption + SS-07 wiring shipped in 3b/3c/3d).
2. Locked decisions
| # | Decision | Rationale |
|---|---|---|
| D1 | CRDT transport: separate Node sidecar. New process console/scripts/crdt-server.mjs running y-websocket's setupWSConnection on 127.0.0.1:${RALPH_CRDT_PORT ?? 8766}, supervised by the Next dev process the same way RALPH_TRANSPORT=sidecar supervises ralph serve. |
Next.js 16 still has no stable Route-Handler WebSocket primitive. Coupling the phase close to upstream framework cadence is open-ended; a bounded ~200-line Node sidecar is the cheapest path. The sidecar is retire-able later (no schema change) — the crdt.ts registry contract is the seam. |
| D2 | Transport env knob: RALPH_CRDT_TRANSPORT=sidecar|remote|disabled. disabled keeps the current 501 and the editor renders an "offline" banner. |
Mirrors the SS-07 RALPH_TRANSPORT shape from 3d. disabled is the cloud-default until a managed CRDT service is provisioned in Phase 6+. |
| D3 | Yjs persistence: per-prompt .crdt sidecar next to .fp. Path: <workspace>/prompts/<file>.yaml.crdt. Debounced idle-write (5 s after last update). |
Survives sidecar restarts without a database. .crdt joins .yaml and .yaml.fp as a third sidecar in the prompts tree; gitignored by default (binary, transient). |
| D4 | WS auth: bearer over Sec-WebSocket-Protocol. Reuse mintBearer(session, workspace_slug) from console/src/lib/ralph/auth.ts; same migration seam as the HTTP routes. |
y-websocket supports subprotocol-as-bearer natively. No new auth surface. |
| D5 | Role gate: operator+admin only on WS handshake. Observer denied at the upgrade. | Matches the §SS-07 role table from 3d. Editing a CRDT doc is editing a prompt. |
| D6 | Run-history aggregator: SS-04 read-through, 30 s in-memory TTL. Lives at console/src/lib/workspace/ralph-runs.ts. No DB, no persistence — ralph serve's journal is the source of truth. |
Per redesign §Subsystem map: SS-02 owns dashboard definitions and rendering; SS-04 owns workspace-scoped read models. Mirrors how council.ts (SS-04) feeds data.ts (SS-02) for council:debates:recent. |
| D7 | Outcome fold: KAHN Outcome enum (clean|clean_with_flake|partial|stuck|catastrophic) → pass (first two) / fail (last three) for pass_rate math. KILN's optional convergence_score and early_stop_reason pass through as no-op slots when absent (forward-compat invariant from 3d D3). |
Pass-rate is a binary metric; the five-valued enum needs an explicit fold and clean_with_flake is "pass with a known retry" — counted as pass. |
3. Architecture
Browser (operator/admin session, Airlock cookie)
│
│ HTTP WebSocket
│ /api/ralph/prompts/... ws://localhost:8766/<slug>/<path>
▼ ▼
┌──────────────────────────┐ ┌──────────────────────────────┐
│ Next.js console │ │ CRDT sidecar (Node) │
│ (existing routes from │ ←──→ │ console/scripts/crdt-server │
│ Phase 3d) │ shared │ • y-websocket setupWSConn │
│ │ Y.Doc │ • role check on handshake │
│ /api/.../crdt/[...path] │ registry│ • bearer verify │
│ → 307 to ws:// │ (mem) │ • debounced .crdt persist │
│ or 501 if disabled │ │ • HATCH session events │
└──────────────────────────┘ └──────────────────────────────┘
│
▼
┌────────────────────────────────────┐
│ <workspace>/prompts/<file>.yaml │
│ <workspace>/prompts/<file>.yaml.fp │
│ <workspace>/prompts/<file>.yaml.crdt│ ← new in 3e
└────────────────────────────────────┘
Browser (any role)
│
│ /dashboards/ralph-runs
▼
┌──────────────────────────┐ ┌──────────────────────────┐
│ SS-02 dashboards/data.ts │ ───── │ SS-04 workspace/ │
│ fetchPanelData() │ calls │ ralph-runs.ts │
│ ralph:runs:* → │ │ aggregateRuns24h(slug) │
│ getRalphRunsLive() │ │ • 30s TTL cache │
│ (was getStub) │ │ • outcome fold (D7) │
└──────────────────────────┘ └────────────┬─────────────┘
│
▼ RalphClient.listRuns()
┌──────────────────┐
│ ralph serve │
│ (unchanged 3c) │
└──────────────────┘
4. CRDT sidecar surface
Process. node console/scripts/crdt-server.mjs. Reads RALPH_CRDT_PORT, RALPH_CRDT_TRANSPORT, and the same VAULT-resolved bearer the HTTP path uses. Single-process, single-port; no clustering (cluster mode is a Phase-6 cloud concern).
Endpoint shape. ws://<host>:<port>/<workspace_slug>/<encodeURIComponent(prompt_path)>. The path-shape is parsed inside setupWSConnection's docName extractor; no Express, no router.
Handshake.
- Client sends
Sec-WebSocket-Protocol: bearer.<token>(subprotocol form so y-websocket's client can pass it through). - Server verifies bearer against the static HMAC from VAULT (identical check to HTTP routes).
- Server resolves Airlock session → role; rejects on
observerwith a 1008 close code and a sub-protocol response. - On accept, server fetches the Y.Doc from the shared registry (
getDoc(slug, prompt_path)fromconsole/src/lib/ralph/crdt.ts) and binds the WS to it.
Persistence. Every Y.Doc maintains an update event handler that schedules a debounced (5 s) persist(slug, prompt_path) call:
// console/src/lib/ralph/crdt.ts (extended)
export function persist(slug: string, promptPath: string): Promise<void>
Implementation: Y.encodeStateAsUpdate(doc) → <workspace>/prompts/<file>.yaml.crdt via temp-file + rename. On sidecar boot, getDoc() first attempts to seed from the .crdt file via Y.applyUpdate(doc, fs.readFileSync(...)).
Lifecycle.
- The Next.js dev process supervises the sidecar (parallel to the existing
ralph servesupervisor): spawn on boot, restart on crash with exponential backoff capped at 30 s, kill on SIGTERM. - Production deploy uses an external supervisor (systemd/PM2/k8s) — out of scope for 3e; documented as a deployment note.
5. CRDT route changes
console/src/app/api/ralph/prompts/crdt/[...path]/route.ts:
RALPH_CRDT_TRANSPORT === "disabled"→ return current 501 unchanged.RALPH_CRDT_TRANSPORT === "sidecar"→ return307redirect tows://127.0.0.1:${RALPH_CRDT_PORT ?? 8766}/<slug>/<encoded path>with original query preserved.RALPH_CRDT_TRANSPORT === "remote"→ return307redirect to${RALPH_CRDT_URL}/<slug>/<encoded path>.
The route still does the role gate up front (observer rejected with 403) so we never leak a redirect to an unauthorised client.
6. Editor wiring
console/src/lib/ralph/prompts.ts:
import { WebsocketProvider } from "y-websocket";
const transport = process.env.NEXT_PUBLIC_RALPH_CRDT_TRANSPORT ?? "sidecar";
if (transport === "disabled") {
// render offline banner; editor stays editable but uncoordinated
} else {
const url = transport === "remote"
? process.env.NEXT_PUBLIC_RALPH_CRDT_URL!
: `ws://${location.hostname}:${process.env.NEXT_PUBLIC_RALPH_CRDT_PORT ?? 8766}`;
new WebsocketProvider(url, `${slug}/${encodeURIComponent(promptPath)}`, doc, {
protocols: [`bearer.${bearer}`],
});
}
Bearer is fetched once at editor mount via a new GET /api/ralph/crdt/token route that returns the static HMAC for operator/admin and 403 otherwise (mirrors the 3d pattern of never embedding the bearer in client bundles).
The "offline" banner ("Collaborative editing offline — your changes will save but not sync") is a small <Alert variant="warning"> rendered when the provider is null.
7. SS-04 run-history aggregator
console/src/lib/workspace/ralph-runs.ts (new):
export interface RalphRunsAgg24h {
total: number;
passRate: number; // 0–1
meanAttempts: number;
trailingPoints: { x: string; y: number }[]; // hourly buckets
// KILN-optional, surfaced only when present:
meanConvergenceScore?: number;
earlyStopBreakdown?: Record<string, number>;
}
export async function aggregateRuns24h(slug: string): Promise<RalphRunsAgg24h>;
Implementation.
- Cache lookup keyed by
slug. If hit < 30 s old, return. getRalphClient().listRuns(slug, { since: now - 24h }).- Fold per D7. KILN fields are summed/averaged only over runs that carry them.
- Bucket runs into hour-of-day for
trailingPoints. - Cache + return.
Failure mode. If getRalphClient() is unreachable (sidecar down at boot), aggregateRuns24h throws RalphUnavailable. getRalphRunsLive in data.ts catches it and falls back to the existing getRalphRunsStub so dev-without-sidecar still renders.
console/src/lib/dashboards/data.ts:
getRalphRunsStubis renamed togetRalphRunsFallbackand kept.getLiveDatacallsgetRalphRunsLive(panel)which wrapsaggregateRuns24hand shapes the result to the panel's data type (NumberDatafor the three current panels).
8. HATCH events emitted
| Event | When | Payload |
|---|---|---|
prompt.crdt.session.opened |
After WS handshake accept | actor, workspace, prompt_path, session_id |
prompt.crdt.session.closed |
On WS close (any reason) | actor, workspace, prompt_path, session_id, reason, duration_s |
Body content is never logged (3d invariant). The existing prompt.edited event continues to fire on the HTTP save (which is now the CRDT publish snapshot per gate G3); no change there.
Both new events route through /api/relay/ralph (the SS-05 RELAY route from 3d). The sidecar reaches the relay by HTTP loopback to the Next.js process, with the same bearer.
9. Testing
- Vitest unit (registry + persistence):
crdt.tsround-trip —getDoc→ mutate →persist→ re-getDocfrom a fresh registry → state matches. - Vitest unit (aggregator): fixture KAHN transitions covering all five
Outcomevalues; verify pass-rate fold per D7; verify KILN-on / KILN-off shapes both produce a validRalphRunsAgg24h. - Vitest unit (cache): cache TTL deduplicates within 30 s; expires after; spy on
listRunsconfirms call count. - Playwright e2e (CRDT convergence): two browser contexts open the same prompt; type concurrently; both observe convergence within 500 ms; saved fingerprint reflects converged state. Extends
console/tests/e2e/ralph-prompts.spec.ts. - Playwright e2e (run history): seed three runs via the sidecar with mixed outcomes; visit
/dashboards/ralph-runs; assert the three panels render non-zero values matching the seed. New fileconsole/tests/e2e/ralph-runs.spec.ts. - Manual: kill the CRDT sidecar mid-edit, confirm editor surfaces "offline" banner and recovers on restart; restart Next.js, confirm prompt edits survive (Yjs
.crdtre-seed).
10. Deferred / out-of-scope
| Item | Phase |
|---|---|
| Cluster-mode CRDT sidecar (multi-instance + shared persistence) | Phase 6 (Hearth Kustomize). Single-instance is fine for self-host and current cloud scale. |
| Per-workspace scoped JWT for CRDT WS bearer | Auth-follow-up phase; same migration seam as 3d (mintBearer). |
| Server-side run-history aggregation across workspaces (cross-tenant ops view) | Phase 4+ once contracts stabilise. |
next lint deprecation fix |
Tracked separately; pre-existing, not 3e regression. |
11. AUTH continuity
No change from 3d. Static HMAC from VAULT secrets/ralph/serve_token is the bearer for both HTTP and WS surfaces. The CRDT sidecar reads VAULT identically to the Next.js process.
12. Observability
- CRDT sidecar logs to
RALPH_CRDT_LOG_PATH(defaults to.rocky/crdt-sidecar.log;.rocky/already gitignored). - Counters surfaced in
/api/healthz(additive to the existing health route):crdt_active_sessions,crdt_docs_loaded,ralph_runs_cache_hits,ralph_runs_cache_misses.
13. Acceptance criteria
A Phase 3e PR is complete when:
- Two operator sessions editing the same prompt converge within 500 ms; observer role refused at WS handshake with a clear close-code reason.
- CRDT sidecar boots from
npm run dev, dies gracefully on Next.js shutdown, leaves no orphan process (verified viapsafter^C). - Prompt edits survive a CRDT-sidecar restart (Yjs persistence to
.crdtsidecar verified by killing and restarting the sidecar mid-edit). - HATCH emits
prompt.crdt.session.openedandprompt.crdt.session.closed; body content not logged in either. dash/ralph-runspanels render real 24h aggregates against fixture runs; both KILN-on and KILN-off event shapes render without panel error.- Aggregator cache deduplicates within 30 s; cache miss after expiry verified by spy.
RALPH_CRDT_TRANSPORT=disabledreturns the existing 501 and the editor renders the "Collaborative editing offline" banner without crashing.RALPH_CRDT_TRANSPORT=sidecarandRALPH_CRDT_TRANSPORT=remoteboth pass the convergence e2e.npm run typecheckclean;npm run test:rungreen;npm run e2egreen with sidecar + browsers installed.- Submodule pointer bump merges to parent
main; MILESTONES.md flips the 3e row to closed; the active-phase block records "None — between phases. Phase 3 closed YYYY-MM-DD."
14. References
- Predecessor:
docs/specs/2026-05-03-rocky-console-ralph-wrapper-phase-3d.md(the wrapper this phase closes out). - Parent spec:
docs/specs/2026-05-02-rocky-system-redesign.md§SS-07, §Phasing-3. - Carry-over markers:
console/src/app/api/ralph/prompts/crdt/[...path]/route.ts:4-7,console/src/lib/dashboards/data.ts:34. - Decision 0001 — push-down policy (CRDT sidecar lives in
console/, not parent root). - y-websocket: https://github.com/yjs/y-websocket (already at
^3.0.0inconsole/package.json).