← Ledger


title: Phase 3e — CRDT WS transport + SS-04 ralph run-history date: 2026-05-04 status: Accepted phase: 3e spec: docs/specs/2026-05-02-rocky-system-redesign.md §SS-07, §Phasing-3 predecessor: docs/specs/2026-05-03-rocky-console-ralph-wrapper-phase-3d.md

Phase 3e — CRDT WS transport + SS-04 ralph run-history

Console-only design (no ralph/ changes; no parent code beyond a submodule pointer bump and the MILESTONES close-row).

1. Goal

Close the two carry-over items deferred from Phase 3d so Phase 3 can be declared complete:

  1. Replace the 501 stub at console/src/app/api/ralph/prompts/crdt/[...path]/route.ts:26 with real bidirectional Yjs sync, satisfying gate G3 (CRDT publish) end-to-end.
  2. Replace the hardcoded-zero getRalphRunsStub at console/src/lib/dashboards/data.ts:54-65 with a live SS-04 read-through aggregator over RalphClient.listRuns(), so the dash/ralph-runs panel renders true 24-hour activity.

No spec scope beyond these two items remains in §Phasing-3 (KAHN adoption + SS-07 wiring shipped in 3b/3c/3d).

2. Locked decisions

# Decision Rationale
D1 CRDT transport: separate Node sidecar. New process console/scripts/crdt-server.mjs running y-websocket's setupWSConnection on 127.0.0.1:${RALPH_CRDT_PORT ?? 8766}, supervised by the Next dev process the same way RALPH_TRANSPORT=sidecar supervises ralph serve. Next.js 16 still has no stable Route-Handler WebSocket primitive. Coupling the phase close to upstream framework cadence is open-ended; a bounded ~200-line Node sidecar is the cheapest path. The sidecar is retire-able later (no schema change) — the crdt.ts registry contract is the seam.
D2 Transport env knob: RALPH_CRDT_TRANSPORT=sidecar|remote|disabled. disabled keeps the current 501 and the editor renders an "offline" banner. Mirrors the SS-07 RALPH_TRANSPORT shape from 3d. disabled is the cloud-default until a managed CRDT service is provisioned in Phase 6+.
D3 Yjs persistence: per-prompt .crdt sidecar next to .fp. Path: <workspace>/prompts/<file>.yaml.crdt. Debounced idle-write (5 s after last update). Survives sidecar restarts without a database. .crdt joins .yaml and .yaml.fp as a third sidecar in the prompts tree; gitignored by default (binary, transient).
D4 WS auth: bearer over Sec-WebSocket-Protocol. Reuse mintBearer(session, workspace_slug) from console/src/lib/ralph/auth.ts; same migration seam as the HTTP routes. y-websocket supports subprotocol-as-bearer natively. No new auth surface.
D5 Role gate: operator+admin only on WS handshake. Observer denied at the upgrade. Matches the §SS-07 role table from 3d. Editing a CRDT doc is editing a prompt.
D6 Run-history aggregator: SS-04 read-through, 30 s in-memory TTL. Lives at console/src/lib/workspace/ralph-runs.ts. No DB, no persistence — ralph serve's journal is the source of truth. Per redesign §Subsystem map: SS-02 owns dashboard definitions and rendering; SS-04 owns workspace-scoped read models. Mirrors how council.ts (SS-04) feeds data.ts (SS-02) for council:debates:recent.
D7 Outcome fold: KAHN Outcome enum (clean|clean_with_flake|partial|stuck|catastrophic) → pass (first two) / fail (last three) for pass_rate math. KILN's optional convergence_score and early_stop_reason pass through as no-op slots when absent (forward-compat invariant from 3d D3). Pass-rate is a binary metric; the five-valued enum needs an explicit fold and clean_with_flake is "pass with a known retry" — counted as pass.

3. Architecture

                Browser (operator/admin session, Airlock cookie)
                 │
                 │  HTTP                          WebSocket
                 │  /api/ralph/prompts/...        ws://localhost:8766/<slug>/<path>
                 ▼                                ▼
   ┌──────────────────────────┐         ┌──────────────────────────────┐
   │ Next.js console          │         │ CRDT sidecar (Node)          │
   │  (existing routes from   │  ←──→   │  console/scripts/crdt-server │
   │   Phase 3d)              │ shared  │  • y-websocket setupWSConn   │
   │                          │ Y.Doc   │  • role check on handshake   │
   │  /api/.../crdt/[...path] │ registry│  • bearer verify             │
   │   → 307 to ws://         │  (mem)  │  • debounced .crdt persist   │
   │     or 501 if disabled   │         │  • HATCH session events      │
   └──────────────────────────┘         └──────────────────────────────┘
                                                     │
                                                     ▼
                             ┌────────────────────────────────────┐
                             │ <workspace>/prompts/<file>.yaml    │
                             │ <workspace>/prompts/<file>.yaml.fp │
                             │ <workspace>/prompts/<file>.yaml.crdt│ ← new in 3e
                             └────────────────────────────────────┘


                Browser (any role)
                 │
                 │  /dashboards/ralph-runs
                 ▼
   ┌──────────────────────────┐       ┌──────────────────────────┐
   │ SS-02 dashboards/data.ts │ ───── │ SS-04 workspace/         │
   │  fetchPanelData()        │ calls │  ralph-runs.ts           │
   │   ralph:runs:* →         │       │   aggregateRuns24h(slug) │
   │    getRalphRunsLive()    │       │   • 30s TTL cache        │
   │     (was getStub)        │       │   • outcome fold (D7)    │
   └──────────────────────────┘       └────────────┬─────────────┘
                                                   │
                                                   ▼  RalphClient.listRuns()
                                          ┌──────────────────┐
                                          │ ralph serve      │
                                          │ (unchanged 3c)   │
                                          └──────────────────┘

4. CRDT sidecar surface

Process. node console/scripts/crdt-server.mjs. Reads RALPH_CRDT_PORT, RALPH_CRDT_TRANSPORT, and the same VAULT-resolved bearer the HTTP path uses. Single-process, single-port; no clustering (cluster mode is a Phase-6 cloud concern).

Endpoint shape. ws://<host>:<port>/<workspace_slug>/<encodeURIComponent(prompt_path)>. The path-shape is parsed inside setupWSConnection's docName extractor; no Express, no router.

Handshake.

  1. Client sends Sec-WebSocket-Protocol: bearer.<token> (subprotocol form so y-websocket's client can pass it through).
  2. Server verifies bearer against the static HMAC from VAULT (identical check to HTTP routes).
  3. Server resolves Airlock session → role; rejects on observer with a 1008 close code and a sub-protocol response.
  4. On accept, server fetches the Y.Doc from the shared registry (getDoc(slug, prompt_path) from console/src/lib/ralph/crdt.ts) and binds the WS to it.

Persistence. Every Y.Doc maintains an update event handler that schedules a debounced (5 s) persist(slug, prompt_path) call:

// console/src/lib/ralph/crdt.ts (extended)
export function persist(slug: string, promptPath: string): Promise<void>

Implementation: Y.encodeStateAsUpdate(doc)<workspace>/prompts/<file>.yaml.crdt via temp-file + rename. On sidecar boot, getDoc() first attempts to seed from the .crdt file via Y.applyUpdate(doc, fs.readFileSync(...)).

Lifecycle.

5. CRDT route changes

console/src/app/api/ralph/prompts/crdt/[...path]/route.ts:

The route still does the role gate up front (observer rejected with 403) so we never leak a redirect to an unauthorised client.

6. Editor wiring

console/src/lib/ralph/prompts.ts:

import { WebsocketProvider } from "y-websocket";

const transport = process.env.NEXT_PUBLIC_RALPH_CRDT_TRANSPORT ?? "sidecar";
if (transport === "disabled") {
  // render offline banner; editor stays editable but uncoordinated
} else {
  const url = transport === "remote"
    ? process.env.NEXT_PUBLIC_RALPH_CRDT_URL!
    : `ws://${location.hostname}:${process.env.NEXT_PUBLIC_RALPH_CRDT_PORT ?? 8766}`;
  new WebsocketProvider(url, `${slug}/${encodeURIComponent(promptPath)}`, doc, {
    protocols: [`bearer.${bearer}`],
  });
}

Bearer is fetched once at editor mount via a new GET /api/ralph/crdt/token route that returns the static HMAC for operator/admin and 403 otherwise (mirrors the 3d pattern of never embedding the bearer in client bundles).

The "offline" banner ("Collaborative editing offline — your changes will save but not sync") is a small <Alert variant="warning"> rendered when the provider is null.

7. SS-04 run-history aggregator

console/src/lib/workspace/ralph-runs.ts (new):

export interface RalphRunsAgg24h {
  total: number;
  passRate: number;             // 0–1
  meanAttempts: number;
  trailingPoints: { x: string; y: number }[];   // hourly buckets
  // KILN-optional, surfaced only when present:
  meanConvergenceScore?: number;
  earlyStopBreakdown?: Record<string, number>;
}

export async function aggregateRuns24h(slug: string): Promise<RalphRunsAgg24h>;

Implementation.

  1. Cache lookup keyed by slug. If hit < 30 s old, return.
  2. getRalphClient().listRuns(slug, { since: now - 24h }).
  3. Fold per D7. KILN fields are summed/averaged only over runs that carry them.
  4. Bucket runs into hour-of-day for trailingPoints.
  5. Cache + return.

Failure mode. If getRalphClient() is unreachable (sidecar down at boot), aggregateRuns24h throws RalphUnavailable. getRalphRunsLive in data.ts catches it and falls back to the existing getRalphRunsStub so dev-without-sidecar still renders.

console/src/lib/dashboards/data.ts:

8. HATCH events emitted

Event When Payload
prompt.crdt.session.opened After WS handshake accept actor, workspace, prompt_path, session_id
prompt.crdt.session.closed On WS close (any reason) actor, workspace, prompt_path, session_id, reason, duration_s

Body content is never logged (3d invariant). The existing prompt.edited event continues to fire on the HTTP save (which is now the CRDT publish snapshot per gate G3); no change there.

Both new events route through /api/relay/ralph (the SS-05 RELAY route from 3d). The sidecar reaches the relay by HTTP loopback to the Next.js process, with the same bearer.

9. Testing

10. Deferred / out-of-scope

Item Phase
Cluster-mode CRDT sidecar (multi-instance + shared persistence) Phase 6 (Hearth Kustomize). Single-instance is fine for self-host and current cloud scale.
Per-workspace scoped JWT for CRDT WS bearer Auth-follow-up phase; same migration seam as 3d (mintBearer).
Server-side run-history aggregation across workspaces (cross-tenant ops view) Phase 4+ once contracts stabilise.
next lint deprecation fix Tracked separately; pre-existing, not 3e regression.

11. AUTH continuity

No change from 3d. Static HMAC from VAULT secrets/ralph/serve_token is the bearer for both HTTP and WS surfaces. The CRDT sidecar reads VAULT identically to the Next.js process.

12. Observability

13. Acceptance criteria

A Phase 3e PR is complete when:

  1. Two operator sessions editing the same prompt converge within 500 ms; observer role refused at WS handshake with a clear close-code reason.
  2. CRDT sidecar boots from npm run dev, dies gracefully on Next.js shutdown, leaves no orphan process (verified via ps after ^C).
  3. Prompt edits survive a CRDT-sidecar restart (Yjs persistence to .crdt sidecar verified by killing and restarting the sidecar mid-edit).
  4. HATCH emits prompt.crdt.session.opened and prompt.crdt.session.closed; body content not logged in either.
  5. dash/ralph-runs panels render real 24h aggregates against fixture runs; both KILN-on and KILN-off event shapes render without panel error.
  6. Aggregator cache deduplicates within 30 s; cache miss after expiry verified by spy.
  7. RALPH_CRDT_TRANSPORT=disabled returns the existing 501 and the editor renders the "Collaborative editing offline" banner without crashing.
  8. RALPH_CRDT_TRANSPORT=sidecar and RALPH_CRDT_TRANSPORT=remote both pass the convergence e2e.
  9. npm run typecheck clean; npm run test:run green; npm run e2e green with sidecar + browsers installed.
  10. Submodule pointer bump merges to parent main; MILESTONES.md flips the 3e row to closed; the active-phase block records "None — between phases. Phase 3 closed YYYY-MM-DD."

14. References