diff --git a/.beads/issues.jsonl b/.beads/issues.jsonl index c6b5525..2b12057 100644 --- a/.beads/issues.jsonl +++ b/.beads/issues.jsonl @@ -1,4 +1,4 @@ -{"_type":"issue","id":"islandflow-thp","title":"stabilize live api memory and reduce internal cache churn","description":"The native VPS deployment is repeatedly OOM-killing islandflow-api.service during live operation. The API live cache is retaining oversized channel histories and rewriting large Redis lists on every flush, which drives multi-GB Bun RSS and heavy loopback traffic between the API, Redis, NATS, and ClickHouse. Implement an emergency VPS mitigation plus repo hardening so unsafe env values, reconnect snapshots, and Redis persistence patterns cannot push the live API back into OOM.","acceptance_criteria":"1. VPS live cache env values are reduced to safe defaults and live redis state is cleared before restart. 2. services/api/src/live.ts enforces server-side live cache caps and clamps snapshot_limit accordingly. 3. Hot generic feed Redis persistence no longer rewrites entire lists on every flush. 4. Metrics/logging expose subscription counts, snapshot sizes, redis flush volume, and API memory trend. 5. Relevant tests pass and the deployment is restarted successfully.","notes":"Implemented local hardening for API live-state limits, incremental generic Redis persistence, live subscription/memory metrics, and safer client/env defaults. Targeted API live tests and the web production build both passed.","status":"in_progress","priority":1,"issue_type":"bug","assignee":"dirtydishes","owner":"dishes@dpdrm.com","created_at":"2026-05-23T01:30:43Z","created_by":"dirtydishes","updated_at":"2026-05-23T01:39:57Z","started_at":"2026-05-23T01:30:52Z","dependency_count":0,"dependent_count":0,"comment_count":0} +{"_type":"issue","id":"islandflow-thp","title":"stabilize live api memory and reduce internal cache churn","description":"The native VPS deployment is repeatedly OOM-killing islandflow-api.service during live operation. The API live cache is retaining oversized channel histories and rewriting large Redis lists on every flush, which drives multi-GB Bun RSS and heavy loopback traffic between the API, Redis, NATS, and ClickHouse. Implement an emergency VPS mitigation plus repo hardening so unsafe env values, reconnect snapshots, and Redis persistence patterns cannot push the live API back into OOM.","acceptance_criteria":"1. VPS live cache env values are reduced to safe defaults and live redis state is cleared before restart. 2. services/api/src/live.ts enforces server-side live cache caps and clamps snapshot_limit accordingly. 3. Hot generic feed Redis persistence no longer rewrites entire lists on every flush. 4. Metrics/logging expose subscription counts, snapshot sizes, redis flush volume, and API memory trend. 5. Relevant tests pass and the deployment is restarted successfully.","notes":"Implemented and deployed the live-state hardening to the VPS. Final validation after restart showed the API around 120 MB RSS with capped live cache depths and clean systemd restarts.","status":"in_progress","priority":1,"issue_type":"bug","assignee":"dirtydishes","owner":"dishes@dpdrm.com","created_at":"2026-05-23T01:30:43Z","created_by":"dirtydishes","updated_at":"2026-05-23T01:50:29Z","started_at":"2026-05-23T01:30:52Z","dependency_count":0,"dependent_count":0,"comment_count":0} {"_type":"issue","id":"islandflow-sc6","title":"fix electron codex bridge preload loading","description":"Electron settings showed the browser-only Desktop Required fallback because the renderer did not see the native islandflowDesktop preload bridge or an Electron user-agent marker. Fix the desktop launch path so ChatGPT/Codex subscription controls are available inside Islandflow Desktop again.","notes":"Reopened after live Electron still showed the browser-only fallback. Follow-up fix adds an explicit preload runtime marker and web runtime detection for that marker so Electron is recognized even when the bridge is not ready and the user agent lacks an Electron token.","status":"closed","priority":1,"issue_type":"bug","owner":"dishes@dpdrm.com","created_at":"2026-05-20T23:42:58Z","created_by":"dirtydishes","updated_at":"2026-05-20T23:51:43Z","closed_at":"2026-05-20T23:51:43Z","close_reason":"Follow-up fix added an explicit islandflowDesktopRuntime preload marker and taught the web runtime to recognize that marker plus IslandflowDesktop user-agent tokens, so Electron no longer falls into the browser-only fallback when the AI bridge is delayed or unavailable. Desktop build and focused desktop/web tests pass; full web build still blocked by islandflow-c8f.","dependency_count":0,"dependent_count":0,"comment_count":0} {"_type":"issue","id":"islandflow-hj3","title":"Fix Electron preload for desktop AI bridge","description":"## Why\\nThe desktop settings page reports the native AI bridge as unavailable because Electron fails to load the preload script in local dev.\\n\\n## What\\nUpdate the desktop preload implementation/build so Electron can execute it, restore window.islandflowDesktop, and verify the Copilot settings panel detects the bridge again.\\n\\n## Acceptance Criteria\\n- Electron no longer logs a preload syntax error\\n- window.islandflowDesktop is available in the desktop renderer\\n- The settings page no longer shows bridge unavailable solely because preload failed\\n- Relevant desktop/web tests pass","status":"closed","priority":1,"issue_type":"bug","assignee":"dirtydishes","owner":"dishes@dpdrm.com","created_at":"2026-05-20T23:16:39Z","created_by":"dirtydishes","updated_at":"2026-05-20T23:20:20Z","started_at":"2026-05-20T23:16:48Z","closed_at":"2026-05-20T23:20:20Z","close_reason":"Closed","dependency_count":0,"dependent_count":0,"comment_count":0} {"_type":"issue","id":"islandflow-199","title":"fix desktop copilot fallback inside electron","description":"## Why\\nThe settings page can render the browser-only fallback even when Islandflow is running inside the Electron desktop shell.\\n\\n## What\\nSeparate desktop-shell detection from desktop AI transport state, make the provider recover if the bridge appears late or initial state loading fails, and cover the regression with tests.\\n\\n## Acceptance Criteria\\n- The desktop shell no longer shows the browser-only fallback solely because initial bridge state failed or arrived late\\n- Desktop-only actions can distinguish between missing Electron bridge and transport/auth problems\\n- Automated tests cover the recovery behavior","status":"closed","priority":1,"issue_type":"bug","assignee":"dirtydishes","owner":"dishes@dpdrm.com","created_at":"2026-05-20T22:30:16Z","created_by":"dirtydishes","updated_at":"2026-05-20T22:37:21Z","started_at":"2026-05-20T22:30:23Z","closed_at":"2026-05-20T22:37:21Z","close_reason":"Fixed desktop-shell Copilot fallback handling, added bridge recovery logic, updated desktop-vs-bridge UI messaging, and added regression tests. Follow-up tracked in islandflow-c8f for unrelated web build blocker.","dependency_count":0,"dependent_count":0,"comment_count":0} diff --git a/docs/turns/2026-05-22-stabilize-live-api-memory.html b/docs/turns/2026-05-22-stabilize-live-api-memory.html new file mode 100644 index 0000000..d2b48e2 --- /dev/null +++ b/docs/turns/2026-05-22-stabilize-live-api-memory.html @@ -0,0 +1,810 @@ + + + + + + Turn Record: Stabilize Live API Memory + + + +
+
+ Turn Record · May 22, 2026 +

Stabilize Live API Memory and Internal Traffic

+

+ The Islandflow live API was repeatedly getting OOM-killed on the VPS because the hot live + cache could retain oversized channel windows and rewrite whole Redis lists at high + frequency. This turn applied an immediate server-side mitigation, hardened the API cache + path in code, and rolled the changes onto the native systemd deployment. +

+
+
+ Branch + stabilize-live-api-memory +
+
+ Beads + islandflow-thp +
+
+ Deployment + Native systemd user services on the VPS +
+
+ Primary Outcome + API RSS returned to roughly 115-130 MB after rollout +
+
+
+ +
+
+

Summary

+

+ The live API is now bounded in three layers instead of trusting environment values and + reconnect behavior. First, the VPS .env was reset to safer live-window + values and the oversized Redis hot-cache keys were cleared. Second, the API now clamps + generic live cache limits per channel in code. Third, generic live feed persistence now + appends deltas into Redis instead of cloning and rewriting entire lists on every flush. +

+
+ Observed on the VPS after rollout: + the API stayed healthy through restart, minute metrics showed much smaller cache depths, + and the kernel did not log any new Bun OOM kill after the hardened restart. +
+
+ +
+

Changes Made

+
    +
  • + Added channel-specific hard caps in + services/api/src/live.ts so oversized + LIVE_LIMIT_* values are clamped before use. +
  • +
  • + Changed generic live Redis persistence from full-list rewrite behavior to append-plus-trim, + with rewrite fallback only when the in-memory ordering has to be rebuilt. +
  • +
  • + Serialized Redis flushes during shutdown so service restarts do not race with a closing + Redis client. +
  • +
  • + Added API minute-log visibility for live subscription counts, Redis flush deltas, + payload bytes, snapshot sizes, and process memory usage. +
  • +
  • + Tightened the browser-exposed live window caps in + apps/web/app/terminal.tsx and aligned the tracked env examples with the safer + production defaults, including LIVE_LIMIT_NEWS. +
  • +
  • + Applied the emergency mitigation directly on the VPS: + updated /home/delta/islandflow/.env, created + /home/delta/islandflow/.env.backup-2026-05-22-2131, deleted stale + live:* Redis keys, rebuilt the web app, and restarted + islandflow-api.service and islandflow-web.service. +
  • +
+
+ +
+

Context

+

+ The VPS was killing islandflow-api.service several times on May 22, 2026. + Kernel logs showed Bun reaching roughly 8-9 GiB RSS inside the API service cgroup before + the OOM killer stepped in. The API minute logs also showed channel depths pinned at + 10000 for multiple feeds, plus massive cumulative Redis rewrite churn. +

+

+ Most of the “huge bandwidth” in btop was local loopback traffic: Bun talking + to Redis, NATS, and ClickHouse on 127.0.0.1. That meant the problem was not a + public-edge flood, it was the live cache architecture multiplying internal work on the box. +

+
+ +
+

Important Implementation Details

+
+
+

API hardening

+
    +
  • + Hard caps now bound generic channel windows even if env values drift upward. +
  • +
  • + snapshot_limit is still honored, but only up to the lower of the request, + the configured limit, and the safe channel cap. +
  • +
  • + Generic feeds use incremental Redis appends; scoped candle and overlay caches still + use full rewrites because they are much smaller and keyed differently. +
  • +
+
+
+

Operational changes

+
    +
  • + The VPS now runs with a much smaller hot live footprint: + options 100, flow 500, alerts 300, + news 100. +
  • +
  • + Old Redis hot-cache keys were deleted so the API did not rehydrate oversized lists on boot. +
  • +
  • + The web app was rebuilt on the VPS checkout after switching that checkout onto + stabilize-live-api-memory. +
  • +
+
+
+
+ +
+

Relevant Diff Snippets

+

+ These snippets are rendered with the Diffs library from + diffs.com, with a plain-text fallback kept inline in the file. +

+
+
+

services/api/src/live.ts: hard caps and append-based generic Redis flushes

+
+
+ Plain-text fallback +
Added LIVE_GENERIC_LIMIT_CAPS, clamped env/configured limits, changed generic writes from
+queueRedisWrite(items:[...items]) to queueGenericRedisWrite(item, items, forceRewrite), and split
+Redis persistence into rewrite and append paths with shutdown-safe flush serialization.
+
+
+ +
+

services/api/src/index.ts: minute metrics now include memory and live subscription visibility

+
+
+ Plain-text fallback +
Added buildLiveSubscriptionMetrics(), previous snapshot tracking, flush delta logging,
+memory snapshots, and gauges for RSS, heap used, active sockets, and per-channel subscriptions.
+
+
+ +
+

.env.example and apps/web/app/terminal.tsx: safer default windows

+
+
+ Plain-text fallback +
Reduced LIVE_LIMIT_OPTIONS in tracked examples to 100, added LIVE_LIMIT_NEWS=100,
+and lowered the client-exposed maximum live hot windows from 100000 to 2000.
+
+
+
+
+ +
+

Expected Impact for End-Users

+
    +
  • + The hosted app should stop disappearing behind API restarts caused by the kernel OOM killer. +
  • +
  • + Live feeds should still feel current, but the server will retain a tighter hot window instead of + hoarding oversized in-memory histories. +
  • +
  • + The operator experience on the VPS should improve because internal loopback churn is materially lower. +
  • +
+
+ +
+

Validation

+
    +
  • + Local API test gate passed: + bun test services/api/tests/live.test.ts +
  • +
  • + Local web production build passed: + bun --cwd=apps/web run build +
  • +
  • + VPS mitigation applied successfully. Redis reported 1524 live keys removed before restart. +
  • +
  • + After mitigation restart, systemctl --user status islandflow-api.service showed the + API at about 84 MB RSS instead of multi-GB startup drift. +
  • +
  • + After rolling the hardened branch onto the VPS, the API minute log at + 2026-05-22 21:44:11 EDT showed: +
  • +
+
+
+ 119.6 MB + API RSS from the minute memory snapshot +
+
+ 100 + live:options depth +
+
+ 500 + live:flow, live:alerts, and live:equity-quotes caps held +
+
+ 34,559 + Redis flush items in that minute delta +
+
+ 9.18 MB + Redis flush payload bytes in that minute delta +
+
+ No new OOM + Kernel logs after the hardened restart +
+
+
+ +
+

Issues, Limitations, and Mitigations

+
    +
  • + The new minute metrics are cumulative plus delta-based. They are much more useful than the old + absolute counters, but they still reset on process restart. +
  • +
  • + snapshotItemsByChannel remains empty when no live websocket clients are connected. + That is expected because snapshots are only recorded when a snapshot is actually served. +
  • +
  • + Quiet feeds such as news and inferred-dark can still show very old freshness ages in logs. + That reflects inactivity, not a broken hot path. +
  • +
  • + The append-based Redis path deliberately falls back to a rewrite when out-of-order live events + require the in-memory ordering to be rebuilt. That keeps correctness ahead of theoretical write minimization. +
  • +
+
+ +
+

Follow-up Work

+
    +
  • + Add explicit alerting for repeated API RSS growth and for minute-level flush deltas that jump far above the new baseline. +
  • +
  • + Decide whether quiet-channel freshness logs should suppress extremely stale values for feeds like news to reduce operator noise. +
  • +
  • + Consider moving the live cache metrics into a dashboard view so operators do not need to parse journal lines manually. +
  • +
+
+
+
+ + + +