Turn Record · May 22, 2026

Stabilize Live API Memory and Internal Traffic

The Islandflow live API was repeatedly getting OOM-killed on the VPS because the hot live cache could retain oversized channel windows and rewrite whole Redis lists at high frequency. This turn applied an immediate server-side mitigation, hardened the API cache path in code, and rolled the changes onto the native systemd deployment.

Branch stabilize-live-api-memory
Beads islandflow-thp
Deployment Native systemd user services on the VPS
Primary Outcome API RSS returned to roughly 115-130 MB after rollout

Summary

The live API is now bounded in three layers instead of trusting environment values and reconnect behavior. First, the VPS .env was reset to safer live-window values and the oversized Redis hot-cache keys were cleared. Second, the API now clamps generic live cache limits per channel in code. Third, generic live feed persistence now appends deltas into Redis instead of cloning and rewriting entire lists on every flush.

Observed on the VPS after rollout: the API stayed healthy through restart, minute metrics showed much smaller cache depths, and the kernel did not log any new Bun OOM kill after the hardened restart.

Changes Made

  • Added channel-specific hard caps in services/api/src/live.ts so oversized LIVE_LIMIT_* values are clamped before use.
  • Changed generic live Redis persistence from full-list rewrite behavior to append-plus-trim, with rewrite fallback only when the in-memory ordering has to be rebuilt.
  • Serialized Redis flushes during shutdown so service restarts do not race with a closing Redis client.
  • Added API minute-log visibility for live subscription counts, Redis flush deltas, payload bytes, snapshot sizes, and process memory usage.
  • Tightened the browser-exposed live window caps in apps/web/app/terminal.tsx and aligned the tracked env examples with the safer production defaults, including LIVE_LIMIT_NEWS.
  • Applied the emergency mitigation directly on the VPS: updated /home/delta/islandflow/.env, created /home/delta/islandflow/.env.backup-2026-05-22-2131, deleted stale live:* Redis keys, rebuilt the web app, and restarted islandflow-api.service and islandflow-web.service.

Context

The VPS was killing islandflow-api.service several times on May 22, 2026. Kernel logs showed Bun reaching roughly 8-9 GiB RSS inside the API service cgroup before the OOM killer stepped in. The API minute logs also showed channel depths pinned at 10000 for multiple feeds, plus massive cumulative Redis rewrite churn.

Most of the “huge bandwidth” in btop was local loopback traffic: Bun talking to Redis, NATS, and ClickHouse on 127.0.0.1. That meant the problem was not a public-edge flood, it was the live cache architecture multiplying internal work on the box.

Important Implementation Details

API hardening

  • Hard caps now bound generic channel windows even if env values drift upward.
  • snapshot_limit is still honored, but only up to the lower of the request, the configured limit, and the safe channel cap.
  • Generic feeds use incremental Redis appends; scoped candle and overlay caches still use full rewrites because they are much smaller and keyed differently.

Operational changes

  • The VPS now runs with a much smaller hot live footprint: options 100, flow 500, alerts 300, news 100.
  • Old Redis hot-cache keys were deleted so the API did not rehydrate oversized lists on boot.
  • The web app was rebuilt on the VPS checkout after switching that checkout onto stabilize-live-api-memory.

Relevant Diff Snippets

These snippets are rendered with the Diffs library from diffs.com, with a plain-text fallback kept inline in the file.

services/api/src/live.ts: hard caps and append-based generic Redis flushes

Plain-text fallback
Added LIVE_GENERIC_LIMIT_CAPS, clamped env/configured limits, changed generic writes from
queueRedisWrite(items:[...items]) to queueGenericRedisWrite(item, items, forceRewrite), and split
Redis persistence into rewrite and append paths with shutdown-safe flush serialization.

services/api/src/index.ts: minute metrics now include memory and live subscription visibility

Plain-text fallback
Added buildLiveSubscriptionMetrics(), previous snapshot tracking, flush delta logging,
memory snapshots, and gauges for RSS, heap used, active sockets, and per-channel subscriptions.

.env.example and apps/web/app/terminal.tsx: safer default windows

Plain-text fallback
Reduced LIVE_LIMIT_OPTIONS in tracked examples to 100, added LIVE_LIMIT_NEWS=100,
and lowered the client-exposed maximum live hot windows from 100000 to 2000.

Expected Impact for End-Users

  • The hosted app should stop disappearing behind API restarts caused by the kernel OOM killer.
  • Live feeds should still feel current, but the server will retain a tighter hot window instead of hoarding oversized in-memory histories.
  • The operator experience on the VPS should improve because internal loopback churn is materially lower.

Validation

  • Local API test gate passed: bun test services/api/tests/live.test.ts
  • Local web production build passed: bun --cwd=apps/web run build
  • VPS mitigation applied successfully. Redis reported 1524 live keys removed before restart.
  • After mitigation restart, systemctl --user status islandflow-api.service showed the API at about 84 MB RSS instead of multi-GB startup drift.
  • After rolling the hardened branch onto the VPS, the API minute log at 2026-05-22 21:44:11 EDT showed:
119.6 MB API RSS from the minute memory snapshot
100 live:options depth
500 live:flow, live:alerts, and live:equity-quotes caps held
34,559 Redis flush items in that minute delta
9.18 MB Redis flush payload bytes in that minute delta
No new OOM Kernel logs after the hardened restart

Issues, Limitations, and Mitigations

  • The new minute metrics are cumulative plus delta-based. They are much more useful than the old absolute counters, but they still reset on process restart.
  • snapshotItemsByChannel remains empty when no live websocket clients are connected. That is expected because snapshots are only recorded when a snapshot is actually served.
  • Quiet feeds such as news and inferred-dark can still show very old freshness ages in logs. That reflects inactivity, not a broken hot path.
  • The append-based Redis path deliberately falls back to a rewrite when out-of-order live events require the in-memory ordering to be rebuilt. That keeps correctness ahead of theoretical write minimization.

Follow-up Work

  • Add explicit alerting for repeated API RSS growth and for minute-level flush deltas that jump far above the new baseline.
  • Decide whether quiet-channel freshness logs should suppress extremely stale values for feeds like news to reduce operator noise.
  • Consider moving the live cache metrics into a dashboard view so operators do not need to parse journal lines manually.