Turn Record · May 22, 2026
Stabilize Live API Memory and Internal Traffic
The Islandflow live API was repeatedly getting OOM-killed on the VPS because the hot live
cache could retain oversized channel windows and rewrite whole Redis lists at high
frequency. This turn applied an immediate server-side mitigation, hardened the API cache
path in code, and rolled the changes onto the native systemd deployment.
Summary
The live API is now bounded in three layers instead of trusting environment values and
reconnect behavior. First, the VPS .env was reset to safer live-window
values and the oversized Redis hot-cache keys were cleared. Second, the API now clamps
generic live cache limits per channel in code. Third, generic live feed persistence now
appends deltas into Redis instead of cloning and rewriting entire lists on every flush.
Observed on the VPS after rollout:
the API stayed healthy through restart, minute metrics showed much smaller cache depths,
and the kernel did not log any new Bun OOM kill after the hardened restart.
Changes Made
-
Added channel-specific hard caps in
services/api/src/live.ts so oversized
LIVE_LIMIT_* values are clamped before use.
-
Changed generic live Redis persistence from full-list rewrite behavior to append-plus-trim,
with rewrite fallback only when the in-memory ordering has to be rebuilt.
-
Serialized Redis flushes during shutdown so service restarts do not race with a closing
Redis client.
-
Added API minute-log visibility for live subscription counts, Redis flush deltas,
payload bytes, snapshot sizes, and process memory usage.
-
Tightened the browser-exposed live window caps in
apps/web/app/terminal.tsx and aligned the tracked env examples with the safer
production defaults, including LIVE_LIMIT_NEWS.
-
Applied the emergency mitigation directly on the VPS:
updated
/home/delta/islandflow/.env, created
/home/delta/islandflow/.env.backup-2026-05-22-2131, deleted stale
live:* Redis keys, rebuilt the web app, and restarted
islandflow-api.service and islandflow-web.service.
Context
The VPS was killing islandflow-api.service several times on May 22, 2026.
Kernel logs showed Bun reaching roughly 8-9 GiB RSS inside the API service cgroup before
the OOM killer stepped in. The API minute logs also showed channel depths pinned at
10000 for multiple feeds, plus massive cumulative Redis rewrite churn.
Most of the “huge bandwidth” in btop was local loopback traffic: Bun talking
to Redis, NATS, and ClickHouse on 127.0.0.1. That meant the problem was not a
public-edge flood, it was the live cache architecture multiplying internal work on the box.
Important Implementation Details
API hardening
-
Hard caps now bound generic channel windows even if env values drift upward.
-
snapshot_limit is still honored, but only up to the lower of the request,
the configured limit, and the safe channel cap.
-
Generic feeds use incremental Redis appends; scoped candle and overlay caches still
use full rewrites because they are much smaller and keyed differently.
Operational changes
-
The VPS now runs with a much smaller hot live footprint:
options
100, flow 500, alerts 300,
news 100.
-
Old Redis hot-cache keys were deleted so the API did not rehydrate oversized lists on boot.
-
The web app was rebuilt on the VPS checkout after switching that checkout onto
stabilize-live-api-memory.
Relevant Diff Snippets
These snippets are rendered with the Diffs library from
diffs.com, with a plain-text fallback kept inline in the file.
services/api/src/live.ts: hard caps and append-based generic Redis flushes
Plain-text fallback
Added LIVE_GENERIC_LIMIT_CAPS, clamped env/configured limits, changed generic writes from
queueRedisWrite(items:[...items]) to queueGenericRedisWrite(item, items, forceRewrite), and split
Redis persistence into rewrite and append paths with shutdown-safe flush serialization.
services/api/src/index.ts: minute metrics now include memory and live subscription visibility
Plain-text fallback
Added buildLiveSubscriptionMetrics(), previous snapshot tracking, flush delta logging,
memory snapshots, and gauges for RSS, heap used, active sockets, and per-channel subscriptions.
.env.example and apps/web/app/terminal.tsx: safer default windows
Plain-text fallback
Reduced LIVE_LIMIT_OPTIONS in tracked examples to 100, added LIVE_LIMIT_NEWS=100,
and lowered the client-exposed maximum live hot windows from 100000 to 2000.
Expected Impact for End-Users
-
The hosted app should stop disappearing behind API restarts caused by the kernel OOM killer.
-
Live feeds should still feel current, but the server will retain a tighter hot window instead of
hoarding oversized in-memory histories.
-
The operator experience on the VPS should improve because internal loopback churn is materially lower.
Validation
-
Local API test gate passed:
bun test services/api/tests/live.test.ts
-
Local web production build passed:
bun --cwd=apps/web run build
-
VPS mitigation applied successfully. Redis reported
1524 live keys removed before restart.
-
After mitigation restart,
systemctl --user status islandflow-api.service showed the
API at about 84 MB RSS instead of multi-GB startup drift.
-
After rolling the hardened branch onto the VPS, the API minute log at
2026-05-22 21:44:11 EDT showed:
119.6 MB
API RSS from the minute memory snapshot
100
live:options depth
500
live:flow, live:alerts, and live:equity-quotes caps held
34,559
Redis flush items in that minute delta
9.18 MB
Redis flush payload bytes in that minute delta
No new OOM
Kernel logs after the hardened restart
Issues, Limitations, and Mitigations
-
The new minute metrics are cumulative plus delta-based. They are much more useful than the old
absolute counters, but they still reset on process restart.
-
snapshotItemsByChannel remains empty when no live websocket clients are connected.
That is expected because snapshots are only recorded when a snapshot is actually served.
-
Quiet feeds such as news and inferred-dark can still show very old freshness ages in logs.
That reflects inactivity, not a broken hot path.
-
The append-based Redis path deliberately falls back to a rewrite when out-of-order live events
require the in-memory ordering to be rebuilt. That keeps correctness ahead of theoretical write minimization.
Follow-up Work
-
Add explicit alerting for repeated API RSS growth and for minute-level flush deltas that jump far above the new baseline.
-
Decide whether quiet-channel freshness logs should suppress extremely stale values for feeds like news to reduce operator noise.
-
Consider moving the live cache metrics into a dashboard view so operators do not need to parse journal lines manually.