Plan Document

Synthetic Market-Data Architecture Review

A plan-mode architecture review for making synthetic market data deterministic, reusable, and useful across fixtures, replay, tests, demos, and load profiles without replacing the working Islandflow event pipeline.

Source: markdown review Mode: Plan Recommendation: Option B Roadmap HTML Phase dossier

Review Sections

Summary

  • Target file: docs/plans/synthetic-market-data-architecture-review.md. No files were changed in the Plan Mode pass.
  • Recommendation: Option B: Refactor. Conservative work would trap determinism inside ingest adapters; full redesign is premature.
  • Core direction: build a no-history, seeded, manifest-driven synthetic event engine with canonical real event types, separate labels and manifests, deterministic replay, fixture generation, load profiles, and demo scenarios.

Direct Answers

  1. 01

    Synthetic generation should be a combination: a reusable @islandflow/synthetic-market package, a CLI for fixture and run generation, replay-source integration, test fixture helpers, and demo presets. A service should be only a thin live or demo emitter.

  2. 02

    Synthetic events should map to existing canonical event types: OptionPrint, OptionNBBO, EquityPrint, and EquityQuote. Do not create parallel synthetic-only market event types for the main pipeline.

  3. 03

    Use metadata plus isolation, not permanent separate business schemas. Add provenance such as source_kind, run_id, parameter_snapshot_hash, and optional scenario_id; use run-scoped subjects and databases for tests and load runs when isolation matters.

  4. 04

    Ground-truth labels should be separate label records keyed by run_id, scenario_id, event IDs or trace IDs, expected class, expected direction, confidence band, required or forbidden evidence, and false-positive penalties. Do not expose hidden labels on emitted market events.

  5. 05

    Expected-output manifests should be versioned JSON or YAML artifacts produced by the CLI. They should pin seed bundle, generator version, parameter snapshot hash, generated event hashes, replay ordering, expected derived events, alert or no-alert expectations, and evidence requirements.

  6. 06

    Deterministic replay should consume either generated fixture files directly or materialized ClickHouse rows through the same replay ordering: event time, ingest time, sequence, stable event ID. Replay should support a synthetic source and run selector.

  7. 07

    Tests should use synthetic data at three levels: pure package invariants, small golden manifests through compute batch logic, and optional infra-backed NATS and ClickHouse integration tests. bun test should not require Docker.

  8. 08

    Demos should use named demo runs and scenarios, not ambient live randomness. Keep the hosted synthetic control drawer for live demo tuning, but add deterministic demo run selection and replay.

  9. 09

    First-class domain objects: SyntheticRun, SeedBundle, ParameterSnapshot, SymbolProfile, LiquidityProfile, VolatilityRegime, OptionChainProfile, ScenarioInjection, GroundTruthLabel, ExpectedOutputManifest, GeneratedEventBatch, ReplayPlan, LoadProfile, and DemoProfile.

  10. 10

    Implementation details include PRNG algorithm internals, sampling formulas, placement heuristics, adapter timers, NATS consumer names, Redis rolling windows, ClickHouse loader mechanics, UI labels, and cache policy.

Area Classification

Existing replay architecture

Refactor

Keep event-time merge and stream publishing; add generated-stream sources, run IDs, manifests, and deterministic output comparison.

Event schemas

Refactor

Keep canonical raw and derived event shapes; add provenance metadata and separate label and manifest schemas.

Service boundaries

Refactor

Move generator logic out of ingest adapters into a package; adapters become thin emitters.

Test structure

Redesign

Current tests are unit-heavy and adapter-local; add fixture manifests, golden outputs, and batch replay checks.

ClickHouse fixture strategy

Refactor

Keep storage helpers; add run-scoped fixture loaders and optional run metadata, not permanent synthetic clone tables.

NATS and JetStream

Keep and Refactor

Keep canonical subjects for production behavior; support isolated subject prefixes or disposable streams for tests and load.

Redis baseline interaction

Refactor

Keep Redis for live rolling state; golden tests should use in-memory or resettable baselines.

UI and demo needs

Refactor

Keep replay UI and synthetic admin rail; add named deterministic demo modes and scenario selectors.

CI feasibility

Keep and Refactor

Keep fast Bun CI; make synthetic package and golden tests infra-free and defer Docker integration to a separate job.

Options

Option A: Conservative

Wrap current synthetic ingest adapters with minimal metadata, a small fixture CLI, and a few golden tests.

  • Pros

    Fastest, least migration, preserves current demos.

  • Cons

    Determinism remains mixed with wall-clock timers and live adapter behavior; labels and manifests stay bolted on.

  • Complexity

    Low to medium.

  • Migration Risk

    Low.

  • PR Sequence

    Add metadata schemas; add CLI wrapper; add fixture files; add basic replay filters; add initial golden tests.

Option C: Redesign

Rebuild around a unified deterministic event-log architecture where generation, replay, live demo, storage, and tests all consume run-partitioned event logs.

  • Pros

    Cleanest long-term model; excellent determinism, provenance, and replay semantics.

  • Cons

    Too much rebuild for pre-alpha; delays product learning.

  • Complexity

    High.

  • Migration Risk

    High.

  • PR Sequence

    Define event log and envelope; implement generator; rebuild replay; rebuild storage materialization; port compute; port API and UI; retire old ingest paths.

What Gets Better Or Worse

Option Better Worse Kept Rewritten Deleted Or Deferred
A: Conservative Quick smoke fixtures, basic provenance, modest replay demos. Long-term generator quality, test reliability, scenario authoring. Current ingest adapters, bus, storage, API, and web mostly unchanged. Small parts of synthetic adapters and tests. Deep replay refactor, new package boundary, batch harness.
B: Refactor Seeded runs, profiles, labels, manifests, replay, golden tests, load profiles. Short-term churn and some duplicated paths during migration. Canonical event schemas, NATS subjects, ClickHouse helpers, compute classifiers, API replay endpoints, web replay shell. Synthetic options and equities adapters, synthetic control state, replay source abstraction, tests around synthetic scenarios. Adapter-local scenario catalog after migration; full LOB, agent, or ML simulation.
C: Redesign Architecture purity, reproducible environments, run isolation. Delivery speed, disruption, operational risk. Some compute, classifier, and domain logic plus UI concepts. Replay, ingest, storage partitioning, bus topology, fixture and test harness. Current synthetic adapters, current replay service shape, much of current live and demo plumbing.

Recommendation

Choose Option B. Option A is a patch, and it will keep producing impressive-looking but untrustworthy demos. Option C is architecture vanity for a pre-alpha product.

Option B is the grown-up move: extract the generator into a deterministic package, keep the useful event pipeline, and make replay, tests, and demos consume the same generated runs.

First-Class Domain Objects

SyntheticRun SeedBundle ParameterSnapshot SymbolProfile LiquidityProfile VolatilityRegime OptionChainProfile ScenarioInjection GroundTruthLabel ExpectedOutputManifest GeneratedEventBatch ReplayPlan LoadProfile DemoProfile

Test Plan

Unit

PRNG determinism, profile normalization, tick validity, quote and trade invariants, option chain sparsity, label and manifest schema parsing.

Golden

Fixed seed plus manifest produces byte or hash-stable raw events and stable smart-money and alert signatures.

Replay

Synthetic source ordering matches manifest; derived outputs match expected-output manifest.

Integration

Optional NATS and ClickHouse run-scoped fixture test behind a non-default CI job.

Demo

Named demo profiles render in replay UI; load profile scales rates without changing event semantics.

Assumptions

MVP remains no-history-first.

Canonical real event schemas remain the pipeline contract.

Hidden labels are never embedded directly in market events.

Infra-backed tests are useful, but the first synthetic quality gate must pass in plain bun test.