document research basis for phase plans

This commit is contained in:
dirtydishes 2026-06-16 13:53:54 -04:00
parent eaa22de302
commit 412c8b8af9
19 changed files with 332 additions and 4 deletions

View file

@ -2,6 +2,14 @@
This roadmap breaks `docs/plans/synthetic-market-data-architecture-review.md` into implementation-sized phases. The recommended direction is still Option B: extract deterministic synthetic generation into a first-class reusable engine while keeping the useful NATS, ClickHouse, compute, API, replay, and web stack.
## Source Documents
- Architecture plan: [`docs/plans/synthetic-market-data-architecture-review.md`](../../plans/synthetic-market-data-architecture-review.md)
- Research report: [`docs/research-docs/synthetic-market-data-generation.md`](../../research-docs/synthetic-market-data-generation.md)
- Research architecture review copy: [`docs/research-docs/synthetic-data-architecture-review.md`](../../research-docs/synthetic-data-architecture-review.md)
The research documents are background and rationale only. Scope comes from the Beads issue and the phase document.
## Core Constraints
- Emit canonical market event types: `OptionPrint`, `OptionNBBO`, `EquityPrint`, and `EquityQuote`.

View file

@ -8,6 +8,23 @@ Create the reusable deterministic foundation for synthetic market data. This pha
Everything else depends on reproducible raw events. Manifests, labels, replay, demos, and smart-flow tests are only trustworthy if the same seed/profile bundle produces the same canonical market event stream every time.
## Source documents
- Architecture plan: [`docs/plans/synthetic-market-data-architecture-review.md`](../../plans/synthetic-market-data-architecture-review.md)
- Research report: [`docs/research-docs/synthetic-market-data-generation.md`](../../research-docs/synthetic-market-data-generation.md)
These documents are rationale, not added scope. This phase implements only the deterministic spine described below.
## Research basis
- The research recommends a no-history-first, transparent, deterministic generator rather than historical replay as an MVP prerequisite.
- The generator needs core market realism handles from the start: discrete ticks, varying spreads, clustered arrivals, heterogeneous sizes, quote/trade separation, and options-chain sparsity.
- Full agent-based, limit-order-book, and generative-ML simulation are too heavy for the first foundation.
## Deferred research ideas
- Full LOB simulation, agent-based simulation, generative ML, and empirical calibration stay out of this phase.
## Dependencies on earlier phases
None. This is the first synthetic phase.

View file

@ -8,6 +8,23 @@ Turn the deterministic generator into reusable artifacts: fixture files, run man
The deterministic spine gives the repo stable raw events. The next step is to make those events durable and addressable so downstream phases can reference exact generated runs instead of recreating ad hoc local randomness.
## Source documents
- Architecture plan: [`docs/plans/synthetic-market-data-architecture-review.md`](../../plans/synthetic-market-data-architecture-review.md)
- Research report: [`docs/research-docs/synthetic-market-data-generation.md`](../../research-docs/synthetic-market-data-generation.md)
These documents are rationale, not added scope. This phase implements only manifests, fixtures, and CLI support.
## Research basis
- Deterministic replay and reviewable artifacts are necessary for synthetic data to be useful as validation data, not just demo data.
- Expected-output manifests should pin seed, profile, generator version, event hashes, and replay ordering.
- Hidden labels must stay separate from market events so tests do not leak ground truth into production-like paths.
## Deferred research ideas
- Empirical residual resampling and historical-window bootstrapping are future artifact sources, not this CLI's first requirement.
## Dependencies on earlier phases
- `islandflow-259.1` - Synthetic deterministic spine

View file

@ -8,6 +8,24 @@ Author named deterministic scenarios, separate ground-truth labels, and expected
The generator and manifest layers should exist before scenario authoring. Smart-flow evidence clustering should also define enough vocabulary for expected outputs to describe evidence requirements without leaking labels into emitted market events.
## Source documents
- Architecture plan: [`docs/plans/synthetic-market-data-architecture-review.md`](../../plans/synthetic-market-data-architecture-review.md)
- Research report: [`docs/research-docs/synthetic-market-data-generation.md`](../../research-docs/synthetic-market-data-generation.md)
- Smart-flow research report: [`docs/research-docs/smart-flow-market-mechanics.md`](../../research-docs/smart-flow-market-mechanics.md)
These documents are rationale, not added scope. This phase implements only named scenarios, separate labels, and expected-output contracts.
## Research basis
- Scenario injection into a realistic synthetic background is mandatory for labeled, replayable alert tests.
- Negative, noisy, stale, wide-market, and event-context cases matter as much as positive "should detect" scenarios.
- Labels and expected outputs need required evidence, forbidden evidence, confidence bands, and false-positive penalties.
## Deferred research ideas
- Empirical tuning of scenario frequencies, full historical replay-plus-mutation, and learned scenario generation belong after the MVP scenario catalog is stable.
## Dependencies on earlier phases
- `islandflow-259.1` - Synthetic deterministic spine

View file

@ -8,6 +8,23 @@ Make replay consume synthetic runs deterministically, either directly from gener
Replay should not be wired to synthetic data until the generator, manifests, labels, and smart-flow hypothesis pipeline have stable semantics. At this point, replay can become a serious acceptance gate instead of a demo convenience.
## Source documents
- Architecture plan: [`docs/plans/synthetic-market-data-architecture-review.md`](../../plans/synthetic-market-data-architecture-review.md)
- Research report: [`docs/research-docs/synthetic-market-data-generation.md`](../../research-docs/synthetic-market-data-generation.md)
These documents are rationale, not added scope. This phase implements only deterministic synthetic replay integration.
## Research basis
- Replay must preserve event-time ordering and deterministic run identity to prove derived behavior.
- Synthetic runs should be selectable by source and run metadata rather than ambient randomness.
- Optional ClickHouse/NATS materialization can exist later, but fast validation should remain infra-free.
## Deferred research ideas
- Historical replay-plus-mutation and calibrated replay benchmarks are future layers after synthetic replay semantics are stable.
## Dependencies on earlier phases
- `islandflow-259.1` - Synthetic deterministic spine

View file

@ -8,6 +8,23 @@ Expose deterministic synthetic runs as named demo and load profiles after the ge
Demos are useful only after the underlying data can be trusted. This phase deliberately waits until replay and golden evaluation prove the event semantics, so hosted controls do not become a front door to ambient randomness.
## Source documents
- Architecture plan: [`docs/plans/synthetic-market-data-architecture-review.md`](../../plans/synthetic-market-data-architecture-review.md)
- Research report: [`docs/research-docs/synthetic-market-data-generation.md`](../../research-docs/synthetic-market-data-generation.md)
These documents are rationale, not added scope. This phase implements only named deterministic demo and load profiles.
## Research basis
- Demo streams should use named, seeded profiles so product behavior is reproducible.
- Load profiles should scale rate or volume without changing event semantics.
- Realism should come from the generator and scenarios, not hidden UI knobs or wall-clock randomness.
## Deferred research ideas
- Historically bootstrapped demo streams, learned realism upgrades, and full LOB-style demos stay future work.
## Dependencies on earlier phases
- `islandflow-259.1` - Synthetic deterministic spine

View file

@ -8,6 +8,23 @@ Plan future calibration of synthetic generator parameters from historical market
It is useful to name the future work now so early designs keep calibration hooks in mind. It should not come before deterministic generation, manifests, scenarios, replay, or demo profiles.
## Source documents
- Architecture plan: [`docs/plans/synthetic-market-data-architecture-review.md`](../../plans/synthetic-market-data-architecture-review.md)
- Research report: [`docs/research-docs/synthetic-market-data-generation.md`](../../research-docs/synthetic-market-data-generation.md)
These documents are rationale, not added scope. This future phase is the place to turn research ideas into scoped calibration work after MVP.
## Research basis
- Once historical data exists, calibration should fit arrival curves, spread states, size mixtures, venue shares, and options-chain activity weights.
- Replay-plus-mutation can improve realism while preserving deterministic test intent.
- Calibration should layer onto the deterministic engine rather than replace it wholesale.
## Deferred research ideas
- Generative ML, learned LOB simulators, and agent-based models remain later research tracks unless a future Beads issue scopes them explicitly.
## Dependencies on earlier phases
- `islandflow-259.5` - Synthetic demo and load profiles