plan synthetic and smart-flow phases

This commit is contained in:
dirtydishes 2026-06-16 13:46:08 -04:00
parent d1fac6c7ec
commit eaa22de302
19 changed files with 1198 additions and 1 deletions

View file

@ -0,0 +1,65 @@
# Smart-Flow Phase 99: Future Calibration
## Purpose
Plan future calibration of smart-flow confidence, policy thresholds, penalties, and abstention behavior after the MVP evidence/hypothesis pipeline is working and replay-validated.
## Why this phase comes now
The architecture should leave room for calibration, but calibration should not block the MVP. The system first needs clean facts, evidence, hypotheses, and replayable evaluation before tuning can be meaningful.
## Dependencies on earlier phases
- `islandflow-zxh.5` - Smart-flow API/UI explainability
- `islandflow-259.6` - Future synthetic historical calibration
## Likely files/modules touched
- Future calibration tooling in `services/compute/` or a research package
- Policy/model version registry
- Evaluation reports or benchmark datasets
- Storage/query helpers for historical derived outputs
- Documentation for metrics and calibration governance
## In-scope work
- Define calibration datasets and evaluation metrics.
- Specify how confidence, conviction, penalties, abstention, and alternatives are tuned.
- Preserve policy/model versioning and replayability.
- Document what makes a calibration dataset acceptable.
- Keep user-facing confidence semantics auditable.
## Explicitly out-of-scope work
- MVP contracts and scoring foundations.
- API/UI explainability for the initial pipeline.
- Treating historical calibration as proof of participant identity.
- Using private or licensed data in committed fixtures without approval.
## Acceptance criteria
- Calibration remains outside the MVP blocker chain.
- Dataset provenance, metrics, and policy versioning are documented before implementation.
- Confidence and abstention semantics remain explainable after tuning.
- Replay can compare calibrated policy versions without losing auditability.
## Test strategy
When implemented, use replayed benchmark datasets with versioned policy outputs. Track false positives, abstentions, precision-like metrics, and scenario-specific regressions. Keep calibration tests separate from the early deterministic fixture tests.
## Risks / design traps
- Treating calibrated confidence as objective truth.
- Tuning to demos instead of representative market regimes.
- Losing policy version lineage.
- Committing restricted data or large generated benchmark artifacts.
## Suggested future Codex implementation prompt
```text
Implement docs/implementation/smart-money/99-future-calibration.md for Beads issue islandflow-zxh.6 only after the MVP smart-flow phases are complete. Define calibration datasets, metrics, policy versioning, and replay comparison. Do not make calibration a prerequisite for earlier evidence, scoring, or UI work.
```
## Matching Beads issue title/id
- `islandflow-zxh.6` - Future smart-flow phase 99: calibration