From df35d4978405a68504b4282e005f30aa16b3a4be Mon Sep 17 00:00:00 2001 From: dirtydishes Date: Mon, 4 May 2026 18:00:50 -0400 Subject: [PATCH] Add smart money rebuild plan and taxonomy - Replace tape overhaul notes with the new smart-money classification plan - Add the taxonomy reference doc for the rules-first rebuild --- smart-money-rebuild-phase-01.md | 122 ++++++++ smartmoney.md | 534 ++++++++++++++++++++++++++++++++ tape-overhaul-phase1-1.md | 170 ---------- tape-overhaul-phase1.md | 320 ------------------- 4 files changed, 656 insertions(+), 490 deletions(-) create mode 100644 smart-money-rebuild-phase-01.md create mode 100644 smartmoney.md delete mode 100644 tape-overhaul-phase1-1.md delete mode 100644 tape-overhaul-phase1.md diff --git a/smart-money-rebuild-phase-01.md b/smart-money-rebuild-phase-01.md new file mode 100644 index 0000000..55b839b --- /dev/null +++ b/smart-money-rebuild-phase-01.md @@ -0,0 +1,122 @@ +# Smart Money Rebuild Plan + +## Summary +Rebuild the current packet-threshold classifier into a `rules-first`, parent-event, multi-profile system driven by the taxonomy in [smartmoney.md](/Users/kell/Cloud/dev/islandflow/smartmoney.md). The first milestone will ship a new event model, feature pipeline, profile rule engine, event-calendar enrichment, deterministic synthetic scenarios, and a compatibility bridge to current alerts/UI. We will explicitly ignore anything that requires owner/account identity, supervised model training, anomaly detection, or speculative profile claims we cannot support from public-tape-style data. + +## Scope In +- Core 6 primary profiles: `institutional_directional`, `retail_whale`, `event_driven`, `vol_seller`, `arbitrage`, `hedge_reactive` +- Parent-event reconstruction from child prints, NBBO context, structure context, and underlying context +- Probabilistic rule scores with reason codes and abstentions +- External corporate-event calendar support via `services/refdata` +- Scenario-driven synthetic options/equity/quote generation for tests, replay, and demos +- Compat bridge from new profile model back to current `ClassifierHitEvent` and `AlertEvent` + +## Scope Out +- Supervised model training/inference in v1 +- Unsupervised anomaly detection in v1 +- `prop/professional customer` as a first-class output +- Claims about beneficial owner, account class, or illegal intent +- Real-time use of next-day open interest +- Rule 606/CAT/private broker data integrations + +## Phase 0: Planning Artifact +- Create `SMART_MONEY_REBUILD_PLAN.md` at repo root as the living implementation document. +- Copy this phased plan into that file and add per-phase checklists, acceptance criteria, and migration notes. +- Treat that file as the session handoff and implementation tracker, while still using `bd` for issue tracking. + +## Phase 1: Contracts and Storage +- Add a new event contract in `packages/types` for `SmartMoneyEvent` with: + - `event_id`, `packet_ids`, `member_print_ids`, `underlying_id`, `event_kind`, `event_window_ms` + - `features` as structured typed fields, not only loose string/number maps + - `profile_scores: { profile_id, probability, confidence_band, direction, reasons[] }[]` + - `primary_profile_id`, `primary_direction`, `abstained`, `suppressed_reasons[]` +- Keep `FlowPacket` during bridge, but stop treating it as the final semantic unit. +- Keep `ClassifierHitEvent`, but derive it from `SmartMoneyEvent.primary_profile_id` plus legacy mapping. +- Add storage support in `packages/storage` for `smart_money_events`. +- Extend `AlertEvent` with optional `primary_profile_id` and `profile_scores` while preserving current fields. + +## Phase 2: Parent-Event Reconstruction +- Add `services/compute/src/parent-events.ts` to group child prints into parent events. +- Reconstruction key should use: contract, direction proxy, burst gap, venue burst context, and structure linkage. +- Preserve special-print flags from conditions so auctions/crosses/complex-like prints can be suppressed or downweighted. +- Allow two parent paths: + - `single_leg_event` + - `multi_leg_event` +- Reuse current structure logic where useful, but move the semantic output to parent events instead of direct classifier hits. +- Emit deterministic event IDs so batch replay and live scoring agree. + +## Phase 3: Feature Engineering +- Add typed feature builders for: + - aggressor mix, spread position, quote age, venue count, inter-fill timing, strike concentration + - DTE, moneyness, ATM proximity, synthetic IV shock, spread widening, underlying move linkage + - structure markers, same-size leg symmetry, net directional bias proxies + - event alignment: days-to-event, expiry-after-event, pre-event concentration +- Build event-calendar ingestion in `services/refdata` for earnings/corporate events from a simple external feed or static importable provider layer. +- Live scoring may use only timestamp-available data; any later validation fields must be batch-only. + +## Phase 4: Rules Engine +- Replace `services/compute/src/classifiers.ts` with profile rules centered on the six primary profiles. +- Each rule returns probability, direction, reason codes, suppression reasons, and a confidence band. +- Add explicit false-positive guards from the research doc: + - special/complex/auction suppression for directional labels + - retail-frenzy guard on short-dated OTM call bursts + - hedge-reactive preference for 0-2 DTE ATM/high-gamma/reactive-underlier cases + - arbitrage requirement for matched-leg symmetry and near-flat directional exposure +- Keep existing structure-specific ideas like straddle/vertical/roll as evidence and reasons, not top-level end states. + +## Phase 5: Synthetic Market Redesign +- Rework `services/ingest-options/src/adapters/synthetic.ts` around labeled parent-event templates instead of loose burst presets. +- Add deterministic synthetic scenario families matching the core 6 profiles plus neutral background noise. +- Each scenario must emit a coherent bundle: + - child option prints + - contemporaneous NBBO evolution + - underlying quote path + - IV response pattern + - realistic conditions/venues/structure markers +- Add two operating modes: + - `test`: seeded, deterministic, low-noise, exact expected labels + - `demo`: seeded, realistic background with controlled noise ratios +- Keep synthetic hidden labels internal to tests/replay harnesses, not public production payloads. + +## Phase 6: Compute, API, and UI Rollout +- In `services/compute`, emit `SmartMoneyEvent` first, then derive compat `ClassifierHitEvent` and `AlertEvent`. +- In `services/api`, add read/stream endpoints for `SmartMoneyEvent` while preserving existing endpoints. +- In `apps/web/app/terminal.tsx`, migrate rendering to profile-aware displays: + - primary profile + - probability ladder + - reason codes + - suppression/abstention state +- During the bridge, old UI elements should continue working from mapped legacy hits. + +## Phase 7: Evaluation and Replay +- Add deterministic rule tests per profile and per major false-positive case. +- Add replay-style integration tests for live-vs-batch consistency. +- Add synthetic scenario acceptance tests proving: + - the intended profile wins + - nearby wrong profiles stay below a threshold + - noisy background does not overwhelm expected results +- Add evaluation utilities for parent-event precision/recall, calibration, abstention rate, and economic sanity checks. + +## Important API and Type Changes +- New primary stream/table/type: `SmartMoneyEvent` +- `ClassifierHitEvent` becomes a legacy-derived compatibility surface +- `AlertEvent` gains optional profile metadata but keeps existing shape +- `FlowPacket` remains during migration, but becomes an intermediate artifact rather than the final semantic alert object + +## Test Cases and Scenarios +- Institutional directional: aggressive concentrated call/put burst with catalyst-aligned expiry +- Retail whale: short-dated OTM attention-name chase with IV pop +- Event-driven: pre-earnings aligned expiry and widening spreads +- Vol seller: sell-side dominant overwrite/put-write/short-vol structure +- Arbitrage: matched multi-leg parity-style event with low net directional bias +- Hedge reactive: short-dated ATM burst tied to underlying move and gamma-sensitive conditions +- False positives: auctions, complex prints, late/stale quote context, illiquid wide spreads, retail frenzy misread as institution, structure trades misread as direction + +## Assumptions and Defaults +- Rollout mode: `Compat Bridge` +- First milestone: `Rules-first` +- Primary outputs: `Core 6` +- Event-driven flow uses real external event-calendar enrichment in v1 +- `prop/professional customer` remains supporting evidence only +- Existing rule labels like `vertical_spread` and `zero_dte_gamma_punch` become evidence/reason codes, not final business-facing profile IDs +- Synthetic generation is optimized for deterministic realism, not maximum randomness diff --git a/smartmoney.md b/smartmoney.md new file mode 100644 index 0000000..f53ebeb --- /dev/null +++ b/smartmoney.md @@ -0,0 +1,534 @@ +# Smart Money Options Flow Classifier Playbook + +## Executive Summary + +A usable options-flow classifier should start from one hard truth: there is no single “smart money” footprint. The same aggressive print can come from an informed institutional buyer, a dealer hedging inventory, a retail stampede, a facilitation auction, or a parity trade embedded in a complex order. The public tape is informative, but it is noisy, and the literature is mixed: some studies find that options volume and order imbalance predict future stock returns and volatility, while other work shows that a meaningful share of apparent pre-event options activity is speculative and retail-driven rather than informed. + +The practical implication is that you should not train one monolithic `smart_money = 1` label. Train a taxonomy. First reconstruct parent events from child prints, quotes, condition codes, venue information, and Greeks. Then classify each parent event into one or more participant-style hypotheses such as directional institutional block buy, dealer hedge, professional/prop burst, retail whale chase, event-driven informed flow, volatility seller, or arbitrage desk. That hierarchy matches both the official options market structure and the academic evidence on demand pressure, volatility-information trading, dealer hedging, retail demand, and event-time option selection. + +Your best baseline is an ensemble. Use rules to produce interpretable weak labels, supervised models to learn non-linear interactions, and unsupervised anomaly detection to catch new regimes. Keep the final output probabilistic and multi-label, not categorical and overconfident. That is the only sane way to handle a market where options sometimes lead stocks, but options are also noisier than equities because of wider spreads, temporary price pressure, legging, and event-driven speculation. + +## Market Structure and Data Foundation + +Start with public U.S. listed-options data from Options Price Reporting Authority. The OPRA specification gives you participant ID, last-sale message types, quote messages, best-bid/best-offer appendages, and end-of-day open interest. Participant IDs identify the exchange that originated the message, and OPRA quote appendages identify the exchange posting the best bid or offer. OPRA also carries important last-sale condition detail including ISO executions, auctions, crosses, multi-leg complex trades, stock-option trades, compression trades, late prints, cancels, and out-of-sequence messages. Those fields are the backbone of any classifier worth building. + +Venue and structure matter because the rules explicitly allow complex and special-order handling to look different from plain single-leg urgency. The options order protection plan defines an ISO as a limit order routed together with additional ISOs to satisfy better-priced protected quotations, and it treats complex trades as a specific trade-through exception. Exchange rulebooks also define complex-order books, complex-order auctions, synthetic BBOs for strategies, and legging into simple books. In plain English: a print that looks “too aggressive” versus the simple-leg NBBO may be perfectly normal for a complex strategy or auction. + +Use official strategy definitions from The Options Clearing Corporation to anchor the arbitrage and overwrite classes. OCC materials and related exchange methodology documents give canonical descriptions of covered-call or buy-write structures, put/call parity, conversions, reversals, and box spreads. Those are not trivia; they let you design deterministic detectors for some of the cleanest non-directional “smart money” profiles on the tape. + +Routing data is the next layer. Public routing disclosure under U.S. Securities and Exchange Commission Rule 606 and the FINRA 606 portal can tell you where non-directed listed-options orders are routed and whether payment for order flow or other venue economics may be shaping execution. That is not a per-trade participant flag, but it becomes useful as a prior when you have broker-level logs, customer requests, or controlled execution datasets. + +For quote alignment, use the latest valid NBBO snapshot at or before the trade timestamp, but maintain a correction pipeline because OPRA explicitly documents late trades, out-of-sequence trades, cancels, and sequence resets. For volatility features, derive IV, delta, gamma, and vega from a surface built from contemporaneous quotes; for variance-risk features, use an option-strip estimate of risk-neutral variance and compare it with subsequent realized variance. That is the cleanest way to separate directional demand from pure vol-selling or vol-buying pressure. + +### Data sources and what each is good for + +| Source | Core fields | Best use in classifier | Main limitations | +|---|---|---|---| +| OPRA time & sales / tape | last sale, condition code, exchange participant ID, contract identifiers | trade reconstruction, urgency, venue, complex/auction/special print filters | no beneficial owner, no true account class | +| OPRA quotes / NBBO | bid, ask, sizes, best-bid/best-offer appendages | aggressor-side inference, spread position, quote pressure | quote-trade desync, temporary noise | +| Vendor-normalized NBBO & trades such as products from Nasdaq | nanosecond timestamps, OPRA-derived NBBO/trade fields, appendages | production-grade replay, lower engineering friction | still usually lacks owner identity | +| Exchange rulebooks and venue specs such as Cboe Options Exchange materials | complex-order logic, auction mechanics, strategy definitions | false-positive mitigation for crosses, auctions, legging | descriptive, not participant labels | +| Open interest | end-of-day OI by contract | weak confirmation of opening flow for training and backtests | not real-time | +| Broker routing / Rule 606 | venue distribution, non-directed-routing stats, PFOF economics | priors on retail/wholesaler routing, venue fingerprints | not per-trade in public reports | +| IV surface and underlying prices | IV, skew, term structure, delta/gamma/vega, realized vol | participant-style separation, especially vol sellers, hedgers, arbs | model choice matters | +| Event calendars | earnings, M&A, dividends, corporate actions | event-driven informed-flow labeling | external datasets required | +| Broker/account/CAT-like audit data if available | account type, origin, open/close, route chain | strongest labels for retail vs professional vs institutional | usually not publicly available | + +Source basis for the table: OPRA field definitions and condition codes, vendor OPRA-derived feeds, exchange complex-order rules, OCC strategy definitions, SEC Rule 606, and FINRA’s 606 reporting portal. + +The core research takeaway is that the tape can contain real information. Easley, O’Hara, and Srinivas show that signed positive and negative option volume contains information about future stock prices; Pan and Poteshman show that open-buy put/call ratios predict subsequent stock returns; Ni, Pan, and Poteshman show that non-market-maker net demand for volatility predicts future realized volatility beyond implied volatility; and later price-discovery work finds that options reflect new information before stocks roughly one-quarter of the time on average, especially around information events. But equally important, other work shows strongly mixed evidence, including papers arguing that much earnings-related options activity is dominated by speculative retail trading and differences of opinion rather than pure information. Your classifier must be built around that ambiguity, not around internet folklore. + +## Taxonomy of Smart Money Profiles + +The table below is intentionally pragmatic. Where the market does not provide a canonical cutoff, the threshold is marked **unspecified** and I add a **seed** value in parentheses that is meant only as a starting hyperparameter. Tune every seed by symbol liquidity bucket, option price level, spread regime, and event context. + +| Profile | Economic motive | Strongest tape signature | Highest-value measurable features | Suggested thresholds or ranges | +|---|---|---|---|---| +| Institutional block buyers | Directional or convexity exposure around a thesis or catalyst | Large parent order, mostly aggressive, concentrated in one strike or a tight strike cluster, expiration usually aligned with a catalyst horizon | ask-lift share, spread position, parent notional, strike concentration, DTE, absolute delta, next-day OI change, IV percentile | ask-lift share **unspecified** (seed `> 0.60`); parent notional **unspecified** (seed `>$250k` single names, `>$1m` indexes); same-strike notional share **unspecified** (seed `>0.70`) | +| Market makers hedging | Inventory and gamma risk management | Activity is most visible as reactive cross-asset flow, especially in short-dated ATM contracts and the underlying/futures, often reversing with price changes | DTE, ATM proximity, dollar gamma, hedge-link to stock/futures, intraday sign reversals, two-sided prints, quote widening | DTE `0–2` days for strongest signatures; abs(delta) **unspecified** (seed `0.35–0.65`); high dollar gamma **unspecified** (seed `>95th percentile by symbol/DTE`) | +| Prop firms / professional customers | Intraday alpha, microstructure taking, liquidity seeking, statistical edge | Rapid child-order bursts across venues or strikes, ISO/sweep-like urgency, low dwell time, often many small or medium clips rather than one giant block | inter-fill milliseconds, venue count, ISO flag, distinct strikes in burst, burst entropy, lot-size dispersion, routing pattern | official “professional customer” threshold is `>390 orders/day` if origin data exists; public burst proxy **unspecified** (seed `>=5` child prints in `<=2s`, `>=2` venues) | +| Retail whales | Leveraged speculation in attention-heavy names | Large prints by retail standards, short-dated and often OTM, heavily call-biased in favored names, rising IV, often occurs in the same contracts retail prefers generally | DTE, moneyness, call/put bias, IV shock, venue prior from routing data, concentration in high-attention symbols | DTE **unspecified** (seed `<=7` for single names, often `0DTE/1DTE` in indexes); abs(delta) **unspecified** (seed `0.10–0.35`); notional threshold is account-dependent and thus **unspecified** | +| Corporate-event informed flow | Exploit private or superior information about timing and direction of a known upcoming event; do **not** equate this with illegal insider trading | Expiration chosen to land just after the event, high leverage via OTM or near-ATM contracts, unusual pre-event volume, IV and spreads often rise before announcement | event-distance days, expiry alignment, moneyness, spread widening, IV term-structure change, OI growth, low-priced leverage preference | event window **unspecified** (seed `1–30d` before event); expiry alignment **unspecified** (seed “first listed expiry after event”); abs(delta) **unspecified** (seed `0.15–0.40` for directional calls/puts; `0.40–0.60` for straddles/strangles) | +| Volatility sellers | Harvest premium, overwrite stock, or short rich implied volatility | Prints are often on the sell side near bid or midpoint, repeated rolled positions, multi-leg short-vol structures, or covered-call / buy-write linkage to stock | signed vega, IV-minus-HV, realized-vs-implied variance spread, roll cadence, covered-call stock ratio, multi-leg flags | sell-side dominance **unspecified** (seed `>0.60` of parent contracts); IV-RV richness **unspecified** (seed z-score `>1.5`); covered-call stock/contract ratio **unspecified** (seed `80–120` shares per contract equivalent) | +| Arbitrage desks | Enforce parity, finance/carry trades, or exploit mispricings | Conversions, reversals, boxes, jelly-roll-like structures, same-size matched legs, near-zero net delta, often same expiry/paired strikes, may appear as complex trades | parity residual, matched-leg timing, same-size legs, net delta near zero, net vega near zero, complex flag, European/cash-settled box eligibility | abs(net delta) **unspecified** (seed `<0.05` of equivalent shares after scaling); parity residual **unspecified** (seed `> fees + slippage`); same-size matched legs `exact or within 5%` | + +Evidence anchors for the taxonomy: signed option volume predicts future stock prices, volatility demand predicts future realized volatility, dealer hedging changes spreads and underlier trading, retail demand clusters in short-dated OTM calls and affects IV, informed traders time maturities around earnings/news, professional-customer status begins above 390 orders per day, and arbitrage/overwrite structures are formally defined by OCC and exchange methodology. + +Two profiles deserve special caution. First, market-maker hedging is often better observed in the underlying than in the options tape itself; a dealer can be the passive counterparty in options and the aggressive actor in stock or futures because of delta rebalancing, especially in high-gamma short-dated regimes. Second, “corporate-event informed flow” should be treated as a market-behavior label, not as an accusation of illegal insider trading. The academic and regulatory evidence shows suspicious pre-event patterns can exist, but the public tape is not enough to prove intent or legal status. + +## Feature Engineering and Weak Labeling + +The right unit of analysis is almost never the raw print. It is the reconstructed parent event. Sessionize child prints by contract, side, and time gap; align them to the most recent valid quote; compute whether the parent traded at the ask, at the bid, or through a special mechanism; then aggregate size, notional, strike dispersion, expiry alignment, venue footprint, and Greeks. If you skip parent reconstruction, your classifier will overfit to child-print fragmentation and venue noise. + +### Feature library + +| Feature | How to compute | Data required | Suggested threshold / range | +|---|---|---|---| +| `order_side_score` | classify child print as buy if `price >= ask - eps`, sell if `price <= bid + eps`, else midpoint/unknown; aggregate parent as ask-lift share or bid-hit share | trades + contemporaneous NBBO | `eps` **unspecified** (seed `0.01` or `0.1 * spread`); buy aggression **unspecified** (seed ask-lift share `>0.60`) | +| `spread_position` | `(price - bid) / max(ask - bid, tick)` clipped to `[0,1]` | trades + NBBO | buy-like `>=0.80`, sell-like `<=0.20` | +| `inter_fill_ms` | median and max milliseconds between child prints in same parent | trades | urgent burst **unspecified** (seed median `<=500ms`); sweep-like **unspecified** (seed child gap `<=50ms`) | +| `parent_notional_usd` | `sum(size * contract_multiplier * trade_price)` over parent | trades + contract multiplier | **unspecified** (seed rank `>=99th pct` by symbol; or absolute `>$250k` single names / `>$1m` indexes) | +| `strike_concentration` | largest strike notional share within parent or same-day cluster | trades | **unspecified** (seed `>0.70`) | +| `maturity_alignment` | days from trade date to expiry; also distance from expiry to event date | contract metadata + event calendar | directional event flow often `expiry just after event`; hedge flow strongest at `0DTE/1DTE/2DTE`; exact cutoff **unspecified** | +| `abs_delta` and `dollar_gamma` | compute from IV surface and spot; scale gamma by contracts and spot | quotes + underlying + surface | event-driven directional flow often abs(delta) **unspecified** (seed `0.15–0.40`); hedge-sensitive flow often `0.35–0.65` | +| `iv_minus_hv` / `vrp_signal` | compare contemporaneous IV or synthetic risk-neutral variance with trailing HV or future RV | quotes + underlying history | **unspecified** (seed z-score `>1.5` for “rich IV”, `<-1.5` for “cheap IV`) | +| `complex_flag` | true if OPRA condition indicates multi-leg, cross, auction, stock-option, or compression trade | OPRA condition codes | exact by condition code; no threshold | +| `venue_count` and `venue_entropy` | count distinct exchanges in parent burst and entropy of prints by exchange | participant ID / exchange | **unspecified** (seed `>=2` venues in `<=1s` = urgency prior) | +| `iso_or_sweep_flag` | true if OPRA ISO condition present or if multi-venue ask-lifting occurs in one burst | trade conditions + participant ID + NBBO | ISO is deterministic when flagged; burst-sweep proxy **unspecified** | +| `routing_prior` | broker-level probability vector from Rule 606, broker execution logs, or account-specific data | Rule 606 / broker logs | public per-trade threshold is **unspecified** | +| `oi_confirmation` | `next_day_OI - prior_day_OI`, optionally scaled by burst size | open interest + trades | **unspecified** (seed `OI delta > 0` or `>=25%` of parent size) | +| `underlying_link` | stock/futures buy-sell imbalance in a short window around option parent; for buy-write detect stock buy near call sale | signed stock/futures trades + options | **unspecified** (seed `±5s` window; share/contract ratio `80–120`) | + +Source basis for the feature library: OPRA quote/trade fields, OPRA condition codes, open interest, the options order protection plan, complex-order rules, variance-risk-premium construction from option prices, and academic evidence linking directional, volatility, retail, and event-time flow to specific contracts and maturities. + +### Weak-label seeds for training + +| Profile | Positive seed label | Hard exclusions / downweights | +|---|---|---| +| Institutional block buyer | large parent notional, ask-lift dominant, concentrated in one strike or narrow cluster, not tagged complex/auction, expiry consistent with thesis horizon, next-day OI rises | complex/auction/cross flags, parity-like matched opposite legs, obvious covered-call stock link | +| Market-maker hedge | high dollar gamma in `0DTE–2DTE` near ATM, option parent followed by opposite-direction stock/futures hedge, repeated intraday sign flips, two-sided inventory management | single giant concentrated directional bet with no underlier hedge | +| Prop / professional | many child prints fast, multiple venues, ISO or sweep-like urgency, multiple strikes or expiries, high daily order count if origin exists | one slow resting limit order, one-venue block, obvious overwrite/arbitrage | +| Retail whale | short-dated OTM call-heavy flow in retail-favored symbol, IV shock, broker/venue prior consistent with retail routing if available | complex parity structures, low-delta institutional put hedges, calm overwrite roll | +| Corporate-event informed flow | event within next `1–30d`, expiry just after event, unusual OTM directional exposure or ATM vol exposure, rising IV/spreads, OI expansion | contracts far beyond event horizon, obvious retail-meme chase, special-order cross conditions | +| Vol seller | sell-side dominant, short vega, repeated monthly roll or overwrite pattern, IV rich to HV/RV, stock buy link for covered call | strong ask-lifting call/put buys, long-vol straddles, event-aligned convexity buys | +| Arbitrage desk | same-size opposite legs, same expiry and parity-linked strikes, near-zero net delta, complex flag or matched-leg timestamps | highly concentrated one-way exposure with large residual delta | + +These labels are intentionally “silver,” not “gold.” They are for weak supervision, self-training, and human review queues. Public OPRA-style data does not identify beneficial owner, open/close intent in real time, or customer/professional/institutional status directly, so hard participant labels require richer private data. + +## Classifier Design and Evaluation + +A strong design is a three-layer ensemble. Layer one is an interpretable rule engine that reconstructs parents, filters special prints, and emits weak labels plus reason codes. Layer two is a supervised event-level model, usually gradient-boosted trees as the baseline and a sequence model only if you truly need temporal microstructure context. Layer three is an unsupervised anomaly detector by symbol and regime to catch novel bursts that the rules and labels miss. Calibrate the final probabilities so downstream systems can set risk-sensitive thresholds instead of blindly trusting raw scores. That structure matches the mixed research evidence and the market’s obvious non-stationarity. + +```mermaid +flowchart LR + A[Raw options trades and quotes] --> B[Quote alignment and correction handling] + B --> C[Parent-order reconstruction] + C --> D[Feature engineering] + D --> E[Rule engine and weak labels] + D --> F[Supervised event model] + D --> G[Unsupervised anomaly model] + E --> H[Ensemble and calibration] + F --> H + G --> H + H --> I[Profile probabilities plus reason codes] + I --> J[Real-time alerts, batch review, backtests] +``` + +Evaluation should happen at the parent-event level, not the child-print level. Use macro and micro F1, precision, recall, AUROC, AUPRC, Matthews correlation, Brier score, and expected calibration error. Then add profile-specific economic validation. For directional institutional or event-driven flow, test post-signal stock return, IV change, and spread-adjusted PnL. For vol sellers, test realized-versus-implied variance, theta capture, and post-event IV compression. For arbitrage, test parity convergence net of fees and slippage. For market-maker hedges, test whether predicted hedge-linked flow lines up with same-session stock or futures rebalancing. + +Backtests should be walk-forward and purged. Split by time first, then by symbol clusters or sectors if you can, and embargo around the same catalyst so nearly identical event windows do not leak into train and test. Use only information available at the event timestamp in the live feature set. End-of-day open interest can validate labels during offline training, but it must never leak into real-time scoring. Re-run batch labels after cancels, late reports, and sequence repairs. That last step is not optional; the OPRA spec explicitly says those records exist. + +### Common false positives and how to kill them + +| False positive | Why it happens | Mitigation | +|---|---|---| +| Aggressive buy at ask that is actually an auction or facilitation print | single-leg or complex auctions can print “aggressively” without informed urgency | downweight or exclude auction/cross/complex condition codes before directional classification | +| Print above simple-leg NBBO that is actually a complex trade | complex trades have protection exceptions and can leg or price off strategy economics | require `complex_flag = false` for simple directional labels | +| Retail frenzy mistaken for institutional conviction | retail demand is heavily concentrated in short-dated OTM calls and can move IV | layer in retail priors, attention proxies, and avoid treating short-dated OTM call bursts as automatically informed | +| Dealer re-hedging mistaken for end-user direction | dealer inventory management can create reactive stock/futures flow and two-sided options activity | use stock/futures linkage and estimated dollar gamma | +| Parity trades mistaken for “smart bullish calls” | conversions/reversals/boxes include calls, puts, and sometimes stock in matched quantities | net Greeks and parity residual checks | +| Late, canceled, or out-of-sequence prints | tape corrections can invert apparent urgency | correction-aware replay and batch relabeling | +| Wide-spread illiquid contracts | midpoint and ask/bid inference is unreliable when spreads are huge | liquidity filters, spread normalization, larger `eps`, contract-level confidence score | +| Pre-earnings speculation mistaken for information | some literature finds earnings-related options activity is mainly speculative and retail-driven | use event alignment plus cross-checks: pre-event stock returns, spreads, OI, and profile probabilities rather than raw volume alone | + +This table is built directly from OPRA condition rules, exchange complex-order mechanics, retail-flow evidence, dealer-hedging evidence, and the mixed academic results on information versus speculation in options. + +## Implementation and Detection Examples + +For production, keep three stores: raw append-only packet or normalized message storage; a corrected trade-and-quote warehouse; and an event-level feature store keyed by parent ID. Real time should score on the best-available aligned quote state and current IV surface, while batch should replay the day after corrections and open interest land. Storage should be columnar for history and ring-buffered in memory for current-day scoring. Latency matters mostly for parent reconstruction and hedge linkage; model inference itself is cheap compared with quote alignment and surface updates. + +```mermaid +timeline + title Sample trade sequence for one inferred parent event + 09:31:02.100: NBBO 1.00 x 1.05 + 09:31:02.120: 1,500 calls print at 1.05 on one venue + 09:31:02.310: 2,000 calls print at 1.05 on second venue + 09:31:02.360: 1,200 calls print at 1.06 after ask lifts + 09:31:02.900: stock/futures buy program starts + 09:31:03.400: implied vol up and spread widens + 09:31:04.000: parent burst closes and event is scored +``` + +### Example pseudocode + +```python +from dataclasses import dataclass, field +from typing import List, Dict, Optional +import math + +@dataclass +class ChildTrade: + ts_ns: int + underlying: str + expiry: str + strike: float + cp: str + price: float + contracts: int + exchange: str + cond: str + bid: float + ask: float + spot: float + iv: Optional[float] = None + delta: Optional[float] = None + gamma: Optional[float] = None + +@dataclass +class ParentEvent: + children: List[ChildTrade] = field(default_factory=list) + + def add(self, t: ChildTrade) -> None: + self.children.append(t) + + def features(self) -> Dict[str, float]: + if not self.children: + return {} + + prices = [c.price for c in self.children] + sizes = [c.contracts for c in self.children] + notionals = [c.price * c.contracts * 100.0 for c in self.children] + spreads = [max(c.ask - c.bid, 0.01) for c in self.children] + + ask_lifts = [ + 1.0 if c.price >= c.ask - min(0.01, 0.1 * (c.ask - c.bid)) else 0.0 + for c in self.children + ] + bid_hits = [ + 1.0 if c.price <= c.bid + min(0.01, 0.1 * (c.ask - c.bid)) else 0.0 + for c in self.children + ] + spread_pos = [ + max(0.0, min(1.0, (c.price - c.bid) / s)) + for c, s in zip(self.children, spreads) + ] + + ts = sorted(c.ts_ns for c in self.children) + gaps_ms = [(ts[i] - ts[i - 1]) / 1e6 for i in range(1, len(ts))] + gap_med_ms = sorted(gaps_ms)[len(gaps_ms) // 2] if gaps_ms else 0.0 + + strike_notional: Dict[tuple, float] = {} + venue_set = set() + complex_flag = 0 + iso_flag = 0 + + gamma_dollar = 0.0 + delta_equiv_shares = 0.0 + + for c, n in zip(self.children, notionals): + strike_notional[(c.expiry, c.strike, c.cp)] = strike_notional.get((c.expiry, c.strike, c.cp), 0.0) + n + venue_set.add(c.exchange) + if c.cond in set("abcdefghijklmnopqrstuvwxyz"): + complex_flag = 1 + if c.cond == "S": + iso_flag = 1 + if c.gamma is not None: + gamma_dollar += c.gamma * c.contracts * 100.0 * (c.spot ** 2) * 0.01 + if c.delta is not None: + delta_equiv_shares += c.delta * c.contracts * 100.0 + + top_cluster_share = max(strike_notional.values()) / max(sum(notionals), 1.0) + + return { + "contracts_total": float(sum(sizes)), + "notional_total_usd": float(sum(notionals)), + "avg_price": sum(prices) / len(prices), + "ask_lift_share": sum(ask_lifts) / len(ask_lifts), + "bid_hit_share": sum(bid_hits) / len(bid_hits), + "spread_pos_mean": sum(spread_pos) / len(spread_pos), + "inter_fill_median_ms": gap_med_ms, + "top_strike_cluster_share": top_cluster_share, + "venue_count": float(len(venue_set)), + "complex_flag": float(complex_flag), + "iso_flag": float(iso_flag), + "gamma_dollar_per_1pct_move": gamma_dollar, + "delta_equiv_shares": delta_equiv_shares, + } + +def same_parent(prev: ChildTrade, cur: ChildTrade, max_gap_ms: int = 2000) -> bool: + if (cur.ts_ns - prev.ts_ns) / 1e6 > max_gap_ms: + return False + return ( + prev.underlying == cur.underlying + and prev.cp == cur.cp + and prev.expiry == cur.expiry + and abs(prev.strike - cur.strike) < 1e-9 + ) + +def score_profile(x: Dict[str, float]) -> Dict[str, float]: + # Interpretable weak-score layer. Replace with calibrated model outputs later. + scores = { + "institutional_block_buy": 0.0, + "market_maker_hedge": 0.0, + "prop_professional": 0.0, + "retail_whale": 0.0, + "corporate_event_informed": 0.0, + "vol_seller": 0.0, + "arbitrage_desk": 0.0, + } + + if x["complex_flag"] == 0 and x["ask_lift_share"] > 0.60 and x["top_strike_cluster_share"] > 0.70: + scores["institutional_block_buy"] += 0.6 + if x["inter_fill_median_ms"] <= 500 and x["venue_count"] >= 2: + scores["prop_professional"] += 0.5 + if x["iso_flag"] == 1: + scores["prop_professional"] += 0.2 + if x["complex_flag"] == 1 and abs(x["delta_equiv_shares"]) < 0.05 * max(x["contracts_total"] * 100.0, 1.0): + scores["arbitrage_desk"] += 0.5 + if x["bid_hit_share"] > 0.60: + scores["vol_seller"] += 0.4 + + # Market-maker-hedge score benefits from separate stock/futures linkage features not shown here. + return scores +``` + +### SQL schema assumption + +The SQL below assumes PostgreSQL and these normalized tables: + +- `option_trades(ts, trade_id, underlying, expiry, strike, cp, price, contracts, exchange, cond)` +- `option_nbbo(ts, underlying, expiry, strike, cp, bid, ask, bid_exch, ask_exch)` +- `stock_trades_signed(ts, symbol, shares, price, side_est)` where `side_est` is `1` for aggressive buy and `-1` for aggressive sell + +### SQL prework: enrich options prints with the latest NBBO + +```sql +CREATE MATERIALIZED VIEW enriched_option_trades AS +SELECT + t.*, + q.bid, + q.ask, + q.bid_exch, + q.ask_exch, + CASE + WHEN t.price >= q.ask - LEAST(0.01, 0.10 * GREATEST(q.ask - q.bid, 0.01)) THEN 1 + WHEN t.price <= q.bid + LEAST(0.01, 0.10 * GREATEST(q.ask - q.bid, 0.01)) THEN -1 + ELSE 0 + END AS side_est, + CASE + WHEN q.ask > q.bid THEN (t.price - q.bid) / (q.ask - q.bid) + ELSE NULL + END AS spread_pos, + (t.price * t.contracts * 100.0) AS notional_usd, + (t.cond = 'S')::int AS iso_flag +FROM option_trades t +JOIN LATERAL ( + SELECT q.* + FROM option_nbbo q + WHERE q.underlying = t.underlying + AND q.expiry = t.expiry + AND q.strike = t.strike + AND q.cp = t.cp + AND q.ts <= t.ts + ORDER BY q.ts DESC + LIMIT 1 +) q ON TRUE; +``` + +### SQL query: buys at or above the ask within seconds + +```sql +WITH tagged AS ( + SELECT + *, + CASE + WHEN LAG(ts) OVER w IS NULL THEN 1 + WHEN EXTRACT(EPOCH FROM (ts - LAG(ts) OVER w)) > 2 THEN 1 + ELSE 0 + END AS new_parent + FROM enriched_option_trades + WHERE side_est = 1 + AND cond NOT IN ('a','b','c','d','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v') + WINDOW w AS ( + PARTITION BY underlying, expiry, strike, cp + ORDER BY ts + ) +), +parents AS ( + SELECT + *, + SUM(new_parent) OVER ( + PARTITION BY underlying, expiry, strike, cp + ORDER BY ts + ROWS UNBOUNDED PRECEDING + ) AS parent_id + FROM tagged +) +SELECT + underlying, expiry, strike, cp, parent_id, + MIN(ts) AS start_ts, + MAX(ts) AS end_ts, + SUM(contracts) AS contracts_total, + SUM(notional_usd) AS notional_total_usd, + AVG(spread_pos) AS mean_spread_pos, + COUNT(*) AS child_prints +FROM parents +GROUP BY underlying, expiry, strike, cp, parent_id +HAVING SUM(notional_usd) > 250000 + AND AVG(spread_pos) >= 0.80 +ORDER BY start_ts; +``` + +### SQL query: large notional concentrated in a single strike + +```sql +WITH daily AS ( + SELECT + DATE(ts) AS trade_date, + underlying, + expiry, + strike, + cp, + SUM(notional_usd) AS strike_notional, + SUM(SUM(notional_usd)) OVER ( + PARTITION BY DATE(ts), underlying + ) AS total_notional_underlying + FROM enriched_option_trades + WHERE side_est = 1 + GROUP BY DATE(ts), underlying, expiry, strike, cp +) +SELECT + trade_date, + underlying, + expiry, + strike, + cp, + strike_notional, + total_notional_underlying, + strike_notional / NULLIF(total_notional_underlying, 0) AS strike_share +FROM daily +WHERE strike_notional >= 250000 + AND strike_notional / NULLIF(total_notional_underlying, 0) >= 0.70 +ORDER BY trade_date, underlying, strike_share DESC; +``` + +### SQL query: rapid repeated buys across strikes + +```sql +WITH bursts AS ( + SELECT + *, + CASE + WHEN LAG(ts) OVER w IS NULL THEN 1 + WHEN EXTRACT(EPOCH FROM (ts - LAG(ts) OVER w)) > 2 THEN 1 + ELSE 0 + END AS new_burst + FROM enriched_option_trades + WHERE side_est = 1 + AND cond NOT IN ('a','b','c','d','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v') + WINDOW w AS ( + PARTITION BY underlying, expiry, cp + ORDER BY ts + ) +), +clustered AS ( + SELECT + *, + SUM(new_burst) OVER ( + PARTITION BY underlying, expiry, cp + ORDER BY ts + ROWS UNBOUNDED PRECEDING + ) AS burst_id + FROM bursts +) +SELECT + underlying, + expiry, + cp, + burst_id, + MIN(ts) AS start_ts, + MAX(ts) AS end_ts, + COUNT(*) AS child_prints, + COUNT(DISTINCT strike) AS strikes_hit, + COUNT(DISTINCT exchange) AS venues_hit, + SUM(notional_usd) AS burst_notional_usd +FROM clustered +GROUP BY underlying, expiry, cp, burst_id +HAVING COUNT(DISTINCT strike) >= 3 + AND COUNT(DISTINCT exchange) >= 2 + AND SUM(notional_usd) > 250000 +ORDER BY start_ts; +``` + +### SQL query: sweeps + +```sql +SELECT + underlying, + expiry, + strike, + cp, + MIN(ts) AS start_ts, + MAX(ts) AS end_ts, + COUNT(*) AS child_prints, + COUNT(DISTINCT exchange) AS venues_hit, + SUM(notional_usd) AS notional_total_usd, + MAX(iso_flag) AS has_iso_flag +FROM enriched_option_trades +WHERE side_est = 1 +GROUP BY underlying, expiry, strike, cp, DATE_TRUNC('second', ts) +HAVING MAX(iso_flag) = 1 + OR ( + COUNT(DISTINCT exchange) >= 2 + AND COUNT(*) >= 3 + AND SUM(notional_usd) > 150000 + ) +ORDER BY start_ts; +``` + +### SQL query: probabilistic buy-write / covered-call indicator + +```sql +WITH option_sales AS ( + SELECT + ts, + underlying, + expiry, + strike, + cp, + contracts, + notional_usd + FROM enriched_option_trades + WHERE cp = 'C' + AND side_est = -1 + AND cond NOT IN ('a','b','c','d','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v') +), +paired AS ( + SELECT + o.ts AS option_ts, + s.ts AS stock_ts, + o.underlying, + o.expiry, + o.strike, + o.contracts, + s.shares, + ABS(s.shares - o.contracts * 100) AS ratio_error + FROM option_sales o + JOIN stock_trades_signed s + ON s.symbol = o.underlying + AND s.side_est = 1 + AND s.ts BETWEEN o.ts - INTERVAL '5 seconds' AND o.ts + INTERVAL '5 seconds' +) +SELECT + option_ts, + stock_ts, + underlying, + expiry, + strike, + contracts, + shares, + ratio_error +FROM paired +WHERE ratio_error <= contracts * 20 +ORDER BY option_ts; +``` + +These SQL patterns are deliberately conservative. They are best used to populate candidate event sets for downstream scoring, not as final labels by themselves. That is especially true for sweeps, retail whales, and buy-write detection, where account-level or route-level data can dramatically improve precision. + +## Open Questions and Limitations + +The biggest limitation is identity. Public OPRA-style options data gives venue, timestamps, quotes, trade conditions, and open interest, but not the beneficial owner, true account class, or reliable open/close intent in real time. Exchange and broker systems may carry professional/customer origin codes or route-chain data, but the public tape generally does not. That means participant-style labels from public data are inferential, not definitive. + +The second limitation is that the literature is not unanimous. Some papers find strong informational content in options flow and meaningful options-led price discovery; other papers find that pre-earnings options activity is often speculative and retail-dominated. Treat that disagreement as a feature, not a bug: it is exactly why your production system should output calibrated probabilities, reason codes, and low-confidence abstentions instead of pretending every urgent call buy is “smart money.” + +The last limitation is regime drift. The SEC’s recent options market-structure work and the newer 0DTE literature both show that short-dated and expiration-day activity has become a much larger share of the market, especially in index products and select equities. Thresholds that worked before widespread 0DTE activity can age badly. Refit by liquidity regime, by DTE bucket, and by event context, or the model will quietly rot. diff --git a/tape-overhaul-phase1-1.md b/tape-overhaul-phase1-1.md deleted file mode 100644 index c2a1016..0000000 --- a/tape-overhaul-phase1-1.md +++ /dev/null @@ -1,170 +0,0 @@ -# Server-Backed Persistent History - -## Summary - -Make live mode server-authoritative across refreshes, sessions, and devices. The browser will not own data persistence. On load, the app will hydrate from ClickHouse-backed server history, then layer live WebSocket updates on top. Users will immediately see a substantial recent persisted window, with older records available through history pagination. - -## Chosen Defaults - -- Source of truth: ClickHouse on the server. -- Browser persistence: UI preferences only, no market-data cache. -- Initial load: recent persisted window per active channel. -- Older data: fetched on demand using cursor pagination. -- Scope: every channel the server handles, including options, NBBO, equities, equity quotes, equity joins, flow packets, classifier hits, alerts, inferred dark events, candles, and chart overlays. -- Freshness: freshness affects status labels only; it must not hide persisted history from a refreshed browser. - -## Current State To Change - -- `LiveStateManager` hydrates from Redis or ClickHouse, but freshness gates currently suppress stale options, NBBO, equities, and flow snapshots. -- The unified `/ws/live` protocol supports snapshots and `next_before`, but the frontend does not retain/use per-channel history cursors for live-mode pagination. -- Some channels have REST history endpoints, but `equity-quotes` is not fully represented in the unified live protocol/history API. -- Charts already query ClickHouse for candle and overlay ranges, but should be treated as part of the same server-history model. - -## Public Interfaces And Types - -Update `packages/types/src/live.ts`: - -- Add `"equity-quotes"` to: - - `LiveGenericChannelSchema` - - `LiveChannelSchema` - - `LiveSubscriptionSchema` - - `livePayloadSchemas` -- Preserve existing `FeedSnapshot` shape: - - `items` - - `watermark` - - `next_before` - -Update API routes in `services/api/src/index.ts`: - -- Add `GET /history/equity-quotes?before_ts=&before_seq=&limit=`. -- Include `equity-quotes` in `/ws/live` subscriptions and fanout. -- Keep existing recent/replay endpoints compatible. - -Update storage in `packages/storage/src/clickhouse.ts`: - -- Add `fetchEquityQuotesBefore`. -- Reuse existing `(ts, seq)` cursor ordering. -- Keep limits clamped consistently with other history endpoints. - -## Server Implementation - -In `services/api/src/live.ts`: - -1. Add generic config for `equity-quotes`: - - Redis key: `live:equity-quotes` - - cursor field: `equity-quotes` - - parser: `EquityQuoteSchema` - - cursor: `{ ts, seq }` - - fetchRecent: `fetchRecentEquityQuotes` -2. Stop filtering historical snapshots by freshness: - - Remove `filterFreshGenericItems` from snapshot construction. - - Keep `isLiveItemFresh` available for UI status/fanout behavior if needed. - - Do not reject persisted ClickHouse rows just because market timestamps are older than 15s/30s. -3. Stop rejecting stale ingests inside `LiveStateManager.ingest`. - - The manager should store valid events it receives. - - Event fanout can still choose how to label status, but should not silently lose durable cache state. -4. Preserve Redis as a hot cache: - - Redis remains an optimization. - - ClickHouse remains the fallback and source of truth. - - API startup should hydrate from Redis if present, otherwise from ClickHouse. - -In `services/api/src/index.ts`: - -1. Include `equity-quotes` in `consumerBindings`. -2. Pump `EquityQuoteSchema` payloads into: - - legacy `/ws/equity-quotes` - - unified `/ws/live` - - `LiveStateManager` -3. Add `/history/equity-quotes`. -4. Keep durable consumer defaults unchanged unless a test proves old events are skipped in a live-running API scenario. ClickHouse hydration handles restart and refresh persistence. - -## Frontend Implementation - -In `apps/web/app/terminal.tsx`: - -1. Extend `LiveSessionState` with: - - per-subscription `next_before` cursors - - per-subscription loading/error state for older history - - equity quotes if exposed in UI state -2. When handling `snapshot` messages: - - Replace the channel's current items with snapshot items when non-empty. - - Store `snapshot.next_before`. - - Do not discard stale-but-persisted rows. - - Continue deduping by `trace_id/seq` or `id`. -3. Add a generic live-history loader: - - Map subscription channel to history endpoint: - - `options` -> `/history/options` - - `nbbo` -> `/history/nbbo` - - `equities` -> `/history/equities` - - `equity-quotes` -> `/history/equity-quotes` - - `equity-joins` -> `/history/equity-joins` - - `flow` -> `/history/flow` - - `classifier-hits` -> `/history/classifier-hits` - - `alerts` -> `/history/alerts` - - `inferred-dark` -> `/history/inferred-dark` - - Carry option/flow filters into options history queries. - - Merge older results into existing channel state. - - Advance `next_before` from the response. - - Stop when `next_before` is null or the response is empty. -4. UI behavior: - - Add a compact "Load older" control at the bottom of each applicable tape/list. - - Disable it while loading. - - Hide it when no more history exists. - - Keep existing pause/jump controls unchanged. - - Do not add browser market-data storage. -5. Chart behavior: - - Keep candles loading from `/candles/equities`. - - Keep overlay loading from `/prints/equities/range`. - - Ensure refresh and device changes show the same server data for the same ticker/window. - -## Config And Deployment - -Update `.env.example`: - -- Add `LIVE_LIMIT_EQUITY_QUOTES=10000`. -- Document that `LIVE_LIMIT_*` controls initial server snapshot/hot-cache depth, not total persisted history. - -Update README if needed: - -- Clarify persistence model: - - ClickHouse is durable history. - - Redis is hot cache. - - Browser is not a market-data database. - - All devices connected to the same API see the same server-seen data. - -Docker volumes already persist ClickHouse/Redis/NATS data locally and in deployment compose, so no migration is needed for volume-backed persistence. - -## Tests - -API tests in `services/api/tests/live.test.ts`: - -- Snapshot hydration returns stale historical options/NBBO/equities/flow instead of filtering them out. -- `LiveStateManager.ingest` stores older valid events. -- `equity-quotes` hydrates from Redis. -- `equity-quotes` hydrates from ClickHouse when Redis is empty. -- `next_before` is set from the oldest item in returned snapshot. -- Redis hot cache persists hydrated ClickHouse data. - -Storage tests: - -- Add `fetchEquityQuotesBefore` coverage using the existing storage test style. - -Frontend tests in `apps/web/app/terminal.test.ts`: - -- Live snapshot with older persisted rows populates visible rows. -- Empty snapshot does not wipe existing visible rows only when preserving an already visible channel during reconnect. -- Older-history merge dedupes existing items. -- History cursor advances after loading older rows. -- "No more history" state is reached when `next_before` is null. -- Live status can be stale while items remain visible. - -## Acceptance Criteria - -- Refreshing the app shows persisted data immediately, even when no new live events arrive after page load. -- Opening the app on another device connected to the same API shows the same server-backed recent history. -- Stale market timestamps do not cause persisted history to disappear. -- Users can load older data beyond the initial recent window. -- Live WebSocket updates still appear without requiring refresh. -- Redis loss does not lose history; API falls back to ClickHouse. -- Browser cache deletion does not lose market data. -- `bun test services/api/tests/live.test.ts apps/web/app/terminal.test.ts packages/storage/tests/*.test.ts` passes, or any unavailable test target is documented. diff --git a/tape-overhaul-phase1.md b/tape-overhaul-phase1.md deleted file mode 100644 index ead0bd6..0000000 --- a/tape-overhaul-phase1.md +++ /dev/null @@ -1,320 +0,0 @@ -# Options Overhaul Phase 1: Snapshot Tape Table - -## Summary - -Convert the Options tape into a dense table where every row is an individual option print with preserved execution context. The print itself becomes the authoritative record for what was known around that trade at the moment it printed: option NBBO, underlying spot, IV, notional, side/classification metadata, and classifier-derived row coloring. - -This phase includes backend enrichment, storage/type changes, synthetic IV behavior, and the frontend table redesign together. - -## Core Principle - -Do not treat NBBO, spot, or IV as live lookups in the table once the print has been recorded. - -Each option print should carry a snapshot of its execution context. The UI should prefer those preserved fields and only fall back to current side maps for legacy rows that predate the migration. - -## Public Type Changes - -Extend `OptionPrintSchema` / `OptionPrint` in `packages/types/src/events.ts`. - -Add optional flat fields: - -```ts -execution_nbbo_bid?: number; -execution_nbbo_ask?: number; -execution_nbbo_mid?: number; -execution_nbbo_spread?: number; -execution_nbbo_bid_size?: number; -execution_nbbo_ask_size?: number; -execution_nbbo_ts?: number; -execution_nbbo_age_ms?: number; -execution_nbbo_side?: OptionNbboSide; - -execution_underlying_spot?: number; -execution_underlying_bid?: number; -execution_underlying_ask?: number; -execution_underlying_mid?: number; -execution_underlying_spread?: number; -execution_underlying_ts?: number; -execution_underlying_age_ms?: number; -execution_underlying_source?: "equity_quote_mid"; - -execution_iv?: number; -execution_iv_source?: "provider" | "synthetic_pressure_model"; -``` - -Keep existing fields for compatibility: - -- `nbbo_side` -- `notional` -- `underlying_id` -- `option_type` -- `signal_*` - -Set `nbbo_side` to match `execution_nbbo_side` for new prints so existing filters continue working. - -## Storage Changes - -Update `packages/storage/src/option-prints.ts`. - -Add ClickHouse columns: - -```sql -execution_nbbo_bid Nullable(Float64), -execution_nbbo_ask Nullable(Float64), -execution_nbbo_mid Nullable(Float64), -execution_nbbo_spread Nullable(Float64), -execution_nbbo_bid_size Nullable(UInt32), -execution_nbbo_ask_size Nullable(UInt32), -execution_nbbo_ts Nullable(UInt64), -execution_nbbo_age_ms Nullable(Float64), -execution_nbbo_side Nullable(String), - -execution_underlying_spot Nullable(Float64), -execution_underlying_bid Nullable(Float64), -execution_underlying_ask Nullable(Float64), -execution_underlying_mid Nullable(Float64), -execution_underlying_spread Nullable(Float64), -execution_underlying_ts Nullable(UInt64), -execution_underlying_age_ms Nullable(Float64), -execution_underlying_source Nullable(String), - -execution_iv Nullable(Float64), -execution_iv_source Nullable(String) -``` - -Add `ALTER TABLE ... ADD COLUMN IF NOT EXISTS` migrations for all fields. - -Update row normalization so missing legacy values parse as `undefined`. - -## Ingest Enrichment - -Update `services/ingest-options/src/index.ts`. - -Maintain caches: - -- latest option NBBO by contract -- latest equity quote by underlying -- synthetic/adapter-provided IV by contract when available - -When an option trade arrives: - -1. Parse raw print. -2. Derive underlying, option type, notional, ETF flag as today. -3. Select latest option NBBO for the contract at or before `print.ts`. -4. Attach preserved NBBO fields: - - bid, ask, mid, spread - - bid/ask sizes - - quote timestamp - - quote age - - execution NBBO side -5. Select latest equity quote for the underlying at or before `print.ts`. -6. Attach preserved underlying fields: - - bid, ask, mid - - spread - - quote timestamp - - quote age - - `execution_underlying_spot = mid` - - `execution_underlying_source = "equity_quote_mid"` -7. Attach IV if available. -8. Evaluate signal filters using preserved execution fields. -9. Persist and publish the enriched print. - -Important behavior: - -- Do not mark these preserved fields stale in the UI. -- Age fields are still stored for auditability. -- If no at-or-before quote exists, leave that context unset. -- Never use a quote after the option print timestamp for preserved execution context. - -## Synthetic IV Model - -Update `services/ingest-options/src/adapters/synthetic.ts`. - -Add persistent contract-level IV state: - -```ts -type SyntheticContractIvState = { - iv: number; - pressure: number; - lastTs: number; -}; -``` - -Behavior: - -- Initialize IV from a plausible baseline based on DTE and moneyness. -- Maintain IV per contract across bursts. -- Repeated aggressive buying of the same contract raises pressure and IV. -- Aggressive buying means synthetic placement `A` or `AA`. -- `MID` has small/no pressure. -- `B` or `BB` reduces pressure slightly. -- Pressure decays over time after inactivity. -- IV is clamped to a plausible range. - -Recommended defaults: - -- Baseline IV: `0.18` to `0.65` -- 0DTE contracts start higher than far-dated contracts. -- Out-of-the-money contracts start slightly higher than near-the-money contracts. -- Ask/above-ask print pressure increment: proportional to size and notional. -- Decay half-life: roughly 30-90 seconds in synthetic time. -- Clamp IV to `0.05..2.5`. - -Each synthetic `OptionPrint` should include: - -```ts -execution_iv -execution_iv_source: "synthetic_pressure_model" -``` - -Synthetic NBBO and trade price generation should remain coherent: - -- As IV rises, option mid/ask should drift higher for that contract. -- Rapid same-contract buying should visibly increase both print price and IV over subsequent prints. -- Bid/ask spread may widen mildly with higher IV. - -## Real Adapter IV Behavior - -For Alpaca, Databento, and IBKR in Phase 1: - -- Preserve NBBO and underlying spot context through ingest enrichment. -- Leave `execution_iv` unset unless the adapter already provides a reliable IV value. -- Do not invent IV for real feeds in Phase 1. - -Synthetic is the only source that must generate IV in this phase. - -## Frontend Table Redesign - -Update `apps/web/app/terminal.tsx` and `apps/web/app/globals.css`. - -Each Options row remains an `OptionPrint`. - -Default columns: - -- `TIME` -- `SYM` -- `EXP` -- `STRIKE` -- `C/P` -- `SPOT` -- `DETAILS` -- `TYPE` -- `VALUE` -- `SIDE` -- `IV` -- `CLASSIFIER` - -Column sources: - -- `SPOT`: `execution_underlying_spot`, fallback `--` -- `SIDE`: `execution_nbbo_side ?? nbbo_side` -- `IV`: `execution_iv`, formatted as percent, fallback `--` -- `DETAILS`: `{size}@{price}_{side}` -- `VALUE`: `notional ?? price * size * 100` - -For legacy rows only: - -- If preserved NBBO is missing, fallback to existing frontend NBBO map. -- If preserved spot/IV is missing, render `--`. - -## Classifier Row Coloring - -Add derived indexes in `TerminalProvider`: - -- `classifierHitsByPacketId` -- `packetIdByOptionTraceId` -- `classifierDecorByOptionTraceId` - -A print inherits classifier color if its trace ID belongs to a flow packet that produced classifier hits. - -Primary hit selection: - -1. Highest confidence -2. Newest `source_ts` -3. Highest `seq` - -Classifier families: - -- `large_bullish_call_sweep`: green -- `large_bearish_put_sweep`: red -- `unusual_contract_spike`: amber -- `large_call_sell_overwrite`: copper -- `large_put_sell_write`: copper -- `straddle` / `strangle`: blue -- `vertical_spread`: teal -- `ladder_accumulation`: yellow-green -- `roll_up_down_out`: violet -- `far_dated_conviction`: cyan -- `zero_dte_gamma_punch`: magenta -- unknown: neutral - -Confidence controls row intensity. - -## Interaction - -Classified rows: - -- Click opens existing classifier/alert drawer behavior through `state.openFromClassifierHit(primaryHit)`. -- Keyboard Enter/Space does the same. -- Row remains compact and table-like. - -Unclassified rows: - -- Hover only. -- No drawer action. - -## Live Manifest - -Update `/tape` live subscriptions to include classifier hits: - -```ts -[ - { channel: "options", filters: flowFilters }, - { channel: "nbbo" }, - { channel: "equities" }, - { channel: "flow", filters: flowFilters }, - { channel: "classifier-hits" } -] -``` - -The table uses preserved execution context from options first, not these side feeds. - -## Tests - -Add/update tests for: - -- `OptionPrintSchema` accepts preserved execution context fields. -- ClickHouse option print normalization handles missing legacy context fields. -- Ingest enrichment attaches preserved NBBO context. -- Ingest enrichment attaches preserved underlying quote mid as spot. -- Enrichment never uses quotes after the option print timestamp. -- `nbbo_side` mirrors `execution_nbbo_side` for new enriched prints. -- Synthetic IV increases under repeated same-contract ask/above-ask buying. -- Synthetic IV decays after inactivity. -- Synthetic IV remains within clamps. -- Options table renders SPOT from `execution_underlying_spot`. -- Options table renders IV from `execution_iv`. -- Legacy rows render `--` for missing SPOT/IV. -- Classifier family mapping and primary hit selection work. -- Classified row opens existing classifier/alert drawer path. - -## Acceptance Criteria - -- The Options tape is a dense table, not card rows. -- Every new option print stores preserved execution NBBO context. -- Every new option print stores preserved execution underlying spot when an at-or-before equity quote exists. -- Synthetic option prints store dynamic IV. -- Synthetic repeated buying of the same contract visibly increases IV. -- The table reads NBBO, SPOT, and IV from preserved print fields first. -- Classifier-hit rows are color-coded by classifier family. -- Existing live/replay filters and tape controls still work. -- No context field is visually treated as stale after being attached to the print. -- Legacy data remains readable with graceful fallbacks. - -## Assumptions - -- Phase 1 uses flat fields for queryability and simple table rendering. -- Underlying spot means equity quote mid at or before the option print timestamp. -- NBBO context means option quote at or before the option print timestamp. -- Preserved age fields are audit metadata, not UI freshness warnings. -- Real-feed IV can remain absent until a reliable provider value is available.