dirtydishes 82861408e4 Track project docs

Remove docs from .gitignore and add AGENTS.md, PLAN.md, CODING_STYLE.md, and RESEARCH.md to the repository.

2025-12-29 22:10:26 -05:00

2.4 KiB

Raw Blame History

RESEARCH.md — Signal Evaluation & Backtesting Discipline

This document defines how research is conducted and interpreted in this repository.

Its purpose is to prevent self-deception, not to slow exploration.

Core Research Principles

No hindsight
No intent claims
No performance without context
No conclusions without uncertainty

All results are provisional unless explicitly validated.

Fact vs Inference

Raw market data = fact
Classifiers, signals, and labels = inference

Inference must always:

reference its evidence
expose confidence
be stored separately from raw data

Labeling Rules

Every evaluation must specify:

Time horizon (e.g. 5m, 15m, 60m)
Metric (return, vol expansion, MAE/MFE, etc.)
Directional vs volatility outcome
Entry definition (event time, next tick, next candle)

No implicit labels. Ever.

Backtesting Constraints

No lookahead bias
No survivorship bias
Use only data available at the event timestamp
OI snapshots must match the event date

Replay pipelines must mirror live pipelines exactly.

Evaluation Metrics (minimum set)

At least one of:

Precision / recall
Hit rate vs baseline
Forward return distribution
Vol realized vs implied
Calibration (probability vs outcome)

Single “win rate” numbers are insufficient.

Regime Awareness

Results must be contextualized by:

Market regime (trend / chop / high vol)
Time of day
DTE bucket
Liquidity conditions

If a signal only works sometimes, that’s still information.

Threshold Tuning Rules

Thresholds may be tuned only on a defined training window.
Validation must occur on disjoint data.
Never tune on the same period you report.

Document when thresholds change and why.

Language Discipline

Allowed:

“suggests”
“consistent with”
“correlated with”
“higher likelihood”

Disallowed:

“smart money”
“institutional intent”
“guaranteed”
“predicts”

Recording Results

For each classifier or hypothesis, record:

What was tested
What failed
What partially worked
What conditions mattered

Failed ideas are assets. Keep them.

Final Reminder

This system is for understanding behavior, not for proving superiority.

If a result cannot survive replay, uncertainty, and explanation, it does not belong in production logic.

2.4 KiB Raw Blame History Unescape Escape