AI / ML2026

Gold Fair Value — Probabilistic Quant Trading System

This Ospina case study documents how Carlos Rico-Ospina approached a specific risk, infrastructure, revenue, or research problem and what was built to address it.

Probabilistic gold trading system where every signal must survive an 11-gate falsification framework, Holm-corrected significance testing, and walk-forward validation before it reaches production.

Volterra KernelsNonlinear CointegrationConformal CalibrationCRPSRegime DetectionLive Trading
Gold Fair Value — Probabilistic Quant Trading System

The Problem

Most quantitative approaches to gold produce impressive backtests that collapse in production. The failure modes are well-known but rarely addressed: future information leaks into features through publication lags, multiple hypothesis testing inflates apparent significance, transaction costs and regime shifts go unmodeled, and walk-forward discipline gets replaced by convenient cross-validation. Daily gold returns are near-random-walk—any claimed edge demands extraordinary proof, not an optimistic Sharpe ratio from a single backtest window.

The Insight

The edge, if it exists, isn't in the model—it's in the validation architecture around the model. A system that enforces point-in-time feature engineering with explicit publication lags, subjects every hypothesis to pre-registered falsification gates (Harvey t-stat ≥ 3.0, Holm-corrected p-values, stationary bootstrap, lag-placebo controls), and outputs calibrated probability distributions instead of point forecasts can distinguish real signal from noise—and has the discipline to reject candidates that don't survive.

What I Built

  • Built Volterra kernel fair value model convolving long-memory supply/demand components (mine production, central bank buying, ETF flows, jewelry, recycled gold) with kernel-specific decay profiles—power-law for persistent structural signals, exponential for fast-decaying demand categories
  • Implemented nonlinear cointegration framework with threshold ECM (TAR/M-TAR), Gregory-Hansen structural break detection, and Holm-corrected multiple hypothesis testing—candidates must survive linear, threshold, and nonlinear test families before promotion
  • Engineered 11-gate edge viability falsification framework: OOS decay detection, execution friction frontier (27-scenario cost grid), half-life and latency viability curves, stationary bootstrap (Politis-Romano), signal-alignment permutation test, lag-placebo RC-style max test, forward holdout with locked thresholds, and capacity scaling at 3× notional
  • Built probabilistic ensemble pipeline—OLS quantile, Quantile Random Forest, XGBoost quantile, GJR-GARCH, regime-switching Markov—evaluated by CRPS with conformal quantile calibration (distribution-free coverage guarantees) and strict walk-forward temporal validation
  • Designed live trading cockpit with DTC WebSocket + SCID dual-source ingestion, real-time VPIN/GARCH/regime computation, traffic-light state machine, policy committee governance, and daily 9-gate scorecards driving automated promotion/monitoring/demotion

Outcomes

  • One spread candidate promoted (Sharpe ~2.17, HAC t-stat ~2.94, Holm-corrected p < 0.03) through the full statistical battery—alongside documented near-misses that failed at specific gates, showing the thresholds are real, not arbitrary
  • Point-in-time feature engineering with publication-lag modeling (14–35 day lags per source) prevents look-ahead bias across all 6 model families and 50+ features—validated by manual audit and 32+ edge-case tests
  • Harvey-standard discovery gates (t-stat ≥ 3.0) reject improvements that show conventional statistical significance but fail robustness—including improvements with t-stats of 2.2–2.6 that most systems would ship
  • Conformal quantile calibration provides distribution-free finite-sample coverage guarantees with adaptive miscoverage tracking—ECE < 0.05 enforced, not just measured

Why It Matters

Most quant portfolios showcase impressive Sharpe ratios. This system is defined by what it rejects—pre-registered falsification gates, Holm correction, and execution friction stress tests kill attractive-looking signals before they reach production. The promoted candidate survived; most didn't.

Demonstrates the full lifecycle from research (cointegration discovery, model ensemble, calibration) through governance (daily scorecards, quality gates) to live execution (DTC streaming, state machine, kill-switch)—not a backtest paper, but a production system with the discipline to say no.

Have a Similar Challenge?

Let's discuss how I can help you achieve similar results.

Start a Conversation