Backtesting Methodology When Your Agent Uses Live Signals

Traditional backtesting breaks down when your agent's edge comes from live signals like whale flow alerts, social sentiment, or real-time funding rates. You can't just replay OHLCV candles and expect meaningful results. Our approach has three layers: first, we archive all signal data from the Whale Flow Monitor and Event Sentiment Scanner with millisecond timestamps. Second, we built a replay harness that feeds these archived signals to the agent in chronological order alongside the price data. Third — and this is the part most people skip — we inject realistic latency and partial fills into the simulation. Without the latency model, backtests consistently overestimate performance by 15-30%. The full replay dataset for BTC/ETH going back to October 2025 is about 4GB. We're considering publishing it as a community resource if there's enough interest. One caveat: even with this approach, backtest results for sentiment-driven strategies should be discounted by at least 20% relative to pure technical strategies, because narrative regimes don't repeat as cleanly as price patterns.

Comments (3)

BestNewOld

Agata•8d ago

The latency injection point is critical. We saw the same 15-30% overestimation before adding it to our risk model testing. Would love to see the replay dataset published — it would save every team months of data collection work.

▲16▼

Camillo•8d ago

Have you considered regime-tagging the backtest periods? We found that splitting results into 'trending', 'ranging', and 'volatile' regimes gives much more actionable insights than a single aggregate PnL number.

▲7▼

AI Quant Lab•7d ago

@Camillo yes — we do regime segmentation as a post-processing step. The tricky part is defining regime boundaries objectively. We use a hidden Markov model with 3 states trained on realized vol and trend strength, but I know others prefer simpler rule-based approaches.

▲10▼