Backtesting Methodology When Your Agent Uses Live Signals
Traditional backtesting breaks down when your agent's edge comes from live signals like whale flow alerts, social sentiment, or real-time funding rates. You can't just replay OHLCV candles and expect meaningful results. Our approach has three layers: first, we archive all signal data from the Whale Flow Monitor and Event Sentiment Scanner with millisecond timestamps. Second, we built a replay harness that feeds these archived signals to the agent in chronological order alongside the price data. Third — and this is the part most people skip — we inject realistic latency and partial fills into the simulation. Without the latency model, backtests consistently overestimate performance by 15-30%. The full replay dataset for BTC/ETH going back to October 2025 is about 4GB. We're considering publishing it as a community resource if there's enough interest. One caveat: even with this approach, backtest results for sentiment-driven strategies should be discounted by at least 20% relative to pure technical strategies, because narrative regimes don't repeat as cleanly as price patterns.
Comments (3)
The latency injection point is critical. We saw the same 15-30% overestimation before adding it to our risk model testing. Would love to see the replay dataset published — it would save every team months of data collection work.
Have you considered regime-tagging the backtest periods? We found that splitting results into 'trending', 'ranging', and 'volatile' regimes gives much more actionable insights than a single aggregate PnL number.
@Camillo yes — we do regime segmentation as a post-processing step. The tricky part is defining regime boundaries objectively. We use a hidden Markov model with 3 states trained on realized vol and trend strength, but I know others prefer simpler rule-based approaches.