Pass Lab Methodology: How We Calculate Propfirm Pass Probability with 95% Confidence Intervals

May 1, 2026 Methodology 9 min read

By · Algorithmic Forex Trader · Founder, SteadyFlowFX

Last updated: 2026-05-01

Verdict is a research/analytics tool that answers a single question with audit-grade rigor: given an EA portfolio, which prop firm shows the highest backtest pass rate against historical simulations, and with what 95% confidence interval? This post documents how the engine computes that — every step is open, deterministic (same input + seed = bit-identical report), and statistically defensible. Verdict is not financial advice; it presents backtest results, not predictions of future trading outcomes.

The methodology principle up front: Verdict surfaces the firm with the highest CI lower bound — not the highest point estimate — as the primary backtest match. A firm with a 95% CI of [55%, 90%] is the primary match over one at [40%, 95%]: the former is more stable in the backtest data. This is what makes Verdict a research tool rather than a marketing tool.

Why standard pass-rate claims are gameable

If you've shopped propfirm calculators or read prop-firm-promoter content, you've seen claims like "95% pass rate at FTMO with our portfolio". Two problems with those numbers:

  1. Path-dependence is ignored. Backtests report MAX drawdown post-hoc on the trade ordering that actually happened. Propfirm rules check DAILY drawdown on whatever order trades happen to land on a real account. The same backtest in different intraday trade order can pass or fail daily-DD rules differently. A backtest with 8% max-DD can show 6% intraday on a bad day — FTMO's 5% daily limit fires.
  2. Sample size is invisible. A "95% pass rate" claim sourced from one walk-forward window means nothing. From 24 windows, it's a real signal. Without a confidence interval, you can't tell which it is.

Verdict fixes both with a four-stage methodology: walk-forward windows, per-window Monte Carlo trade-shuffling, bootstrap confidence intervals, and a sample-size guard.

Stage 1: Walk-forward windows

For each firm, slide a window matching the firm's evaluation period (e.g. 30 days for FTMO 2-step Phase 1) across the user's backtest with stride = window/2. Yields N overlapping windows.

Window generation
window_length = firm.evaluation_days
stride = window_length / 2
N = floor((backtest_span - window_length) / stride) + 1

Concrete example: a 1-year backtest against FTMO 2-step Phase 1 (30-day window, 15-day stride) yields ~24 windows. A 2-year backtest yields ~48. For "no time limit" challenges (FundedNext, FundingPips, FXIFY, Goat), we cap window length at 60 days — long enough for portfolios to compound, short enough that the bootstrap has enough windows.

Window inclusion rule: a trade counts in window W if its entry_time ≥ W.start and exit_time ≤ W.end. Trades that span window boundaries are excluded entirely. This is conservative — cutting trades at boundaries would require window-local mark-to-market with candle data, which complicates the per-window Monte Carlo without changing the conclusion.

Stage 2: Per-window Monte Carlo trade-shuffling

For each window, replay the trades 1,000 times with within-day trade order shuffled. Within-day shuffling preserves daily clustering (so daily DD remains realistic) but captures path-dependence: the same trades in different intraday order can pass or fail daily-DD rules differently.

Per-window MC iteration
For i in 1..1000:
  For each trading day in the window:
    Shuffle the day's trade order (seeded RNG)
  Run the firm's rule evaluator over the shuffled trades
  Record Pass / Fail
window_pass_rate = passes / 1000

The shuffle uses a deterministic RNG (ChaCha8) seeded by master_seed XOR firm_index XOR window_index. This means: same input + same seed = same shuffled iterations across firms and windows. Reproducibility is core to the audit-grade brand commitment.

The rule evaluator checks every firm rule per trade: profit target, daily DD, total DD (static / trailing-EoD / trailing-intraday), min trading days, weekend holding, and consistency rules like FTMO 1-step's 50% Best Day rule. Each rule fires the moment it's breached — the iteration's outcome is "passed" only if all rules survive every trade.

Stage 3: Bootstrap confidence interval across windows

The per-window pass-rates form a sample. Bootstrap that sample to estimate uncertainty: resample N windows with replacement K times (default K=5,000), compute the mean of each resample, and take the 2.5th and 97.5th percentiles of the resampled means.

Bootstrap CI
For k in 1..5000:
  Sample N values from window_pass_rates with replacement
  means[k] = mean(sample)
sort(means)
CI_low = means[ceil(0.025 * 5000) - 1]
CI_high = means[ceil(0.975 * 5000) - 1]
point_estimate = mean(window_pass_rates)

We use nearest-rank R-1 percentile (ceil(q × N) - 1) for both the CI bounds and the auxiliary percentile_spread calls (P25/P50/P75 in pass-time). This is the standard convention; previous off-by-one variants biased CI-low slightly upward, which is the wrong direction for a lower-bound-based tool.

Why bootstrap rather than a parametric CI? Three reasons:

Stage 4: Sample-size guard

Below 12 walk-forward windows the bootstrap CI is too unstable to be useful — too few unique resamples to estimate the tails. The engine returns NoneViable in that case rather than a misleading high-but-uncertain estimate, with a warning that explains why.

This is a feature, not a bug. A 1-month backtest will produce NoneViable for every firm because there's not enough data. The honest answer is "we can't tell"; pretending otherwise would make Verdict a marketing tool.

Primary-match selection: highest CI-LOW

Most "propfirm comparison" tools rank by point estimate. That's gameable (cherry-pick favorable backtests, claim 95% pass rate) and brittle (small sample sizes inflate point estimates).

Verdict ranks viable firms (≥12 windows) by their CI lower bound, not their point estimate. The backtest match strength tier is determined by that lower bound:

Backtest match strength tiers
CI-low ≥ 70% → Strong Backtest Match
CI-low ≥ 50% → Moderate Backtest Match
CI-low ≥ 30% → Weak Backtest Match
CI-low < 30% → NoneViable (no firm meets backtest threshold)

Concrete example: if Firm A has CI [40%, 95%] and Firm B has CI [55%, 90%], Verdict surfaces B as the primary backtest match even though A's point estimate (67.5%) is higher than B's (72.5%). The lower bound for B is more stable in the backtest data than the lower bound for A.

This is the methodology principle in code form: sort(firms, key=ci_low, descending). Anyone who runs Verdict on the same input + same seed gets the same primary-match selection. No room for bias, no room for cherry-picking.

Failure attribution

Knowing your portfolio fails 60% of the time is useful. Knowing WHY it fails 60% of the time is actionable. For each firm, Verdict surfaces:

Top failure modes

Across all failed iterations × failed windows, what share of failures was driven by daily-DD breach, total-DD breach, profit-target-not-met, min-trading-days, consistency, weekend-holding, news-trading? Returns the top 3 ranked by share. If "daily DD breach" is 80% of failures, the user knows the fix is reducing daily DD exposure (smaller lots, more diversification).

Worst-EA contributor

Across all failed iterations, which EA was the top negative contributor on the failure-causing day? Verdict uses absolute negative P&L contribution as the primary key (with top-loser-count as tiebreaker), so the answer matches the user's mental model of "which EA dropped the most dollars when things went wrong" rather than "which EA happened to be the largest loser by 1¢ in 51% of iterations".

Rule mismatch detection

If a single firm-side rule (weekend holding, news trading) accounts for ≥70% of failures, Verdict surfaces it as "Rule mismatch · Weekend holding" instead of a misleading 0% CI-low. This separates "your portfolio is too risky for this firm" from "your portfolio is fine but this firm bans something you do". Concretely: an EA that holds positions over weekends can never pass The5ers Hyper Growth — that's a rule-mismatch, not a risk problem.

Strategy classification (path-dependent EA detection)

Verdict's Monte Carlo trade-shuffle is statistically calibrated for independent-signal EAs — strategies where each trade is driven by its own signal with a hard stop-loss, and the order of trades doesn't materially change the outcome distribution. Path-dependent strategies (grid, martingale, hedging, recovery) violate this exchangeability assumption: their P&L sequence is structurally coupled, so reshuffling produces an unrealistic resampled distribution.

Per session F (2026-05-02), Verdict classifies every uploaded portfolio into one of five categories:

Path-dependent portfolios are excluded from primary-match selection — Verdict still computes CI bounds against compatible firms, but flags them as "not used in ranking" with the classification reason surfaced. Calibration validation (session H fase 6) confirms 0% false-positive and 0% false-negative rates across 19 path-dep + 12 independent-signal fixtures, with redundancy fixtures locking the boundary thresholds on both sides.

Pre-flight rule-mismatch (session H fase 3)

The Rust engine evaluates per-trade rule violations (DD, profit target, min trading days, news, weekend, consistency, leverage caps). It cannot see portfolio-level qualitative restrictions — patterns like "martingale prohibited" or "trades held under 60 seconds banned". Verdict bridges these via a TypeScript pre-flight check: before invoking the engine, the strategy classifier output and per-trade stats are compared against each firm's rule flags. If a portfolio violates a firm's explicit ban, that (portfolio, firm) pair is skipped before the engine runs.

Pre-flight currently catches: martingale ban, grid ban, same-account hedging ban (Goat-unique in V1), HFT ban (>200 trades/day), tick-scalping ban (≥10% trades held under 60 sec), and per-asset leverage caps (forex/metals/indices computed against each trade's notional USD value).

The engine ALSO enforces leverage caps per-trade for trades that pass pre-flight (defense-in-depth). Pre-flight is the fast path that surfaces "this firm bans your strategy class entirely" before paying for MC compute.

Calibration validation (session H fase 6)

Verdict's detector and pre-flight system are validated against synthetic fixtures with known ground truth. The current scoreboard:

Full calibration validation report: app/docs/passlab-calibration-validation-2026-05-02.md (640 lines, 9 sections covering fixture-by-fixture margins, detector × pre-flight overlap, accepted gaps, and refresh recommendations).

Verdict vs Verdict

FXOptimize ships Verdict in two surfaces. The split is about scope, not engine quality — both surfaces share the identical Rust + WASM engine, identical firm catalog, identical Monte Carlo methodology, identical seed defaults. A parity-guard test (verdictFreeProParity.test.ts) asserts byte-identical reports for identical inputs across the two code paths so the surfaces don't drift over time.

The single-line tier rule: Free does one portfolio. Solo does your portfolio universe. Both honor the same calibration thresholds, news-rule semantics, and per-firm rule modeling.

Reproducibility (and why it matters)

Verdict is fully deterministic. Same input + same seed = bit-identical report. The default seed is 42; advanced users can override.

This matters for two reasons:

  1. Trust: if I claim my portfolio scored 71% CI-low at FTMO 2-step, anyone can re-run with the same backtests + seed 42 and verify. No hidden randomness, no sample-of-the-day cherry-picking.
  2. Auditability: changes to engine math leave a clear paper trail. The Verified Badge generator embeds the seed + issue date, and the public read-only badge view re-displays them — visitors can independently reproduce.

Performance + scale

For a typical 5-EA, 2-year backtest against 8 firms with default settings (1,000 MC × 5,000 bootstrap), Verdict completes in ~13 seconds on M1 native, ~25-40 seconds in browser WASM. Mobile devices auto-detect via navigator.hardwareConcurrency; low-spec phones get a Fast Mode with 250 MC iterations + 2,000 bootstrap samples that completes in ~10-20 seconds. CIs widen modestly in Fast Mode but the primary-match selection on CI-low remains conservative.

Engine: Rust → WebAssembly. ~300 KB bundle. 95+ unit tests + 7 mandatory integration tests covering determinism, sample-size guard, CI sanity, failure attribution, primary-match selection logic, cross-firm consistency, and edge cases.

Validation: what we tested

Before public launch, we validated the engine on real user portfolios. Key findings that influenced V1 design:

What Verdict does NOT model

For audit-grade honesty, here's what V1 doesn't capture:

Important: Verdict tells you what the engine's statistical model predicts based on your historical backtests. Past performance does not guarantee future results. Real-world execution adds slippage, broker latency, broker-specific symbol restrictions, and news-window suspensions that V1 does not fully model. Use Verdict as one signal among many — including manual reading of each firm's published trading objectives.

Open methodology, open feedback

The complete engine source is documented in the build prompt and reflected in the firm catalog with source URLs and verification dates. Everything you read above is what runs when you hit "Run Verdict" in your browser.

If you find a methodological concern (a rule we model wrong, a firm whose published rules drifted, a statistical critique), the contact is [email protected]. Substantive feedback shapes V2.

Run Verdict on your portfolio

Upload your MT4/MT5 backtests and see your 95% CI pass-probability across 8 prop firms. Free. Runs locally in your browser. No signup.

Open Verdict →