Pass Lab Methodology: How We Calculate Propfirm Pass Probability with 95% Confidence Intervals

May 1, 2026 Methodology 9 min read

By Frederik Baunsoe · Algorithmic Forex Trader · Founder, SteadyFlowFX

Last updated: 2026-05-01

Verdict is a research/analytics tool that answers a single question with audit-grade rigor: given an EA portfolio, which prop firm shows the highest backtest pass rate against historical simulations, and with what 95% confidence interval? This post documents how the engine computes that — every step is open, deterministic (same input + seed = bit-identical report), and statistically defensible. Verdict is not financial advice; it presents backtest results, not predictions of future trading outcomes.

The methodology principle up front: Verdict surfaces the firm with the highest CI lower bound — not the highest point estimate — as the primary backtest match. A firm with a 95% CI of [55%, 90%] is the primary match over one at [40%, 95%]: the former is more stable in the backtest data. This is what makes Verdict a research tool rather than a marketing tool.

Why standard pass-rate claims are gameable

If you've shopped propfirm calculators or read prop-firm-promoter content, you've seen claims like "95% pass rate at FTMO with our portfolio". Two problems with those numbers:

Path-dependence is ignored. Backtests report MAX drawdown post-hoc on the trade ordering that actually happened. Propfirm rules check DAILY drawdown on whatever order trades happen to land on a real account. The same backtest in different intraday trade order can pass or fail daily-DD rules differently. A backtest with 8% max-DD can show 6% intraday on a bad day — FTMO's 5% daily limit fires.
Sample size is invisible. A "95% pass rate" claim sourced from one walk-forward window means nothing. From 24 windows, it's a real signal. Without a confidence interval, you can't tell which it is.

Verdict fixes both with a four-stage methodology: walk-forward windows, per-window Monte Carlo trade-shuffling, bootstrap confidence intervals, and a sample-size guard.

Stage 1: Walk-forward windows

For each firm, slide a window matching the firm's evaluation period (e.g. 30 days for FTMO 2-step Phase 1) across the user's backtest with stride = window/2. Yields N overlapping windows.

Window generation

window_length = firm.evaluation_days
stride = window_length / 2
N = floor((backtest_span - window_length) / stride) + 1

Concrete example: a 1-year backtest against FTMO 2-step Phase 1 (30-day window, 15-day stride) yields ~24 windows. A 2-year backtest yields ~48. For "no time limit" challenges (FundedNext, FundingPips, FXIFY, Goat), we cap window length at 60 days — long enough for portfolios to compound, short enough that the bootstrap has enough windows.

Window inclusion rule: a trade counts in window W if its entry_time ≥ W.start and exit_time ≤ W.end. Trades that span window boundaries are excluded entirely. This is conservative — cutting trades at boundaries would require window-local mark-to-market with candle data, which complicates the per-window Monte Carlo without changing the conclusion.

Stage 2: Per-window Monte Carlo trade-shuffling

For each window, replay the trades 1,000 times with within-day trade order shuffled. Within-day shuffling preserves daily clustering (so daily DD remains realistic) but captures path-dependence: the same trades in different intraday order can pass or fail daily-DD rules differently.

Per-window MC iteration

For i in 1..1000:
  For each trading day in the window:
    Shuffle the day's trade order (seeded RNG)
  Run the firm's rule evaluator over the shuffled trades
  Record Pass / Fail
window_pass_rate = passes / 1000

The shuffle uses a deterministic RNG (ChaCha8) seeded by master_seed XOR firm_index XOR window_index. This means: same input + same seed = same shuffled iterations across firms and windows. Reproducibility is core to the audit-grade brand commitment.

The rule evaluator checks every firm rule per trade: profit target, daily DD, total DD (static / trailing-EoD / trailing-intraday), min trading days, weekend holding, and consistency rules like FTMO 1-step's 50% Best Day rule. Each rule fires the moment it's breached — the iteration's outcome is "passed" only if all rules survive every trade.

Stage 3: Bootstrap confidence interval across windows

The per-window pass-rates form a sample. Bootstrap that sample to estimate uncertainty: resample N windows with replacement K times (default K=5,000), compute the mean of each resample, and take the 2.5th and 97.5th percentiles of the resampled means.

Bootstrap CI

For k in 1..5000:
Sample N values from window_pass_rates with replacement
means[k] = mean(sample)
sort(means)
CI_low = means[ceil(0.025 * 5000) - 1]
CI_high = means[ceil(0.975 * 5000) - 1]
point_estimate = mean(window_pass_rates)

We use nearest-rank R-1 percentile (ceil(q × N) - 1) for both the CI bounds and the auxiliary percentile_spread calls (P25/P50/P75 in pass-time). This is the standard convention; previous off-by-one variants biased CI-low slightly upward, which is the wrong direction for a lower-bound-based tool.

Why bootstrap rather than a parametric CI? Three reasons:

We don't assume a parametric distribution for window pass-rates
Bootstrap handles the discrete nature of MC pass-rates (1000 trials per window → granular but still continuous-enough)
Bootstrap captures the "windows are not iid" structure better than a Wald interval would: adjacent windows overlap by stride/length, so they're correlated. Verdict reports the mean overlap fraction so the UI can flag "high-overlap windows correlate, narrowing CIs artificially"

Stage 4: Sample-size guard

Below 12 walk-forward windows the bootstrap CI is too unstable to be useful — too few unique resamples to estimate the tails. The engine returns NoneViable in that case rather than a misleading high-but-uncertain estimate, with a warning that explains why.

This is a feature, not a bug. A 1-month backtest will produce NoneViable for every firm because there's not enough data. The honest answer is "we can't tell"; pretending otherwise would make Verdict a marketing tool.

Primary-match selection: highest CI-LOW

Most "propfirm comparison" tools rank by point estimate. That's gameable (cherry-pick favorable backtests, claim 95% pass rate) and brittle (small sample sizes inflate point estimates).

Verdict ranks viable firms (≥12 windows) by their CI lower bound, not their point estimate. The backtest match strength tier is determined by that lower bound:

Backtest match strength tiers

CI-low ≥ 70% → Strong Backtest Match
CI-low ≥ 50% → Moderate Backtest Match
CI-low ≥ 30% → Weak Backtest Match
CI-low < 30% → NoneViable (no firm meets backtest threshold)

Concrete example: if Firm A has CI [40%, 95%] and Firm B has CI [55%, 90%], Verdict surfaces B as the primary backtest match even though A's point estimate (67.5%) is higher than B's (72.5%). The lower bound for B is more stable in the backtest data than the lower bound for A.

This is the methodology principle in code form: sort(firms, key=ci_low, descending). Anyone who runs Verdict on the same input + same seed gets the same primary-match selection. No room for bias, no room for cherry-picking.

Failure attribution

Knowing your portfolio fails 60% of the time is useful. Knowing WHY it fails 60% of the time is actionable. For each firm, Verdict surfaces:

Top failure modes

Across all failed iterations × failed windows, what share of failures was driven by daily-DD breach, total-DD breach, profit-target-not-met, min-trading-days, consistency, weekend-holding, news-trading? Returns the top 3 ranked by share. If "daily DD breach" is 80% of failures, the user knows the fix is reducing daily DD exposure (smaller lots, more diversification).

Worst-EA contributor

Across all failed iterations, which EA was the top negative contributor on the failure-causing day? Verdict uses absolute negative P&L contribution as the primary key (with top-loser-count as tiebreaker), so the answer matches the user's mental model of "which EA dropped the most dollars when things went wrong" rather than "which EA happened to be the largest loser by 1¢ in 51% of iterations".

Rule mismatch detection

If a single firm-side rule (weekend holding, news trading) accounts for ≥70% of failures, Verdict surfaces it as "Rule mismatch · Weekend holding" instead of a misleading 0% CI-low. This separates "your portfolio is too risky for this firm" from "your portfolio is fine but this firm bans something you do". Concretely: an EA that holds positions over weekends can never pass The5ers Hyper Growth — that's a rule-mismatch, not a risk problem.

Strategy classification (path-dependent EA detection)

Verdict's Monte Carlo trade-shuffle is statistically calibrated for independent-signal EAs — strategies where each trade is driven by its own signal with a hard stop-loss, and the order of trades doesn't materially change the outcome distribution. Path-dependent strategies (grid, martingale, hedging, recovery) violate this exchangeability assumption: their P&L sequence is structurally coupled, so reshuffling produces an unrealistic resampled distribution.

Per session F (2026-05-02), Verdict classifies every uploaded portfolio into one of five categories:

independent-signal: passes exchangeability, primary-match candidate
grid: ≥5 concurrent same-direction positions OR uniform-spacing detected
martingale: post-loss lots ≥1.25× growth ratio AND ≥55% growing fraction
hedging: ≥25% of trades paired with opposite-direction within ±60 min entry proximity
recovery: post-loss median wait ≤40% of post-win median wait (over 8+ samples each)

Path-dependent portfolios are excluded from primary-match selection — Verdict still computes CI bounds against compatible firms, but flags them as "not used in ranking" with the classification reason surfaced. Calibration validation (session H fase 6) confirms 0% false-positive and 0% false-negative rates across 19 path-dep + 12 independent-signal fixtures, with redundancy fixtures locking the boundary thresholds on both sides.

Pre-flight rule-mismatch (session H fase 3)

The Rust engine evaluates per-trade rule violations (DD, profit target, min trading days, news, weekend, consistency, leverage caps). It cannot see portfolio-level qualitative restrictions — patterns like "martingale prohibited" or "trades held under 60 seconds banned". Verdict bridges these via a TypeScript pre-flight check: before invoking the engine, the strategy classifier output and per-trade stats are compared against each firm's rule flags. If a portfolio violates a firm's explicit ban, that (portfolio, firm) pair is skipped before the engine runs.

Pre-flight currently catches: martingale ban, grid ban, same-account hedging ban (Goat-unique in V1), HFT ban (>200 trades/day), tick-scalping ban (≥10% trades held under 60 sec), and per-asset leverage caps (forex/metals/indices computed against each trade's notional USD value).

The engine ALSO enforces leverage caps per-trade for trades that pass pre-flight (defense-in-depth). Pre-flight is the fast path that surfaces "this firm bans your strategy class entirely" before paying for MC compute.

Calibration validation (session H fase 6)

Verdict's detector and pre-flight system are validated against synthetic fixtures with known ground truth. The current scoreboard:

Strategy detector — 0/19 false negatives, 0/12 false positives (including range traders, trend followers, scalpers, swing traders)
Pre-flight rule-mismatch — 22/22 tests passing across martingale/grid/hedging/HFT/tick-scalping/leverage-cap detection
Boundary fragility — three threshold-margin fixtures pass within 0–16% of their thresholds; redundancy fixtures (F-grid-9 above STRONG, F-indep-11 below STRONG, F-indep-12 above hedge-cutoff) lock all three boundaries on both sides so a future detector refactor fails loudly
Threshold floor documented: MARTINGALE_MEAN_RATIO_MIN not to drop below 1.25 without measured real-EA distribution data

Full calibration validation report: app/docs/passlab-calibration-validation-2026-05-02.md (640 lines, 9 sections covering fixture-by-fixture margins, detector × pre-flight overlap, accepted gaps, and refresh recommendations).

Verdict vs Verdict

FXOptimize ships Verdict in two surfaces. The split is about scope, not engine quality — both surfaces share the identical Rust + WASM engine, identical firm catalog, identical Monte Carlo methodology, identical seed defaults. A parity-guard test (verdictFreeProParity.test.ts) asserts byte-identical reports for identical inputs across the two code paths so the surfaces don't drift over time.

Free Verdict (this page, /verdict) — single-portfolio analysis. Upload one EA-set, get pass-rate across all 8 firms. Includes the full transparency stack: path-dependent classifier, pre-flight rule-mismatch, FirmRestrictions panel, news-trading rule (opt-in advanced setting). Free, no signup, runs locally in browser.
Verdict (in main app, $39/mo) — portfolio-universe Pareto search. Searches across the user's 15-30 Pareto-optimal portfolios for the (portfolio, firm) pair with the highest CI-low. Adds heatmap visualization, drilldown per pair (worst-EA contributor, days-to-target distribution, headroom stats), margin-call / stop-out pre-filter, compute-profile selector, re-run without re-upload.

The single-line tier rule: Free does one portfolio. Solo does your portfolio universe. Both honor the same calibration thresholds, news-rule semantics, and per-firm rule modeling.

Reproducibility (and why it matters)

Verdict is fully deterministic. Same input + same seed = bit-identical report. The default seed is 42; advanced users can override.

This matters for two reasons:

Trust: if I claim my portfolio scored 71% CI-low at FTMO 2-step, anyone can re-run with the same backtests + seed 42 and verify. No hidden randomness, no sample-of-the-day cherry-picking.
Auditability: changes to engine math leave a clear paper trail. The Verified Badge generator embeds the seed + issue date, and the public read-only badge view re-displays them — visitors can independently reproduce.

Performance + scale

For a typical 5-EA, 2-year backtest against 8 firms with default settings (1,000 MC × 5,000 bootstrap), Verdict completes in ~13 seconds on M1 native, ~25-40 seconds in browser WASM. Mobile devices auto-detect via navigator.hardwareConcurrency; low-spec phones get a Fast Mode with 250 MC iterations + 2,000 bootstrap samples that completes in ~10-20 seconds. CIs widen modestly in Fast Mode but the primary-match selection on CI-low remains conservative.

Engine: Rust → WebAssembly. ~300 KB bundle. 95+ unit tests + 7 mandatory integration tests covering determinism, sample-size guard, CI sanity, failure attribution, primary-match selection logic, cross-firm consistency, and edge cases.

Validation: what we tested

Before public launch, we validated the engine on real user portfolios. Key findings that influenced V1 design:

Engine differentiates correctly across portfolios. Conservative portfolios prefer firms with lower profit targets (FTMO 1-step at 5% target, before FTMO unified targets at 10% in 2026). Aggressive portfolios prefer firms with higher daily-DD ceilings.
Diversification gains are measurable. A single EA on EURJPY scored 0.9% CI-low against FTMO 2-step Phase 1; the same EA combined with 5 other HF-strategy EAs across pairs scored 80.3% CI-low — the magnitude of the boost matches portfolio theory expectations.
Rule-mismatch detection fires correctly. Portfolios that hold over weekends always show rule-mismatch on The5ers Hyper Growth (which previously banned weekend holding). After we corrected our profile in May 2026, The5ers became viable for those portfolios.
Sample-size guard prevents over-confident output on small backtests. A single-EA, 1,500-trade portfolio correctly returns NoneViable for FTMO 2-step Phase 1 — there's not enough data to compute a meaningful CI, and the warning text says exactly that.

What Verdict does NOT model

For audit-grade honesty, here's what V1 doesn't capture:

Intraday MAE: V1 treats trades as point events at exit_time and uses running prefix-sums of P&L for daily lows. A real intraday MAE model would require threading hourly candle data through the MC loop, which would explode compute and break the in-browser perf budget. The simplification is conservative when shuffled order is unfavorable; the bootstrap CI captures the variance honestly.
Dynamic lot scaling: V1 uses constant per-trade lot scaling (account_balance / original_balance for dynamic-lot EAs). Running-balance dynamic scaling would make each MC iteration's scale evolution different, which destroys comparability across iterations. Constant scaling is also closer to how a real challenge unfolds (broker doesn't auto-rescale during a 30-day evaluation).
News-trading rule enforcement (V1.1, opt-in): Verdict ships a curated 2020-2026 high-impact news calendar (NFP, FOMC, ECB, BoE, BoJ rate decisions, US CPI mid-month — 392 events). When users opt in via Advanced settings, the engine fires news_trading_violation for firms with newsTradingBanned: true on trades that overlap a news event ± 5 min. V1 catalog: Goat Funded Trader only — verified against the firm's help-center as the single V1 firm whose news rule applies on Phase 1 evaluation. The other 7 firms allow news on evaluation per their published rules. V1 treats Goat's news-window as a hard breach; the real rule is a 5-min profit-cap at 1% of initial balance — slight under-rating of Goat's pass-rate for news-active EAs. Profit-cap variant is V2 work.
Funded-account rules: V1 models the EVALUATION challenge only. Some firms relax rules between evaluation and funded; some tighten them — particularly news-trading rules where 6 of 8 firms have funded-stage profit-caps that V1 doesn't model. Funded-stage variants are V2 work.

Important: Verdict tells you what the engine's statistical model predicts based on your historical backtests. Past performance does not guarantee future results. Real-world execution adds slippage, broker latency, broker-specific symbol restrictions, and news-window suspensions that V1 does not fully model. Use Verdict as one signal among many — including manual reading of each firm's published trading objectives.

Open methodology, open feedback

The complete engine source is documented in the build prompt and reflected in the firm catalog with source URLs and verification dates. Everything you read above is what runs when you hit "Run Verdict" in your browser.

If you find a methodological concern (a rule we model wrong, a firm whose published rules drifted, a statistical critique), the contact is [email protected]. Substantive feedback shapes V2.

Run Verdict on your portfolio

Upload your MT4/MT5 backtests and see your 95% CI pass-probability across 8 prop firms. Free. Runs locally in your browser. No signup.

Open Verdict →

Pass Lab Methodology: How We Calculate Propfirm Pass Probability with 95% Confidence Intervals

Why standard pass-rate claims are gameable

Stage 1: Walk-forward windows

Stage 2: Per-window Monte Carlo trade-shuffling

Stage 3: Bootstrap confidence interval across windows

Stage 4: Sample-size guard

Primary-match selection: highest CI-LOW

Failure attribution

Top failure modes

Worst-EA contributor

Rule mismatch detection

Strategy classification (path-dependent EA detection)

Pre-flight rule-mismatch (session H fase 3)

Calibration validation (session H fase 6)

Verdict vs Verdict

Reproducibility (and why it matters)

Performance + scale

Validation: what we tested

What Verdict does NOT model

Open methodology, open feedback

Run Verdict on your portfolio

Related Articles & Tools