Verdict is a research/analytics tool that answers a single question with audit-grade rigor: given an EA portfolio, which prop firm shows the highest backtest pass rate against historical simulations, and with what 95% confidence interval? This post documents how the engine computes that — every step is open, deterministic (same input + seed = bit-identical report), and statistically defensible. Verdict is not financial advice; it presents backtest results, not predictions of future trading outcomes.
The methodology principle up front: Verdict surfaces the firm with the highest CI lower bound — not the highest point estimate — as the primary backtest match. A firm with a 95% CI of [55%, 90%] is the primary match over one at [40%, 95%]: the former is more stable in the backtest data. This is what makes Verdict a research tool rather than a marketing tool.
If you've shopped propfirm calculators or read prop-firm-promoter content, you've seen claims like "95% pass rate at FTMO with our portfolio". Two problems with those numbers:
Verdict fixes both with a four-stage methodology: walk-forward windows, per-window Monte Carlo trade-shuffling, bootstrap confidence intervals, and a sample-size guard.
For each firm, slide a window matching the firm's evaluation period (e.g. 30 days for FTMO 2-step Phase 1) across the user's backtest with stride = window/2. Yields N overlapping windows.
Concrete example: a 1-year backtest against FTMO 2-step Phase 1 (30-day window, 15-day stride) yields ~24 windows. A 2-year backtest yields ~48. For "no time limit" challenges (FundedNext, FundingPips, FXIFY, Goat), we cap window length at 60 days — long enough for portfolios to compound, short enough that the bootstrap has enough windows.
Window inclusion rule: a trade counts in window W if its entry_time ≥ W.start and exit_time ≤ W.end. Trades that span window boundaries are excluded entirely. This is conservative — cutting trades at boundaries would require window-local mark-to-market with candle data, which complicates the per-window Monte Carlo without changing the conclusion.
For each window, replay the trades 1,000 times with within-day trade order shuffled. Within-day shuffling preserves daily clustering (so daily DD remains realistic) but captures path-dependence: the same trades in different intraday order can pass or fail daily-DD rules differently.
The shuffle uses a deterministic RNG (ChaCha8) seeded by master_seed XOR firm_index XOR window_index. This means: same input + same seed = same shuffled iterations across firms and windows. Reproducibility is core to the audit-grade brand commitment.
The rule evaluator checks every firm rule per trade: profit target, daily DD, total DD (static / trailing-EoD / trailing-intraday), min trading days, weekend holding, and consistency rules like FTMO 1-step's 50% Best Day rule. Each rule fires the moment it's breached — the iteration's outcome is "passed" only if all rules survive every trade.
The per-window pass-rates form a sample. Bootstrap that sample to estimate uncertainty: resample N windows with replacement K times (default K=5,000), compute the mean of each resample, and take the 2.5th and 97.5th percentiles of the resampled means.
We use nearest-rank R-1 percentile (ceil(q × N) - 1) for both the CI bounds and the auxiliary percentile_spread calls (P25/P50/P75 in pass-time). This is the standard convention; previous off-by-one variants biased CI-low slightly upward, which is the wrong direction for a lower-bound-based tool.
Why bootstrap rather than a parametric CI? Three reasons:
Below 12 walk-forward windows the bootstrap CI is too unstable to be useful — too few unique resamples to estimate the tails. The engine returns NoneViable in that case rather than a misleading high-but-uncertain estimate, with a warning that explains why.
This is a feature, not a bug. A 1-month backtest will produce NoneViable for every firm because there's not enough data. The honest answer is "we can't tell"; pretending otherwise would make Verdict a marketing tool.
Most "propfirm comparison" tools rank by point estimate. That's gameable (cherry-pick favorable backtests, claim 95% pass rate) and brittle (small sample sizes inflate point estimates).
Verdict ranks viable firms (≥12 windows) by their CI lower bound, not their point estimate. The backtest match strength tier is determined by that lower bound:
Concrete example: if Firm A has CI [40%, 95%] and Firm B has CI [55%, 90%], Verdict surfaces B as the primary backtest match even though A's point estimate (67.5%) is higher than B's (72.5%). The lower bound for B is more stable in the backtest data than the lower bound for A.
This is the methodology principle in code form: sort(firms, key=ci_low, descending). Anyone who runs Verdict on the same input + same seed gets the same primary-match selection. No room for bias, no room for cherry-picking.
Knowing your portfolio fails 60% of the time is useful. Knowing WHY it fails 60% of the time is actionable. For each firm, Verdict surfaces:
Across all failed iterations × failed windows, what share of failures was driven by daily-DD breach, total-DD breach, profit-target-not-met, min-trading-days, consistency, weekend-holding, news-trading? Returns the top 3 ranked by share. If "daily DD breach" is 80% of failures, the user knows the fix is reducing daily DD exposure (smaller lots, more diversification).
Across all failed iterations, which EA was the top negative contributor on the failure-causing day? Verdict uses absolute negative P&L contribution as the primary key (with top-loser-count as tiebreaker), so the answer matches the user's mental model of "which EA dropped the most dollars when things went wrong" rather than "which EA happened to be the largest loser by 1¢ in 51% of iterations".
If a single firm-side rule (weekend holding, news trading) accounts for ≥70% of failures, Verdict surfaces it as "Rule mismatch · Weekend holding" instead of a misleading 0% CI-low. This separates "your portfolio is too risky for this firm" from "your portfolio is fine but this firm bans something you do". Concretely: an EA that holds positions over weekends can never pass The5ers Hyper Growth — that's a rule-mismatch, not a risk problem.
Verdict's Monte Carlo trade-shuffle is statistically calibrated for independent-signal EAs — strategies where each trade is driven by its own signal with a hard stop-loss, and the order of trades doesn't materially change the outcome distribution. Path-dependent strategies (grid, martingale, hedging, recovery) violate this exchangeability assumption: their P&L sequence is structurally coupled, so reshuffling produces an unrealistic resampled distribution.
Per session F (2026-05-02), Verdict classifies every uploaded portfolio into one of five categories:
Path-dependent portfolios are excluded from primary-match selection — Verdict still computes CI bounds against compatible firms, but flags them as "not used in ranking" with the classification reason surfaced. Calibration validation (session H fase 6) confirms 0% false-positive and 0% false-negative rates across 19 path-dep + 12 independent-signal fixtures, with redundancy fixtures locking the boundary thresholds on both sides.
The Rust engine evaluates per-trade rule violations (DD, profit target, min trading days, news, weekend, consistency, leverage caps). It cannot see portfolio-level qualitative restrictions — patterns like "martingale prohibited" or "trades held under 60 seconds banned". Verdict bridges these via a TypeScript pre-flight check: before invoking the engine, the strategy classifier output and per-trade stats are compared against each firm's rule flags. If a portfolio violates a firm's explicit ban, that (portfolio, firm) pair is skipped before the engine runs.
Pre-flight currently catches: martingale ban, grid ban, same-account hedging ban (Goat-unique in V1), HFT ban (>200 trades/day), tick-scalping ban (≥10% trades held under 60 sec), and per-asset leverage caps (forex/metals/indices computed against each trade's notional USD value).
The engine ALSO enforces leverage caps per-trade for trades that pass pre-flight (defense-in-depth). Pre-flight is the fast path that surfaces "this firm bans your strategy class entirely" before paying for MC compute.
Verdict's detector and pre-flight system are validated against synthetic fixtures with known ground truth. The current scoreboard:
MARTINGALE_MEAN_RATIO_MIN not to drop below 1.25 without measured real-EA distribution dataFull calibration validation report: app/docs/passlab-calibration-validation-2026-05-02.md (640 lines, 9 sections covering fixture-by-fixture margins, detector × pre-flight overlap, accepted gaps, and refresh recommendations).
FXOptimize ships Verdict in two surfaces. The split is about scope, not engine quality — both surfaces share the identical Rust + WASM engine, identical firm catalog, identical Monte Carlo methodology, identical seed defaults. A parity-guard test (verdictFreeProParity.test.ts) asserts byte-identical reports for identical inputs across the two code paths so the surfaces don't drift over time.
/verdict) — single-portfolio analysis. Upload one EA-set, get pass-rate across all 8 firms. Includes the full transparency stack: path-dependent classifier, pre-flight rule-mismatch, FirmRestrictions panel, news-trading rule (opt-in advanced setting). Free, no signup, runs locally in browser.The single-line tier rule: Free does one portfolio. Solo does your portfolio universe. Both honor the same calibration thresholds, news-rule semantics, and per-firm rule modeling.
Verdict is fully deterministic. Same input + same seed = bit-identical report. The default seed is 42; advanced users can override.
This matters for two reasons:
For a typical 5-EA, 2-year backtest against 8 firms with default settings (1,000 MC × 5,000 bootstrap), Verdict completes in ~13 seconds on M1 native, ~25-40 seconds in browser WASM. Mobile devices auto-detect via navigator.hardwareConcurrency; low-spec phones get a Fast Mode with 250 MC iterations + 2,000 bootstrap samples that completes in ~10-20 seconds. CIs widen modestly in Fast Mode but the primary-match selection on CI-low remains conservative.
Engine: Rust → WebAssembly. ~300 KB bundle. 95+ unit tests + 7 mandatory integration tests covering determinism, sample-size guard, CI sanity, failure attribution, primary-match selection logic, cross-firm consistency, and edge cases.
Before public launch, we validated the engine on real user portfolios. Key findings that influenced V1 design:
For audit-grade honesty, here's what V1 doesn't capture:
news_trading_violation for firms with newsTradingBanned: true on trades that overlap a news event ± 5 min. V1 catalog: Goat Funded Trader only — verified against the firm's help-center as the single V1 firm whose news rule applies on Phase 1 evaluation. The other 7 firms allow news on evaluation per their published rules. V1 treats Goat's news-window as a hard breach; the real rule is a 5-min profit-cap at 1% of initial balance — slight under-rating of Goat's pass-rate for news-active EAs. Profit-cap variant is V2 work.Important: Verdict tells you what the engine's statistical model predicts based on your historical backtests. Past performance does not guarantee future results. Real-world execution adds slippage, broker latency, broker-specific symbol restrictions, and news-window suspensions that V1 does not fully model. Use Verdict as one signal among many — including manual reading of each firm's published trading objectives.
The complete engine source is documented in the build prompt and reflected in the firm catalog with source URLs and verification dates. Everything you read above is what runs when you hit "Run Verdict" in your browser.
If you find a methodological concern (a rule we model wrong, a firm whose published rules drifted, a statistical critique), the contact is [email protected]. Substantive feedback shapes V2.
Upload your MT4/MT5 backtests and see your 95% CI pass-probability across 8 prop firms. Free. Runs locally in your browser. No signup.
Open Verdict →