Bench Coach
2.6 million plate appearances. 75-state lead-aware Markov chain. 358 personalized batter profiles. Zero black boxes.
Architecture
Pre-game and in-game predictions answer different questions, so they use different math. Before first pitch, no game state exists — no inning, no outs, no runners, no score. The pre-game model has to estimate game outcomes from priors alone: team strength, starting-pitcher quality, lineup composition, park factor, home-field advantage. Once first pitch is thrown, state becomes observable and the question pivots: given exactly this state right now, what is the probability of the home team winning? Different questions, different mathematical structures.
The pre-game model takes nine inputs into a regression-style assembly. Pythagorean expectation gives base team strength; FIP scales for the starting pitchers' innings; bullpen quality, lineup wOBA, home-field advantage, and Marcel-inspired recent-form weighting each layer in their own adjustment; Log5 (the head-to-head probability formula) converts two team strengths into a single win probability for the matchup. Park factor and umpire effects are the eighth and ninth inputs, but they affect expected run totals only — applying a symmetric park factor to both teams' expected runs cancels in the Pythagorean win-probability ratio, and umpire bias is added to the total only after WP is finalized. So seven inputs affect WP; nine inputs affect total runs. The output is one number per game, fixed at first pitch — it updates only when an input changes (lineup posted, weather shifts, injury reported, rotation moved, umpire assigned).
The in-game model is structurally different: a 75-state lead-aware Markov chain — 24 base-out states per bucket plus an absorbing “inning over” state, partitioned across three lead buckets (trailing, tied, leading by 2+) for 75 effective states total. Each bucket carries its own transition matrix capturing how game flow differs by lead context. The chain simulates the rest of the game forward from the current state. Each at-bat is a state transition — every possible plate-appearance outcome carries a probability, read from a transition matrix that has been personalized for the specific batter–pitcher matchup using Bayesian-blended Statcast profiles. 5,000 Monte Carlo simulations run per query; the reported win probability is the fraction of those simulations in which the home team is ahead at the end.
First pitch is the handoff. The pre-game probability and the in-game probability at game start (top of the 1st, no outs, bases empty, 0–0) should land within a few percentage points of each other — they are answering the same question with the same information. They diverge the moment the game produces information the pre-game model could not have known: a leadoff single, an early bullpen call, a runner thrown out trying to stretch a double. The in-game model updates with every state change; the pre-game probability stays fixed. From first pitch forward, displayed win probability is the in-game number — the pre-game probability stays available as a reference value matching what the dashboard showed before first pitch, but the live number on every game surface is always the Markov chain output for the current state.
Nine inputs into the team strength model; seven affect win probability, park factor and umpire effects (the eighth and ninth) affect expected run totals only — they don't flow into the WP calculation.
Used for scheduled games before first pitch.
75-state lead-aware Markov chain (3-bucket: trailing, tied, leading by 2+) with Monte Carlo simulation. Per-at-bat personalization using Bayesian-blended batter and pitcher profiles applied to each bucket matrix.
Active from first pitch through final out.
Pre-Game Model
Each layer adds signal. The model starts with base team strength and progressively adjusts for today's specific matchup conditions.
Estimates true team strength from runs scored and runs allowed — more predictive than win-loss record alone.
W% = RS^1.83 / (RS^1.83 + RA^1.83)
Early season: blended with prior year to prevent small-sample noise.
Replaces team-average pitching for the starter's innings with their individual quality, measured by Fielding Independent Pitching.
pitcher_RA9 = team_RA × (pitcher_FIP / LEAGUE_AVG_FIP)
SP covers ~5.5 IP (61%), bullpen covers ~3.5 IP (39%).
Adjusts expected runs allowed for the relief innings based on each team's bullpen FIP.
bullpen_RA9 = team_RA × (team_BP_FIP / LEAGUE_AVG_BP_FIP)
Source: MLB Stats API reliever splits, 30 teams.
Compares today's actual starting lineup offensive quality to the team's season average using weighted on-base average.
run_adj = (lineup_wOBA - team_avg_wOBA) × 12.0
Requires 5+ matched batters with 30+ PA each. 579 batters in cache.
Adjusts expected total runs based on the venue's historical run environment.
all_runs × park_factor
Statcast 3-year rolling average. Coors Field: 1.27×. T-Mobile Park: 0.79×.
Home teams score approximately 2.5% more runs historically. Applied to run scoring, not run prevention.
home_RS × 1.025 | away_RS × 0.975
One of the most stable effects in baseball analytics.
Bill James's formula for converting two team win percentages into a head-to-head probability.
P(A) = (pA − pA×pB) / (pA + pB − 2×pA×pB)
Output clamped to [0.15, 0.85] to prevent overconfident pre-game predictions.
Home plate umpires shift run totals via strike zone size and consistency. Today's assigned umpire's career run-per-game tendency is applied as an additive adjustment to the expected total.
expected_total += umpire_runs_per_game_bias (clamped ±0.4 R/G)
Source: MLB Stats API boxscore aggregated across 2024–2025 seasons. 88 active umpires tracked. Monthly refresh.
Smooth regression toward a 3-year weighted prior replaces the old 30-game step function. Team RS/RA blends current season with prior years proportionally to sample size — the same approach used by ZiPS, Steamer, and Marcel projection systems.
weight_current = games_played / (games_played + 50) RS_blended = weight × RS_current + (1 − weight) × RS_3yr_prior
Weights: 2026 × 1.0 + 2025 × 0.5 + 2024 × 0.25, normalized. Eliminates discontinuity at games_played=30.
Pre-Game Model
Home plate umpires have measurable, persistent effects on run scoring. A strict-zone umpire generates more strikeouts and fewer walks, suppressing runs. A loose-zone umpire does the reverse. Per SABR Analytics research, umpire-to-umpire variance within a single season is 0.1–0.3 runs per game. At a typical 8.5 / 9.0 O/U line, a 0.2-run shift moves the over probability from ~52% to ~57%.
Bench Coach tracks home plate umpire assignments via the MLB Stats API (assigned ~24 hours before first pitch) and compares each umpire's career runs-per-game to the league average. The signed bias is applied additively to the expected total — not to win probability — and is clamped to ±0.4 R/G regardless of cache value. This is the same signal that EV Analytics and Action Network include in their projections.
# Step 7 in pregame_model.predict()
umpire_bias = umpire_cache[umpire_id].runs_per_game_bias
umpire_bias = clamp(umpire_bias, min=-0.4, max=+0.4)
expected_total += umpire_bias
# Cache schema (data/umpire_cache.json)
{
"427164": {
"name": "Andy Fletcher",
"games": 39,
"avg_runs_per_game": 9.18,
"runs_per_game_bias": +0.38, # vs league avg 8.80
"data_source": "mlb_stats_api_boxscore"
}
}Data source decision: Bench Coach uses MLB Stats API boxscore aggregation as the primary umpire signal source — 88 umpires, ≥10 games threshold, refreshed monthly. UmpScorecards (umpscorecards.com) offers a more sophisticated expected-stats methodology and would be a stronger primary signal if their API access were public; the cache loader ( scripts/build_umpire_cache.py) is built to accept richer data without model changes if access becomes available.
Pre-Game Model
The original model used a step function: regress team strength 30% toward .500 for the first 30 games, then stop regressing entirely. This creates an artificial discontinuity — a team's 29th game and their 30th game should not be treated fundamentally differently.
The replacement approach aligns with how Marcel, ZiPS, and Steamer project players: blend current-season data with a multi-year weighted prior, with the current season's weight increasing proportionally to sample size. At 0 games, the model is fully on prior; at 162 games, it's 76% on current season.
# Smooth regression weight (replaces games_played < 30 step)
REGRESSION_PRIOR_PA_EQUIV = 50 # tune empirically; 50 ≈ 1/3 season
weight_current = games_played / (games_played + REGRESSION_PRIOR_PA_EQUIV)
# At 0 games: weight = 0.0 (full prior)
# At 50 games: weight = 0.5 (half prior, half current)
# At 162 games: weight = 0.76 (mostly current)
# 3-year Marcel-weighted prior (data/team_stats_3yr.json)
RS_prior = (1.0 × RS_2026_to_date + 0.5 × RS_2025_full + 0.25 × RS_2024_full)
/ (1.0 + 0.5 + 0.25)
# Blended estimate used in pregame_model.predict()
RS_blended = weight_current × RS_current + (1 - weight_current) × RS_priorThe weights here (1.0 / 0.5 / 0.25) are inspired by — not identical to — Marcel's published 5/4/3 system. Marcel weights the most recent season most heavily; Bench Coach applies the same principle with steeper recent-year emphasis. Both approaches improve on a hard step function by smoothing the regression-to-prior boundary.
The same 3-year weighting is applied to pitcher FIP (pitcher_fip_3yr.json) and batter wOBA (batter_woba_3yr.json), with caches built from the MLB Stats API and committed to the repo. Railway loads them at startup — no Statcast calls required in production.
In-Game Model
Every baseball half-inning can be described by two variables: how many outs there are (0, 1, or 2) and which bases are occupied (8 configurations). That gives 24 active states plus one absorbing state (3 outs — inning over). This is our Markov chain.
24 active states = 3 outs × 8 base configs 1 absorbing state = 3 outs (inning over) 25 total states Base encoding (bitmask): 1B = 1, 2B = 2, 3B = 4 Empty = 0, Loaded = 7, 1st & 3rd = 5
A 25×25 matrix T where T[i][j] is the probability of moving from state i to state j on a single plate appearance. This is the underlying baseline; production uses three lead buckets (trailing, tied, leading by 2+) for a 75-state lead-aware chain. Trained from 2.6 million plate appearances across 15 MLB seasons (Retrosheet, 2010–2024).
Expected runs from any base-out state through the end of the inning, computed via the fundamental matrix of absorbing Markov chains.
Q = T[0:24, 0:24] transient submatrix N = (I − Q)⁻¹ fundamental matrix RE = N × r expected runs vector
Win probability is estimated by simulating the remainder of the game 5,000 times from the current state. Each simulation walks through the Markov chain, sampling transitions and accumulating runs until the game ends. Walkoff logic, extra innings, and the Manfred runner rule are all modeled.
home_wp = home_wins / 5,000 SE ≈ 0.7% at p = 0.50 (worst case) SE ≈ 0.6% at p = 0.80
The transition matrix is personalized for the current batter-pitcher matchup using Bayesian-blended Statcast profiles. 358 batters and 354 pitchers have pre-computed matrices. Players with more plate appearances get more weight on their personal data; small samples regress toward league average.
T_personalized = T_batter + T_pitcher − T_league where: T_batter = w_b × raw_batter + (1 − w_b) × league T_pitcher = w_p × raw_pitcher + (1 − w_p) × league w_b = PA / (PA + 200) w_p = PA / (PA + 200) × 0.5 pitcher gets half influence Example: Aaron Judge (751 PA) → 79% personal weight Example: 100-PA rookie → 33% personal, 67% league
Out-of-Sample Validation
The model has been backtested out-of-sample against nine distinct MLB seasons spanning four rule eras — pre-pitch-clock traditional rules (2017–2019), COVID-shortened with universal DH and ghost runner debut (2020), post-COVID with ghost-runner retained and NL pitchers returning to bat in 2021 before the universal DH was made permanent in 2022 (2021–2022), and post-pitch-clock with larger bases and shift restrictions (2023 transition year, 2024–2025 settled). Each test year uses a holdout model trained only on prior years — the 2017 backtest uses a model trained on 2010–2016, the 2025 backtest uses a model trained on 2010–2024, and so on. No cherry-picking, no overfitting. Every regular-season inning-start across nine seasons — 361,519 predictions across 20,325 games, run on the post-audit 3-bucket lead-aware production model.
| Year | Era | Brier | 95% CI | Confident Acc. | N |
|---|---|---|---|---|---|
| Pre-pitch-clock (2017–2019) | |||||
| 2017 | pre-pitch-clock | 0.1617 | [0.1600, 0.1635] | 82.9% | 43,328 |
| 2018 | pre-pitch-clock | 0.1635 | — | 83.1% | 43,556 |
| 2019 | pre-pitch-clock | 0.1608 | — | 83.3% | 43,513 |
| COVID-shortened (2020) | |||||
| 2020 | COVID + DH/ghost runner | 0.1624 | — | 83.3% | 15,516 |
| Post-COVID, pre-clock (2021–2022) | |||||
| 2021 | post-COVID, pre-clock | 0.1651 | — | 83.1% | 42,740 |
| 2022 | post-COVID, pre-clock | 0.1598 | — | 84.8% | 43,207 |
| Clock-transition + post-clock (2023–2025) | |||||
| 2023 | clock-transition | 0.1677 | — | 82.1% | 43,204 |
| 2024 | post-pitch-clock | 0.1636 | [0.1618, 0.1653] | 83.3% | 43,236 |
| 2025 | post-pitch-clock | 0.1624 | [0.1607, 0.1640] | 83.5% | 43,219 |
Brier range across nine seasons: 0.1598 — 0.1677. The 7.9-basis-point spread is the cost of cross-era generalization. The model performs best on 2022 (mid-training-era 2010–2021, settled rules) at 0.1598 and worst on 2023 (the first full season with pitch-clock + larger bases + shift ban, tested against a model trained mostly on the old rules) at 0.1677. The 2024 and 2025 numbers (0.1636, 0.1624) sit between, as the new-rules training data accumulates and the chain adapts.
Era-aggregate means tell the same story without averaging over structurally different rule sets: pre-pitch-clock 0.1620 across three traditional-rules seasons (closest to the rules the bulk of training data captures); COVID-shortened 0.1624 on a single year of pandemic-era ball; post-COVID, pre-clock 0.1625 across two seasons where the ghost runner persisted, NL pitchers returned to bat in 2021 before the universal DH was made permanent in 2022, but the pitch clock had not yet arrived; clock-transition + post-clock 0.1646 across three seasons of the current rule set, with the 2023 transition pulling the aggregate up. 26-basis-point spread across four genuinely different rule regimes is the cross-era stability signal — the chain captures structural baseball rather than memorizing era-specific patterns.
A note on bootstrap confidence intervals: 2017, 2024, and 2025 have full bootstrap 95% CIs from 10,000 resamples (script: scripts/rigor_pass.py). 2018, 2019, 2020, 2021, 2022, and 2023 captured headline metrics only; a follow-up rigor pass on those six seasons is queued and does not block the headline aggregate. The three years we have CIs for span the rule-era spectrum (pre-pitch-clock, post-pitch-clock years 2 and 3); their CI widths land at 3.5–3.5 basis points, suggesting the unbanded years sit in the same precision regime.
The 2020 row is the deliberate stress test. Model trained on 2010–2019 (no DH for the National League, no extra-innings ghost runner) tested against 2020 (universal DH first deployed, ghost runner active). Brier 0.1624 — within 4 basis points of the era mean despite a complete rule overhaul and a 60-game regular season (N=15,516 vs ~43K elsewhere). That is structural-baseball signal in the chain, not era-specific memorization.
A consistent calibration pattern appears in all nine seasons: the model is systematically conservative at high-confidence levels (positive deltas at the 70–80% and 80–90% bins). The original 25-state chain ran +5.12pp at the 80–90% bin. The current 75-state 3-bucket lead-aware chain (Phase 6.9 + the Phase 6.9 audit response) closes that to +1.46pp mean across all 9 seasons, an ~72% closure of the conservatism gap. Net effect for users is still favorable — the model undersells its own edge by ~1pp at the high-confidence range.
Phase 6.9 lead-state expansion + the 2026-04-29 four-pass full-repo audit together produced these numbers. Pre-audit Brier (0.158 range) included a self-cancelling pair of parser bugs in the training pipeline. Post-audit numbers reflect true model calibration.
Tested both 3-bucket and 7-bucket lead-aware designs across 9 seasons (2017–2025, 4 rule eras). The 7-bucket design produced a mean 0.14pp marginal improvement at the 80–90% target bin and a mean 0.32pp marginal improvement at the 40–50% mid-bin. Mean Brier identical to four decimal places (0.1630 both). The marginal improvements did not justify 100 additional states, sample-sturdiness loss in tail cells, or the methodology complexity of distinguishing seven lead buckets vs three. 3-bucket parsimony shipped.
One residual is disclosed honestly: the 40–50% mid-confidence bin shows a persistent +5.67pp positive bias across all 9 seasons (range +3.76 to +7.65pp). Phase 6.9 was scoped to high-confidence conservatism specifically; the mid-bin variance is a separate phenomenon affecting predictions at the 40–50% confidence boundary, where mid-confidence predictions concentrate in late-game close-score states and leverage variables (bullpen quality, lineup turnover, pitcher fatigue) appear to matter more than the lead-state matrix can capture. Investigation scheduled for the next phase. Model ships at the residual disclosed here while that work proceeds.
2024 Detail
The model was trained on 2010–2023 data and tested against the full 2024 MLB season — data it had never seen. No cherry-picking, no overfitting. Every regular-season inning-start across all 2,428 games — 43,236 predictions, with bootstrap 95% confidence intervals computed via 10,000 resamples. Numbers below reflect the post-audit 3-bucket lead-aware production model.
A well-calibrated model produces actual win rates that match its predicted probabilities. Each row below shows predicted probability range, number of predictions in that range, and what actually happened. Delta near zero = calibrated.
| Predicted | N | Actual | Delta |
|---|---|---|---|
| 0–10% | 5,644 | 3.8% | -1.2% |
| 10–20% | 3,236 | 15.9% | +0.9% |
| 20–30% | 2,901 | 26.5% | +1.5% |
| 30–40% | 3,083 | 34.8% | -0.2% |
| 40–50% | 6,308 | 48.8% | +3.8% |
| 50–60% | 7,671 | 55.8% | +0.8% |
| 60–70% | 2,902 | 65.9% | +0.9% |
| 70–80% | 2,829 | 77.0% | +2.0% |
| 80–90% | 3,214 | 85.7% | +0.7% |
| 90–100% | 5,448 | 96.3% | +1.3% |
Note: the 80–90% bin closes to +0.7% on the post-audit 3-bucket lead-aware model — well within the ≥80%-of-gain ship threshold the design spec specified. The 70–80% bin sits at +2.0% and 90–100% at +1.3%. The visible residual is the 40–50% mid-confidence bin (+3.8% on 2024, +5.7pp mean across 9 seasons) — a separate phenomenon affecting predictions where the model expresses near-coin-flip uncertainty, scheduled for follow-on investigation in the next phase. Model ships at the residual disclosed here.
We publish the pre-game Brier score honestly. A league-average model without player-specific in-game state has no edge over 50/50. Our pre-game model adds value through run total predictions, matchup context, and park-adjusted expectations — not by pretending to beat the market on moneyline before first pitch.
2025 Validation
A second consecutive out-of-sample validation. The model was trained on 2010–2024 data and tested against the full 2025 MLB season — data it had never seen. Two seasons of consecutive out-of-sample holdout testing removes the risk of 2024 being a lucky year.
Same calibration methodology as 2024. Delta near zero = calibrated. Two-season consistency confirms the model is not overfit to any single year.
| Predicted | N | Actual | Delta |
|---|---|---|---|
| 0–10% | 5,371 | 3.7% | -1.3% |
| 10–20% | 3,171 | 15.3% | +0.3% |
| 20–30% | 2,882 | 26.8% | +1.8% |
| 30–40% | 3,016 | 38.4% | +3.4% |
| 40–50% | 6,422 | 52.3% | +7.3% |
| 50–60% | 7,651 | 58.3% | +3.3% |
| 60–70% | 3,235 | 67.3% | +2.3% |
| 70–80% | 2,788 | 77.7% | +2.6% |
| 80–90% | 3,230 | 88.1% | +3.1% |
| 90–100% | 5,453 | 97.6% | +2.6% |
CLV Validation
Brier score measures accuracy against game outcomes — but outcomes are noisy. A 65% favorite loses 35% of the time; that doesn't make the model wrong. The sharper validation metric is closing-line value (CLV): how does the model's prediction compare to Pinnacle's closing line?
Pinnacle is the world's sharpest sportsbook. Their closing odds correlate above 99% with actual outcomes on large samples — sharper than any public model's outcome predictions. When Bench Coach's probability estimate is closer to the game outcome than Pinnacle's closing line, that is evidence the model carries independent signal beyond what the market has already priced.
Most public MLB models don't publish CLV. Bench Coach does — capture began 2026-04-27 (live as of this page revision); published results land once N ≥ 30 resolved games per market segment.
At the moment of first pitch on every game, the prediction-capture loop records two things simultaneously: Bench Coach's current inning-start win probability, and Pinnacle's live moneyline (which at first pitch equals the effective closing line — Pinnacle makes minimal adjustments post-lock).
At first pitch (game lock):
bench_coach_prob = Markov engine output (home win prob)
pinnacle_close = Pinnacle's live moneyline de-vigged to true prob
CLV-Brier = mean((bench_coach_prob − actual_outcome)²)
where actual_outcome ∈ {0, 1}
CLV-beat-rate = % of games where bench_coach_prob
was closer to actual_outcome than pinnacle_close wasStatus: Live capture began 2026-04-27. As of launch, N = 0 completed game captures. Published CLV results will appear here once we reach N ≥ 30 resolved games per market segment — the minimum for statistically meaningful reporting. No results are manufactured or back-filled; every data point is a real first-pitch capture from a 2026 regular-season game.
We chose Pinnacle as the benchmark because they accept sharp action without restriction, post the smallest vig in the market, and their closing lines are the consensus best estimate of true probability available anywhere. Beating Pinnacle's close — even occasionally — demonstrates the model carries signal beyond what the sharpest market participants have priced.
Betting Edge
When our model disagrees with the sportsbooks, we flag it. Here's how.
Sportsbook odds include a house edge (vig). We strip it to get true implied probabilities.
implied = |odds| / (|odds| + 100) if favorite implied = 100 / (odds + 100) if underdog devigged = implied / (home_imp + away_imp)
How much a $1 bet is worth given our model's probability vs the market price.
EV = (model_prob × payoff) − ((1 − model_prob) × stake)
Optimal bet sizing for long-term bankroll growth, based on edge magnitude.
kelly = (p × (odds − 1) − (1 − p)) / (odds − 1)
Monte Carlo run distribution tells us the probability of the game going over or under the book's total line.
over_prob = Σ P(total > line) under_prob = Σ P(total < line)
Data
| Source | What | Volume | Refresh |
|---|---|---|---|
| Retrosheet | Historical play-by-play | 2.6M PA (2010–2024) | Annual |
| MLB Stats API | Live game state (GUMBO feed) | Real-time | 15s polling |
| MLB Stats API | Rosters, standings, schedules | 30 teams | Daily |
| Baseball Savant | Statcast batter/pitcher profiles | 358 batters, 354 pitchers | Weekly |
| The Odds API | Live sportsbook lines | DK, FD, BetMGM + | 5-min cache |
Disclosure
Bench Coach’s injury report defaults to surfacing only transitions that meaningfully affect game predictions. The full criterion is published below; the toggle to show all transactions is visible on the dashboard injury report.
These are the inputs Bench Coach’s lineup and rotation models read when generating predictions. A reviewer or user asking “did you incorporate today’s IL move?” can verify against the surfaced feed.
The “Show all” toggle is one click away. The criterion is auditable: the filter logic lives in injury_scanner.py and matches the events currently surfaced.
Bench Coach surfaces IL placements / activations and rotation moves today. Trade announcements, suspensions, role changes, surgery announcements, and arbitration / contract-status changes are part of the published criterion above; they ship to the surfaced feed as Bench Coach’s upstream data sources expand to cover them. The criterion-vs-code alignment will hold at every step: anything in the surfaced feed matches a “Material” entry on this list, and any “Material” entry not yet surfaced is named here explicitly.
Every game. Every pitch. Every edge. The model is running right now.
Sign Up Free