Predicting Haulers, Not Averages: An ML Pipeline for Fantasy Football

What 174 features taught me about FPL prediction, and where ML still fails

~10 min read


tl;dr - Key Takeaways

If you don’t want the technical details:

Early signals for GW22+: Bruno Guimarães, Watkins, Guéhi, Vicario (see the full list below).

  1. Value efficiency beats star names (see “Magnificent Seven”): The model’s #1 signal is points-per-pound, not raw talent. A £6m midfielder averaging 5 pts/GW is often better than a £12m premium averaging 7.

  2. Recent form matters more than season averages: The last 2 gameweeks are weighted 2x more than games 4-5 weeks ago. Chase momentum, not reputation.

  3. Premium players are fixture-proof (see SHAP finding 1): Elite players (£10m+) only lose ~20% of their expected points in tough fixtures. Don’t bench Salah against City.

  4. Budget players need fixture rotation (see SHAP finding 3): Unlike premiums, budget picks show 2.5x swings based on opponent. Rotate your £5-7m players aggressively.

  5. Caveat: The model ranks players well but can’t reliably pick the single best captain each week (0% accuracy on holdout).

Want the technical details? Read on.


The Problem: No Differentiation

Standard regression models trained on Fantasy Premier League data have a curious failure mode: they predict everyone will score 5-7 points with no meaningful differentiation.

This makes sense from an optimisation perspective. Most FPL players score between 2 and 6 points per gameweek, but a model that predicts 5.2 for everyone achieves a respectable Mean Absolute Error (MAE) while useless in practice - it tells you nothing about which players to pick.

FPL players don’t need perfect point predictions. They need to identify haulers: the explosive 10+ point performances that separate top managers from the rest. Or, at minimum, reliably rank players so the best options rise to the top. Analysis of top 1% FPL managers reveals the stakes:

MetricTop 1%AverageDifference
Captain Points per GW18.715.0+24%
Haulers per GW3.02.4+25%

Top players don’t just get hauls; they tend to be on the right ones more often.

This post describes an ML pipeline I built that aims to address both sides of the problem: maintaining reasonable point-prediction accuracy while deliberately optimising for differentiation. In practice, this means trading some average error for better separation between players - particularly in the upper tail, where haulers live. For reference, the model achieves 1.80 MAE under cross-validation and 1.31 MAE on a GW20–21 hold-out, outperforming a simple rule-based baseline, but those figures are a sanity check rather than the goal. The core design choices focus on domain-aware feature engineering, position-specific modelling, and loss functions that penalise missed explosive performances.

A Note on Design Philosophy

This approach extrapolates from historical data; it assumes the past patterns carry forward. Like any model trained on historical data, it has a ceiling: top 1% managers may incorporate information the model can’t see (injury whispers, press conferences, watching matches). The model prioritises consistency over this potential upside.


Section 1: The Feature Engineering Story

174 Features Sounds Like Bloat. It (Mostly) Isn’t.

The model uses 174 engineered features across 8 development phases. Each phase solves a specific blind spot that off-the-shelf models miss.

/insights/2026-01-11-fpl-ml-pipeline/feature_network_graph.png 174 features across 13 categories. Green = high importance features, orange = medium importance.

What makes this feature set effective isn’t the count; it’s the systematic iteration. Each phase targeted a known predictive gap.

The Magnificent Seven: Features That Actually Matter

I ran three independent importance analyses: Mean Decrease Impurity (MDI), Permutation Importance, and SHAP values. Seven features consistently appeared in the top 10 across all three methods, with strong agreement on relative importance:

RankFeatureSignal StrengthWhy It Matters
1points_per_poundHighValue efficiency beats raw talent. A £6m player with 5 xP > £12m player with 8 xP for squad building.
2ewm_5gw_pointsMedium-HighExponential weighted form. Last 2 games matter 2x more than games 4-5 weeks ago.
3value_vs_positionMedium-HighPositional context. A 4-point defender means something different than a 4-point forward.
4clean_sheet_potentialMediumDefensive security matters as much as attacking returns for DEF/GKP.
5cs_x_minutesMediumClean sheet probability × minutes = defensive ceiling.
6rolling_5gw_minutesMediumPlaying time is destiny. No minutes, no points.
7rolling_5gw_ict_indexMediumICT captures overall involvement: goals, assists, threat, creativity.

Key insight: These features dominate across all three methods. If I had to rebuild with a minimal set, these would be my starting point.

The top 3 features account for 39% of total predictions. Notice what dominates: value metrics and recent form. The model cares mostly about value efficiency and momentum.

What Surprised Me: The Underperformers

Several feature categories I expected to be important… weren’t:

Feature CategoryExpected RankActual RankWhat Happened
Betting OddsTop 20#40-70Model learns fixture quality from base stats; doesn’t need market odds
Haul RatesTop 20#100-120Current form beats historical haul patterns
Elite InteractionsTop 20#44-138Premium players don’t respond uniquely to fixtures; they just scale with price
Opponent Vulnerability vs Top 6Top 20#63-92Generic fixture difficulty works just as well

Even more annoyingly: 35 features had 0.0 importance across all three methods. These include all venue-specific features, all ranking features, and all fixture run projections (3GW, 5GW lookahead).

Practical implication: I can likely remove ~20% of features (174 → 139) with minimal impact, but I still need to confirm via ablation.

SHAP Reveals the Non-Linear Truth

Some interesting insights come from SHAP dependence plots, which reveal relationships that linear models and simple rules can’t capture.

Finding 1: Elite Players Are Fixture-Proof

/insights/2026-01-11-fpl-ml-pipeline/shap_elite_fixture.png SHAP dependence plot showing elite player (£10m+) performance across fixture difficulty levels.

Conventional FPL wisdom treats tough fixtures as near-automatic blanks, with managers often benching players facing top-6 opposition. The SHAP analysis reveals a more nuanced reality:

  • Premium players (£10m+) lose only 20% of their ceiling in hard fixtures
  • Elite players maintain SHAP values of +0.1 to +0.3 even against top defences
  • The penalty is steepest at mid-range difficulty (2.5-3.5), not at extremes

Example: A premium striker facing a top-6 defence might predict 8.2 xP versus 10.1 xP in an easy fixture; a 19% drop, not the dramatic reduction that conventional wisdom suggests.

Strategic implication: Hold premium assets through 2-3 tough fixtures; their floor remains high.

Finding 2: Form Amplifies Fixture Quality

/insights/2026-01-11-fpl-ml-pipeline/shap_form_opponent.png SHAP dependence plot showing recent form × opponent vulnerability interaction.

The relationship between form and fixture isn’t additive; it’s multiplicative:

Form (5GW points)Weak OpponentStrong OpponentRatio
35+ pts+2.0 to +3.0 SHAP+0.5 to +1.0 SHAP3x boost
25-35 pts+0.5 to +1.0 SHAP-0.5 to +0.5 SHAP2x boost
<25 pts-0.5 to +0.5 SHAP-1.5 to -0.5 SHAP1.5x penalty

A player with 35+ points in 5 gameweeks gets a 2-3x boost from weak opponents versus strong ones.

Strategic implication: The upside is real, but it’s rarely “free”. When form is obvious, prices/ownership move quickly; so the edge is often in spotting early momentum, not buying after the crowd.

Finding 3: Budget Players Are Fixture-Dependent

/insights/2026-01-11-fpl-ml-pipeline/shap_opponent_vulnerability.png SHAP dependence plot showing opponent vulnerability × price tier interaction.

The SHAP spread tells the story:

  • Budget players (£4m-£7m): SHAP ranges from -1.0 to +1.5 (2.5 unit swing)
  • Elite players (£10m+): SHAP ranges from +0.1 to +1.0 (0.9 unit swing, always positive)

Elite players maintain positive SHAP even against the best defences. Budget players go negative.

Strategic implication: Rotate budget picks based on fixture runs; set-and-forget premiums.


Section 2: The Training Pipeline

Two Phases: Evaluate, Then Retrain

The training pipeline follows a disciplined two-phase approach:

/insights/2026-01-11-fpl-ml-pipeline/full_pipeline_flowchart.png

Full training pipeline from data split to production. Phase 1 evaluates all models in parallel using walk-forward validation. Phase 2 retrains the selected hybrid architecture on the full dataset.

Phase 1: Evaluation (GW1-19 training, GW20-21 holdout)

  • Train 4 unified models in parallel (RandomForest, LightGBM, XGBoost, GradientBoosting)
  • Train 4 position-specific models per position (16 total)
  • Evaluate on holdout using walk-forward validation
  • Decision: Which positions benefit from specialisation?

Phase 2: Production (GW1-21 full training)

  • Retrain selected models on all available data
  • Build HybridPositionModel router
  • Deploy with position-based prediction routing

Note on architecture selection: The pipeline re-evaluates every gameweek. The hybrid configuration (GKP position-specific, others unified) shown here is from this specific training run; next week’s optimal configuration may differ. Treat this as one valid configuration, not the optimal one.

Why GKP Gets Its Own Model (This Week)

/insights/2026-01-11-fpl-ml-pipeline/position_comparison_heatmap.png Position-specific vs unified model performance. Only GKP exceeds the 2% improvement threshold, justifying a hybrid architecture with one position-specific model.

PositionUnified MAESpecific MAEImprovementDecision
GKP1.7891.584+11.48%Use Specific
DEF1.6791.712-1.96%Use Unified
MID1.8091.791+0.98%Use Unified
FWD2.1882.313-5.72%Use Unified

Only goalkeepers show significant improvement with position-specific modelling. Why?

GKP scoring is fundamentally different:

  • Clean sheets: 4 points (major contributor)
  • Saves: 1 point per 3 saves (variable across GKs)
  • Other positions: Goals/assists dominate

The unified model tries to learn one pattern across all positions; however, GKPs are outliers. Position-specific features like saves_x_opposition_xG and clean_sheet_potential capture nuances the unified model misses.

Threshold decision: I require ≥2% improvement to justify the complexity of a specialised model. Only GKP meets this bar.

Model Selection: RandomForest Wins (This Week)

/insights/2026-01-11-fpl-ml-pipeline/model_comparison.png RandomForest selected for production with 1.800 MAE (≈33% lower than a rule-based baseline at ~2.7 MAE). On the GW20–21 holdout, the same baseline comparison is ≈51%.

RankRegressorMAEvs RandomForest
1RandomForest1.800— (selected)
2LightGBM1.876+4.2%
3XGBoost1.932+7.3%
4GradientBoosting2.041+13.5%
Rule-based baseline~2.7+33%

What is the rule-based baseline? Before building the ML pipeline, I used a simpler heuristic model: it weights recent form (last 5 gameweeks) at 70% and season averages at 30%, multiplied by team strength ratings and fixture difficulty. No learned parameters—just hand-tuned coefficients based on FPL domain knowledge. It serves as a sanity check: if the ML model can’t beat simple heuristics, the added complexity isn’t justified.

RandomForest provides the best MAE as well as a useful dispersion signal via ensemble variance (helpful for riskier decisions like captaincy); this is critical for captain selection where I want to identify high-ceiling players.

Validation Methodology: The Numbers Behind the Numbers

Walk-forward validation simulates real-world deployment: train on past data and predict future gameweeks. This prevents look-ahead bias that inflates metrics in standard cross-validation.

/insights/2026-01-11-fpl-ml-pipeline/learning_curve.png Learning curve showing MAE on GW20-21 holdout as training data increases. The model stabilises around GW14-15 with ~1.5 MAE. Variability across training sizes reflects different hyperparameter configurations per experiment.

Training DataHoldout MAESpearmanNotes
GW1-102.040.60Insufficient data
GW1-121.550.64Improvement with more data
GW1-141.880.62Variance in HPO
GW1-171.580.60Stabilizing
GW1-191.310.62Production model
GW1-211.530.65Full season data

Key observations:

  • MAE variance across training sizes: ±0.25 (not stable)
  • The GW1-19 model shows the best holdout performance, possibly due to favourable hyperparameter search
  • More training data doesn’t always improve performance; model configuration matters more

This variability is important context for the headline numbers. The 1.31 MAE isn’t guaranteed; it’s one point in a distribution of possible outcomes.


Section 3: Why Tree Models Win

FPL prediction is a tree-friendly problem. Here’s why gradient boosting and random forests dominate:

1. Non-Linear Interactions Everywhere

/insights/2026-01-11-fpl-ml-pipeline/tree_decision_path.png Example decision path for a hauler prediction. RandomForest checks elite status, fixture difficulty, and recent form to predict 9.2 xP for Haaland against a weak opponent.

Trees naturally capture interactions that linear models miss:

  • Elite × Fixture: Premium players maintain value in tough fixtures
  • Form × Opponent: Recent form amplifies weak-opponent opportunity
  • Minutes × xG: 0 minutes equals 0 points (sharp threshold, not gradual)

2. Categorical Handling

FPL data is inherently categorical:

  • Position (GKP/DEF/MID/FWD) has fundamentally different scoring
  • Team strength (20 teams, non-ordinal relationships)
  • Opponent matchups (categorical fixture interactions)

Trees handle this natively. Linear models need extensive one-hot encoding and lose interaction effects.

3. Custom Objectives for Hauler Identification

Standard MAE treats all errors equally. However, in FPL, missing a 15-point haul costs more than overestimating a 2-point blank.

I implemented an asymmetric loss function for LightGBM: 2x penalty for under-predicting haulers (10+ points). This teaches the model to prioritise capturing explosive performances.


Section 4: Results & Key Takeaways

Production Performance

Evaluated on GW20-21 holdout (n=1,585 predictions, model trained on GW1-19):

MetricML ModelBaselinesContext
MAE1.31Rule-based: ~2.751% lower error
Spearman ρ0.62Moderate rank-order
Captain accuracy0/2 GWs (0%)Random: 6.7%, Highest-owned: ~25%See limitations below
Hauler precision@1520%Aspirational: 50-70%Gap addressed in limitations

Note: These holdout metrics are more conservative than cross-validation metrics. The model excels at ranking but struggles with identifying the absolute top performer.

Per-Gameweek Breakdown

/insights/2026-01-11-fpl-ml-pipeline/per_gw_mae.png MAE by gameweek on holdout set. GW21 (MAE 1.21) outperformed GW20 (MAE 1.41), showing ~15% variance between gameweeks.

GameweekMAESpearmanCaptain CorrectHaulers Found
GW201.410.593/15 (20%)
GW211.210.642/12 (17%)

The model missed the top captain pick in both holdout gameweeks. This is the critical failure mode: it ranks the pack well but performs poorly at identifying the leader.

/insights/2026-01-11-fpl-ml-pipeline/calibration_plot.png Calibration plot showing predicted vs actual points. Binned means (orange) track the 45° line reasonably well, indicating predictions are calibrated across the range.

/insights/2026-01-11-fpl-ml-pipeline/residual_distribution.png Residual distribution showing prediction errors are approximately symmetric around zero, with slight right skew (model under-predicts more often than over-predicts).

Key Takeaways

  1. Value efficiency trumps raw talent: points_per_pound is the #1 feature. Build squads around value, not star names.

  2. Recent form with exponential weighting: ewm_5gw_points (#2 feature) captures momentum. The FPL app’s simple “Form” average misses this.

  3. Premium players are fixture-proof: Elite players lose only 20% of ceiling in tough fixtures; hold through bad runs.

  4. Form amplifies fixtures: Target in-form players (35+ pts/5GW) entering easy fixture runs for maximum upside.

  5. 35 features can be removed: Zero-importance features (venue strength, rankings, fixture runs) add complexity without improving accuracy.

  6. Custom objectives beat MAE: Asymmetric loss functions that penalise missing haulers align model training with FPL objectives.

  7. Hybrid architecture justified by data: Only GKP benefits from position-specific modelling (11% improvement). DEF/MID/FWD share enough patterns that unified models work better.


Limitations & Where the Model Fails

Not every metric tells a success story. Transparency about failures is as important as reporting wins.

The Honest Numbers

On the GW20-21 holdout set:

MetricAspirationalActualGap
Top-15 overlap12-13/15 (80%)1.7/15 (11%)-69%
Captain accuracy40%+0/2 (0%)-40%
Hauler precision@1550-70%20%-30 to -50%

The model excels at ranking (Spearman 0.62-0.65) but fails at identifying the absolute top performers. This is a critical distinction: it is good at “Ekitiké will outscore Thiago” but bad at “Ekitiké will be THE top scorer this week.”

Why the Gap?

The model systematically under-predicts high scorers by 5-12 points. Analysis suggests:

  1. Conservative hyperparameters: max_depth=4 prioritises stability over capturing explosive patterns
  2. MAE-based training: Even with custom hauler objectives, the loss function doesn’t sufficiently reward capturing 15+ point hauls
  3. Rare event problem: Haulers (10+ points) are ~8% of observations; the model learns the majority pattern

Note on sample sizes: Evaluation metrics are computed over player-game rows (e.g., thousands of predictions). However, when training position-specific models, the effective diversity is bounded by the number of unique players in that position. Goalkeepers are a small and behaviourally distinct group, which makes specialised modelling more fragile and easier to overfit.

The 35 Zero-Importance Features

I identified 35 features with 0.0 importance across all three methods (MDI, Permutation, SHAP). These include all venue-specific features, all ranking features, and all fixture run projections (3GW, 5GW lookahead).

Why haven’t I removed them?

  1. Ablation study not yet completed to confirm no interaction effects
  2. Backward compatibility with saved models
  3. Some may become predictive with more training data as the season progresses

This is technical debt that I acknowledge.

What I’d Do Differently

  1. Quantile regression: Train separate models for 10th, 50th, 90th percentiles to better capture haul ceiling
  2. Larger hauler penalty: Increase the asymmetric loss weight from 2x to 5x for missing haulers
  3. Ensemble approach: Combine ML predictions with ownership-weighted baseline
  4. Feature reduction: Remove the 35 zero-importance features before the next retraining

Players to Watch

Based on ML predictions from the GW1-21 model as of January 11, 2026. These are model suggestions, not guarantees.

The model identifies the following players as high-value picks for the upcoming gameweeks. Remember: the model excels at relative ranking, not predicting the absolute top scorer.

Forwards

PlayerTeamPrice1GW xP3GW xPWhy the Model Likes Them
WatkinsAston Villa£8.7m6.816.2Strong form momentum + value efficiency
HaalandMan City£15.1m6.219.0Elite status, fixture-proof, 74% owned template
ThiagoBrighton£7.1m5.115.7Excellent 3GW projection, differential
MatetaCrystal Palace£7.6m4.912.6Budget premium alternative

Midfielders

PlayerTeamPrice1GW xP3GW xPWhy the Model Likes Them
Bruno GuimarãesNewcastle£7.2m7.115.3Top 1GW pick, underpriced for output
RogersAston Villa£7.6m5.814.1Form momentum, favourable fixtures
WirtzLiverpool£8.2m5.714.4Strong 3GW projection
FodenMan City£8.7m5.715.1Elite x fixture interaction positive
Bruno FernandesMan Utd£9.1m5.614.1Set-piece threat, consistent floor
BowenWest Ham£7.7m5.115.4Good value at price point

Defenders

PlayerTeamPrice1GW xP3GW xPWhy the Model Likes Them
GuéhiCrystal Palace£5.3m5.212.1High clean sheet potential, 40% owned
GabrielArsenal£6.7m5.113.6Set-piece threat + clean sheets
Van de VenSpurs£4.6m4.913.1Excellent value, 28% owned
CashAston Villa£4.8m4.713.2Attacking returns potential
J.TimberArsenal£6.3m4.612.6Premium defence, rotation risk

Goalkeepers

PlayerTeamPrice1GW xP3GW xPWhy the Model Likes Them
A.BeckerLiverpool£5.4m4.511.6Top 1GW projection
VicarioSpurs£4.8m4.312.8Strong 3GW projection
VerbruggenBrighton£4.5m4.011.9Budget with upside
RayaArsenal£5.9m4.012.8Premium defence, 34% owned

Disclaimer: Use with caution 😉.


Conclusion

Building an ML system for FPL meant moving away from “predict the average well” and towards something closer to the decision problem: separating likely hauls from the pack. I’m not there yet, but the pipeline is now producing signals that are directionally useful, and it’s made the failure modes very obvious.

This required three design choices:

  1. Domain-aware features (174 total) to capture things the FPL UI flattens away (exponential form weighting, fixture interactions, market signals). The nice surprise: 35 features show zero importance, which is a clear simplification/pruning opportunity.

  2. Custom asymmetric loss functions to penalise missing hauls more than overestimating blanks. This helps, but it still isn’t strong enough: the model continues to under-predict the true upper tail.

  3. A hybrid architecture that treats positions differently. The lift is real, but the constraint is equally real: small samples (such as 58 for GKP) make anything position-specific fragile.

What worked:

  • Ranking quality (Spearman ~0.62): good enough for selection/portfolio construction

  • Clear value signals (points_per_pound dominates)

  • Much better clarity on what matters, via a rigorous three-method importance analysis

What didn’t (yet):

  • Captain choice (0% on holdout)

  • Top-15 identification (11% overlap vs ~80% target)

  • Hauler capture (20% precision vs 50–70% target)

The model is useful for relative ranking, but still weak at calling the single highest-ceiling outcome in a given week, which is exactly where FPL points are won.

What’s next?

Even with a better model, picking an FPL team is still an optimisation problem under hard constraints. This post stops at ranking and signal generation by design. The next step is the optimiser: using simulated annealing to turn noisy model outputs into squad and captaincy decisions under real FPL constraints (budget, positions, and risk appetite).



Model: GW1-21 Hybrid (RandomForest unified + GKP position-specific) Season: 2025-26 FPL Generated: January 2026