Predicting Haulers, Not Averages: An ML Pipeline for Fantasy Football

What 174 features taught me about FPL prediction, and where ML still fails

~10 min read

tl;dr - Key Takeaways

If you don’t want the technical details:

Early signals for GW22+: Bruno Guimarães, Watkins, Guéhi, Vicario (see the full list below).

Value efficiency beats star names (see “Magnificent Seven”): The model’s #1 signal is points-per-pound, not raw talent. A £6m midfielder averaging 5 pts/GW is often better than a £12m premium averaging 7.
Recent form matters more than season averages: The last 2 gameweeks are weighted 2x more than games 4-5 weeks ago. Chase momentum, not reputation.
Premium players are fixture-proof (see SHAP finding 1): Elite players (£10m+) only lose ~20% of their expected points in tough fixtures. Don’t bench Salah against City.
Budget players need fixture rotation (see SHAP finding 3): Unlike premiums, budget picks show 2.5x swings based on opponent. Rotate your £5-7m players aggressively.
Caveat: The model ranks players well but can’t reliably pick the single best captain each week (0% accuracy on holdout).

Want the technical details? Read on.

The Problem: No Differentiation

Standard regression models trained on Fantasy Premier League data have a curious failure mode: they predict everyone will score 5-7 points with no meaningful differentiation.

This makes sense from an optimisation perspective. Most FPL players score between 2 and 6 points per gameweek, but a model that predicts 5.2 for everyone achieves a respectable Mean Absolute Error (MAE) while useless in practice - it tells you nothing about which players to pick.

FPL players don’t need perfect point predictions. They need to identify haulers: the explosive 10+ point performances that separate top managers from the rest. Or, at minimum, reliably rank players so the best options rise to the top. Analysis of top 1% FPL managers reveals the stakes:

Metric	Top 1%	Average	Difference
Captain Points per GW	18.7	15.0	+24%
Haulers per GW	3.0	2.4	+25%

Top players don’t just get hauls; they tend to be on the right ones more often.

This post describes an ML pipeline I built that aims to address both sides of the problem: maintaining reasonable point-prediction accuracy while deliberately optimising for differentiation. In practice, this means trading some average error for better separation between players - particularly in the upper tail, where haulers live. For reference, the model achieves 1.80 MAE under cross-validation and 1.31 MAE on a GW20–21 hold-out, outperforming a simple rule-based baseline, but those figures are a sanity check rather than the goal. The core design choices focus on domain-aware feature engineering, position-specific modelling, and loss functions that penalise missed explosive performances.

A Note on Design Philosophy

This approach extrapolates from historical data; it assumes the past patterns carry forward. Like any model trained on historical data, it has a ceiling: top 1% managers may incorporate information the model can’t see (injury whispers, press conferences, watching matches). The model prioritises consistency over this potential upside.

Section 1: The Feature Engineering Story

174 Features Sounds Like Bloat. It (Mostly) Isn’t.

The model uses 174 engineered features across 8 development phases. Each phase solves a specific blind spot that off-the-shelf models miss.

174 features across 13 categories. Green = high importance features, orange = medium importance.

What makes this feature set effective isn’t the count; it’s the systematic iteration. Each phase targeted a known predictive gap.

The Magnificent Seven: Features That Actually Matter

I ran three independent importance analyses: Mean Decrease Impurity (MDI), Permutation Importance, and SHAP values. Seven features consistently appeared in the top 10 across all three methods, with strong agreement on relative importance:

Rank	Feature	Signal Strength	Why It Matters
1	points_per_pound	High	Value efficiency beats raw talent. A £6m player with 5 xP > £12m player with 8 xP for squad building.
2	ewm_5gw_points	Medium-High	Exponential weighted form. Last 2 games matter 2x more than games 4-5 weeks ago.
3	value_vs_position	Medium-High	Positional context. A 4-point defender means something different than a 4-point forward.
4	clean_sheet_potential	Medium	Defensive security matters as much as attacking returns for DEF/GKP.
5	cs_x_minutes	Medium	Clean sheet probability × minutes = defensive ceiling.
6	rolling_5gw_minutes	Medium	Playing time is destiny. No minutes, no points.
7	rolling_5gw_ict_index	Medium	ICT captures overall involvement: goals, assists, threat, creativity.

Key insight: These features dominate across all three methods. If I had to rebuild with a minimal set, these would be my starting point.

The top 3 features account for 39% of total predictions. Notice what dominates: value metrics and recent form. The model cares mostly about value efficiency and momentum.

What Surprised Me: The Underperformers

Several feature categories I expected to be important… weren’t:

Feature Category	Expected Rank	Actual Rank	What Happened
Betting Odds	Top 20	#40-70	Model learns fixture quality from base stats; doesn’t need market odds
Haul Rates	Top 20	#100-120	Current form beats historical haul patterns
Elite Interactions	Top 20	#44-138	Premium players don’t respond uniquely to fixtures; they just scale with price
Opponent Vulnerability vs Top 6	Top 20	#63-92	Generic fixture difficulty works just as well

Even more annoyingly: 35 features had 0.0 importance across all three methods. These include all venue-specific features, all ranking features, and all fixture run projections (3GW, 5GW lookahead).

Practical implication: I can likely remove ~20% of features (174 → 139) with minimal impact, but I still need to confirm via ablation.

SHAP Reveals the Non-Linear Truth

Some interesting insights come from SHAP dependence plots, which reveal relationships that linear models and simple rules can’t capture.

Finding 1: Elite Players Are Fixture-Proof

SHAP dependence plot showing elite player (£10m+) performance across fixture difficulty levels.

Conventional FPL wisdom treats tough fixtures as near-automatic blanks, with managers often benching players facing top-6 opposition. The SHAP analysis reveals a more nuanced reality:

Premium players (£10m+) lose only 20% of their ceiling in hard fixtures
Elite players maintain SHAP values of +0.1 to +0.3 even against top defences
The penalty is steepest at mid-range difficulty (2.5-3.5), not at extremes

Example: A premium striker facing a top-6 defence might predict 8.2 xP versus 10.1 xP in an easy fixture; a 19% drop, not the dramatic reduction that conventional wisdom suggests.

Strategic implication: Hold premium assets through 2-3 tough fixtures; their floor remains high.

Finding 2: Form Amplifies Fixture Quality

SHAP dependence plot showing recent form × opponent vulnerability interaction.

The relationship between form and fixture isn’t additive; it’s multiplicative:

Form (5GW points)	Weak Opponent	Strong Opponent	Ratio
35+ pts	+2.0 to +3.0 SHAP	+0.5 to +1.0 SHAP	3x boost
25-35 pts	+0.5 to +1.0 SHAP	-0.5 to +0.5 SHAP	2x boost
<25 pts	-0.5 to +0.5 SHAP	-1.5 to -0.5 SHAP	1.5x penalty

A player with 35+ points in 5 gameweeks gets a 2-3x boost from weak opponents versus strong ones.

Strategic implication: The upside is real, but it’s rarely “free”. When form is obvious, prices/ownership move quickly; so the edge is often in spotting early momentum, not buying after the crowd.

Finding 3: Budget Players Are Fixture-Dependent

SHAP dependence plot showing opponent vulnerability × price tier interaction.

The SHAP spread tells the story:

Budget players (£4m-£7m): SHAP ranges from -1.0 to +1.5 (2.5 unit swing)
Elite players (£10m+): SHAP ranges from +0.1 to +1.0 (0.9 unit swing, always positive)

Elite players maintain positive SHAP even against the best defences. Budget players go negative.

Strategic implication: Rotate budget picks based on fixture runs; set-and-forget premiums.

Section 2: The Training Pipeline

Two Phases: Evaluate, Then Retrain

The training pipeline follows a disciplined two-phase approach:

Full training pipeline from data split to production. Phase 1 evaluates all models in parallel using walk-forward validation. Phase 2 retrains the selected hybrid architecture on the full dataset.

Phase 1: Evaluation (GW1-19 training, GW20-21 holdout)

Train 4 unified models in parallel (RandomForest, LightGBM, XGBoost, GradientBoosting)
Train 4 position-specific models per position (16 total)
Evaluate on holdout using walk-forward validation
Decision: Which positions benefit from specialisation?

Phase 2: Production (GW1-21 full training)

Retrain selected models on all available data
Build HybridPositionModel router
Deploy with position-based prediction routing

Note on architecture selection: The pipeline re-evaluates every gameweek. The hybrid configuration (GKP position-specific, others unified) shown here is from this specific training run; next week’s optimal configuration may differ. Treat this as one valid configuration, not the optimal one.

Why GKP Gets Its Own Model (This Week)

Position-specific vs unified model performance. Only GKP exceeds the 2% improvement threshold, justifying a hybrid architecture with one position-specific model.

Position	Unified MAE	Specific MAE	Improvement	Decision
GKP	1.789	1.584	+11.48%	Use Specific
DEF	1.679	1.712	-1.96%	Use Unified
MID	1.809	1.791	+0.98%	Use Unified
FWD	2.188	2.313	-5.72%	Use Unified

Only goalkeepers show significant improvement with position-specific modelling. Why?

GKP scoring is fundamentally different:

Clean sheets: 4 points (major contributor)
Saves: 1 point per 3 saves (variable across GKs)
Other positions: Goals/assists dominate

The unified model tries to learn one pattern across all positions; however, GKPs are outliers. Position-specific features like saves_x_opposition_xG and clean_sheet_potential capture nuances the unified model misses.

Threshold decision: I require ≥2% improvement to justify the complexity of a specialised model. Only GKP meets this bar.

Model Selection: RandomForest Wins (This Week)

RandomForest selected for production with 1.800 MAE (≈33% lower than a rule-based baseline at ~2.7 MAE). On the GW20–21 holdout, the same baseline comparison is ≈51%.

Rank	Regressor	MAE	vs RandomForest
1	RandomForest	1.800	— (selected)
2	LightGBM	1.876	+4.2%
3	XGBoost	1.932	+7.3%
4	GradientBoosting	2.041	+13.5%
—	Rule-based baseline	~2.7	+33%

What is the rule-based baseline? Before building the ML pipeline, I used a simpler heuristic model: it weights recent form (last 5 gameweeks) at 70% and season averages at 30%, multiplied by team strength ratings and fixture difficulty. No learned parameters—just hand-tuned coefficients based on FPL domain knowledge. It serves as a sanity check: if the ML model can’t beat simple heuristics, the added complexity isn’t justified.

RandomForest provides the best MAE as well as a useful dispersion signal via ensemble variance (helpful for riskier decisions like captaincy); this is critical for captain selection where I want to identify high-ceiling players.

Validation Methodology: The Numbers Behind the Numbers

Walk-forward validation simulates real-world deployment: train on past data and predict future gameweeks. This prevents look-ahead bias that inflates metrics in standard cross-validation.

Learning curve showing MAE on GW20-21 holdout as training data increases. The model stabilises around GW14-15 with ~1.5 MAE. Variability across training sizes reflects different hyperparameter configurations per experiment.

Training Data	Holdout MAE	Spearman	Notes
GW1-10	2.04	0.60	Insufficient data
GW1-12	1.55	0.64	Improvement with more data
GW1-14	1.88	0.62	Variance in HPO
GW1-17	1.58	0.60	Stabilizing
GW1-19	1.31	0.62	Production model
GW1-21	1.53	0.65	Full season data

Key observations:

MAE variance across training sizes: ±0.25 (not stable)
The GW1-19 model shows the best holdout performance, possibly due to favourable hyperparameter search
More training data doesn’t always improve performance; model configuration matters more

This variability is important context for the headline numbers. The 1.31 MAE isn’t guaranteed; it’s one point in a distribution of possible outcomes.

Section 3: Why Tree Models Win

FPL prediction is a tree-friendly problem. Here’s why gradient boosting and random forests dominate:

1. Non-Linear Interactions Everywhere

Example decision path for a hauler prediction. RandomForest checks elite status, fixture difficulty, and recent form to predict 9.2 xP for Haaland against a weak opponent.

Trees naturally capture interactions that linear models miss:

Elite × Fixture: Premium players maintain value in tough fixtures
Form × Opponent: Recent form amplifies weak-opponent opportunity
Minutes × xG: 0 minutes equals 0 points (sharp threshold, not gradual)

2. Categorical Handling

FPL data is inherently categorical:

Position (GKP/DEF/MID/FWD) has fundamentally different scoring
Team strength (20 teams, non-ordinal relationships)
Opponent matchups (categorical fixture interactions)

Trees handle this natively. Linear models need extensive one-hot encoding and lose interaction effects.

3. Custom Objectives for Hauler Identification

Standard MAE treats all errors equally. However, in FPL, missing a 15-point haul costs more than overestimating a 2-point blank.

I implemented an asymmetric loss function for LightGBM: 2x penalty for under-predicting haulers (10+ points). This teaches the model to prioritise capturing explosive performances.

Section 4: Results & Key Takeaways

Production Performance

Evaluated on GW20-21 holdout (n=1,585 predictions, model trained on GW1-19):

Metric	ML Model	Baselines	Context
MAE	1.31	Rule-based: ~2.7	51% lower error
Spearman ρ	0.62	—	Moderate rank-order
Captain accuracy	0/2 GWs (0%)	Random: 6.7%, Highest-owned: ~25%	See limitations below
Hauler precision@15	20%	Aspirational: 50-70%	Gap addressed in limitations

Note: These holdout metrics are more conservative than cross-validation metrics. The model excels at ranking but struggles with identifying the absolute top performer.

Per-Gameweek Breakdown

MAE by gameweek on holdout set. GW21 (MAE 1.21) outperformed GW20 (MAE 1.41), showing ~15% variance between gameweeks.

Gameweek	MAE	Spearman	Captain Correct	Haulers Found
GW20	1.41	0.59	❌	3/15 (20%)
GW21	1.21	0.64	❌	2/12 (17%)

The model missed the top captain pick in both holdout gameweeks. This is the critical failure mode: it ranks the pack well but performs poorly at identifying the leader.

Calibration plot showing predicted vs actual points. Binned means (orange) track the 45° line reasonably well, indicating predictions are calibrated across the range.

Residual distribution showing prediction errors are approximately symmetric around zero, with slight right skew (model under-predicts more often than over-predicts).

Key Takeaways

Value efficiency trumps raw talent: points_per_pound is the #1 feature. Build squads around value, not star names.
Recent form with exponential weighting: ewm_5gw_points (#2 feature) captures momentum. The FPL app’s simple “Form” average misses this.
Premium players are fixture-proof: Elite players lose only 20% of ceiling in tough fixtures; hold through bad runs.
Form amplifies fixtures: Target in-form players (35+ pts/5GW) entering easy fixture runs for maximum upside.
35 features can be removed: Zero-importance features (venue strength, rankings, fixture runs) add complexity without improving accuracy.
Custom objectives beat MAE: Asymmetric loss functions that penalise missing haulers align model training with FPL objectives.
Hybrid architecture justified by data: Only GKP benefits from position-specific modelling (11% improvement). DEF/MID/FWD share enough patterns that unified models work better.

Limitations & Where the Model Fails

Not every metric tells a success story. Transparency about failures is as important as reporting wins.

The Honest Numbers

On the GW20-21 holdout set:

Metric	Aspirational	Actual	Gap
Top-15 overlap	12-13/15 (80%)	1.7/15 (11%)	-69%
Captain accuracy	40%+	0/2 (0%)	-40%
Hauler precision@15	50-70%	20%	-30 to -50%

The model excels at ranking (Spearman 0.62-0.65) but fails at identifying the absolute top performers. This is a critical distinction: it is good at “Ekitiké will outscore Thiago” but bad at “Ekitiké will be THE top scorer this week.”

Why the Gap?

The model systematically under-predicts high scorers by 5-12 points. Analysis suggests:

Conservative hyperparameters: max_depth=4 prioritises stability over capturing explosive patterns
MAE-based training: Even with custom hauler objectives, the loss function doesn’t sufficiently reward capturing 15+ point hauls
Rare event problem: Haulers (10+ points) are ~8% of observations; the model learns the majority pattern

Note on sample sizes: Evaluation metrics are computed over player-game rows (e.g., thousands of predictions). However, when training position-specific models, the effective diversity is bounded by the number of unique players in that position. Goalkeepers are a small and behaviourally distinct group, which makes specialised modelling more fragile and easier to overfit.

The 35 Zero-Importance Features

I identified 35 features with 0.0 importance across all three methods (MDI, Permutation, SHAP). These include all venue-specific features, all ranking features, and all fixture run projections (3GW, 5GW lookahead).

Why haven’t I removed them?

Ablation study not yet completed to confirm no interaction effects
Backward compatibility with saved models
Some may become predictive with more training data as the season progresses

This is technical debt that I acknowledge.

What I’d Do Differently

Quantile regression: Train separate models for 10th, 50th, 90th percentiles to better capture haul ceiling
Larger hauler penalty: Increase the asymmetric loss weight from 2x to 5x for missing haulers
Ensemble approach: Combine ML predictions with ownership-weighted baseline
Feature reduction: Remove the 35 zero-importance features before the next retraining

Players to Watch

Based on ML predictions from the GW1-21 model as of January 11, 2026. These are model suggestions, not guarantees.

The model identifies the following players as high-value picks for the upcoming gameweeks. Remember: the model excels at relative ranking, not predicting the absolute top scorer.

Forwards

Player	Team	Price	1GW xP	3GW xP	Why the Model Likes Them
Watkins	Aston Villa	£8.7m	6.8	16.2	Strong form momentum + value efficiency
Haaland	Man City	£15.1m	6.2	19.0	Elite status, fixture-proof, 74% owned template
Thiago	Brighton	£7.1m	5.1	15.7	Excellent 3GW projection, differential
Mateta	Crystal Palace	£7.6m	4.9	12.6	Budget premium alternative

Midfielders

Player	Team	Price	1GW xP	3GW xP	Why the Model Likes Them
Bruno Guimarães	Newcastle	£7.2m	7.1	15.3	Top 1GW pick, underpriced for output
Rogers	Aston Villa	£7.6m	5.8	14.1	Form momentum, favourable fixtures
Wirtz	Liverpool	£8.2m	5.7	14.4	Strong 3GW projection
Foden	Man City	£8.7m	5.7	15.1	Elite x fixture interaction positive
Bruno Fernandes	Man Utd	£9.1m	5.6	14.1	Set-piece threat, consistent floor
Bowen	West Ham	£7.7m	5.1	15.4	Good value at price point

Defenders

Player	Team	Price	1GW xP	3GW xP	Why the Model Likes Them
Guéhi	Crystal Palace	£5.3m	5.2	12.1	High clean sheet potential, 40% owned
Gabriel	Arsenal	£6.7m	5.1	13.6	Set-piece threat + clean sheets
Van de Ven	Spurs	£4.6m	4.9	13.1	Excellent value, 28% owned
Cash	Aston Villa	£4.8m	4.7	13.2	Attacking returns potential
J.Timber	Arsenal	£6.3m	4.6	12.6	Premium defence, rotation risk

Goalkeepers

Player	Team	Price	1GW xP	3GW xP	Why the Model Likes Them
A.Becker	Liverpool	£5.4m	4.5	11.6	Top 1GW projection
Vicario	Spurs	£4.8m	4.3	12.8	Strong 3GW projection
Verbruggen	Brighton	£4.5m	4.0	11.9	Budget with upside
Raya	Arsenal	£5.9m	4.0	12.8	Premium defence, 34% owned

Disclaimer: Use with caution 😉.

Conclusion

Building an ML system for FPL meant moving away from “predict the average well” and towards something closer to the decision problem: separating likely hauls from the pack. I’m not there yet, but the pipeline is now producing signals that are directionally useful, and it’s made the failure modes very obvious.

This required three design choices:

Domain-aware features (174 total) to capture things the FPL UI flattens away (exponential form weighting, fixture interactions, market signals). The nice surprise: 35 features show zero importance, which is a clear simplification/pruning opportunity.
Custom asymmetric loss functions to penalise missing hauls more than overestimating blanks. This helps, but it still isn’t strong enough: the model continues to under-predict the true upper tail.
A hybrid architecture that treats positions differently. The lift is real, but the constraint is equally real: small samples (such as 58 for GKP) make anything position-specific fragile.

What worked:

Ranking quality (Spearman ~0.62): good enough for selection/portfolio construction
Clear value signals (points_per_pound dominates)
Much better clarity on what matters, via a rigorous three-method importance analysis

What didn’t (yet):

Captain choice (0% on holdout)
Top-15 identification (11% overlap vs ~80% target)
Hauler capture (20% precision vs 50–70% target)

The model is useful for relative ranking, but still weak at calling the single highest-ceiling outcome in a given week, which is exactly where FPL points are won.

What’s next?

Even with a better model, picking an FPL team is still an optimisation problem under hard constraints. This post stops at ranking and signal generation by design. The next step is the optimiser: using simulated annealing to turn noisy model outputs into squad and captaincy decisions under real FPL constraints (budget, positions, and risk appetite).

Model: GW1-21 Hybrid (RandomForest unified + GKP position-specific) Season: 2025-26 FPL Generated: January 2026