<- Data projects

2026 / Independent ML project

March Madness Prediction Pipeline

Top 1% Kaggle-style tournament model with leakage-safe features, calibrated ensembles, and Stage 2 submissions

MLSports AnalyticsCatBoostLightGBMForecastingKaggle
March Madness prediction cover
March Madness prediction cover

Overview

Built an end-to-end March Machine Learning Mania pipeline for men's and women's NCAA tournament prediction, combining point-in-time feature engineering, calibrated model sweeps, strict walk-forward validation, and final Stage 2 submission generation. The project reached a top 1% result while keeping the modeling workflow auditable and reproducible.

Problem

Tournament prediction is a small-sample, high-variance forecasting problem where leakage is easy: seeds, ratings, injuries, and late-season signals must be available only as of the prediction date. The goal was to maximize calibrated win probabilities without letting future tournament outcomes contaminate training.

Role

Individual project - data pipeline, feature engineering, model selection, validation, submission strategy, and reporting

Timeline

2026

Tools

Python / pandas / CatBoost / LightGBM / XGBoost / scikit-learn / pytest

Data

  • Kaggle men's and women's NCAA regular-season, tournament, seed, slot, conference, coach, city, and Massey ordinal files
  • Stage 2 sample submission with 132,133 matchup rows validated against exact sample IDs
  • Point-in-time feature snapshots cached by division, season, and cutoff day
  • Optional external 2026 injury, prospect, and bracket projection snapshots for inference experiments

Approach

  • Created leakage-safe team-season and matchup features from regular-season results only, including Elo, Glicko-like ratings, rating uncertainty, ORtg, DRtg, NetRtg, pace, eFG, turnover, rebounding, free-throw, 3PA, opponent-adjusted margin, conference strength, seed priors, and Massey aggregates
  • Ran multi-family sweeps across logistic elastic net, HistGB, CatBoost, LightGBM, XGBoost, and OOF stacking
  • Compared Platt, isotonic, and beta calibration under walk-forward validation
  • Built distinct final strategies: balanced, chalk-leaning, upset-leaning, and uncertainty-robust, then generated Stage 2 A/B submissions

Evaluation

  • Strict validation folds: 2022, 2023, 2024, and 2025 with train seasons strictly earlier than the validation season
  • Primary metric: Brier score; secondary metrics: LogLoss and expected calibration error
  • Best stable Stage 2 candidate: CatBoost rating-focused model with mean Brier 0.164668, mean LogLoss 0.494439, and mean ECE 0.032421 across four folds
  • Final strategy audit reported robust strategy mean Brier 0.164916 and balanced strategy mean Brier 0.165139 under four-fold validation

Results

  • Achieved top 1% performance with a fully reproducible prediction workflow
  • Produced final Stage 2 submissions A and B plus four strategy submissions for balanced, chalk, upset, and robust risk profiles
  • Built automated reports, figures, validation audits, submission checks, and tests for parsing, leakage guardrails, and output format

Deployment

  • One-command training and report pipeline via Python modules
  • Generated CSV submissions, model summaries, experiment leaderboards, calibration figures, and PDF reports
  • Validation checks ensure exact sample ID alignment and probability bounds before submission

Limitations

  • Tournament sample size remains limited and season-to-season variance is high
  • External injury mapping is noisy, especially for women's coverage
  • Some strategy-level metrics include additional calibration-selection layers and are treated directionally rather than as direct single-run comparisons

Evidence

Final strategy metrics and comparison
Final strategy metrics and comparison
Model-family results across experiment sweeps
Model-family results across experiment sweeps
Calibration curve for the balanced strategy
Calibration curve for the balanced strategy
Sensitivity analysis for injury signal experiments
Sensitivity analysis for injury signal experiments

Repro Steps

  • Install project requirements and place Kaggle competition CSVs under data/raw
  • Run python -m src.experiments.stage2_finalize --asof 2026-02-21 --budget 35 --rebuild_features
  • Validate submissions against SampleSubmissionStage2.csv and inspect outputs/reports

Next Steps

  • Add minute-level player availability priors mapped to possession-level impact
  • Expand uncertainty modeling with bootstrap and fold variance
  • Increase sweep budget with early stopping and run-time pruning
  • Add explicit monotonic constraints for seed features in LightGBM variants
View repository ->