nfl_batchgineering

An end-to-end NFL game prediction system targeting point spread and totals betting. The pipeline ingests play-by-play and schedule data from nflverse, transforms it through a layered dbt project in Snowflake, trains a hybrid ensemble ML model, and generates weekly predictions — all orchestrated by Dagster.

Primary Goal: Achieve >52.4% ATS (against-the-spread) accuracy to overcome standard -110 juice.

Tech Stack

Layer	Technology
Data Warehouse	Snowflake
Data Transformations	dbt (dbt-snowflake)
Data Ingestion	Python · Polars · nflverse parquet releases
ML Models	XGBoost · PyMC (Bayesian) · PyTorch (Neural Net) · scikit-learn
Orchestration	Dagster (assets, jobs, schedules)
Code Quality	ruff (lint + format)
Logging	loguru
Auth	Snowflake key-pair authentication
Runtime	Python 3.11+

Repository Structure

nfl_batchgineering/
├── dagster_project/               # Dagster orchestration definitions
│   ├── __init__.py                # Definitions entry point (assets, jobs, schedules)
│   ├── constants.py               # Shared path constants
│   ├── assets/
│   │   ├── ingestion.py           # nflverse → Snowflake raw assets
│   │   ├── dbt_assets.py          # dbt model assets (auto-generated from manifest)
│   │   ├── ml_training.py         # XGBoost training asset
│   │   └── predictions.py         # Weekly predictions asset
│   ├── jobs/
│   │   └── weekly_pipeline.py     # Full pipeline job definition
│   ├── resources/
│   │   └── dbt_resource.py        # dbt CLI resource configuration
│   └── schedules/
│       └── weekly_schedule.py     # Tuesday 8AM NFL-season schedule
│
├── dbt_project/                   # dbt transformations
│   ├── dbt_project.yml            # Project config (lookback: 10 seasons)
│   ├── models/
│   │   ├── 1_staging/nflverse/    # Cleaned views over raw Snowflake tables
│   │   │   ├── stgnv_play_by_play.sql
│   │   │   ├── stgnv_player_summary_stats.sql
│   │   │   ├── stgnv_team_summary_stats.sql
│   │   │   ├── stgnv_rosters.sql
│   │   │   ├── stgnv_schedules.sql
│   │   │   ├── stgnv_injuries.sql
│   │   │   └── stgnv_play_by_play_participation.sql
│   │   ├── 2_intermediate/        # Feature engineering (views + materialized tables)
│   │   │   ├── int_plays_cleaned.sql
│   │   │   ├── int_team_offensive_metrics.sql     # Rolling 4-week EPA/play
│   │   │   ├── int_team_defensive_strength.sql    # Rolling defensive EPA allowed
│   │   │   ├── int_situational_efficiency.sql     # Red zone, 3rd down, etc.
│   │   │   ├── int_game_vegas_lines.sql           # Opening lines, line movement
│   │   │   └── int_upcoming_games.sql             # Future schedule with context
│   │   └── 3_marts/               # Final ML-ready tables
│   │       ├── mart_predictive_features.sql       # Per-team, per-week feature set
│   │       ├── mart_game_prediction_features.sql  # Historical game rows for training
│   │       ├── mart_upcoming_game_predictions.sql # Upcoming games for inference
│   │       ├── mart_weather_impact_metrics.sql    # Weather-adjusted efficiency
│   │       └── mart_model_validation.sql          # Walk-forward validation dataset
│   ├── macros/
│   └── tests/
│
├── src/
│   ├── ingestion/
│   │   └── training_data_loader.py   # Bulk parquet → Snowflake loader (Polars)
│   └── ml/
│       ├── base.py                    # Abstract BasePredictor interface
│       ├── models/
│       │   ├── spread_predictor.py    # XGBoost spread model
│       │   ├── elo_model.py           # Elo rating baseline
│       │   ├── bayesian.py            # Bayesian state-space model (PyMC)
│       │   ├── neural.py              # Shallow neural network (PyTorch)
│       │   └── ensemble.py            # Ridge stacking meta-learner
│       ├── utils/
│       │   ├── feature_engineering.py # Feature selection and preparation
│       │   ├── evaluation.py          # ATS accuracy, Brier score, RMSE
│       │   └── validation.py          # Walk-forward CV helpers
│       ├── train_spread_model.py      # XGBoost training entry point
│       ├── train_elo_model.py         # Elo model training
│       ├── train_ensemble.py          # Full ensemble training
│       └── predict.py                 # Inference: load models → generate predictions
│
├── models/                        # Serialized model artifacts (.pkl, .json)
├── data/                          # Prediction CSV outputs
├── notebooks/                     # Exploratory analysis
├── snowflake_sql/                 # Ad-hoc SQL scripts
├── logs/                          # Rotating application logs
├── pyproject.toml                 # Project metadata, uv/ruff config
├── requirements.txt               # Python dependencies
├── TESTING_GUIDE.md               # dbt layer validation walkthrough
└── CLAUDE.md                      # AI assistant project instructions

Data Pipeline Overview

nflverse GitHub Releases (parquet)
         │
         ▼
 TrainingDataLoader (Polars + Snowflake connector)
         │  bulk COPY INTO
         ▼
 Snowflake RAW.NFLVERSE  ──────────────────────────────────────┐
         │                                                       │
         ▼  dbt run                                             │
 1_staging (views)                                              │
  · stgnv_play_by_play                                          │
  · stgnv_schedules, stgnv_rosters                              │
  · stgnv_player/team_summary_stats                             │
  · stgnv_injuries, stgnv_pbp_participation                     │
         │                                                       │
         ▼  dbt run                                             │
 2_intermediate (views + tables)                                │
  · int_plays_cleaned                                           │
  · int_team_offensive_metrics  (rolling EPA/play)              │
  · int_team_defensive_strength (rolling EPA allowed)           │
  · int_situational_efficiency                                  │
  · int_game_vegas_lines                                        │
  · int_upcoming_games                                          │
         │                                                       │
         ▼  dbt run                                             │
 3_marts (tables)                                               │
  · mart_game_prediction_features  ← training data             │
  · mart_predictive_features       ← per-team feature store    │
  · mart_upcoming_game_predictions ← inference input           │
  · mart_model_validation                                       │
  · mart_weather_impact_metrics                                 │
         │                                                       │
         ▼  Python                                              │
 ML Training (Snowflake → Polars → sklearn/XGBoost/PyMC)       │
  · SpreadPredictor (XGBoost)                                   │
  · EloModel (baseline)                                         │
  · BayesianStateSpace (PyMC AR(1))                             │
  · NeuralNetPredictor (PyTorch)                                │
  · StackingEnsemble (Ridge meta-learner)                       │
         │                                                       │
         ▼  Python                                              │
 Weekly Predictions → Snowflake ML schema + CSV output ────────┘

ML Model Architecture

The system uses a stacking ensemble of four base models, combined by a Ridge regression meta-learner:

┌─────────────┐  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐
│   XGBoost   │  │  Bayesian   │  │   Neural    │  │    Elo      │
│  Regressor  │  │ State-Space │  │   Network   │  │  Baseline   │
│ (max_depth4)│  │  (PyMC AR1) │  │ (PyTorch,   │  │  (Glickman) │
│ (lr=0.05)   │  │             │  │  ≤3 layers) │  │             │
└──────┬──────┘  └──────┬──────┘  └──────┬──────┘  └──────┬──────┘
       └────────────────┴────────────────┴────────────────┘
                              │
                   ┌──────────┴──────────┐
                   │   Ridge Regression  │
                   │   Meta-Learner      │
                   │   (+ uncertainty)   │
                   └──────────┬──────────┘
                              │
                   ┌──────────┴──────────┐
                   │  Betting Decision   │
                   │  (edge ≥ 3 pts →   │
                   │   0.25x Kelly bet) │
                   └─────────────────────┘

All models implement the BasePredictor abstract interface (src/ml/base.py) providing fit(), predict(), predict_proba(), save_model(), and load_model().

Validation Strategy

Walk-forward cross-validation is required — random splits are never used:

Train on seasons [t-n ... t-1], test on season t
Repeat sliding the window forward for each available test season
Target metrics: ATS Accuracy (>52.4%), Brier Score, RMSE vs Vegas line

Dagster Orchestration

The Dagster project (dagster_project/) defines the full pipeline as software-defined assets:

Asset	Description
`raw_nflverse_data`	Loads current-season parquet datasets to Snowflake
`raw_schedules`	Loads full schedules file (all seasons, full replace)
`nfl_dbt_assets`	Runs dbt models (staging → intermediate → marts)
`trained_xgboost_model`	Trains XGBoost spread predictor, saves to `models/`
`weekly_predictions`	Loads upcoming games mart, runs ensemble inference, writes to Snowflake + CSV

Schedule: Every Tuesday at 8:00 AM during the NFL season (September–February). The schedule auto-derives the current NFL week and season from the execution date.

To launch the Dagster UI locally:

dagster dev -m dagster_project

Snowflake Schema Structure

NFL_ANALYTICS.RAW.NFLVERSE       -- Bulk-loaded parquet tables
NFL_ANALYTICS.STAGING            -- dbt views (stgnv_*)
NFL_ANALYTICS.INTERMEDIATE       -- dbt views/tables (int_*)
NFL_ANALYTICS.MARTS              -- dbt materialized tables (mart_*)
NFL_ANALYTICS.ML                 -- Model artifacts and predictions

Setup

Prerequisites

Python 3.11+
Snowflake account with key-pair authentication configured
dbt CLI (installed via dbt-snowflake)

Install

# Clone and create virtual environment
python -m venv .venv
source .venv/bin/activate   # Windows: .venv\Scripts\activate

pip install -r requirements.txt

Environment Variables

Create a .env file at the project root:

SNOWFLAKE_ACCOUNT=your_account_identifier
SNOWFLAKE_USER=your_username
SNOWFLAKE_WAREHOUSE=NFL_ANALYTICS_WH
SNOWFLAKE_DATABASE=NFL_ANALYTICS
SNOWFLAKE_SCHEMA=RAW
SNOWFLAKE_ROLE=your_role

# Key-pair authentication
SNOWFLAKE_KEYPAIR_PRIVATE_KEY="-----BEGIN ENCRYPTED PRIVATE KEY-----\n...\n-----END ENCRYPTED PRIVATE KEY-----"
SNOWFLAKE_KEYPAIR_PASSPHRASE=your_key_passphrase

dbt Profile

Add the following to ~/.dbt/profiles.yml:

nfl_batchgineering:
  target: dev
  outputs:
    dev:
      type: snowflake
      account: "{{ env_var('SNOWFLAKE_ACCOUNT') }}"
      user: "{{ env_var('SNOWFLAKE_USER') }}"
      private_key: "{{ env_var('SNOWFLAKE_KEYPAIR_PRIVATE_KEY') }}"
      private_key_passphrase: "{{ env_var('SNOWFLAKE_KEYPAIR_PASSPHRASE') }}"
      role: "{{ env_var('SNOWFLAKE_ROLE') }}"
      database: NFL_ANALYTICS
      warehouse: NFL_ANALYTICS_WH
      schema: STAGING
      threads: 4

Running the Pipeline

1. Load raw data to Snowflake

python src/ingestion/training_data_loader.py

Loads the following nflverse datasets for the current season (historical years are a one-time load):

play_by_play — full play-by-play with EPA
rosters — active and historical rosters
player_summary_stats — per-player season/weekly stats
team_summary_stats — per-team summary stats
play_by_play_participation — player participation per play
injuries — weekly injury reports
schedules — full schedule with Vegas lines (all seasons, single file)

2. Run dbt transformations

cd dbt_project
dbt run        # Build all models
dbt test       # Run data quality tests

Or layer by layer:

dbt run --select 1_staging
dbt run --select 2_intermediate
dbt run --select 3_marts

See TESTING_GUIDE.md for validation queries and expected row counts.

3. Train models

# XGBoost spread model (primary)
python src/ml/train_spread_model.py

# Elo baseline model
python src/ml/train_elo_model.py

# Full ensemble (requires base models trained first)
python src/ml/train_ensemble.py

Trained artifacts are serialized to the models/ directory.

4. Generate predictions

python src/ml/predict.py --week 15 --season 2025 --output predictions_week15.csv

Or run the full pipeline via Dagster:

dagster dev -m dagster_project

Key Design Decisions

No Look-Ahead Bias

All rolling features use point-in-time joins. Window functions use ROWS BETWEEN N PRECEDING AND 1 PRECEDING — never including the current game. This is enforced in dbt intermediate models and validated via mart_model_validation.

Sample Size Constraints

NFL seasons have ~270 games. With ~20 years of data that is ~5,000 samples. The architecture deliberately avoids deep learning (LSTM, transformers) in favor of shallow models with strong regularization.

Calibration Over Accuracy

Models are optimized for Brier Score. Well-calibrated probabilities produce higher ROI than raw directional accuracy (Walsh & Joshi, 2024).

Fractional Kelly Sizing

All bet sizing uses 0.25x Kelly to reduce variance. A minimum edge threshold of 3–4 points against the Vegas line is required before any bet is recommended.

Code Quality

# Lint and format with ruff
ruff check .
ruff format .

Configuration in pyproject.toml: line length 88, Python 3.11 target, isort integrated.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

nfl_batchgineering

Tech Stack

Repository Structure

Data Pipeline Overview

ML Model Architecture

Validation Strategy

Dagster Orchestration

Snowflake Schema Structure

Setup

Prerequisites

Install

Environment Variables

dbt Profile

Running the Pipeline

1. Load raw data to Snowflake

2. Run dbt transformations

3. Train models

4. Generate predictions

Key Design Decisions

No Look-Ahead Bias

Sample Size Constraints

Calibration Over Accuracy

Fractional Kelly Sizing

Code Quality

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
.claude		.claude
dagster_project		dagster_project
dbt_project		dbt_project
notebooks		notebooks
snowflake_sql		snowflake_sql
src		src
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
TESTING_GUIDE.md		TESTING_GUIDE.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

nfl_batchgineering

Tech Stack

Repository Structure

Data Pipeline Overview

ML Model Architecture

Validation Strategy

Dagster Orchestration

Snowflake Schema Structure

Setup

Prerequisites

Install

Environment Variables

dbt Profile

Running the Pipeline

1. Load raw data to Snowflake

2. Run dbt transformations

3. Train models

4. Generate predictions

Key Design Decisions

No Look-Ahead Bias

Sample Size Constraints

Calibration Over Accuracy

Fractional Kelly Sizing

Code Quality

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages