Skip to content

Pedagogy: add 15_optimization/038 — optimization on a histogram surrogate (classification example) #425

@kwlee2025cpp

Description

@kwlee2025cpp

Summary

Build 15_optimization/038_optimization_on_histogram_surrogate.ipynb as a companion to the existing 030_Classification_Optimization.ipynb. The new notebook teaches a general numerical-methods pattern: when iterative parameter optimization is expensive over $N$ raw data points, replace the per-point sum in the cost function with a per-cell sum over a multi-dimensional histogram of the feature space, optimize on the surrogate, then verify the final result on the full dataset.

The pattern was used in Ch. 6 of:

Lee, K. (2004). Longitudinal Driver Model and Collision Warning and Avoidance Algorithms Based on Human Driving Databases. Ph.D. dissertation, Mechanical Engineering, University of Michigan.

There, two 5D histograms reduced 3.4 M data points (~400 MB) to ~1.9 MB while approximating the confusion matrix used by fmincon with 1000 random initial conditions. The same pattern transplants cleanly to a 2D classification example for teaching.

Pedagogical arc

  • Motivate the problem: iterative optimization (multi-start, hyperparameter sweeps) makes the per-point cost-function scan the bottleneck
  • Build per-class histograms from a 2D training set (two overlapping Gaussians, ~$10^6$ points) — visualize side-by-side heatmaps analogous to the dissertation's Fig. 6.2 "Threatening" / "Safe" panels
  • Surrogate confusion matrix: $\widehat{TP}, \widehat{FP}, \widehat{TN}, \widehat{FN}$ as weighted sums over cell centroids (the algebra of Eqs. 38–39 in Ch. 6, restated for classification)
  • Train a logistic regression model two ways — on raw data vs on the surrogate — and overlay the decision boundaries
  • Bias-vs-speed tradeoff via an ipywidgets slider on grid resolution $n \times n$, with a live Pareto plot
  • Verify-on-truth step: re-evaluate the surrogate-trained model point-by-point on the full dataset; tabulate the confusion matrix both ways
  • Train / evaluation split for overfitting check (mirrors Ch. 6 §6.1's split-half analysis)

Scope decisions

  • Classifier: logistic regression (clean weighted log-loss on the surrogate) — likely. SVM is more visual but heavier.
  • Grid: uniform for the first notebook. Log-spaced (per dimension) and quantile-spaced are mentioned as alternatives but not implemented in 038.
  • Cost function: choose one of (a) the geometric mean of TP and precision, mirroring Ch. 6 Eq. 40, or (b) sklearn-style weighted log-loss. Signpost the other in the closing.

Curricular hooks

  • Bridges 20_probability/ histograms (per-class joint densities) and 15_optimization/ (optimization on a surrogate)
  • Direct follow-on to 030_Classification_Optimization.ipynb
  • Train-on-surrogate, verify-on-truth pattern is a general engineering-numerical-methods lesson reusable elsewhere

Out of scope (parked as future threads)

  • Adaptive $C$-groups (k-means coreset) — extends the rectangular grid to data-adaptive cells, the move that escapes the curse of dimensionality. Would live as a sibling notebook (~039) once 038 lands.
  • Soft-assignment / Gaussian-mixture surrogate
  • Boundary-aware adaptive refinement (cluster misclassified points after a first pass)
  • Sufficient-statistics-per-cell extension (Bradley, Fayyad, Reina 1998), the high-D successor to histograms

Metadata

Metadata

Assignees

No one assigned

    Labels

    pedagogyNotebook content / teaching changes

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions