GitHub - alingom/ml-malignancy: Breast Cancer Diagnosis System

# End-to-End ML Case Study: Early Breast Cancer Diagnosis (Classification)
 
Teaching-grade example for **PAU 3102 Research Methods**: problem framing → data →
methodology → experiments & evaluation → deployment → reproducibility.
 
## Quickstart
```bash
python -m venv .venv && source .venv/bin/activate  # Windows: .venv\Scripts\activate
pip install -r requirements.txt
python -m src.train model.name=gb               # trains & saves to app/model.joblib (Hydra-driven)
# Examples of config overrides with Hydra:
# python -m src.train model.name=rf training.seed=123 data.test_size=0.25
python src/evaluate.py                          # prints metrics & saves reports/last_eval.json
python src/predict.py --json examples/one.json  # runs a sample prediction
uvicorn app.main:app --reload                   # run API locally
```
 
 # MLflow UI
 mlflow ui  # Visualisez les résultats avec MLflow at http://localhost:5000

## Data versioning with DVC

This repository includes a minimal DVC pipeline to version model artefacts and make runs reproducible.

Files added for DVC integration:
- `dvc.yaml` - defines a `train` stage that runs the Hydra-driven training command and declares outputs (`app/model.joblib`, `app/model_meta.json`).
- `params.yaml` - parameters that DVC can track across runs (model name, seed, test split, cv folds, output path).
- `.dvcignore` - ignore rules for DVC/git.

How to use DVC locally (one-time setup):

1. Install DVC (you can also install via pip using the project requirements):
```powershell
pip install dvc
```

2. Initialize DVC in your repo (run once):
```powershell
dvc init
```

3. Configure a remote storage for model/artifact pushes (example: S3, GCS, SSH or a local directory):
```powershell
# example local remote
dvc remote add -d myremote C:\path\to\dvc_remote
# or S3: dvc remote add -d myremote s3://my-bucket/path
```

4. Reproduce the `train` stage (runs the command and creates the outputs listed in `dvc.yaml`):
```powershell
dvc repro
```

5. Track and push artifacts to the remote:
```powershell
dvc push
```

Notes:
- Because this project uses the sklearn built-in dataset, there is no large external raw dataset to `dvc add` by default. DVC is useful here to version the produced model artefacts (e.g., `app/model.joblib`) and `model_meta.json` produced by training.
- To pin artifacts to commits, use git to commit the generated `.dvc` metafiles and the `dvc.lock` file (created after `dvc repro`) so each git commit references the exact artifact versions.
- You can add `dvc add <path/to/data>` and commit the generated `<file>.dvc` file to version external datasets if you add external data later.

## Web demo (Streamlit)

A lightweight Streamlit app is included to inspect evaluation metrics and run predictions locally.

Run the app after installing dependencies:

```powershell
# install deps (if not already)
pip install -r requirements.txt

# run streamlit UI
streamlit run app/streamlit_app.py
```

The sidebar allows uploading a CSV for batch predictions. On the right you can enter a single sample using the features saved in `app/model_meta.json`.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.dvc		.dvc
.github/workflows		.github/workflows
app		app
conf		conf
examples		examples
mlruns		mlruns
notebooks		notebooks
outputs/2025-11-06		outputs/2025-11-06
reports		reports
src		src
tests		tests
.dvcignore		.dvcignore
.gitignore		.gitignore
README.md		README.md
classification_report.txt		classification_report.txt
confusion_matrix.png		confusion_matrix.png
datasheet.md		datasheet.md
dvc.lock		dvc.lock
dvc.yaml		dvc.yaml
model_card.md		model_card.md
params.yaml		params.yaml
requirements.txt		requirements.txt
roc_curve.png		roc_curve.png
structure.jpeg		structure.jpeg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages