Predict the software components impacted by a user story or requirement description using a multi-label DistilBERT classifier. The project covers data preparation, fine-tuning, batch inference, and serving a FastAPI endpoint.
component_identifier/
├── app/
│ └── main.py # FastAPI service
├── data/
│ ├── generate_synthetic_baseline.py # Script to rebuild the demo dataset
│ └── train.csv # Synthetic baseline training set (3,000 stories)
├── models/
│ └── distilbert_component_classifier/ # Saved weights/tokenizer after training
├── src/
│ ├── predict.py
│ ├── train.py
│ └── utils.py
├── requirements.txt
└── README.md
-
Input CSV columns:
text: user story or requirement.labels: impacted components separated by|(multi-label, 20 components in the baseline taxonomy).
-
data/train.csvcontains 3,000 synthetic user stories derived from 25 domain scenarios. Each scenario maps deterministically to the impacted components to provide a balanced starting point for fine-tuning. -
Regenerate the demo data anytime:
python data/generate_synthetic_baseline.py
-
Swap in your labeled requirements before training for production scenarios, but keep the same schema and ensure every component has dozens of examples for stable learning.
python -m venv .venv
source .venv/bin/activate # On Windows use: .venv\Scripts\activate
pip install --upgrade pip
pip install -r requirements.txtpython src/train.py \
--train_path data/train.csv \
--output_dir models/distilbert_component_classifier \
--batch_size 4 \
--num_epochs 4 \
--max_length 256Key features:
- Uses
DistilBertTokenizerFastandDistilBertForSequenceClassification. - Multi-label setup with sigmoid activation and
BCEWithLogitsLoss. - Train/validation split (default 90/10), micro precision/recall/F1 metrics, and Hugging Face
Trainer. - Model + tokenizer + label map saved via
save_pretrained(). - Reference run (distilbert-base-uncased, synthetic baseline dataset, batch size 4, 4 epochs) previously achieved micro Precision 1.00, Recall 0.96, F1 0.98 on the held-out validation split. Expect metrics to differ once you introduce real data.
python src/predict.py \
--model_dir models/distilbert_component_classifier \
--text "As an admin I need to monitor feature rollout to stay compliant."The script prints component predictions with confidence scores. Adjust --threshold (default 0.5) for stricter or more permissive outputs.
- Ensure the fine-tuned artifacts exist in
models/distilbert_component_classifier/. - Start the API:
uvicorn app.main:app --reload --port 8000- Send a request:
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{ "text": "As a customer I need to reset my password to reduce support calls." }'Response example:
{
"components": [
{"label": "auth", "score": 0.91},
{"label": "support", "score": 0.54}
],
"threshold": 0.5
}- Reproducible training:
src/train.pynow accepts--seedand fixes RNGs. After training it writesmetrics.jsonandrun_info.jsoninto the output model folder with args, data hash, and git commit (if available) for traceability. - Health/CORS/metrics: FastAPI exposes
/healthzand/readyz. CORS is enabled for React by default. Optional Prometheus metrics (setENABLE_METRICS=1) ifprometheus-fastapi-instrumentatoris installed. - Tests: Added unit and API tests under
tests/. Dev deps indev-requirements.txt. CI via GitHub Actions runs tests on pushes/PRs in public repos for free.
- By default all origins are allowed. To restrict:
set ALLOW_ORIGINS=http://localhost:5173,http://localhost:3000 # Windows PowerShell: $env:ALLOW_ORIGINS="http://..."
uvicorn app.main:app --reload --port 8000export ENABLE_METRICS=1 # Windows PowerShell: $env:ENABLE_METRICS=1
uvicorn app.main:app --reload --port 8000pip install -r requirements.txt -r dev-requirements.txt
pytest -qShare visuals and plain-English talking points with stakeholders in one command:
pip install -r requirements.txt -r dev-requirements.txt
python reports/generate_report.py \
--model_dir models/distilbert_component_classifier \
--train_path data/train.csv \
--report_dir reports/latestYou will get:
reports/latest/report_summary.md: non-technical explanation of precision, recall, F1, exact-match accuracy, and loss trends.reports/latest/report_data.json: structured metrics and per-label stats ready for slide tables or dashboards.reports/latest/figures/*.png: validation F1/loss curves plus a top-component coverage bar chart for quick storytelling.
Run python reports/generate_report.py -h to tweak thresholds, validation split, or the destination folder.
When you need to show strengths and improvement areas, run the curated scenario harness:
python reports/run_manual_eval.py \
--cases_path reports/manual_eval_cases_100.jsonl \
--model_dir models/distilbert_component_classifier \
--output_dir reports/manual_eval \
--top_k_fallback 3Artifacts:
reports/manual_eval/manual_eval_results.{json,csv}: per-scenario expectations vs. predictions, true/false positives, misses.reports/manual_eval/manual_eval_summary.md: bullet-point narrative calling out gaps for non-technical leads.reports/manual_eval/figures/*.png: outcome distribution, per-case recall, and top missed components for slide-ready visuals.
Use --top_k_fallback (default 0) to add the best-scoring labels even when the sigmoid score is below the threshold—handy for exploratory edge-case analysis. Edit reports/manual_eval_cases_100.jsonl directly or regenerate it with:
python reports/build_manual_cases.py \
--limit 100 \
--output_path reports/manual_eval_cases_100.jsonlThe generator spans authentication, lending, collections, KYC, payments, reporting, disputes, core integration, omni-channel comms, and regulatory scenarios so each component in the taxonomy appears multiple times.
- After training, check your output dir (e.g.,
models/distilbert_component_classifier/) for:metrics.json: evaluation metrics from the Trainerrun_info.json: training args, data SHA256, git commitlabel2id.json: label mapping used at inference
These files are simple, portable, and versionable in git or any storage.
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]Build and run:
docker build -t component-identifier .
docker run -p 8000:8000 component-identifier- Training defaults target CPU-friendly settings (batch size 4, max length 256, 3–5 epochs). Adjust
--num_epochs,--learning_rate, and other CLI flags as needed. - The provided synthetic dataset is only for demonstration. Replace it with real, labeled production data for meaningful predictions.
- Point your React app to the FastAPI endpoint:
// Example using fetch
async function predictComponents(text: string, threshold = 0.5) {
const resp = await fetch("http://localhost:8000/predict", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ text, threshold }),
});
if (!resp.ok) throw new Error("Prediction failed");
return await resp.json();
}- Ensure the backend has CORS configured (default is permissive) or set
ALLOW_ORIGINSaccordingly in production.