Skip to content

Latest commit

 

History

History
47 lines (34 loc) · 1.34 KB

File metadata and controls

47 lines (34 loc) · 1.34 KB

NERDD

NER pipeline for the Disque Denúncia context, organized into training, calibration, and pseudolabelling subpipelines.

Current Structure

  • src/: main source code.
  • src/base_model_training/: base training and evaluation with nested CV.
  • src/pseudolabelling/: pseudolabel generation, score-based split, and refit.
  • src/calibration/: fit/apply reusable probability calibrators for the base model scores.
  • src/tools/: auxiliary utilities.
  • docs/: operational and architectural documentation.
  • data/: training, test, and calibration datasets.

Prerequisites

  • Git
  • Python 3.11+
  • pip

Quick Setup

git clone https://github.com/MLRG-CEFET-RJ/nerdd.git
cd nerdd
cd src
pip install -r requirements.txt

Next Steps

  • Detailed installation: docs/INSTALL.md
  • Runbook: docs/RUNBOOK.md
  • Pipeline overview: docs/PIPELINE_OVERVIEW.md
  • Architecture: docs/ARCHITECTURE.md
  • Architectural decisions: docs/ARCHITECTURAL_DECISIONS.md
  • Migration: docs/MIGRATION.md

Canonical Flow

  1. Train the base model in src/base_model_training/.
  2. Build a labeled calibration subset and fit a reusable calibrator artifact in src/calibration/.
  3. Run large-corpus prediction in src/pseudolabelling/, optionally applying the calibrator during inference.

Contributing

Open an issue or PR with fixes and improvements.