YAML-driven framework for mapping research data to PCGL data model.
# Clone the repository
git clone https://github.com/Pan-Canadian-Genome-Library/pcgl-data-mapper.git
cd pcgl-data-mapper
# Install dependencies
pip install -r requirements.txtInput (Two Modes):
-
Single-file mode :
- One data file with all data (CSV, TSV, or TXT format)
- Example:
data/source/EXAMPLE.csv
-
Multi-file mode :
- Directory with multiple data files (CSV, TSV, or TXT)
- Split by domain: e.g,
demographics.csv,clinical.tsv,lab_results.txt - Automatically joins files per entity configuration
- Example:
data/source/MyStudy/
Supported File Formats:
.csv- Comma-separated values.tsv- Tab-separated values.txt- Text files (delimiter auto-detected)
Output:
- Mapped CSV files - One file per entity, written to specified output directory
- Default location:
data/mapped/{StudyName}/ - Files:
participant.csv,diagnosis.csv,comorbidity.csv, etc. - Each file contains data in PCGL schema format
- Default location:
- Mapping summary -
data/mapped/{StudyName}/mapping_summary.txtwith processing statistics and validation results - Log file -
prototype_mapper.login the root directory (overwrites each run)
python prototype_mapper.py --study_id EXAMPLE --input_csv data/source/EXAMPLE.csv --output_dir data/mapped/EXAMPLE/cp -r studies/_TEMPLATE studies/MyStudyCustomize entity YAML files to match your source data fields and target PCGL schema.
Single-file mode (one CSV with all data):
python prototype_mapper.py --study_id MyStudy --input_csv data/source/your_data.csv --output_dir data/mapped/MyStudy/Multi-file mode (directory with multiple files):
python prototype_mapper.py --study_id MyStudy --input_dir data/source/MyStudy/ --output_dir data/mapped/MyStudy/- Mapped CSV files:
data/mapped/MyStudy/*.csv - Mapping summary:
data/mapped/MyStudy/mapping_summary.txt - Detailed logs:
prototype_mapper.log
data_mapper/
├── studies/
│ ├── EXAMPLE/ # Simple working example for testing
│ ├── _TEMPLATE/ # Starting point for new studies
│ ├── HostSeq/ # Production example (15 entities)
│ └── YourStudy/
│ ├── config/ # YAML entity configs (required)
│ ├── participant.yaml
│ ├── diagnosis.yaml
│ └── ...
├── core/ # Framework code (generic, reusable)
├── data/
│ ├── source/ # Input CSV files
│ │ └── EXAMPLE.csv # Example input data
│ └── mapped/ # Output directory
│ └── EXAMPLE/ # Example output
└── prototype_mapper.py # Main script
The mapper runs 5 steps for each entity:
- Preprocess - Apply filters, merge baseline fields, then clean/generate fields (YAML:
filters,preprocessing) - Map - Transform fields (YAML:
mappings,configs, pattern:direct/expansion) - Post-process - Type conversion, cleanup (YAML:
post_processing) - Validate - Quality checks (YAML:
validations) - Output - Export CSV, log stats
See core/README.md for pipeline flowchart and details.
Each entity has a YAML file with up to 7 sections. Most entities only need entity and mappings.
Minimal example:
entity:
name: Participant
pattern: direct # Options: 'direct' or 'expansion'
fields:
base: [submitter_participant_id, study_id]
mappings:
# Direct copy
- source_field: id
target_field: submitter_participant_id
# Constant value
- target_field: study_id
default_value: "MyStudy"
# Value mapping
- source_field: vital_status_code
target_field: vital_status
source_type: radio
value_mappings:
0: "Alive"
1: "Deceased"
# Age calculation
- target_field: age_at_enrollment
target_type: age
params:
birth_date_field: dob
event_date_field: enrollment_dateComplete config sections:
entity- Name, schema, fields, pattern (required)filters- Participant eligibility check and row filtering (optional)preprocessing- Data cleaning and generation applied to source data before mapping (optional)mappings- Field transforms (required)configs- Expansion configs for checkboxes (expansion pattern only)post_processing- Type conversion, cleanup (optional)validations- Quality checks (optional)
Mapping patterns:
# Direct (1:1)
entity:
pattern: direct
# Expansion (1:N checkboxes)
entity:
pattern: expansion
params:
selection_type: radio
skip_values: [0, -1]See studies/README.md for complete YAML documentation and all supported options.
Empty output
- Check
source_fieldnames match CSV columns exactly (case-sensitive) - Verify filters aren't excluding all rows
- Check required field validations
Age calculation fails
- Verify birth date and event date fields exist in source
- Check date formats are parseable
- Use
age_fallback_fieldif that's applicable to your use case
Records filtered out
- Check logics defined in
filters - Verify
validationssettings - Review
post_processconditions
Run with verbose logging: check prototype_mapper.log for details.
EXAMPLE study - Simple working example for testing and learning
- 10 entity configs with straightforward mappings
- Uses
data/source/EXAMPLE.csvas input - Direct mapping entities: participant, demographic, specimen, sample, diagnosis, sociodemographic
- Expansion mapping entities: comorbidity, medication, treatment, measurement
- See
studies/EXAMPLE/config/*.yamlfor basic configuration patterns - Run with:
python prototype_mapper.py --study_id EXAMPLE --input_csv data/source/EXAMPLE.csv --output_dir data/mapped/EXAMPLE/
HostSeq study - Complete production example with 15 entities (13 base + 2 extension)
- Direct mapping entities: participant, demographic, specimen, sample, diagnosis, sociodemographic, demographic, hla
- Expansion mapping entities: comorbidity, medication, treatment, phenotype, procedure, measurement
- See
studies/HostSeq/config/*.yamlfor advanced working configs
_TEMPLATE - Starting point for new studies with pre-configured YAML files and inline documentation.
-
studies/README.md - YAML configuration guide
- All 7 config sections with examples
- Supported preprocessing/filter/validation types
- Pattern configuration (direct, expansion, custom)
- Study-specific customization
-
core/README.md - Framework internals
- Mapping pipeline flowchart
- Key classes (
EntityMapper,StudyDataMapper,MappingConfig) - Record transforms and utilities
- Usage examples