Add WHO Tuberculosis Preventive Treatment Import#2000
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
There was a problem hiding this comment.
Code Review
This pull request introduces a new data import for WHO Tuberculosis preventive treatment statistics, including a Python download script, a manifest for automation, and necessary mapping and metadata files. The review identified several critical issues: filename mismatches in the manifest script, an incorrect column count in the metadata configuration, and a logic error in the PV mapping where a redundant scaling factor would corrupt the percentage values. Additionally, the README requires updates to fix a header typo and correct the processing instructions, and the manifest's provenance URL needs formatting corrections.
| "provenance_description": "Tuberculosis: Percentage of household contacts (or all close contacts) who were started on TB preventive treatment out of those eligible", | ||
| "scripts": [ | ||
| "tb_data_download_who.py", | ||
| "../../../tools/statvar_importer/stat_var_processor.py --input_data=input_files/Tuberculosis_preventive_treatment.csv --pv_map=tuberculosis_preventive_pvmap.csv --config_file=metadata.csv --output_path=output/tuberculosis_preventive_output --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf" |
There was a problem hiding this comment.
There are filename mismatches in the script command which will cause the import to fail:
--pv_mapis set totuberculosis_preventive_pvmap.csv, but the actual filename istuberculosis_PreventiveTreatment_pv_mapping.csv.--config_fileis set tometadata.csv, but the actual filename istuberculosis_PreventiveTreatment_metadata.csv.
Please correct these filenames to match the files in the repository. Also, ensure that file paths are not quoted as per repository rules.
| "../../../tools/statvar_importer/stat_var_processor.py --input_data=input_files/Tuberculosis_preventive_treatment.csv --pv_map=tuberculosis_preventive_pvmap.csv --config_file=metadata.csv --output_path=output/tuberculosis_preventive_output --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf" | |
| "../../../tools/statvar_importer/stat_var_processor.py --input_data=input_files/Tuberculosis_preventive_treatment.csv --pv_map=tuberculosis_PreventiveTreatment_pv_mapping.csv --config_file=tuberculosis_PreventiveTreatment_metadata.csv --output_path=output/tuberculosis_preventive_output --existing_statvar_mcf=gs://unresolved_mcf/scripts/statvar/stat_vars.mcf" |
References
- Do not quote arguments that represent file paths in manifest.json scripts if they are not strictly required for the command to function correctly.
| populationType,HouseholdContact | ||
| measuredProperty,count | ||
| header_rows,1 | ||
| mapped_columns,6 |
There was a problem hiding this comment.
| iso3,observationAbout,dcid:country/{Data},,,,,,,,,,,,,, | ||
| YEAR,observationDate,@Data,,,,,,,,,,,,,, | ||
| DISAGGR_1:0-4 years,age,dcid:Years0To4,,,,,,,,,,,,,, | ||
| VALUE,value,@Data,populationType,dcid:Person,measuredProperty,dcid:count,measurementDenominator,dcid:Count_Person_Tuberculosis_EligibleForPreventiveTreatment,medicalCondition,dcid:Tuberculosis,medicalStatus,dcid:StartedOnTBPreventiveTreatment,unit,dcid:Percent,scalingFactor,100 No newline at end of file |
There was a problem hiding this comment.
The input data values are already percentages (e.g., 46.0). Using scalingFactor,100 will incorrectly multiply these values by 100, resulting in values like 4600%. Since the unit is already dcid:Percent, the scalingFactor is not needed and should be removed. The mapping file should be curated to handle the data as-is.
VALUE,value,@Data,populationType,dcid:Person,measuredProperty,dcid:count,measurementDenominator,dcid:Count_Person_Tuberculosis_EligibleForPreventiveTreatment,medicalCondition,dcid:Tuberculosis,medicalStatus,dcid:StartedOnTBPreventiveTreatment,unit,dcid:Percent
References
- Input data files, and test data files that mirror them, should not be modified to fix typos or inconsistencies. The corresponding mapping file (e.g., pvmap.csv) must be curated to handle the data as-is.
| @@ -0,0 +1,54 @@ | |||
| # WHO Tuberculosis: Percentage of household contacts (or all close contacts) who were started on TB preventive treatment out of those eligible | |||
There was a problem hiding this comment.
There appears to be a typo in the main header. The # should be #.
| # WHO Tuberculosis: Percentage of household contacts (or all close contacts) who were started on TB preventive treatment out of those eligible | |
| # WHO Tuberculosis: Percentage of household contacts (or all close contacts) who were started on TB preventive treatment out of those eligible |
| **For Test Data Run** | ||
| ```bash | ||
| python3 tools/statvar_importer/stat_var_processor.py \ | ||
| --input_data="statvar_imports/tuberculosis_preventive_treatment/source_files/Tuberculosis_preventive_treatment.csv" \ | ||
| --pv_map="statvar_imports/tuberculosis_preventive_treatment/tuberculosis_PreventiveTreatment_pv_mapping.csv" \ | ||
| --output_path="statvar_imports/tuberculosis_preventive_treatment/output_files/tuberculosis_PreventiveTreatment" \ | ||
| --config_file="statvar_imports/tuberculosis_preventive_treatment/tuberculosis_PreventiveTreatment_metadata.csv" \ | ||
| --existing_statvar_mcf="gs://unresolved_mcf/scripts/statvar/stat_vars.mcf" | ||
| ``` | ||
|
|
||
| **For Main data run** | ||
| ```bash | ||
| python3 tools/statvar_importer/stat_var_processor.py \ | ||
| --input_data="statvar_imports/tuberculosis_preventive_treatment/source_files/Tuberculosis_preventive_treatment.csv" \ | ||
| --pv_map="statvar_imports/tuberculosis_preventive_treatment/tuberculosis_PreventiveTreatment_pv_mapping.csv" \ | ||
| --output_path="statvar_imports/tuberculosis_preventive_treatment/output_files/tuberculosis_PreventiveTreatment" \ | ||
| --config_file="statvar_imports/tuberculosis_preventive_treatment/tuberculosis_PreventiveTreatment_metadata.csv" \ | ||
| --existing_statvar_mcf="gs://unresolved_mcf/scripts/statvar/stat_vars.mcf" | ||
| ``` |
There was a problem hiding this comment.
The processing instructions have several issues that could confuse users and prevent the commands from running correctly:
- The commands for 'Test Data Run' and 'Main data run' are identical, which is redundant and the test command does not use test data.
- The
--input_datapath points to.../source_files/..., but the download script saves to.../input_files/.... For testing, it should point to the test data in the.../testdata/directory. - The main data run command is also incorrect and should point to
input_files.
Please correct the test command to use the test data and update the main data run command to use the correct path.
| "curator_emails": [ | ||
| "support@datacommons.org" | ||
| ], | ||
| "provenance_url": "<https://data.who.int/indicators/i/45274BD/F5556F8>", |
There was a problem hiding this comment.
The provenance_url is enclosed in angle brackets (< >). These are typically not part of the URL and should be removed to ensure the link is valid.
| "provenance_url": "<https://data.who.int/indicators/i/45274BD/F5556F8>", | |
| "provenance_url": "https://data.who.int/indicators/i/45274BD/F5556F8", |
Please find the CL/PR Checklist:[https://docs.google.com/spreadsheets/d/1fmOgPpbf3zao7ouz8elEKxTtWmOxj0OrOEkUQzk92yI/edit?usp=drive_link&resourcekey=0--OoqVHRDwzvT84pVUGSFMA]