-
Notifications
You must be signed in to change notification settings - Fork 36
1655 dataset builder from define xml #1677
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
DmitryMK
wants to merge
8
commits into
main
Choose a base branch
from
1655-dataset-builder-from-define-xml
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+241
−33
Open
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
1a1dbb4
Using outer join for define.xml metadata. Making dataset+define build…
DmitryMK ec3d65b
Updating the test, as now contents is read when building a dataset wi…
DmitryMK 9ebe4c6
Formatting with black
DmitryMK 508efb8
Merging the main branch. Resolving merge conflict
DmitryMK 6033b20
Updated the new test according to the changes in the main branch
DmitryMK 9cbe5b9
Merge branch 'main' into 1655-dataset-builder-from-define-xml
gerrycampion 62a41e0
Merge branch 'main' into 1655-dataset-builder-from-define-xml
gerrycampion 9aaeb8a
Merge branch 'main' into 1655-dataset-builder-from-define-xml
gerrycampion File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
148 changes: 148 additions & 0 deletions
148
tests/unit/test_dataset_builders/test_variables_metadata_with_define_dataset_builder.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,148 @@ | ||
| from unittest.mock import MagicMock, patch | ||
| from cdisc_rules_engine.models.dataset.pandas_dataset import PandasDataset | ||
| from cdisc_rules_engine.models.sdtm_dataset_metadata import SDTMDatasetMetadata | ||
| from cdisc_rules_engine.services.cache.in_memory_cache_service import ( | ||
| InMemoryCacheService, | ||
| ) | ||
| from cdisc_rules_engine.services.data_services import LocalDataService | ||
| from pathlib import Path | ||
| import pandas as pd | ||
| from cdisc_rules_engine.dataset_builders.variables_metadata_with_define_dataset_builder import ( | ||
| VariablesMetadataWithDefineDatasetBuilder, | ||
| ) | ||
|
|
||
| resources_path: Path = Path(__file__).parent.parent.parent.joinpath("resources") | ||
| test_define_file_path: Path = resources_path.joinpath("test_defineV21-SDTM.xml") | ||
|
|
||
|
|
||
| @patch("cdisc_rules_engine.services.data_services.LocalDataService.get_dataset") | ||
| @patch( | ||
| "cdisc_rules_engine.services.data_services.LocalDataService.get_variables_metadata" | ||
| ) | ||
| @patch( | ||
| "cdisc_rules_engine.services.data_services.LocalDataService.get_define_xml_contents" | ||
| ) | ||
| def test_build_combined_metadata( | ||
| mock_get_define_xml, | ||
| mock_get_variables_metadata, | ||
| mock_get_dataset, | ||
| ): | ||
| """Test that the builder correctly combines two metadata sources""" | ||
| # Load actual define XML content | ||
| with open(test_define_file_path, "rb") as f: | ||
| define_data = f.read() | ||
| mock_get_define_xml.return_value = define_data | ||
|
|
||
| # Setup variables metadata | ||
| mock_get_variables_metadata.return_value = pd.DataFrame.from_dict( | ||
| { | ||
| "variable_name": ["STUDYID", "USUBJID", "AETERM"], | ||
| "variable_label": ["Study ID", "Subject ID", "AE Term"], | ||
| "variable_size": [16, 16, 200], | ||
| "variable_order_number": [1, 2, 9], | ||
| "variable_data_type": ["Char", "Char", "Char"], | ||
| } | ||
| ) | ||
|
|
||
| # Setup dataset contents | ||
| mock_get_dataset.return_value = PandasDataset( | ||
| pd.DataFrame.from_dict( | ||
| { | ||
| "STUDYID": ["STUDY1", "STUDY1", "STUDY1"], | ||
| "USUBJID": ["SUBJ1", "", "SUBJ3"], | ||
| "AETERM": ["Headache", "Nausea", ""], | ||
| } | ||
| ) | ||
| ) | ||
|
|
||
| # Create builder | ||
| builder = VariablesMetadataWithDefineDatasetBuilder( | ||
| rule=None, | ||
| data_service=LocalDataService(MagicMock(), MagicMock(), MagicMock()), | ||
| cache_service=InMemoryCacheService(), | ||
| rule_processor=None, | ||
| data_processor=None, | ||
| dataset_metadata=SDTMDatasetMetadata( | ||
| name="AE", | ||
| first_record={"DOMAIN": "AE"}, | ||
| full_path=str(test_define_file_path), | ||
| ), | ||
| define_xml_path=str(test_define_file_path), | ||
| standard="sdtmig", | ||
| standard_version="3-4", | ||
| standard_substandard=None, | ||
| ) | ||
|
|
||
| result = builder.build() | ||
|
|
||
| expected_columns = { | ||
| "variable_name", | ||
| "variable_label", | ||
| "variable_size", | ||
| "variable_order_number", | ||
| "variable_data_type", | ||
| "define_variable_name", | ||
| "define_variable_label", | ||
| "define_variable_data_type", | ||
| "define_variable_is_collected", | ||
| "define_variable_role", | ||
| "define_variable_size", | ||
| "define_variable_ccode", | ||
| "define_variable_format", | ||
| "define_variable_allowed_terms", | ||
| "define_variable_origin_type", | ||
| "define_variable_has_no_data", | ||
| "define_variable_order_number", | ||
| "define_variable_length", | ||
| "define_variable_has_codelist", | ||
| "define_variable_codelist_coded_values", | ||
| "define_variable_codelist_coded_codes", | ||
| "define_variable_mandatory", | ||
| "define_variable_has_comment", | ||
| "define_variable_has_method", | ||
| "variable_has_empty_values", | ||
| "variable_is_empty", | ||
| } | ||
| assert set(result.columns.tolist()) == expected_columns | ||
|
|
||
| core_variables = ["STUDYID", "USUBJID", "AETERM"] | ||
| for var in core_variables: | ||
| row = result[result["variable_name"] == var].iloc[0] | ||
| assert row["define_variable_name"] == var | ||
| assert row["variable_has_empty_values"] == (var in ["USUBJID", "AETERM"]) | ||
|
|
||
| studyid_row = result[result["variable_name"] == "STUDYID"].iloc[0] | ||
| assert studyid_row["variable_size"] == 16.0 | ||
| assert studyid_row["variable_order_number"] == 1.0 | ||
| assert studyid_row["variable_data_type"] == "Char" | ||
| assert studyid_row["define_variable_role"] == "Identifier" | ||
| assert not studyid_row["variable_has_empty_values"] | ||
| assert not studyid_row["variable_is_empty"] | ||
|
|
||
| usubjid_row = result[result["variable_name"] == "USUBJID"].iloc[0] | ||
| assert usubjid_row["variable_size"] == 16.0 | ||
| assert usubjid_row["variable_order_number"] == 2.0 | ||
| assert usubjid_row["variable_data_type"] == "Char" | ||
| assert usubjid_row["define_variable_role"] == "Identifier" | ||
| assert usubjid_row["variable_has_empty_values"] | ||
| assert not usubjid_row["variable_is_empty"] | ||
|
|
||
| aeterm_row = result[result["variable_name"] == "AETERM"].iloc[0] | ||
| assert aeterm_row["variable_size"] == 200.0 | ||
| assert aeterm_row["variable_order_number"] == 9.0 | ||
| assert aeterm_row["variable_data_type"] == "Char" | ||
| assert aeterm_row["define_variable_role"] == "Topic" | ||
| assert aeterm_row["variable_has_empty_values"] | ||
| assert not aeterm_row["variable_is_empty"] | ||
|
|
||
| # We need to check that the rest of the variables are coming from define.xml (variable_name is NaN) | ||
| assert ( | ||
| len( | ||
| result[ | ||
| (result["variable_name"] == "") & result["define_variable_name"].notna() | ||
| ] | ||
| ) | ||
| == 34 | ||
| ) | ||
|
|
||
| assert len(result) == 37 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make the Nan's a empty string "" we lose information if the columns was missing or it is just empty. I think better would be to handle this in a different way. Please let me know if I am not understanding the real intent here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Ramil, I did it for consistency with metadata+define+library dataset builder, as it had fillna("") in it.
What is the difference between missing and empty column? In my understanding it either comes from dataset(metadata) or define. So the missing values logically appear only when it is present in define only. I might be wrong here.