Skip to content

702: Add regex support for target sorting in target_is_sorted_by operator#1705

Open
alexfurmenkov wants to merge 14 commits into
mainfrom
702-target-is-sorted-by-regex
Open

702: Add regex support for target sorting in target_is_sorted_by operator#1705
alexfurmenkov wants to merge 14 commits into
mainfrom
702-target-is-sorted-by-regex

Conversation

@alexfurmenkov
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated schema has not been merged with markdown descriptions. Please run the "Merge Schema with Markdown Descriptions" workflow to update the merged schema files.

@alexfurmenkov alexfurmenkov changed the title Add regex support for target sorting in target_is_sorted_by operator 702: Add regex support for target sorting in target_is_sorted_by operator Apr 23, 2026
@alexfurmenkov alexfurmenkov linked an issue Apr 23, 2026 that may be closed by this pull request
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated schema has not been merged with markdown descriptions. Please run the "Merge Schema with Markdown Descriptions" workflow to update the merged schema files.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated schema has not been merged with markdown descriptions. Please run the "Merge Schema with Markdown Descriptions" workflow to update the merged schema files.

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated schema has not been merged with markdown descriptions. Please run the "Merge Schema with Markdown Descriptions" workflow to update the merged schema files.

target_for_sorting = f"{target}_extracted"
# Sort by within columns only, preserve original order within groups
sorted_df = working_df.sort_values(
by=within_columns,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we would need to sort by within_columns and extracted target not preserve the order for regex branch.

"markdownDescription": "\nTrue if the values in name are ordered according to the values specified by value\nin ascending/descending order, grouped by the values in within. Each value entry\nrequires a variable name, a sort_order of asc or desc, and an optional\nnull_position of first or last (defaults to last) which controls where null/empty\ncomparator values are placed in the expected ordering. Within accepts either a\nsingle column or an ordered list of columns. Columns can be either number or Char\nDates in ISO8601 YYYY-MM-DD format. Date value(s) with different precisions that\noverlap (e.g. 2005-10, 2005-10-3 and 2005-10-08) are all flagged as not sorted as\ntheir order cannot be inferred.\n\nOptionally supports a `regex` parameter that extracts a portion of the target\nvalue for sorting. The regex must contain at least one capturing group. The first\ncaptured group is extracted and converted to numeric if possible, allowing proper\nsorting of sequence numbers (e.g., \"MIDS1\", \"MIDS2\", ..., \"MIDS10\" with regex\n`.*?(\\\\d+)$`). This is particularly useful for variables that end with sequence\nnumbers that may or may not be zero-padded.\n\n```yaml\nCheck:\n all:\n - name: --SEQ\n within:\n - USUBJID\n - MIDSTYPE\n operator: target_is_sorted_by\n value:\n - name: --STDTC\n sort_order: asc\n null_position: last\n```\n\nExample with regex for extracting sequence numbers:\n\n```yaml\nCheck:\n all:\n - name: MIDS\n operator: target_is_sorted_by\n regex: \".*?(\\\\d+)$\" # Extract trailing digits, convert to numeric\n value:\n - name: SMSTDTC\n sort_order: asc\n within:\n - USUBJID\n - MIDSTYPE\n```\n"
}
},
"required": ["operator", "value", "within"],
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should add the new regex property here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s not very visible here, but I actually added information about the regex. It’s easier to view it in the editor.

Copy link
Copy Markdown
Collaborator

@RamilCDISC RamilCDISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I executed the rule CG0546 in dev editor. I used the dataset from folder CG0545 in sharepiont as there was no dataset for CG0546. I made a change and added the suffix to the MIDSTYPE column records in the SM dataset. The updated dataset.is attached.

I get the following error

{
  "SM": [
    {
      "executionStatus": "execution error",
      "dataset": "SM",
      "domain": "SM",
      "variables": [],
      "message": "rule evaluation error - operation failed",
      "errors": [
        {
          "dataset": "SM",
          "error": "Error occurred during operation execution",
          "message": "Failed to execute rule operation. Operation: record_count, Target: None, Domain: SM, Error: single positional indexer is out-of-bounds"
        }
      ]
    }
  ],
  "TM": [
    {
      "executionStatus": "skipped",
      "dataset": "TM",
      "domain": "TM",
      "variables": [],
      "message": "Rule skipped - doesn't apply to domain for rule id=CDISC.SDTMIG.CG0546, dataset=TM",
      "errors": [
        {
          "dataset": "TM",
          "error": "Outside scope",
          "message": "Rule skipped - doesn't apply to domain for rule id=CDISC.SDTMIG.CG0546, dataset=TM"
        }
      ]
    }
  ]
}

unit-test-coreid-CG0545-negative 1.xlsx

Please let me know if I updated the dataset incorrectly.

Copy link
Copy Markdown
Collaborator

@gerrycampion gerrycampion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit tests look good to me, by still waiting on @alexfurmenkov to address @RamilCDISC's test results

@alexfurmenkov
Copy link
Copy Markdown
Collaborator Author

Hi @RamilCDISC ,
The initial test dataset unit-test-coreid-CG0545-negative.xlsx contained errors that prevented rule CG0546 from executing (status SKIPPED): the MIDSTYPE field in the SM dataset had values "DIAGNOSIS1" and "DIAGNOSIS2" instead of "DIAGNOSIS" (as in TM), which blocked dataset merging, and the MIDS field contained "DIAG" without trailing digits, failing the regex validation. After fixing these basic issues (changed MIDSTYPE to "DIAGNOSIS" and MIDS to "DIAG1"/"DIAG2"), two correct datasets were created: a positive test with proper order (DIAG1→2005-10, DIAG2→2007-10) and a negative test with violated order (DIAG2→2005-10, DIAG1→2007-10).

When testing the target_is_sorted_by operator in isolation, it worked correctly: for positive data it returned [True, True] (data is valid), for negative data it returned [False, False] (data is invalid). However, during full validation through the engine, a complete logical inversion was discovered: the positive dataset with correct data returned status "ISSUE REPORTED" with 2 errors (should be SUCCESS with 0 errors), while the negative dataset with incorrect order returned status "SUCCESS" with 0 errors (should be ISSUE REPORTED with 2 errors). This means the rule works completely backwards: valid data is marked as invalid (false positives), and invalid data is passed through (false negatives).

The problem is that the rule uses the operator's result directly for error generation: when target_is_sorted_by returns True (which semantically means "data is properly sorted"), the engine generates an error, and when it returns False (data is improperly sorted) - it does not generate an error.

The quickest fix is to use the target_is_NOT_sorted_by operator instead of target_is_sorted_by in the rule, which inverts the logic: the operator will return True when order is violated, and the rule will correctly generate errors only for incorrect data.

I attach correct datasets
unit-test-coreid-CG0545-negative-FIXED.xlsx
unit-test-coreid-CG0545-positive.xlsx

Copy link
Copy Markdown
Collaborator

@RamilCDISC RamilCDISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR adds regex support for target_is _sorted_by_operator. The validation was done by:

  1. Reviewing the PR for any unwanted code or comments.
  2. Reviewing the PR logic in accordance with AC.
  3. Ensuring all unit and regression testing pass.
  4. Ensuring relevant testing is updated.
  5. Ensuring documentation is updated.
  6. Ensuring the regex support is properly implemented.
  7. Verifying the regex implementation in the operator will handle edge cases.
  8. Verifying the regex implementation is similar to other operators.
  9. Running manual validation using dev editor using negative dataset.
  10. Running manual validation using dev editor using positive dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rule blocked: CORERULES-246

4 participants