Skip to content

Refactor SQL queries and add auto-subsampling feature#208

Merged
singjc merged 5 commits intoPyProphet:masterfrom
singjc:master
May 6, 2026
Merged

Refactor SQL queries and add auto-subsampling feature#208
singjc merged 5 commits intoPyProphet:masterfrom
singjc:master

Conversation

@singjc
Copy link
Copy Markdown
Contributor

@singjc singjc commented May 6, 2026

This pull request introduces an automatic subsampling feature for semi-supervised learning in the scoring workflow, aiming to improve efficiency when handling large datasets with many runs. Additionally, it makes several improvements and corrections to alignment feature queries and adds a utility function for determining the number of runs in various file formats.

Automatic subsampling and scoring improvements:

  • Added logic in score.py to automatically set the subsample_ratio to 1/N (where N is the number of runs) when the default value is used and the input contains more than 20 runs, optimizing performance for large datasets. Users can disable this auto-subsampling by setting --subsample_ratio -1. The CLI help text was updated to document this behavior. [1] [2] [3]

Alignment feature query corrections:

  • Changed a LEFT JOIN to an INNER JOIN and made the PEP threshold exclusive in the DuckDB alignment feature query, ensuring only features with PEP strictly less than the threshold are included.
  • Refactored the SQLite alignment feature query to correctly select and join features, handle groupings, and apply the PEP threshold, improving data accuracy and compatibility.

Utility enhancements:

  • Added a new get_num_runs function in io/util.py to robustly determine the number of runs in input files across supported formats (OSW, Parquet, split Parquet, and TSV), supporting the new auto-subsampling logic.

singjc and others added 3 commits May 5, 2026 20:05
…tion and ensure correct PEP threshold comparison

Co-authored-by: Copilot <copilot@github.com>
…t_num_runs utility function

Co-authored-by: Copilot <copilot@github.com>
Copilot AI review requested due to automatic review settings May 6, 2026 00:20
@singjc singjc enabled auto-merge May 6, 2026 00:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds automatic subsampling for semi-supervised learning in the score CLI to improve performance on large multi-run inputs, and corrects/refactors alignment-feature queries (DuckDB + SQLite) used in the IPF workflow. It also introduces a utility helper to determine the number of runs across supported input formats.

Changes:

  • Add get_num_runs() to infer run counts for OSW/Parquet/split-Parquet/TSV inputs.
  • Add auto-subsampling in pyprophet score when --subsample_ratio is left at the default and the input has >20 runs, with --subsample_ratio -1.0 as an opt-out.
  • Fix/refactor alignment feature SQL to ensure only features with PEP strictly below the threshold are included, and align SQLite logic more closely with DuckDB behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
pyprophet/io/util.py Adds get_num_runs() helper used by scoring to decide whether to auto-subsample.
pyprophet/io/ipf/osw.py Adjusts DuckDB/SQLite alignment-feature queries (join type and PEP threshold semantics; refactors SQLite feature selection/join).
pyprophet/cli/score.py Implements auto-subsampling logic and updates CLI help text to document default behavior and opt-out.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pyprophet/io/util.py
Comment thread pyprophet/io/util.py Outdated
Comment thread pyprophet/cli/score.py
Comment thread pyprophet/cli/score.py
singjc and others added 2 commits May 5, 2026 20:33
@singjc singjc merged commit 7ef3f36 into PyProphet:master May 6, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants