Refactor SQL queries and add auto-subsampling feature by singjc · Pull Request #208 · PyProphet/pyprophet

singjc · 2026-05-06T00:20:04Z

This pull request introduces an automatic subsampling feature for semi-supervised learning in the scoring workflow, aiming to improve efficiency when handling large datasets with many runs. Additionally, it makes several improvements and corrections to alignment feature queries and adds a utility function for determining the number of runs in various file formats.

Automatic subsampling and scoring improvements:

Added logic in score.py to automatically set the subsample_ratio to 1/N (where N is the number of runs) when the default value is used and the input contains more than 20 runs, optimizing performance for large datasets. Users can disable this auto-subsampling by setting --subsample_ratio -1. The CLI help text was updated to document this behavior. [1] [2] [3]

Alignment feature query corrections:

Changed a LEFT JOIN to an INNER JOIN and made the PEP threshold exclusive in the DuckDB alignment feature query, ensuring only features with PEP strictly less than the threshold are included.
Refactored the SQLite alignment feature query to correctly select and join features, handle groupings, and apply the PEP threshold, improving data accuracy and compatibility.

Utility enhancements:

Added a new get_num_runs function in io/util.py to robustly determine the number of runs in input files across supported formats (OSW, Parquet, split Parquet, and TSV), supporting the new auto-subsampling logic.

…tion and ensure correct PEP threshold comparison Co-authored-by: Copilot <copilot@github.com>

…t_num_runs utility function Co-authored-by: Copilot <copilot@github.com>

Copilot

Pull request overview

This PR adds automatic subsampling for semi-supervised learning in the score CLI to improve performance on large multi-run inputs, and corrects/refactors alignment-feature queries (DuckDB + SQLite) used in the IPF workflow. It also introduces a utility helper to determine the number of runs across supported input formats.

Changes:

Add get_num_runs() to infer run counts for OSW/Parquet/split-Parquet/TSV inputs.
Add auto-subsampling in pyprophet score when --subsample_ratio is left at the default and the input has >20 runs, with --subsample_ratio -1.0 as an opt-out.
Fix/refactor alignment feature SQL to ensure only features with PEP strictly below the threshold are included, and align SQLite logic more closely with DuckDB behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
`pyprophet/io/util.py`	Adds `get_num_runs()` helper used by scoring to decide whether to auto-subsample.
`pyprophet/io/ipf/osw.py`	Adjusts DuckDB/SQLite alignment-feature queries (join type and PEP threshold semantics; refactors SQLite feature selection/join).
`pyprophet/cli/score.py`	Implements auto-subsampling logic and updates CLI help text to document default behavior and opt-out.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…runs for parquet files using DuckDB Co-authored-by: Copilot <copilot@github.com>

singjc and others added 3 commits May 5, 2026 20:05

Refactor SQL queries in OSWReader to improve alignment feature extrac…

e8e1974

…tion and ensure correct PEP threshold comparison Co-authored-by: Copilot <copilot@github.com>

Add auto-subsampling feature based on number of runs and implement ge…

16915fb

…t_num_runs utility function Co-authored-by: Copilot <copilot@github.com>

Merge branch 'PyProphet:master' into master

6fee8bf

Copilot AI review requested due to automatic review settings May 6, 2026 00:20

Copilot started reviewing on behalf of singjc May 6, 2026 00:20 View session

singjc enabled auto-merge May 6, 2026 00:20

Copilot AI reviewed May 6, 2026

View reviewed changes

Comment thread pyprophet/io/util.py

Comment thread pyprophet/io/util.py Outdated

Comment thread pyprophet/cli/score.py

Comment thread pyprophet/cli/score.py

singjc and others added 2 commits May 5, 2026 20:33

Fix subsample_ratio reference in score function and optimize get_num_…

58e7a6b

…runs for parquet files using DuckDB Co-authored-by: Copilot <copilot@github.com>

Merge branch 'master' of github.com:singjc/pyprophet

4604549

singjc merged commit 7ef3f36 into PyProphet:master May 6, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor SQL queries and add auto-subsampling feature#208

Refactor SQL queries and add auto-subsampling feature#208
singjc merged 5 commits intoPyProphet:masterfrom
singjc:master

singjc commented May 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

singjc commented May 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants