Refactor SQL queries and add auto-subsampling feature#208
Merged
singjc merged 5 commits intoPyProphet:masterfrom May 6, 2026
Merged
Refactor SQL queries and add auto-subsampling feature#208singjc merged 5 commits intoPyProphet:masterfrom
singjc merged 5 commits intoPyProphet:masterfrom
Conversation
…tion and ensure correct PEP threshold comparison Co-authored-by: Copilot <copilot@github.com>
…t_num_runs utility function Co-authored-by: Copilot <copilot@github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds automatic subsampling for semi-supervised learning in the score CLI to improve performance on large multi-run inputs, and corrects/refactors alignment-feature queries (DuckDB + SQLite) used in the IPF workflow. It also introduces a utility helper to determine the number of runs across supported input formats.
Changes:
- Add
get_num_runs()to infer run counts for OSW/Parquet/split-Parquet/TSV inputs. - Add auto-subsampling in
pyprophet scorewhen--subsample_ratiois left at the default and the input has >20 runs, with--subsample_ratio -1.0as an opt-out. - Fix/refactor alignment feature SQL to ensure only features with PEP strictly below the threshold are included, and align SQLite logic more closely with DuckDB behavior.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
pyprophet/io/util.py |
Adds get_num_runs() helper used by scoring to decide whether to auto-subsample. |
pyprophet/io/ipf/osw.py |
Adjusts DuckDB/SQLite alignment-feature queries (join type and PEP threshold semantics; refactors SQLite feature selection/join). |
pyprophet/cli/score.py |
Implements auto-subsampling logic and updates CLI help text to document default behavior and opt-out. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…runs for parquet files using DuckDB Co-authored-by: Copilot <copilot@github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces an automatic subsampling feature for semi-supervised learning in the scoring workflow, aiming to improve efficiency when handling large datasets with many runs. Additionally, it makes several improvements and corrections to alignment feature queries and adds a utility function for determining the number of runs in various file formats.
Automatic subsampling and scoring improvements:
score.pyto automatically set thesubsample_ratioto1/N(where N is the number of runs) when the default value is used and the input contains more than 20 runs, optimizing performance for large datasets. Users can disable this auto-subsampling by setting--subsample_ratio -1. The CLI help text was updated to document this behavior. [1] [2] [3]Alignment feature query corrections:
LEFT JOINto anINNER JOINand made the PEP threshold exclusive in the DuckDB alignment feature query, ensuring only features with PEP strictly less than the threshold are included.Utility enhancements:
get_num_runsfunction inio/util.pyto robustly determine the number of runs in input files across supported formats (OSW, Parquet, split Parquet, and TSV), supporting the new auto-subsampling logic.