-
Notifications
You must be signed in to change notification settings - Fork 0
Batch Ensembl CDS, stabilize BioMart downloads, and improve PAE handling #80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
St3451
wants to merge
58
commits into
master
Choose a base branch
from
dev/fix_build_datasets
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+1,070
−243
Open
Changes from all commits
Commits
Show all changes
58 commits
Select commit
Hold shift + click to select a range
31bac06
fix: enhance download_biomart_metadata function with retry logic and …
St3451 b268d2a
feat: improve get_ref_dna_from_ensembl_mp function with error handlin…
St3451 86796c0
refactor: streamline get_ref_dna_from_ensembl function by removing un…
St3451 fea7d2f
logs: enhance get_ref_dna_from_ensembl function with improved error h…
St3451 a402cde
docs: update organism specification in README to include full scienti…
St3451 d961730
refactor: simplify multiprocessing in get_ref_dna_from_ensembl_mp fun…
St3451 8c2ce07
fix: enhance download_biomart_metadata function with improved error h…
St3451 dbf0cd2
logs: improve error handling and logging in download_biomart_metadata…
St3451 519d92c
fix: improve error handling by removing partial BioMart metadata file…
St3451 2db7336
feat: add batch retrieval for Ensembl CDS DNA sequences and improve r…
St3451 3753b2c
feat: handle rate limiting in get_ref_dna_from_ensembl_batch function
St3451 826c239
increase max_attempts in download_biomart_metadata to 5
St3451 2cf9865
fix: add logging import to seq_for_mut_prob.py
St3451 2c006e3
remove outdated download_biomart_metadata function
St3451 08eb1bb
fix: handle empty dataframes in process_seq_df and process_seq_df_mane
St3451 40b3432
logs: enhance logging for BioMart download failures and improve error…
St3451 62e8df5
logs: add warning log for exceeding max attempts in Ensembl CDS batch…
St3451 a29831d
fix: add SSL verification for download_single_file in download_biomar…
St3451 5c2a26f
logs: update headers for Ensembl REST API and add wget option to disa…
St3451 4276b47
feat: add SSL option to download_single_file for secure downloads
St3451 b1f5722
refactor: remove unused datasets_dir parameter from process_seq_df fu…
St3451 6c7d4a4
refactor: update process_seq_df_mane to only download biomart metadat…
St3451 2029a9b
logs: add debug logging for BioMart download attempts in download_bio…
St3451 c01dd5c
feat: initialize Tri_context with NaN and apply transformation only t…
St3451 cd65b60
refactor: return NaN instead of empty string for DNA sequences in Ens…
St3451 5839863
feat: add support for custom PAE directory and update README
St3451 b6f60cb
docs: update README and main.py to clarify AlphaFold DB versioning fo…
St3451 2812f00
docs: correct typos and improve clarity in README.md
St3451 dfc5243
feat: enhance dataset building process with improved logging and stru…
St3451 d79d679
fix: enable directory cleaning in dataset build process
St3451 26fc3d5
fix: improve logging for custom PAE directory handling in dataset bui…
St3451 6456014
lint: update main execution to enforce CLI usage for dataset building
St3451 5194584
fix: pass af_version to merge_af_fragments for improved dataset merging
St3451 4e447dd
logs: enhance logging for duplicate gene removal and Ensembl CDS fail…
St3451 cfc4e6b
fix: update custom MANE PDB directory option to require --mane_only f…
St3451 0d9412e
lint: simplify debug logging
St3451 44a8644
refactor: enforce CLI usage for seq_for_mut_prob module execution
St3451 19f3c0c
refactor: increase probe size and consecutive missing threshold for P…
St3451 94f8f2e
fix: remove existing PAE output directory before copying custom PAE d…
St3451 91e22f4
docs: update process_seq_df docstring to include canonical transcript…
St3451 eaa0b9e
feat: add function to load custom gene symbol mappings from samplesheet
St3451 f7853af
frefactor: update samplesheet tool build_metadata_map() to accept a p…
St3451 d8ea655
logs: enhance PDB copying process with detailed logging and summary o…
St3451 3c3f055
fix: ensure REPO_ROOT is added to sys.path for module imports in prep…
St3451 50498ac
limit number of connections in download_single_file to a maximum of 10
St3451 0ed4e61
reduce maximum number of connections in download_single_file from 10 …
St3451 7967d1b
feat: add retry logic for missing entries; increase max attempts and…
St3451 3a348b5
cap Ensembl CDS batch workers to a maximum number of cores
St3451 2c5c109
feat: implement bounded parallelism for retrying missing Ensembl CDS …
St3451 ea63130
fix: handle consecutive missing PAE downloads correctly
St3451 cdf9f6a
logs: enhance logging for Ensembl CDS retrieval with sequence count
St3451 2209b0e
fix: prevent duplicate SEQRES records in PDB files and log skipped in…
St3451 d7bf403
fix: handle ENSP IDs in get_exons_coord function and return NaN for m…
St3451 6eeadf3
update symbol assignment to use pd.NA for missing values in build_sym…
St3451 9170592
fix: update add_seqres_to_pdb to return bool and skip insertion if SE…
St3451 3d8d4e9
fix: reduce batch size for Backtranseq API calls to improve performan…
St3451 44255a5
logs: enhance error handling and logging in backtranseq function for …
St3451 b836ccf
feat: enhance backtranseq function with retry logic and timeout handl…
St3451 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docstring says “Add the SEQREF records”, but the code is inserting
SEQRESrecords. Update the docstring to match the PDB record type to avoid confusion.