feat: Implementing parallelization for unzipping files#1114
feat: Implementing parallelization for unzipping files#1114Roaimkhan wants to merge 2 commits intogoogle-deepmind:mainfrom
Conversation
Vincy1230
left a comment
There was a problem hiding this comment.
I noticed the same issue as you - some servers may require up to ten or more hours for the initial download stage when only utilizing one CPU core.
Deeply grateful for your pull request.
Vincy1230
left a comment
There was a problem hiding this comment.
I noticed that there is still room for optimization in your changes: you added parallel to enable gzip to run concurrently, but doing so loses the ability of find ... {} + to pass as many files as possible at once, which causes the new script to incur the overhead of starting and stopping gzip frequently.
How about adding a --xargs option as well, so we can get the best of both approaches?
Co-authored-by: 史雲昔 (Vincy SHI) <vincy@vincy1230.net>
|
@Vincy1230 Appreciate the suggestion... it definitely is a worthy upgrade! |
|
@Vincy1230 Hey could you please sign the Google CLA so the CI checks can pass? |
|
I did sign it a long time ago, there might be a problem with the CI process. Could you please take a look? @Augustin-Zidek |
Description
This PR implements GNU Parallel-based unzipping for 200,000+ *.cif.gz files in the AlphaFold pipeline.
The Problem:
The current unzipping process is strictly serial, which is extremely slow for large datasets. This limits efficiency and delays downstream processing.
The Fix:
Added a check for GNU Parallel availability.
Automatically detects the number of CPU cores, leaving one core free for I/O-bound tasks and using the remaining cores for parallel unzipping.
Falls back to the existing serial method if GNU Parallel is not installed.
Updated README.md to reflect the new parallelization option and usage instructions.
Fixes: #1075