Skip to content

feat: Implementing parallelization for unzipping files#1114

Open
Roaimkhan wants to merge 2 commits intogoogle-deepmind:mainfrom
Roaimkhan:features/bugs
Open

feat: Implementing parallelization for unzipping files#1114
Roaimkhan wants to merge 2 commits intogoogle-deepmind:mainfrom
Roaimkhan:features/bugs

Conversation

@Roaimkhan
Copy link

Description

This PR implements GNU Parallel-based unzipping for 200,000+ *.cif.gz files in the AlphaFold pipeline.

The Problem:

The current unzipping process is strictly serial, which is extremely slow for large datasets. This limits efficiency and delays downstream processing.

The Fix:

Added a check for GNU Parallel availability.

Automatically detects the number of CPU cores, leaving one core free for I/O-bound tasks and using the remaining cores for parallel unzipping.

Falls back to the existing serial method if GNU Parallel is not installed.

Updated README.md to reflect the new parallelization option and usage instructions.

Fixes: #1075

Copy link

@Vincy1230 Vincy1230 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed the same issue as you - some servers may require up to ten or more hours for the initial download stage when only utilizing one CPU core.

Deeply grateful for your pull request.

Copy link

@Vincy1230 Vincy1230 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that there is still room for optimization in your changes: you added parallel to enable gzip to run concurrently, but doing so loses the ability of find ... {} + to pass as many files as possible at once, which causes the new script to incur the overhead of starting and stopping gzip frequently.

How about adding a --xargs option as well, so we can get the best of both approaches?

Co-authored-by: 史雲昔 (Vincy SHI) <vincy@vincy1230.net>
@Roaimkhan
Copy link
Author

@Vincy1230 Appreciate the suggestion... it definitely is a worthy upgrade!

@Roaimkhan
Copy link
Author

@Vincy1230 Hey could you please sign the Google CLA so the CI checks can pass?

@Vincy1230
Copy link

I did sign it a long time ago, there might be a problem with the CI process. Could you please take a look? @Augustin-Zidek

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

parallelization opportunity

2 participants