feat: Implementing parallelization for unzipping files by Roaimkhan · Pull Request #1114 · google-deepmind/alphafold

Roaimkhan · 2026-01-25T16:21:16Z

Description

This PR implements GNU Parallel-based unzipping for 200,000+ *.cif.gz files in the AlphaFold pipeline.

The Problem:

The current unzipping process is strictly serial, which is extremely slow for large datasets. This limits efficiency and delays downstream processing.

The Fix:

Added a check for GNU Parallel availability.

Automatically detects the number of CPU cores, leaving one core free for I/O-bound tasks and using the remaining cores for parallel unzipping.

Falls back to the existing serial method if GNU Parallel is not installed.

Updated README.md to reflect the new parallelization option and usage instructions.

Fixes: #1075

Vincy1230

I noticed the same issue as you - some servers may require up to ten or more hours for the initial download stage when only utilizing one CPU core.

Deeply grateful for your pull request.

Vincy1230

I noticed that there is still room for optimization in your changes: you added parallel to enable gzip to run concurrently, but doing so loses the ability of find ... {} + to pass as many files as possible at once, which causes the new script to incur the overhead of starting and stopping gzip frequently.

How about adding a --xargs option as well, so we can get the best of both approaches?

scripts/download_pdb_mmcif.sh

Co-authored-by: 史雲昔 (Vincy SHI) <vincy@vincy1230.net>

Roaimkhan · 2026-02-13T04:47:24Z

@Vincy1230 Appreciate the suggestion... it definitely is a worthy upgrade!

Roaimkhan · 2026-02-22T23:34:32Z

@Vincy1230 Hey could you please sign the Google CLA so the CI checks can pass?

Vincy1230 · 2026-02-22T23:55:01Z

I did sign it a long time ago, there might be a problem with the CI process. Could you please take a look? @Augustin-Zidek

feat: Implementing parallelization for unzipping files

ec68cdb

Vincy1230 approved these changes Feb 11, 2026

View reviewed changes

Vincy1230 reviewed Feb 11, 2026

View reviewed changes

scripts/download_pdb_mmcif.sh Outdated Show resolved Hide resolved

Update scripts/download_pdb_mmcif.sh

b362e03

Co-authored-by: 史雲昔 (Vincy SHI) <vincy@vincy1230.net>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implementing parallelization for unzipping files#1114

feat: Implementing parallelization for unzipping files#1114
Roaimkhan wants to merge 2 commits intogoogle-deepmind:mainfrom
Roaimkhan:features/bugs

Roaimkhan commented Jan 25, 2026

Uh oh!

Vincy1230 left a comment

Uh oh!

Vincy1230 left a comment

Uh oh!

Uh oh!

Roaimkhan commented Feb 13, 2026

Uh oh!

Roaimkhan commented Feb 22, 2026

Uh oh!

Vincy1230 commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Roaimkhan commented Jan 25, 2026

Description

The Problem:

The Fix:

Uh oh!

Vincy1230 left a comment

Choose a reason for hiding this comment

Uh oh!

Vincy1230 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Roaimkhan commented Feb 13, 2026

Uh oh!

Roaimkhan commented Feb 22, 2026

Uh oh!

Vincy1230 commented Feb 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants