Adding an option to change the compression type of the parquet file by erikvansebille · Pull Request #2611 · Parcels-code/Parcels

erikvansebille · 2026-04-30T14:14:39Z

Description

This PR explores the option to add the compression type of the parquet file output. It's now mostly so that we can explore what the best type of compression is for Lagrangain particles, but might also be useful for other users?

Checklist

Closes #xxxx
Tests added
This PR targets the correct branch (main for normal development, v3-support for v3 support)

AI Disclosure

This PR contains AI-generated content.
- I have tested any AI-generated content in my PR.
- I take responsibility for any AI-generated content in my PR.
- Describe how you used it (e.g., by pasting your prompt):

VeckoTheGecko · 2026-05-01T09:46:19Z

Changes here look good.

I wonder if there's a limit to how much changing the compression type here will help .

Running the code from

Parcels/tests/test_advection.py

Lines 64 to 79 in e79ca8f

    
           def test_advection_zonal_with_particlefile(tmp_parquet): 
        
               """Particles at high latitude move geographically faster due to the pole correction.""" 
        
               npart = 10 
        
               ds = simple_UV_dataset(mesh="flat") 
        
               ds["U"].data[:] = 1.0 
        
               fieldset = FieldSet.from_sgrid_conventions(ds, mesh="flat") 
        
               pset = ParticleSet(fieldset, lon=np.zeros(npart) + 20.0, lat=np.linspace(0, 80, npart)) 
        
               pfile = ParticleFile(tmp_parquet, outputdt=np.timedelta64(30, "m")) 
        
               pset.execute(AdvectionRK4, runtime=np.timedelta64(2, "h"), dt=np.timedelta64(15, "m"), output_file=pfile) 
        
               assert (np.diff(pset.lon) < 1.0e-4).all() 
        
               df = pd.read_parquet(tmp_parquet) 
        
               final_time = df["time"].max() 
        
               np.testing.assert_allclose(df[df["time"] == final_time]["lon"].values, pset.lon, atol=1e-5) 
        
               assert_cftime_like_particlefile(tmp_parquet)

Results in a Parquet file that has 5 row groups (one per write out to the particle file - as you can see below the fold)

Details

In [1]: import pyarrow.parquet as pq

In [2]: parquet_file = pq.ParquetFile('tmp.parquet')

In [3]: print(parquet_file.num_row_groups)
5

In [4]: parquet_file.read_row_group(0)
Out[4]: 
pyarrow.Table
lon: float
lat: float
z: float
time: double
particle_id: int64
----
lon: [[20,20,20,20,20,20,20,20,20,20]]
lat: [[0,8.888889,17.777779,26.666666,35.555557,44.444443,53.333332,62.22222,71.111115,80]]
z: [[0,0,0,0,0,0,0,0,0,0]]
time: [[0,0,0,0,0,0,0,0,0,0]]
particle_id: [[0,1,2,3,4,5,6,7,8,9]]

In [5]: parquet_file.read_row_group(1)
Out[5]: 
pyarrow.Table
lon: float
lat: float
z: float
time: double
particle_id: int64
----
lon: [[1820,1820,1820,1820,1820,1820,1820,1820,1820,1820]]
lat: [[0,8.888889,17.777779,26.666666,35.555557,44.444443,53.333332,62.22222,71.111115,80]]
z: [[0,0,0,0,0,0,0,0,0,0]]
time: [[1800,1800,1800,1800,1800,1800,1800,1800,1800,1800]]
particle_id: [[0,1,2,3,4,5,6,7,8,9]]

In [6]: parquet_file.read_row_group(2)
Out[6]: 
pyarrow.Table
lon: float
lat: float
z: float
time: double
particle_id: int64
----
lon: [[3620,3620,3620,3620,3620,3620,3620,3620,3620,3620]]
lat: [[0,8.888889,17.777779,26.666666,35.555557,44.444443,53.333332,62.22222,71.111115,80]]
z: [[0,0,0,0,0,0,0,0,0,0]]
time: [[3600,3600,3600,3600,3600,3600,3600,3600,3600,3600]]
particle_id: [[0,1,2,3,4,5,6,7,8,9]]

In [7]:

https://parquet.apache.org/docs/concepts/
Glossary of relevant terminology.
Block (HDFS block): This means a block in HDFS and the meaning is unchanged for describing this file format. The file format is designed to work well on top of HDFS.

File: A HDFS file that must include the metadata for the file. It does not need to actually contain the data.

Row group: A logical horizontal partitioning of the data into rows. There is no physical structure that is guaranteed for a row group. A row group consists of a column chunk for each column in the dataset.

Column chunk: A chunk of the data for a particular column. They live in a particular row group and are guaranteed to be contiguous in the file.

Page: Column chunks are divided up into pages. A page is conceptually an indivisible unit (in terms of compression and encoding). There can be multiple page types which are interleaved in a column chunk.

Hierarchically, a file consists of one or more row groups. A row group contains exactly one column chunk per column. Column chunks contain one or more pages.

Given we have very short column chunks (hence also very short pages) I am wondering whether the compression will be able to have much effect.

VeckoTheGecko

Happy to merge this into main, just wanted to flag the item about row groups perhaps being the main culprit. Needs further investivation. @erikvansebille looking at the sargassum notebooks, how many row groups are there?

erikvansebille · 2026-05-07T06:30:25Z

I ran the different compression types on the Sargassum notebook in https://github.com/Parcels-code/Sargassum_growth_model/blob/main/Manuscript_Figures/satellite_simulation.py, and then I get 361 row groups, which is the same as the number of unique times. So each row group is one time slice.

Below is the size of the files for each type of compression. There are 103.614 particles that are all written 361 times

compression brotli: 2.3G
compression gzip: 2.4G
compression zstd: 2.5G
compression snappy: 2.9G
compression None: 3.0G

So brotli seems to be the smallest in size, but 'only' 20% smaller than no compression at all. So compression type doesn't seem to matter than much...

VeckoTheGecko · 2026-05-07T08:35:38Z

@erikvansebille Can you test what happens if you try to read in one of these parquet files and write it out again? I think it will write it out as a single row group and you will have a much better size

Adding an option to change the compression type of the parquet file

479e888

github-project-automation Bot added this to Parcels development Apr 30, 2026

github-project-automation Bot moved this to Backlog in Parcels development Apr 30, 2026

VeckoTheGecko approved these changes May 1, 2026

View reviewed changes

Merge branch 'main' into parquet_compression

cec73f5

Merge branch 'main' into parquet_compression

10cbcb9

erikvansebille enabled auto-merge (squash) May 7, 2026 06:32

erikvansebille merged commit bc20889 into Parcels-code:main May 7, 2026
12 of 15 checks passed

github-project-automation Bot moved this from Backlog to Done in Parcels development May 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding an option to change the compression type of the parquet file#2611

Adding an option to change the compression type of the parquet file#2611
erikvansebille merged 3 commits intoParcels-code:mainfrom
erikvansebille:parquet_compression

erikvansebille commented Apr 30, 2026

Uh oh!

VeckoTheGecko commented May 1, 2026

Uh oh!

VeckoTheGecko left a comment

Uh oh!

erikvansebille commented May 7, 2026

Uh oh!

Uh oh!

VeckoTheGecko commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

erikvansebille commented Apr 30, 2026

Description

Checklist

AI Disclosure

Uh oh!

VeckoTheGecko commented May 1, 2026

Uh oh!

VeckoTheGecko left a comment

Choose a reason for hiding this comment

Uh oh!

erikvansebille commented May 7, 2026

Uh oh!

Uh oh!

VeckoTheGecko commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants