Skip to content

Adding an option to change the compression type of the parquet file#2611

Merged
erikvansebille merged 3 commits intoParcels-code:mainfrom
erikvansebille:parquet_compression
May 7, 2026
Merged

Adding an option to change the compression type of the parquet file#2611
erikvansebille merged 3 commits intoParcels-code:mainfrom
erikvansebille:parquet_compression

Conversation

@erikvansebille
Copy link
Copy Markdown
Member

Description

This PR explores the option to add the compression type of the parquet file output. It's now mostly so that we can explore what the best type of compression is for Lagrangain particles, but might also be useful for other users?

Checklist

  • Closes #xxxx
  • Tests added
  • This PR targets the correct branch (main for normal development, v3-support for v3 support)

AI Disclosure

  • This PR contains AI-generated content.
    • I have tested any AI-generated content in my PR.
    • I take responsibility for any AI-generated content in my PR.
    • Describe how you used it (e.g., by pasting your prompt):

@VeckoTheGecko
Copy link
Copy Markdown
Contributor

Changes here look good.

I wonder if there's a limit to how much changing the compression type here will help .

Running the code from

def test_advection_zonal_with_particlefile(tmp_parquet):
"""Particles at high latitude move geographically faster due to the pole correction."""
npart = 10
ds = simple_UV_dataset(mesh="flat")
ds["U"].data[:] = 1.0
fieldset = FieldSet.from_sgrid_conventions(ds, mesh="flat")
pset = ParticleSet(fieldset, lon=np.zeros(npart) + 20.0, lat=np.linspace(0, 80, npart))
pfile = ParticleFile(tmp_parquet, outputdt=np.timedelta64(30, "m"))
pset.execute(AdvectionRK4, runtime=np.timedelta64(2, "h"), dt=np.timedelta64(15, "m"), output_file=pfile)
assert (np.diff(pset.lon) < 1.0e-4).all()
df = pd.read_parquet(tmp_parquet)
final_time = df["time"].max()
np.testing.assert_allclose(df[df["time"] == final_time]["lon"].values, pset.lon, atol=1e-5)
assert_cftime_like_particlefile(tmp_parquet)

Results in a Parquet file that has 5 row groups (one per write out to the particle file - as you can see below the fold)

Details

In [1]: import pyarrow.parquet as pq

In [2]: parquet_file = pq.ParquetFile('tmp.parquet')

In [3]: print(parquet_file.num_row_groups)
5

In [4]: parquet_file.read_row_group(0)
Out[4]: 
pyarrow.Table
lon: float
lat: float
z: float
time: double
particle_id: int64
----
lon: [[20,20,20,20,20,20,20,20,20,20]]
lat: [[0,8.888889,17.777779,26.666666,35.555557,44.444443,53.333332,62.22222,71.111115,80]]
z: [[0,0,0,0,0,0,0,0,0,0]]
time: [[0,0,0,0,0,0,0,0,0,0]]
particle_id: [[0,1,2,3,4,5,6,7,8,9]]

In [5]: parquet_file.read_row_group(1)
Out[5]: 
pyarrow.Table
lon: float
lat: float
z: float
time: double
particle_id: int64
----
lon: [[1820,1820,1820,1820,1820,1820,1820,1820,1820,1820]]
lat: [[0,8.888889,17.777779,26.666666,35.555557,44.444443,53.333332,62.22222,71.111115,80]]
z: [[0,0,0,0,0,0,0,0,0,0]]
time: [[1800,1800,1800,1800,1800,1800,1800,1800,1800,1800]]
particle_id: [[0,1,2,3,4,5,6,7,8,9]]

In [6]: parquet_file.read_row_group(2)
Out[6]: 
pyarrow.Table
lon: float
lat: float
z: float
time: double
particle_id: int64
----
lon: [[3620,3620,3620,3620,3620,3620,3620,3620,3620,3620]]
lat: [[0,8.888889,17.777779,26.666666,35.555557,44.444443,53.333332,62.22222,71.111115,80]]
z: [[0,0,0,0,0,0,0,0,0,0]]
time: [[3600,3600,3600,3600,3600,3600,3600,3600,3600,3600]]
particle_id: [[0,1,2,3,4,5,6,7,8,9]]

In [7]: 

https://parquet.apache.org/docs/concepts/
Glossary of relevant terminology.
Block (HDFS block): This means a block in HDFS and the meaning is unchanged for describing this file format. The file format is designed to work well on top of HDFS.

File: A HDFS file that must include the metadata for the file. It does not need to actually contain the data.

Row group: A logical horizontal partitioning of the data into rows. There is no physical structure that is guaranteed for a row group. A row group consists of a column chunk for each column in the dataset.

Column chunk: A chunk of the data for a particular column. They live in a particular row group and are guaranteed to be contiguous in the file.

Page: Column chunks are divided up into pages. A page is conceptually an indivisible unit (in terms of compression and encoding). There can be multiple page types which are interleaved in a column chunk.

Hierarchically, a file consists of one or more row groups. A row group contains exactly one column chunk per column. Column chunks contain one or more pages.

Given we have very short column chunks (hence also very short pages) I am wondering whether the compression will be able to have much effect.

Copy link
Copy Markdown
Contributor

@VeckoTheGecko VeckoTheGecko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to merge this into main, just wanted to flag the item about row groups perhaps being the main culprit. Needs further investivation. @erikvansebille looking at the sargassum notebooks, how many row groups are there?

@erikvansebille
Copy link
Copy Markdown
Member Author

I ran the different compression types on the Sargassum notebook in https://github.com/Parcels-code/Sargassum_growth_model/blob/main/Manuscript_Figures/satellite_simulation.py, and then I get 361 row groups, which is the same as the number of unique times. So each row group is one time slice.

Below is the size of the files for each type of compression. There are 103.614 particles that are all written 361 times

compression brotli: 2.3G
compression gzip: 2.4G
compression zstd: 2.5G
compression snappy: 2.9G
compression None: 3.0G

So brotli seems to be the smallest in size, but 'only' 20% smaller than no compression at all. So compression type doesn't seem to matter than much...

@erikvansebille erikvansebille enabled auto-merge (squash) May 7, 2026 06:32
@erikvansebille erikvansebille merged commit bc20889 into Parcels-code:main May 7, 2026
12 of 15 checks passed
@github-project-automation github-project-automation Bot moved this from Backlog to Done in Parcels development May 7, 2026
@VeckoTheGecko
Copy link
Copy Markdown
Contributor

@erikvansebille Can you test what happens if you try to read in one of these parquet files and write it out again? I think it will write it out as a single row group and you will have a much better size

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants