Adding an option to change the compression type of the parquet file#2611
Conversation
|
Changes here look good. I wonder if there's a limit to how much changing the compression type here will help . Running the code from Parcels/tests/test_advection.py Lines 64 to 79 in e79ca8f Results in a Parquet file that has 5 row groups (one per write out to the particle file - as you can see below the fold) Details
Given we have very short column chunks (hence also very short pages) I am wondering whether the compression will be able to have much effect. |
VeckoTheGecko
left a comment
There was a problem hiding this comment.
Happy to merge this into main, just wanted to flag the item about row groups perhaps being the main culprit. Needs further investivation. @erikvansebille looking at the sargassum notebooks, how many row groups are there?
|
I ran the different compression types on the Sargassum notebook in https://github.com/Parcels-code/Sargassum_growth_model/blob/main/Manuscript_Figures/satellite_simulation.py, and then I get 361 row groups, which is the same as the number of unique times. So each row group is one time slice. Below is the size of the files for each type of compression. There are 103.614 particles that are all written 361 times compression So |
|
@erikvansebille Can you test what happens if you try to read in one of these parquet files and write it out again? I think it will write it out as a single row group and you will have a much better size |
Description
This PR explores the option to add the compression type of the parquet file output. It's now mostly so that we can explore what the best type of compression is for Lagrangain particles, but might also be useful for other users?
Checklist
mainfor normal development,v3-supportfor v3 support)AI Disclosure