Faster transport #90

lrobion · 2026-01-15T22:15:02Z

Summary

Addresses #89. PR makes the transport step faster in APCEMM. Improvements on total runtime are of order of 25% for single threads and 40% for 8 threads.

Tested correctness on randomly perturbed fields (scaled + gaussian perturbation) from the issl_rhi140/ example for 4 hours. The results are bitwise identical for a single thread run. I could not test this for multi-threaded setups due to #19.

User facing changes:

New compile option -DENABLE_TIMING to report profiling information around key parts of APCEMM
Warning if OMP Num threads > 4 (see below, might remove this...)

Code changes

There are many things which could still be improved but would require a larger refactoring. This is a first step at improving the transport step. Most of the changes are focused around making the ice aerosol transport step faster.

FVM_Solver::FVM_Solver constructor gets coords data by reference instead of value
Reduced the number of conversions from Vector_2D to Eigen::VectorXd
Cached the diffusion matrix used in operatorSplitSolve across all 38 aerosol size bins instead of recomputing it every time
Generate a FVM_Solver::FVM_Solver pool such that threads do not create a new solver each iteration

The last two points are the ones which make the largest difference by far.

Benchmark

Benchmark is run on the examples/issl_rhi140/ data for 4h. Overall APCEMM scales badly with number of CPUs > 2. In main I think this is because of a lot of repeated work. In this PR, this is likely due to expensive copies when initializing the solver pool.

Speedup varies as a function of number of CPUs and is heavily dependent on machine. I ran this both on a fully reserved node (c041 on Hex) as well as locally, and got very different speedups, ranging from x1.25 to x2. I suspect this is cache size related.

Your mileage may vary, but across all tests on different nodes this resulted in a net speedup apart from a case using 8 CPUs. There I sometimes found slight performance regressions (0.95-1x performance) compared to 8 CPUs on main, which I why I added a warning to test this yourself. However with the speedup from this PR, it might be worth starting in parallel two separate APCEMM simulations with 4 CPUs each rather than sequentially doing one after the other with 8 CPUs.

Time spent overall in each part of APCEMM still shows that transport dominates, but it might be worth looking into the ice growth mechanisms too.

Future improvements

I see two large avenues for further improvements, but they would require larger refactors:

Change the representation of most compute intensive variables to Eigen::VectorXd under the hood, with some light wrapper around it to allow for 2D or 3D indexing. This would avoid the (many) back and forth conversions throughout the code, and working with contiguous memory blocks will likely boost cache locality. It would also help with the saving of the 3D PSD distribution by eliminating the need for copying the data back to a contiguous format.
Refactor the solvers to be lighter and copiable/movable: eliminate the unique pointers for the Points, get rid of the Eigen solvers that we do not use, and separate the solver instances into a shared immutable state (diffusion vectors, diffusion matrix coefficients...) and a mutable state. All of this would make the solvers must faster to initialize (fewer redundant computations, fewer copies) with would make the solver pool scale much better, and potentially bring performance improvements at 8 threads.

Let me know what you think.

ian-ross

I think this looks good, and, as you say, going further in this direction is going to take more significant refactoring (and it's not certain exactly what performance benefits would result). I would bank the gains you've made with these changes by merging this PR, and you can come back to it later, now that you have a better idea what's going on and what needs to be done. The transport part is still the best target for optimization since you know that it's doing unnecessary work and it's still the most time-consuming part.

I didn't know about #19. It would certainly be good to fix that so that you can do proper testing of multi-threaded optimizations.

lrobion added 8 commits January 13, 2026 17:15

Added timing code to compile options

aceffec

Pass ref to solver constructor instead of value

8160f6b

Convert to VectorXd once

3cafb10

Cache diffusion coefficients matrix

e4880f0

Solver pool pointers serial init 0.9x speed 8 cores

2981ab1

Timing in AdvDiffSys constructor

d5255e9

Cleanup and comments

b70aaac

Added performance hint

37a83c3

lrobion requested review from ian-ross and sdeastham January 15, 2026 22:19

ian-ross approved these changes Jan 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Faster transport #90

Faster transport #90

Uh oh!

lrobion commented Jan 15, 2026 •

edited

Loading

Uh oh!

ian-ross left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Faster transport #90

Are you sure you want to change the base?

Faster transport #90

Uh oh!

Conversation

lrobion commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Code changes

Benchmark

Future improvements

Uh oh!

ian-ross left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lrobion commented Jan 15, 2026 •

edited

Loading