Skip to content

Conversation

@lrobion
Copy link
Contributor

@lrobion lrobion commented Jan 15, 2026

Summary

Addresses #89. PR makes the transport step faster in APCEMM. Improvements on total runtime are of order of 25% for single threads and 40% for 8 threads.

Tested correctness on randomly perturbed fields (scaled + gaussian perturbation) from the issl_rhi140/ example for 4 hours. The results are bitwise identical for a single thread run. I could not test this for multi-threaded setups due to #19.

User facing changes:

  • New compile option -DENABLE_TIMING to report profiling information around key parts of APCEMM
  • Warning if OMP Num threads > 4 (see below, might remove this...)

Code changes

There are many things which could still be improved but would require a larger refactoring. This is a first step at improving the transport step. Most of the changes are focused around making the ice aerosol transport step faster.

  • FVM_Solver::FVM_Solver constructor gets coords data by reference instead of value
  • Reduced the number of conversions from Vector_2D to Eigen::VectorXd
  • Cached the diffusion matrix used in operatorSplitSolve across all 38 aerosol size bins instead of recomputing it every time
  • Generate a FVM_Solver::FVM_Solver pool such that threads do not create a new solver each iteration

The last two points are the ones which make the largest difference by far.

Benchmark

Benchmark is run on the examples/issl_rhi140/ data for 4h. Overall APCEMM scales badly with number of CPUs > 2. In main I think this is because of a lot of repeated work. In this PR, this is likely due to expensive copies when initializing the solver pool.

Speedup varies as a function of number of CPUs and is heavily dependent on machine. I ran this both on a fully reserved node (c041 on Hex) as well as locally, and got very different speedups, ranging from x1.25 to x2. I suspect this is cache size related.

Your mileage may vary, but across all tests on different nodes this resulted in a net speedup apart from a case using 8 CPUs. There I sometimes found slight performance regressions (0.95-1x performance) compared to 8 CPUs on main, which I why I added a warning to test this yourself. However with the speedup from this PR, it might be worth starting in parallel two separate APCEMM simulations with 4 CPUs each rather than sequentially doing one after the other with 8 CPUs.

image image

Time spent overall in each part of APCEMM still shows that transport dominates, but it might be worth looking into the ice growth mechanisms too.
image

Future improvements

I see two large avenues for further improvements, but they would require larger refactors:

  1. Change the representation of most compute intensive variables to Eigen::VectorXd under the hood, with some light wrapper around it to allow for 2D or 3D indexing. This would avoid the (many) back and forth conversions throughout the code, and working with contiguous memory blocks will likely boost cache locality. It would also help with the saving of the 3D PSD distribution by eliminating the need for copying the data back to a contiguous format.

  2. Refactor the solvers to be lighter and copiable/movable: eliminate the unique pointers for the Points, get rid of the Eigen solvers that we do not use, and separate the solver instances into a shared immutable state (diffusion vectors, diffusion matrix coefficients...) and a mutable state. All of this would make the solvers must faster to initialize (fewer redundant computations, fewer copies) with would make the solver pool scale much better, and potentially bring performance improvements at 8 threads.

Let me know what you think.

@lrobion lrobion requested review from ian-ross and sdeastham January 15, 2026 22:19
Copy link
Member

@ian-ross ian-ross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good, and, as you say, going further in this direction is going to take more significant refactoring (and it's not certain exactly what performance benefits would result). I would bank the gains you've made with these changes by merging this PR, and you can come back to it later, now that you have a better idea what's going on and what needs to be done. The transport part is still the best target for optimization since you know that it's doing unnecessary work and it's still the most time-consuming part.

I didn't know about #19. It would certainly be good to fix that so that you can do proper testing of multi-threaded optimizations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants