Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Addresses #89. PR makes the transport step faster in APCEMM. Improvements on total runtime are of order of 25% for single threads and 40% for 8 threads.
Tested correctness on randomly perturbed fields (scaled + gaussian perturbation) from the
issl_rhi140/example for 4 hours. The results are bitwise identical for a single thread run. I could not test this for multi-threaded setups due to #19.User facing changes:
-DENABLE_TIMINGto report profiling information around key parts of APCEMMCode changes
There are many things which could still be improved but would require a larger refactoring. This is a first step at improving the transport step. Most of the changes are focused around making the ice aerosol transport step faster.
FVM_Solver::FVM_Solverconstructor gets coords data by reference instead of valueVector_2DtoEigen::VectorXdoperatorSplitSolveacross all 38 aerosol size bins instead of recomputing it every timeFVM_Solver::FVM_Solverpool such that threads do not create a new solver each iterationThe last two points are the ones which make the largest difference by far.
Benchmark
Benchmark is run on the
examples/issl_rhi140/data for 4h. Overall APCEMM scales badly with number of CPUs > 2. InmainI think this is because of a lot of repeated work. In this PR, this is likely due to expensive copies when initializing the solver pool.Speedup varies as a function of number of CPUs and is heavily dependent on machine. I ran this both on a fully reserved node (
c041on Hex) as well as locally, and got very different speedups, ranging from x1.25 to x2. I suspect this is cache size related.Your mileage may vary, but across all tests on different nodes this resulted in a net speedup apart from a case using 8 CPUs. There I sometimes found slight performance regressions (0.95-1x performance) compared to 8 CPUs on main, which I why I added a warning to test this yourself. However with the speedup from this PR, it might be worth starting in parallel two separate APCEMM simulations with 4 CPUs each rather than sequentially doing one after the other with 8 CPUs.
Time spent overall in each part of APCEMM still shows that transport dominates, but it might be worth looking into the ice growth mechanisms too.

Future improvements
I see two large avenues for further improvements, but they would require larger refactors:
Change the representation of most compute intensive variables to
Eigen::VectorXdunder the hood, with some light wrapper around it to allow for 2D or 3D indexing. This would avoid the (many) back and forth conversions throughout the code, and working with contiguous memory blocks will likely boost cache locality. It would also help with the saving of the 3D PSD distribution by eliminating the need for copying the data back to a contiguous format.Refactor the solvers to be lighter and copiable/movable: eliminate the unique pointers for the
Points, get rid of the Eigen solvers that we do not use, and separate the solver instances into a shared immutable state (diffusion vectors, diffusion matrix coefficients...) and a mutable state. All of this would make the solvers must faster to initialize (fewer redundant computations, fewer copies) with would make the solver pool scale much better, and potentially bring performance improvements at 8 threads.Let me know what you think.