Skip to content

Try sharded atmosphere on TPU#2370

Open
Pangoraw wants to merge 25 commits intomainfrom
pb/atmosphere
Open

Try sharded atmosphere on TPU#2370
Pangoraw wants to merge 25 commits intomainfrom
pb/atmosphere

Conversation

@Pangoraw
Copy link
Copy Markdown
Collaborator

@Pangoraw Pangoraw commented Apr 1, 2026

@Pangoraw Pangoraw closed this Apr 1, 2026
@Pangoraw Pangoraw reopened this Apr 1, 2026
@Pangoraw Pangoraw linked an issue Apr 6, 2026 that may be closed by this pull request
@dkytezab
Copy link
Copy Markdown
Collaborator

dkytezab commented Apr 6, 2026

@Pangoraw you may have to bump the Reactant_jll compat inside of the Reactant branch you're using

@wsmoses
Copy link
Copy Markdown
Member

wsmoses commented Apr 6, 2026

@Pangoraw can you separate powsimplify into a different PR so we can get it merged and jll'd concurrently

@Pangoraw
Copy link
Copy Markdown
Collaborator Author

Pangoraw commented Apr 6, 2026

@Pangoraw can you separate powsimplify into a different PR so we can get it merged and jll'd concurrently

#2398

@Pangoraw
Copy link
Copy Markdown
Collaborator Author

Pangoraw commented Apr 6, 2026

@Pangoraw you may have to bump the Reactant_jll compat inside of the Reactant branch you're using

this is reactant main, can you open a pr? EnzymeAD/Reactant.jl#2776

@Pangoraw Pangoraw closed this Apr 6, 2026
@Pangoraw Pangoraw reopened this Apr 6, 2026
@Pangoraw Pangoraw closed this Apr 7, 2026
@Pangoraw Pangoraw reopened this Apr 7, 2026
@Pangoraw Pangoraw closed this Apr 8, 2026
@Pangoraw Pangoraw reopened this Apr 8, 2026
@Pangoraw Pangoraw closed this Apr 8, 2026
@Pangoraw Pangoraw reopened this Apr 8, 2026
@Pangoraw Pangoraw closed this Apr 8, 2026
@Pangoraw Pangoraw reopened this Apr 8, 2026
@Pangoraw Pangoraw closed this Apr 13, 2026
@Pangoraw Pangoraw reopened this Apr 13, 2026
@giordano
Copy link
Copy Markdown
Member

https://github.com/EnzymeAD/Enzyme-JAX/actions/runs/24361180145/job/71141242221?pr=2370#step:19:61

┌ Info: [0] allocations
│   GordonBell25.allocatorstats() =
│    AllocatorStats
│    --------------
│    Num Allocs: 2
│    In Use: 32.000 KiB
│    Peak In Use: 32.000 KiB
│    Largest Alloc Size: 30.500 KiB
│    Limit: 31.246 GiB
│    Reserved: 0 bytes
│    Peak Reserved: 0 bytes
│    Reservable Limit: 31.246 GiB
│    Largest Free Block: 31.246 GiB
│    Pool: nothing
│    Peak Pool: nothing
└    
┌ Warning: [0] IC file not found at /__w/Enzyme-JAX/Enzyme-JAX/GB-25/simulations/initial_conditions/atmosphere_no_microphysics_1deg_14day.jld2 — using analytic IC
└ @ Main /__w/Enzyme-JAX/Enzyme-JAX/GB-25/sharding/sharded_atmosphere_simulation_run.jl:101
┌ Info: [0] Generating atmosphere model (Nλ=6136, Nφ=3064, Nz=64, Δt=0.5s)...
└   now(UTC) = 2026-04-13T19:04:28.375
ERROR: LoadError: RESOURCE_EXHAUSTED: E0100: RuntimeBufferAllocationFailure:
Error allocating device buffer: Attempting to allocate 1.27G. That was not possible. There are 819.90M free.; (1x1x0_HBM0)
See https://openxla.org/xla/errors/error_0100 for more details.: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).

Looks like there should be a lot more than 819 MB available (unless there have been other allocations before?)

@Pangoraw
Copy link
Copy Markdown
Collaborator Author

@giordano that size might be too big

RuntimeBufferAllocationFailure:
Error allocating device buffer: Attempting to allocate 1.27G.

@giordano
Copy link
Copy Markdown
Member

--float-type=Float32 --grid-x 544 --grid-y 272 --grid-z 64 works on v6: https://github.com/EnzymeAD/Enzyme-JAX/actions/runs/24363829566/job/71150315422?pr=2370#step:19:210


┌ Info: [0] allocations
│   GordonBell25.allocatorstats() =
│    AllocatorStats
│    --------------
│    Num Allocs: 70
│    In Use: 2.496 GiB
│    Peak In Use: 2.851 GiB
│    Largest Alloc Size: 201.307 MiB
│    Limit: 31.246 GiB
│    Reserved: 13.965 GiB
│    Peak Reserved: 13.965 GiB
│    Reservable Limit: 28.392 GiB
│    Largest Free Block: 14.427 GiB
│    Pool: nothing
│    Peak Pool: nothing
└    

If we can trust "Reservable Limit" + "Largest Free Block", then we should be around half of capacity. 768x384x64 should fit just barely, it failed before, probably just barely. I'm currently trying two grids slightly smaller than 768x384x64

@giordano
Copy link
Copy Markdown
Member

--float-type=Float32 --grid-x 728 --grid-y 364 --grid-z 64: https://github.com/EnzymeAD/Enzyme-JAX/actions/runs/24365876944/job/71157384196?pr=2370#step:19:210

 Info: [0] allocations
│   GordonBell25.allocatorstats() =
│    AllocatorStats
│    --------------
│    Num Allocs: 70
│    In Use: 3.846 GiB
│    Peak In Use: 4.578 GiB
│    Largest Alloc Size: 252.137 MiB
│    Limit: 31.246 GiB
│    Reserved: 22.801 GiB
│    Peak Reserved: 22.801 GiB
│    Reservable Limit: 26.663 GiB
│    Largest Free Block: 3.862 GiB
│    Pool: nothing
│    Peak Pool: nothing
└    

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Slow compile times on sharded run

4 participants