- Only allows for allocations on a warp-basis and cannot really free memory it seems
- Cuda Allocator cannot be resized once its size has been set once and allocations happened!
- Probably needs a context reset for it to work
- This means also that you can only have one instance of it running
- Not sure if this is a major use case, but nonetheless
- It seems there is also no difference between different iterations
- So no difference between the first round and subsequent iterations
- Have two variants of that now, the one from GitHub in mallocMC (currently in use) and the base version found in the RegEff Code
- Probably should get the Original some time xD
- Currently it only works correctly in sync-mode
- The masks are quite hard to translate to modern CUDA
- Currently it works with Sync Build
- Simply increments an offset for each new allocation
return d_heap_base + offset- Can be used together with coalescing on a warp basis
- Has no de-allocation and no re-use
- Can only increase its offset, hence will over time simply run out of memory as soon as it reaches the end of the allocated memory
- Works the same as the basis atomic allocator, the only difference is what happens once it reaches the end of the allocated memory
- In this case it will try to wrap around to the beginning using successive atomicCAS operations
- So it will simply start overwriting data from the front
- In this case it will try to wrap around to the beginning using successive atomicCAS operations
- Has no de-allocation and no re-use
These methods get slower over time during allocation if no free happens
Text
Text
- Only works in sync-mode
- Allocates from the cudaHeap, hence cannot reallocate unfortunately
- Problem for
mixed_allocationtestcase for100.000allocations with range512-819210.000works without a problem
- Can only allocate objects implemented in their specific format
- Even with a hack it will not work for a general purpose memory allocator
-
Performance10.000Oro - C - VAfail with increasing likelihood for the larger allocation sizesOro - C - VLseems to work, but is quite slow
100.000- Reg-Eff-CF failed at
8192 - Reg-Eff-CFM fails a few times after
7376 - Reg-Eff-CM fails a few times after
6768
- Reg-Eff-CF failed at
-
Mixed Performance10.000Reg-Eff-Cfails in between for sizes32,64,256Oro - P - VLfails after32
100.000Reg-Eff-Cfails after16Reg-Eff-CMfails after1024Reg-Eff-CFMfails after4096Oro - C - VAfails after2048-> got manual results with less iterationsOro - P - VLfails after32
Oro - C - Sfails for the largest two sizes4096and8192the largest two thread counts500.000and1.000.000
Fragmentation- Missing still for
Reg-Eff-CF,Reg-Eff-CMandReg-Eff-CFM
- Missing still for
OOMOro - C - VAandOro - C - VLbecome really slow after a few hundred iterations, probably not moving the front correctly.Reg-Eff-A*also align to 16 Bytes internally, hence they don't get to maximum in the beginningReg-Eff-C*are painfully slow, hence typically are reigned in by the timeout- Also get slower with every passing iteration
- Graph Stats captured ✔️
InitOro - P - V*not everything works,VAdies sometimes withdied in freePageVAis missing333SP,hugetricandadaptiveVLis missingcaidaRouterLevel,delaunay_n20
UpdateReg-Effdoes not return 16-byte aligned memory, hence copying data over vectorized does not work
WorkloadOro - P - VLfails after 1024Reg-Eff-Cfails after8192
- Could also test how write performance to that memory region is, not only the allocation speed