Skip to content

Latest commit

 

History

History
115 lines (95 loc) · 4.02 KB

File metadata and controls

115 lines (95 loc) · 4.02 KB

Some Notes on the Allocators

Ouroboros


FDGMalloc


  • Only allows for allocations on a warp-basis and cannot really free memory it seems

CUDA


  • Cuda Allocator cannot be resized once its size has been set once and allocations happened!
    • Probably needs a context reset for it to work
    • This means also that you can only have one instance of it running
      • Not sure if this is a major use case, but nonetheless
  • It seems there is also no difference between different iterations
    • So no difference between the first round and subsequent iterations

ScatterAlloc


  • Have two variants of that now, the one from GitHub in mallocMC (currently in use) and the base version found in the RegEff Code
    • Probably should get the Original some time xD
  • Currently it only works correctly in sync-mode

Halloc


  • The masks are quite hard to translate to modern CUDA
    • Currently it works with Sync Build

Reg-Eff


Atomic

  • Simply increments an offset for each new allocation return d_heap_base + offset
    • Can be used together with coalescing on a warp basis
  • Has no de-allocation and no re-use
  • Can only increase its offset, hence will over time simply run out of memory as soon as it reaches the end of the allocated memory

Atomic Wrap

  • Works the same as the basis atomic allocator, the only difference is what happens once it reaches the end of the allocated memory
    • In this case it will try to wrap around to the beginning using successive atomicCAS operations
      • So it will simply start overwriting data from the front
  • Has no de-allocation and no re-use

These methods get slower over time during allocation if no free happens

Circular

Circular Fused

Text

Circular Multi

Circular Fused

Text

XMalloc


  • Only works in sync-mode
  • Allocates from the cudaHeap, hence cannot reallocate unfortunately
  • Problem for mixed_allocation testcase for 100.000 allocations with range 512-8192
    • 10.000 works without a problem

DynaSOAr


  • Can only allocate objects implemented in their specific format
    • Even with a hack it will not work for a general purpose memory allocator

Notes Performance

  • Performance

    • 10.000
      • Oro - C - VA fail with increasing likelihood for the larger allocation sizes
      • Oro - C - VL seems to work, but is quite slow
    • 100.000
      • Reg-Eff-CF failed at 8192
      • Reg-Eff-CFM fails a few times after 7376
      • Reg-Eff-CM fails a few times after 6768
  • Mixed Performance

    • 10.000
      • Reg-Eff-C fails in between for sizes 32,64,256
      • Oro - P - VL fails after 32
    • 100.000
      • Reg-Eff-C fails after 16
      • Reg-Eff-CM fails after 1024
      • Reg-Eff-CFM fails after 4096
      • Oro - C - VA fails after 2048 -> got manual results with less iterations
      • Oro - P - VL fails after 32

Notes Scaling

  • Oro - C - S fails for the largest two sizes 4096 and 8192 the largest two thread counts 500.000 and 1.000.000

Notes Mixed

Notes Fragmentation

  • Fragmentation
    • Missing still for Reg-Eff-CF, Reg-Eff-CM and Reg-Eff-CFM
  • OOM
    • Oro - C - VA and Oro - C - VL become really slow after a few hundred iterations, probably not moving the front correctly.
    • Reg-Eff-A* also align to 16 Bytes internally, hence they don't get to maximum in the beginning
    • Reg-Eff-C* are painfully slow, hence typically are reigned in by the timeout
      • Also get slower with every passing iteration

Notes Dynamic Graph

  • Graph Stats captured ✔️
  • Init
    • Oro - P - V* not everything works, VA dies sometimes with died in freePage
      • VA is missing 333SP, hugetric and adaptive
      • VL is missing caidaRouterLevel, delaunay_n20
  • Update
    • Reg-Eff does not return 16-byte aligned memory, hence copying data over vectorized does not work

Notes Synthetic

  • Workload
    • Oro - P - VL fails after 1024
    • Reg-Eff-C fails after 8192
  • Could also test how write performance to that memory region is, not only the allocation speed