__ __ __ ______
/\ "-./ \ /\ \ /\ __ \
\ \ \-./\ \ \ \ \ \ \ \/\ \
\ \_\ \ \_\ \ \_\ \ \_____\
\/_/ \/_/ \/_/ \/_____/ -- (CXL) Memory Latency Distribution Measurement Tool
It measures the average latency for each n pointer chasing memory access latency (n is set by -I),
and then output the latency numbers into the log files, which forms the distribution of the memory access latency.
The number of threads for pointer-chasing can be set by -t when -T is set as 0.
It can also output the latency values with the sequential read/write in the background.
The background read/write intensity can be tuned by setting the number of threads with -t.
The paper shows that, the (loaded) latency distribution would be quite different on our CXL-DRAM devices, compared to regular DRAM.
-
Compile the files: run
makeinsrc. The compiled file is named asbench. -
All-in-one run:
./run.sh [nodes_list]./run.sh 0,1nodes_listis a list separated by,. The number corresponds to each NUMA node. -
For each type of memory accesses:
pcmeans 1-n threads pointer-chasing.-Tis set as0forbench.rdmeans 1-thread pointer-chasing with 0-(n-1) threads sequential reads.-Tis set as1forbench.wrmeans 1-thread pointer-chasing with 0-(n-1) threads sequential writes.-Tis set as2forbench. -
The output files will be in
pc,rd, andwr. Each line denotes the latency in nanoseconds.
TSC_FREQ_GHZshould be set as the machine's CPU frequency.-Iis set as 8 for measuring the averaging latency for each 8 accesses.- Tune of the number of threads in
runx.shin each folder for different access types to comply with different physical machines. For example,pccurrently uses1,2,4,8threads. - The data buffer size is for each threads.
The data buffer accessed by each thread is independent.
The size can be set with
-min MB. -isets the number of iterations.-rsets which NUMA node is accessed.- Use
-Rfor enabling random pointer-chasing. The default setting (without-R) uses non-random pointer-chasing with prefetchers off for measuring latency. The suffling "window" and iterations can also be tuned in the code. - By default, the program pins the cores for each thread it uses, and the cores it tries to pin is starting from core
0. For example, with-tset as 8, the program uses the cores from0to7. The starting core (starting_corein the code) can be specified by using-c.
The assembly code in op_ptr_chase, op_ld, and op_st is
adapted from the source code in
"Demystifying CXL Memory with Genuine CXL-Ready Systems and Devices" [MICRO'23]