-
Notifications
You must be signed in to change notification settings - Fork 52
Expand file tree
/
Copy pathperf.txt
More file actions
1202 lines (1075 loc) · 66.4 KB
/
perf.txt
File metadata and controls
1202 lines (1075 loc) · 66.4 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
This file contains various notes and lessons learned concerning performance
of the Homa Linux kernel module. The notes are in reverse chronological
order.
68. (January 2025) Performance snapshot with and without pacer, using
c6620 CloudLab nodes, "-w w4 -b 80 -s 20 -n 6". cp_vs_tcp is used unless
cp_both is indicated.
AvgSlow: "avg slowdown" from cp_vs_tcp log output
Min: "min" from cp_vs_tcp "avg slowdown" line
P50: "P50" from cp_vs_tcp "avg slowdown" line
P99: "P99" from cp_vs_tcp "avg slowdown" line
P99L: P99 for 1 MB messages, from *_w4.data file
MaxT: Throughput under "-b100"
AvgSlow Min P50 P99 P99L MaxT
Homa (old pacer) 3.33 22.1 50.9 98.5 3284 96.6
homa (homa_qdisc) 3.31 21.1 50.6 90.7 3698 94.6
Homa (cp_both, old_pacer) 4.56 23.6 56.4 379.6 4182
homa (cp_both, homa_qdisc) 3.72 23.5 53.6 124.6 4021
TCP (no homa_qdisc) 11.81 32.9 180.4 1271.6 5235 94.8
TCP (homa_qdisc) 10.80 32.6 157.1 832.6 4627 95.7
TCP (cp_both, old pacer) 9.22 31.4 151.6 839.2 3062
TCP (cp_both, homa_qdisc) 9.13 32.9 136.5 762.5 4127
Summary:
* Without homa_qdisc, Homa P99 suffers a lot under cp_both; with homa_qdisc
it improves 3x, but is still 30% slower than running without TCP.
* homa_qdisc improves TCP performance even when running without Homa.
* TCP performance is better running with Homa than standalone.
* Homa_qdisc reduces Homa's maximum throughput slightly, increases TCP's
maximum throughput slightly.
67. (January 2025) Performance variation over reboots. On the c6620 CloudLab
cluster, both Homa and TCP performance seems to vary from reboot to reboot.
Performance is relatively consistent between reboots. However, after
observing this phenomenon one day, it completely disappeared the next day
(reboots consistent result in "fast" behavior). There was a CloudLab datacenter
shutdown overnight... perhaps that somehow changed the behavior?
Each line below represents one reboot of a c6620 cluster running this command:
cp_vs_tcp -w w4 -b 80 -s 20 -l /ouster/logs/test -n 6 --skip 0 --tcp yes
--port-threads 3 --port-receivers 3 --client-ports 5 --server-ports 2
--tcp-client-ports 10 --tcp-server-ports 20
The command was run once with the old pacer and once with homa_qdisc
enabled. In addition, for the "Both" measurements, cp_both was used
to run Homa and TCP simultaneously (with homa_qdisc enabled):
cp_both -w w4 -b 80 -s 20 -l /ouster/logs/test -n 6 --skip 0 --homa-gbps 40
--port-threads 3 --port-receivers 3 --client-ports 5 --server-ports 2
--tcp-client-ports 10 --tcp-server-ports 20
Each measurement includes average slowdown and P99 short-message latency,
as printed by cp_vs_tcp on the "avg slowdown" line.
Homa no qdisc TCP no qdisc Homa Qdisc Tcp Qdisc Homa Both TCP Both
Avg P99 Avg P99 Avg P99 Avg P99 Avg P99 Avg P99
----------------------------------------------------------------------------
3.30 97 11.68 1258 3.30 90 10.91 850 3.67 125 9.09 766
3.30 97 11.73 1277 3.29 90 10.72 816 3.69 125 9.17 768
3.30 97 11.62 1273 3.31 91 10.77 837 3.68 125 9.15 784
3.30 98 11.71 1268 3.30 90 10.75 828 3.68 124 9.15 769
3.31 99 11.64 1268 3.60 97 11.54 906 4.38 144 11.60 892
3.34 101 11.71 1253 3.35 94 10.92 860 3.71 125 9.19 774
3.85 135 12.40 1501 4.08 117 12.04 961 4.77 158 13.19 1003
3.94 143 12.53 1555 3.92 107 11.87 961 5.12 204 14.28 1126
4.20 255 12.86 1694 4.20 133 12.71 1053
The following experiments were run repeatedly without rebooting the nodes
(two different reboots separated by a blank line):
Homa no qdisc TCP no qdisc Homa Qdisc Tcp Qdisc Homa Both TCP Both
Avg P99 Avg P99 Avg P99 Avg P99 Avg P99 Avg P99
----------------------------------------------------------------------------
3.30 97 11.73 1277 3.29 90 10.72 816 3.69 125 9.17 768
3.30 97 11.74 1270 3.31 90 10.72 823 3.67 125 9.05 764
3.30 97 11.68 1259 3.30 90 10.92 852 3.67 124 9.11 762
3.30 97 11.68 1267 3.30 90 10.79 831 3.69 125 9.21 766
3.29 97 11.74 1291 3.30 90 10.79 837 3.68 125 9.12 767
3.29 97 11.75 1276 3.30 90 10.68 814 3.68 125 9.16 773
3.97 138 12.72 1599 4.06 110 12.33 1002 4.91 173 13.78 1083
4.07 146 12.74 1632 4.03 112 12.04 966 4.72 162 13.49 1020
4.09 150 12.63 1577 4.06 114 12.43 1013 5.05 181 14.41 1145
4.08 148 12.59 1557 4.11 114 12.33 1000 5.02 177 14.39 1093
4.17 180 12.65 1603 3.94 106 12.20 988 4.95 178 14.30 1130
3.99 132 12.74 1629 4.05 111 12.25 989 5.01 172 14.01 1085
4.02 143 12.53 1558 4.09 113 11.91 959 5.10 191 15.18 1208
3.89 126 12.54 1590 4.28 120 12.08 981 5.14 182 14.07 1111
66. (January 2025) Evaluated benchmarking parameters for 100 Gbps networks
(c6620 CloudLab cluster). Overall, for W4 the best parameters for Homa are:
--port-threads 3 --port-receivers 3 --client-ports 5 --server-ports 2
('--client-ports 4 --server-ports 2' and '--client-ports 3 --server-ports 3'
are about the same)
and for Tcp:
--tcp-client-ports 10 --tcp-server-ports 20
Here are more detailed measurements:
Thr: --port-threads and --port-receivers
CPorts: --client-ports
SPorts: --server-ports
TcpCP: --tcp-client-ports
TcpSP: --tcp-server-ports
HomaS: Average slowdown for Homa
HomaP99: P99 latency for short messages for Homa (usecs)
TcpS: Average slowdown for TCP
TcpP99: P99 latency for short messages for TCP (usecs)
Homa under cp_vs_tcp with homa_qdisc (c6620, 6 nodes):
Note: these measurements were taken with a "good" boot configuration
-w -b Thr CPorts SPorts HomaS HomaP99
-------------------------------------------
w3 34 2 3 3 1.98 142
w3 34 2 4 4 1.77 98
w3 34 2 5 5 1.67 76
w3 34 2 6 6 1.68 72
w3 34 2 7 7 1.69 69 max tput (47.4 Gbps)
w3 34 2 8 8 1.70 69
w4 80 3 5 1 3.44 170
w4 80 3 4 2 3.36 94
w4 80 3 5 2 3.35 94
w4 80 3 3 3 3.41 96
w4 80 3 4 4 3.51 99
w4 80 3 5 5 3.53 100
w4 80 3 3 5 3.58 104
w4 80 3 5 3 3.43 94
w4 80 2 4 4 3.43 109
w4 80 2 5 5 3.43 103
w4 80 2 6 6 3.47 102
w4 80 2 7 7 3.51 104
w5 80 2 6 4 8.04 177
w5 80 3 4 2 7.74 136
w5 80 3 3 3 8.25 141
w5 80 3 4 4 8.42 141
TCP under cp_vs_tcp with homa_qdisc (c6620, 6 nodes):
Note: these measurements were taken with a "good" boot configuration
-w -b TcpCP TcpSP TcpS TcpP99
------------------------------------
w3 34 4 8 3.10 445
w3 34 5 10 3.72 516
w3 34 6 12 3.50 430
w3 34 7 14 3.53 390
w3 34 8 16 3.63 368 max tput (42.7 Gbps)
w3 34 9 18 3.80 361
w4 80 2 4 25.31 4040
w4 80 3 6 13.83 1790
w4 80 4 8 12.42 1536
w4 80 5 10 12.23 1461
w4 80 6 12 12.28 1342
w4 80 7 14 11.68 1105
w4 80 8 16 11.40 980
w4 80 9 18 10.87 872
w4 80 10 20 10.79 843
w4 80 12 24 11.41 821
w4 80 15 30 15.74 915
w5 80 6 12 16.00 1927
w5 80 8 16 15.80 1866
w5 80 10 20 15.27 1636
w5 80 12 24 15.38 1478
Explored configuration for cp_both (c6620 cluster, -w w4 -b 80):
Note: these measurements were taken with a "good" boot configuration
(Hgbps is the --homa_gbps parameter)
HGbps Thr CPorts SPorts TcpCP TcpSP HomaS HomaP99 TcpS TcpP99
----------------------------------------------------------------------
5 3 1 1 8 16 5.02 268 10.83 929
5 3 2 2 10 20 4.54 205 10.50 842
5 3 3 3 12 24 4.45 179 11.07 818
5 3 4 2 10 15 4.23 173 10.23 892
5 3 5 2 12 12 3.95 136 10.18 932
5 3 6 2 8 24 4.36 181 10.30 830
5 2 4 2 16 16 4.02 148 10.52 856
5 2 5 2 16 20 4.32 182 11.40 844
20 3 4 2 10 20 3.93 154 10.14 828
20 3 5 2 10 20 3.86 146 10.06 815
20 3 5 3 10 20 3.89 145 10.13 824
40 3 5 2 10 20 3.71 125 9.19 774
40 3 5 3 10 20 3.71 124 9.21 763
40 3 5 3 8 16 3.72 126 8.99 794
60 3 4 2 10 20 3.48 106 7.60 643
60 3 5 2 10 20 3.47 104 7.62 647
60 3 5 3 10 20 3.48 103 7.60 635
75 3 4 2 10 20 3.27 94 7.38 560
75 3 5 2 10 20 3.24 93 7.36 553
75 3 5 3 3 6 3.42 93 7.04 602
75 3 5 3 8 16 3.31 91 7.21 565
65. (December 2025) The pacer does not prevent NIC queue buildup. Under
"-w w4 -b 80" on cc620 machines (Intel NICs) it is not unusual to see
periods of 1ms or longer with more than 500 Kbytes of packet data queued
in the NIC. This happens because the NIC cannot always sustain 100 Gbps of
output. Even with large amounts of queued data, the NIC completion rate varies
between 85 and 100 Gbps. Since the pacer will queue data at almost 100 Gbps,
the NIC queue builds when there is a large backlog of data; over time, the
queue for data tends to move from the pacer to the NIC. The pacer limit rate
would have to be reduced considerably to eliminate this problem (e.g., 85 Gbps
instead of 99 Gbps?) but that would waste a lot of NIC bandwidth since there
are many times when the NIC can transmit at nearly 100 Gbps.
64. (November 2025) Separating pacer traffic from non-paced traffic in
homa_qdisc (use tx queue 0 for paced traffic; non-paced traffic is spread
across other queues, using default queues except that traffic for queue 0
goes to queue 1 instead). In comparison to the old pacer (measurements
with w4 and w5 on c6620 cluster at 80 Gbps load; see log book for graphs):
* P99 for messages shorter than defer_min_bytes is 20-30% faster with separation
* P99 for messages between defer_min_bytes and unsched_limit is about 2x
slower with separation
* P99 for messages longer than unsched_limit starts off 40-50% slower with
separate, but gradually converges.
* Increasing defer_min_bytes provides upside with no apparent downside.
* Average slowdowns are better with the old pacer: 3.45 vs. 3.77 for W4,
9.40 vs. 7.72 for W5 (W5 has no messages shorter than defer_min_bytes).
* It appears that Intel NICs cannot always transmit at full link bandwidth,
so some some queuing occurs in the NIC even with Homa's output
pacing.
* When packets build up in the NIC, it appears to use some sort of fair
sharing mechanism between the queues. By placing a disproportionate
share of outgoing bytes in a single queue, those bytes effectively get
lower priority and bytes in other queues get higher priority, which
explains the behaviors observed above.
* Overall, it appears that placing pacer traffic in a dedicated queue is
not a good idea.
63. (September 2025) Compared CPU utilization against TCP. Measured with
top, running cp_vs_tcp -w w4 -b20 on a 6-node xl170 cluster (20 cores):
Homa TCP Homa no polling
us (user) 9.8 15.7 11.0
sy (system) 31.5 11.8 17.3
ni (nice) 0.0 0.0 0.0
id (idle) 38.0 49.2 51.9
wa (iowait) 0.0 0.0 0.0
hi (hardware interrupts) 0.0 0.0 0.0
si (software interrupts) 19.3 22.2 19.5
st (hypervisor steal) 0.0 0.0 0.0
Without polling, Homa's CPU utilization is slightly lower than TCP's.
Polling costs an extra 2-3 cores for Homa.
62. (August 2025) Using ktime_get_ns (rdtscp) instead of get_cycles (rdtsc)
in homa_clock (Linux reviewers won't allow get_cycles for upstreaming).
rdtscp takes about 14 ns per call, vs. 8 for ktime_get_ns. Running "w4 -b20"
on xl170s homa_clock invocations are 21 M/sec, so expect about .12 additional
core to be used. Measurements on xl170 cluster (25 Gbps network) using "w4 -b20"
(average across 6 nodes in experiment, then average over 5 runs):
rdtsc rdtscp Ratio
Gbps/sec/core: 6.46 6.22 0.954
Total core utilization: 6.20 6.44 1.038
Same experiment but in overload ("w4 -b40"):
rdtsc rdtscp Ratio
Gbps/sec/core: 5.44 5.32 0.980
Total core utilization: 8.08 8.05 0.997
Maximum throughput (Gbps): 21.95 21.42 0.976
61. (July 2025) Client responses could starve server requests. This came
about because a server request that wakes up after waiting for buffer space
has 0 received bytes. In contrast, a new client response will have received
unscheduled bytes. As a result, the client responses always got priority for
new grants and server requests could starve. The solution was to grant server
requests an amount equal to the unscheduled bytes when they wake up after
qwaiting for buffer space.
60. (July 2025) Measured impact of new FIFO grant mechanism on xl170
cluster using "-w starve -b 40 -s 30 -n 6" (priorities were not enabled).
Slowdowns as a function of message length:
grant_fifo_fraction = 0 grant_fifo_fraction = 50 grant_fifo_fraction = 100
# length s50 s99 s999 s50 s99 s999 s50 s99 s999
100000 13.7 25.5 86.8 13.3 21.7 31.9 13.2 22.2 32.7
200000 13.0 32.2 75.7 12.7 21.2 29.0 12.6 21.5 30.7
300000 13.4 30.2 64.5 13.1 22.1 28.2 13.0 22.7 30.2
400000 14.3 30.9 60.1 14.0 24.5 30.6 14.1 25.9 33.3
500000 16.1 35.0 83.0 15.9 30.5 37.4 16.4 32.8 41.6
600000 19.0 49.3 185.7 19.5 41.2 53.1 20.8 47.7 62.2
700000 24.1 70.1 222.0 26.7 67.8 91.4 30.8 88.6 122.2
800000 34.8 121.2 282.6 47.5 178.9 268.4 67.9 315.6 470.3
900000 72.6 307.5 470.5 1155.3 2139.8 2314.1 1477.2 1746.0 1823.7
1000000 3093.4 12063.2 13050.8 1982.2 2354.0 2482.9 1467.0 1647.1 1709.4
Even shorter messages seem to benefit from the FIFO mechanism (not sure why...).
Increasing the FIFO fraction from 5% to 10% doesn't make much difference and
starts to penalize smaller messages more.
FIFO also helps even when the cluster isn't overloaded: slowdown at
"-w starve -b 20 -s 30 -n 6":
grant_fifo_fraction = 0 grant_fifo_fraction = 50
# length s50 s99 s999 s50 s99 s999
100000 11.0 20.0 27.4 11.1 20.0 28.1
200000 10.5 19.5 26.0 10.5 19.6 26.1
300000 10.6 20.4 25.7 10.7 20.8 25.8
400000 11.1 22.0 26.8 11.2 22.7 28.5
500000 11.8 24.6 31.3 12.1 26.9 35.4
600000 13.0 30.3 39.2 13.4 33.5 45.9
700000 14.6 39.4 53.5 15.2 46.9 67.8
800000 16.9 55.3 82.7 17.6 63.8 92.7
900000 20.2 93.6 147.9 20.8 80.5 112.2
1000000 23.4 155.6 250.4 23.3 93.3 128.7
When the cluster isn't overloaded, short messages get a bit worse when FIFO
is enabled.
59. (May 2025) Measured overhead to read various clocks on 2.4 GHz
Xeon E5-2640 (note: measured when CPU is active, hence running in fastest
mode):
Function Units Overhead
-----------------------------------------------
rdtsc cycles 8 ns
rdtscp cycles 14 ns
sched_clock ns 9 ns
ktime_get_mono_fast_ns ns 24 ns
ktime_get_raw_fast_ns ns 24 ns
58. (September 2024): Interference between Homa and TCP when both run
concurrently on the same nodes (no special kernel code to mitigate
interference)
Experiment on xl170 cluster:
cp_both -n 9 --skip 0 -w w4 -b 20 -s 30
HomaGbps: Gbps generated by Homa (20 - HomaGbps generated by TCP)
HAvg: Average slowdown for Homa
HP50: Median RTT for Homa short messages
HP99: P99 RTT for Homa short messages
TAvg: Average slowdown for TCP
TP50: Median RTT for TCP short messages
TP99: P99 RTT for TCP short messagesAvailable
HomaGbps HAvg HP50 HP99 TAvg TP50 TP99
0 63.4 797 6089
2 8.1 66 335 80.5 1012 10131
4 8.6 65 507 80.0 1021 9315
6 9.9 66 765 80.8 1022 9328
8 12.1 68 1065 79.8 1042 8309
10 14.3 70 1324 76.7 993 6881
12 15.1 72 1394 73.4 971 5866
14 14.8 75 1305 73.1 927 6076
16 12.9 75 1077 70.2 816 6564
18 10.0 70 755 69.7 748 7387
20 4.4 44 119
Overall observations:
* Short messages:
* Homa: 2x for P50, 10x increase for P99, 2x for P50
* TCP: 25% increase for P50, 10% increase for P99
* The TCP degradation is caused by Homa using priorities. If the
experiment is run without priorities for Homa, TCP's short-message
latencies are significantly better than TCP by itself: 571 us for P50,
3835 us for P99.
* Long messages:
* TCP P50 and P99 latency drop by up to 40% as Homa traffic share
increases (perhaps because Homa throttles itself to link speed?)
* Running Homa without priorities improves TCP even more (2x gain for TCP
P50 and P99 under even traffic split, relative to TCP alone)
* Homa latency not much affected
* Other workloads:
* W5 similar to W4
* W3 and W2 show less Homa degradation, more TCP degradation
* Estimated NIC queue lengths have gotten much longer (e.g P99 queueing
delay of 235-750 us now, vs. < 10 us when Homa runs alone)
* Homa packets are experiencing even longer delays than this because
packets aren't distributed evenly across tx queues, while the NIC serves
queues evenly.
57. (August 2024): Best known parameters for c6525-100g cluster:
Homa:
hijack_tcp=1 .unsched_bytes=20000 window=0 max_incoming=1000000
gro_policy=0xe2 throttle_min_bytes=1000
--client-ports 4 --port-receivers 6 --server-ports 4 --port-threads 6
TCP:
--tcp-client-ports 4 --tcp-server-ports 6
56. (August 2024): Performance challenges with c6525-100g cluster (AMD CPUs,
100 Gbps links):
* The highest achievable throughput for Homa with W4 is 72-75 Gbps.
* TCP can get 78-79 Gbps with W4.
* The bottleneck is NIC packet transmission: 1 MB or more of data can
accumulate in NIC queues, and data can be queued in the NIC for 1 ms
or more.
* Memory bandwidth appears to be the limiting factor (not, say,
per-packet overheads for mapping addresses). For example, W2 can
transmit more packets than W4 without any problem.
* NIC queue buildup is not even across output queues. The queue used by
the pacer has significantly more buildup than the other queues. This
suggests that the NIC services queues in round-robin order. The pacer
queue gets a large fraction of all outbound traffic but it receives
only a 1/Nth share of the NIC's output bandwidth, so when the NIC can't
keep up, packets accumulate primarily in this one queue.
* Priorities don't make a significant difference in latency! It appears
that the NIC queuing issue is the primary contribution to P99 latency
even for short messages (too short to use the pacer). This is evident
because not only do P99 packets take a long time to reach the receiver's
GRO, they also take a long time to get returned to the sender to be
freed; this suggests that they are waiting a long time to get
transmitted. Perhaps the P99 packets are using the same output queue
as the pacer?
* Even at relatively low throughputs (e.g. 40 Gbps), P99 latency still
seems to be caused by slow NIC transmission, not incast queueing.
* Increasing throttle_min_bytes improves latency significantly, because
packets transmitted by the pacer are much more likely to experience
high NIC delays.
55. (June/July 2024): Reworked retry mechanism to retry more agressively.
Introduced ooo_window_usecs sysctl parameter with an initial value of
100 us; retry gaps once they reach this age. However, this increased the
number of resent packets by 20x and reduced throughput as well.
Hypothesis: many packets suffer quite long delays but eventually get
through; with fast retries, these get resent unnecessarily. Tried
increasing the value of ooo_window_usecs, and this helped a bit, but
performance is best if retries only happen when homa_timer hits its
resend_ticks value. So, backed out support for ooo_window_usecs.
54. (June 2024): New sk_buff allocation mechanism. Up until now, Homa
allocated an entire tx sk_buff with alloc_skb: both the packet header
and the packet data were allocated in the head. However, this resulted
in high overheads for sk_buff allocation. Introduced a new mechanism
(in homa_skb.c) for tx sk_buffs, where only the packet header is in the
head. The data for data packets is allocated using frags and high-order
pages (currently 64 KB). In addition, when sk_buffs are freed, Homa
saves the pages in pools (one per NUMA node) to eliminate the overhead
of page allocation. Here are before/after measurements taken with the
W4 workload on a 9-node c6525-100g cluster:
Before After
Avg. time to allocate sk_buff 7-9 us 0.85 us
Cores spent in sk_buff alloc 3.6-4.5 0.4-0.5
Cores spent in kfree_skb 1.1-1.3 0.3-0.4
Goodput/core 5.9-7.2 Gbps 8.4-10 Gbps
Time to allocate page 12 us
Cores spent allocating pages 0.04-0.08
53. (May 2024; superceded by #56) Strange NIC behavior (observed with Mellanox
ConnectX5 NICs on the c6525-100g CloudLab cluster, using W4 with offerred
load 80 Gbps and actual throughput more like 60 Gbps).
* The NIC is not returning tx packets to the host promptly after
transmission. In one set of traces (W4 at 80% offered load), 20% of
all packets weren't freed until at least 50 us after the packets had
been received by the target GRO; P99 delay was 400 us, and some packets
were delayed more than 1 ms. Note: other traces are not as bad, but
still show significant delays (15-20% of delays are at least 50 usec,
worst delays range from 250 us - 1100 us).
* Long delays in returning tx packets cause Linux to stop the tx queue
(it has a limit on outstanding bytes on a given channel), which slows
down transmission.
* The NIC doesn't seem to be able to transmit packets at 100 Gbps.
Many packets seem not to be transmitted for long periods of time (up to
1-2 ms) after they are added to a NIC queue: both the time until GRO
receipt and time until packet free are very long. Different tx queues
experience different delays: the delays for one queue can be short at
the same time that delays for another queue are very long. These problems
occur when Homa is passing packets to the NIC at < 100 Gbps.
* The NIC is not transmitting packets from different tx queues in a FIFO
order; it seems to be favoring some tx queues (perhaps it is
round-robining so queues with more traffic get treated badly?).
52. (February 2024) Impact of core allocation. Benchmark setup: 2 nodes,
c6525-100g cluster (100 Gbps network, 48 hyperthreads, 24 cores, 3 cores
per chiplet?):
cp_node server --pin N
cp_node client ----workload 500000 --one-way --client-max 1
window=0 max_incoming=2500000 gro_policy=16 unsched_bytes=50000
Measured RPC throughput and copy_to_user throughput:
--pin Gbps Copy
0 17.7 33.4
3 18.9 32.2
6 19.0 34.3
8 18.8 34.1
9 22.2 54.2
10 25.7 53.2
11 26.3 55.1
12 17.9 31.7
13 18.2 31.6
15 17.9 31.5
18 18.2 32.3
21 18.1 32.4
32 18.6 34.0
33 24.8 54.0
34 25.9 54.5
35 26.3 54.5
36 17.7 31.5
51. (February 2024) RPC lock preemption. When SoftIRQ is processing a large
batch of packets for a single RPC, it was holding the RPC lock continuously.
This prevented homa_copy_to_user from acquiring the lock to extract the
next batch of packets to copy. Since homa_copy_to_user is the bottleneck
for large messages on 100 Gbps networks, this can potentially affect
throughput. Fixed by introducing APP_NEEDS_LOCK for RPCs, so that
SoftIRQ releases the lock temporarily if homa_copy_to_user needs it.
This may have improved throughput for W4 on c6525-100g cluster by 10%,
but it's very difficult to measure accurately.
50. (February 2024) Don't queue IPIs. Discovered that when homa_gro_receive
invokes netif_receive_skb (intending to push a batch of packets through
to SoftIRQ ASAP), Linux doesn't immediately send an interprocessor
interrupt (IPI). It just queues the pending IPI until all NAPI processing
is finished, then issues all of the queued IPIs. This results in
significant delay for the first batch when NAPI has lots of additional
packets to process. Fixed this by writing homa_send_ipis and invoking it
in homa_gro_receive after calling netif_receive_skb. In 2-node tests
with "cp_node client --workload 500000 --client-max 1 --one-way"
(c6525-100g cluster), this improved latency from RPC start to beginning
copy to user space from 79 us to 46 us, resulting in 10-20% improvement
in throughput. W4 throughput appears to have improved about 10% (but a bit
had to measure precisely).
49. (November 2023) Implemented "Gen3" load balancing scheme, renamed the
old scheme "Gen2". For details on load balancing, see balance.txt.
Gen3 seems to reduce significantly tail latency for cross-core handoffs;
here are a few samples(us):
--Gen2 P50- ---Gen2 P99--- --Gen3 P50- --Gen3 P99-
GRO -> SoftIRQ 2.7 2.8 3.0 71.1 43.6 71.3 2.8 2.6 2.7 8.7 5.4 8.3
SoftIRQ -> App 0.3 0.3 0.3 20.5 21.7 19.9 0.3 0.3 0.3 7.2 6.8 9.0
However, this doesn't seem to translate into better overall performance:
standard slowdown graphs look about the same with Gen2 and Gen3 (Gen2 has
better P99 latency for W2 and W3; Gen3 is better for W5). This needs more
analysis.
48. (August 2023) Unexpected packet loss on c6525-100g cluster (AMD processors,
100 Gbps links). Under some conditions (such as "cp_node client --one-way
--workload 1000000" with dynamic_windows=1 and unsched_bytes=50000)
messages suffer packet losses starting around offset 700000 and
continuing intermittently until the end of the message. I was unable
to identify a cause, but increasing the size of the Mellanox driver's
page cache (MLX5E_CACHE_SIZE, see item 46 below) seems to make the problem
go away. Slight configuration changes, such as unsched_bytes=200000 also
make the problem go away.
47. (July 2023) Intel vs. AMD processors. 100B roundtrips under best-case
conditions are about 8.7 us slower on AMD processors than Intel:
xl170: 14.5 us
c6525-100g: 23.2 us
Places where c6525-100g is slower (each way):
Packet prep (Homa): 1.2 us
IP stack and driver : 0.9 us
Network (interrupts?): 1.7 us
Thread wakeup: 0.6 us
TCP is also slower on AMD: 38.7 us vs. 23.3 us
Note: results on AMD are particularly sensitive to core placement of
various components.
46. (July 2023) MLX buffer issues on c6525-100g cluster. The Mellanox
driver is configured with 256 pages (1 MB) of receive buffer space
for each channel. With a 100 Gbps network, this is about 80 us of
time. However, a single thread can copy data from buffers to user space
at only about 40 Gbps, which means that with longer messages, the
copy gets behind and packet lifetimes increase: with 1 MB messages,
median lifetime is 77 us and P90 lifetime (i.e. the later packets in
messages) are 115 us. With multiple messages from one host to another,
the buffer cache is running dry. When this happens, the Mellanox driver
allocates (and eventually frees) additional buffers, which adds
significant overhead. Bottom line: it's essential to use multiple
channels to keep up with a 100 Gbps network (this provides a larger
total buffer pool, plus more threads to copy to user space).
45. (January 2023) Up until now, output messages had to be completely copied
into sk_buffs before transmission could begin. Modified Homa to pipeline
the copy from user space with packet transmission. This makes a significant
difference in performance. For cp_node client --one-way --workload 500000
with MTU 1500, goodput increased from 11 Gbps (see #43 below) to 17-19
Gbps. For comparison, TCP is about 18.5 Gbps .
44. (January 2023) Until now Homa has held an RPC's lock while transmitting
packets for that RPC. This isn't a problem if ip_queue_xmit returns
quickly. However, in some configurations (such as Intel xl170 NICs) the
driver is very slow, and if the NIC can't do TSO for Homa then the packets
passed to the NIC aren't very large. In these situations, Homa will be
transmitting packets almost 100% of the time for large messages, which
means the RPC lock will be held continuously. This locks out other
activities on the RPC, such as processing grants, which causes additional
performance problems. To fix this, Homa releases the RPC lock while
transmitting data packets (ip_queue_xmit or ip6_xmit). This helps a lot
with bad NICs, and even seems to help a little with good NICs (5-10%
increase in throughput for single-flow benchmarks).
43. (December 2022) 2-host throughput measurements (Gbps). Configuration:
* Single message: cp_node client --one-way --workload 500000
Server: one thread, pinned on a "good" core (avoid GRO/SoftIRQ conflicts)
* Multiple messages: client adds "--ports 2 --client-max 8
Server doesn't pin, adds "--port-threads 2" (single port)
* All measurements used rtt_bytes=150000
1.01 2.0 Buf + Short Bypass
---------------------------------------------------------------------
Single message (MTU 1500) 9 11
Single message (MTU 3000) 10-11 13
Multiple messages (MTU 1500) 20-21 21-22
Multiple messages (MTU 3000) 22-23 22-23
Conclusions:
* The new buffering mechanism helps single-message throughput about 20%,
but not much impact when there are many concurrent messages.
* Homa 1.01 seems to be able to hide most of the overhead of
page pool thrashing (#35 below).
42. (December 2022) New cluster measurements with "bench n10_mtu3000" (10
nodes, MTU 3000B) on the following configurations:
Jun 22: Previous measurements from June of 2022
1.01: Last commit before implementing new Homa-allocated buffers
2.0 Buf: Homa-allocated buffers
Grants: 2.0 Buf plus GRO_FAST_GRANTS (incoming grants processed
entirely during GRO)
Short Bypass: 2.0 Buf plus GRO_SHORT_BYPASS (all packets < 1400 bytes
processed entirely during GRO)
Short-message latencies in usecs (fastest short messages taken from
homa_w*.data files, W4NL data taken from unloaded_w4.data):
Jun 22 1.01 2.0 Buf Grants Short Bypass
P50 P99 P50 P99 P50 P99 P50 P99 P50 P99
---------- ---------- ---------- ---------- ----------
W2 38.2 100 37.1 84.7 38.3 87.1 38.9 89.7 27.1 70.5
W3 54.8 269 53.0 263 51.8 211 51.0 216 39.2 216
W4 55.8 189 56.0 207 53.0 113 54.0 128 44.6 106
W5 65.3 223 66.2 232 61.9 133 62.2 154 61.5 150
W4NL 16.6 32.4 15.2 30.1 16.2 30.6 16.2 31.5 13.7 27.1
Best of 5 runs from "bench basic_n10_mtu3000":
1.01 2.0 Buf Grants Short Bypass
------------------------------------------------------------------------
Short-message RTT (usec) 16.1 16.1 16.1 13.5
Single-message throughput (Gbps) 10.2 12.2 12.7 12.5
Client RPC throughput (Mops/s) 1.46 1.51 1.52 1.75
Server RPC throughput (Mops/s) 1.52 1.66 1.63 1.73
Client throughput (Gbps) 23.6 23.7 23.6 23.7
Server throughput (Gbps) 23.6 23.7 23.7 23.7
Conclusions:
* New buffering reduces tail latency >40% for W4 and W5 (perhaps by
eliminating all-at-once message copies that occupy cores for long
periods?). Latency improves by 20-30% (both at P50 and P99) for
all message lengths in W4.
* New buffering improves single-message throughput by 20% (25% when
combined with fast grants)
* Short bypass appears to be a win overall: a bit worse P99 for W5,
but better everywhere else and a significant improvement for short
messages at low load
41. (December 2022) More analysis of SMI interrupts. Wrote smi.cc to gather
data on events that cause all cores to stop simultaneously. Found 3 distinct
kinds of gaps on xl170 (Intel) CPUs:
* 2.5 usec gaps every 4 ms
* 17 usec gaps every 10 ms (however, these don't seem to be consistent:
they appear for a while at the start of each experiment, then stop)
* 170 usec gaps every 250 ms
I don't know for sure that these are all caused by SMI (e.g., could the
gaps every 4 ms be scheduler wakeups?)
40. (December 2022) NAPI can't process incoming jumbo frames at line rate
for 100 Gbps network (AMD CPUs): it takes about 850 ns to process each
packet (median), but packets are arriving every 700 ns.
Most of the time is spent in __alloc_skb in two places:
kmalloc_reserve for data: 370 ns
prefetchw for last word of data: 140 ns
These times depend on core placements of threads; the above times
are for an "unfortunate" (but typical) placement; with an ideal placement,
the times drop to 100 ns for kmalloc_reserve and essentially 0 for the
prefetch.
Intel CPUs don't seem to have this problem: on the xl170 cluster, NAPI
processes 1500B packets in about 300 ns, and 9000B packets in about
450 ns.
39. (December 2022) One-way throughput for 1M messages varies from 18-27 Gbps
for Homa on the c6525-100g cluster, whereas TCP throughput is relatively
constant at 24 Gbps. Homa's variance comes from core placement: performance
is best if all of NAPI, GRO, and app are in the same group of 3 cores
(3N..3N+2) or their hypertwins. If they aren't, there are significant
cache miss costs as skbs get recycled from the app core back to the NAPI
core. TCP uses RFS to make sure that NAPI and GRO processing happen on
the same core as the application.
38. (December 2022) Restructured the receive buffer mechanism to mitigate
the page_pool_alloc_pages_slow problem (see August 2022 below); packets
can now be copied to user space and their buffers released without waiting
for the entire message to be received. This has a significant impact on
throughput. For "cp_node --one-way --client-max 4 --ports 1 --server-ports 1
--port-threads 8" on the c6525-100g cluster:
* Throughput increased from 21.5 Gbps to 42-45 Gbps
* Page allocations still happen with the new code, but they only consume
0.07 core now, vs. 0.6 core before
37. (November 2022) Software GSO is very slow (17 usec on AMD EPYC processors,
breaking 64K into 9K jumbo frames). The main problem appears to be sk_buff
allocation, which takes multiple usecs because the packet buffers are too
large to be cached in the slab allocator.
36. (November 2022) Intel vs. AMD CPUs. Compared
"cp_node client --workload 500000" performance on c6525-100g cluster
(24-core AMD 7402P processors @ 2.8 Ghz, 100 Gbps networking) vs. xl170
cluster (10-core Intel E5-2640v4 @ 2.4 Ghz, 25 Gbps networking), priorities
not enabled on either cluster:
Intel/25Gbps AMD/100Gbps
-----------------------------------------------------------------------
Packet size 1500B 9000B
Overall throughput (each direction) 3.4 Gbps 6.7-7.5 Gbps
Stats from ttrpcs.py:
Xmit/receive tput 11 Gbps 30-50 Gbps
Copy to/from user space 36-54 Gbps 30-110 Gbps
RTT for first grant 28-32 us 56-70 us
Stats from ttpktdelay.py:
SoftIRQ Wakeup (P50/P90) 6/30 us 14/23 us
Minimum network RTT 5.5 us 8 us
RTT with 100B messages 17 us 28 us
35. (August 2022) Found problem with Mellanox driver that explains the
page_pool_alloc_pages_slow delays in the item below.
* The driver keeps a cache of "free" pages, organized as a FIFO
queue with a size limit.
* The page for a packet buffer gets added to the queue when the
packet is received, but with a nonzero reference count.
* The reference count is decremented when the skbuff is released.
* If the page gets to the front of the queue with a nonzero reference
count, it can't be allocated. Instead, a new page is allocated,
which is slower. Furthermore, this will result in excess pages,
eventually causing the queue to overflow; at that point, the excess
pages will be freed back to Linux, which is slow.
* Homa likes to keep around large numbers of buffers around for
significant time periods; as a result, it triggers the slow path
frequently, especially for large messages.
34. (August 2022) 2-node performance is problematic. Ran experiments with
the following client cp_node command:
cp_node client --ports 3 --server-ports 3 --client-max 10 --workload 500000
With max_window = rtt_bytes = 60000, throughput is only about 10 Gbps
on xl170 nodes. ttpktdelay output shows one-way times commonly 30us or
more, which means Homa can't keep enough grants outstanding for full
bandwidth. The overheads are spread across many places:
IP: IP stack, from calling ip_queue_xmit to NIC wakeup
Net: Additional time until homa_gro_receive gets packet
GRO Other: Time until end of GRO batch
GRO Gap: Delay after GRO packet processing until SoftIRQ handoff
Wakeup: Delay until homa_softirq starts
SoftIRQ: Time in homa_softirq until packet is processed
Total: End-to-end time from calling ip_queue_xmit to homa_softirq
handler for packet
Data packet lifetime (us), client -> server:
Pctile IP Net GRO Other GRO Gap Wakeup SoftIRQ Total
0 0.5 4.6 0.0 0.2 1.0 0.1 7.3
10 0.6 10.3 0.0 5.7 2.0 0.2 21.0
30 0.7 12.4 0.4 6.3 2.1 1.9 27.0
50 0.7 15.3 1.0 6.6 2.2 3.3 32.2
70 0.8 18.2 2.0 8.1 2.3 3.8 45.3
90 1.0 33.9 4.9 31.3 2.5 4.8 62.8
99 1.4 56.5 20.7 48.5 17.7 17.5 85.6
100 16.0 74.3 31.0 61.9 28.3 24.4 111.0
Grant lifetime (us), client -> server:
Pctile IP Net GRO Other GRO Gap Wakeup SoftIRQ Total
0 1.7 2.6 0.0 0.3 1.0 0.0 7.6
10 2.4 5.3 0.0 0.5 1.5 0.1 12.1
30 2.5 10.3 0.0 6.1 2.1 0.1 23.3
50 2.6 12.7 0.5 6.5 2.2 0.2 28.1
70 2.8 16.5 1.1 7.2 2.3 0.3 38.1
90 3.4 31.7 3.5 22.6 2.5 3.1 56.2
99 4.6 54.1 17.7 48.4 17.5 4.3 78.5
100 54.9 67.5 28.4 61.9 28.3 21.9 98.3
Additional client-side statistics:
Pre NAPI: usecs from interrupt entry to NAPI handler
GRO Total: usecs from NAPI handler entry to last homa_gro_receive
Batch: number of packets processed in one interrupt
Gap: usecs from last homa_gro_receive call to SoftIRQ handoff
Pctile Pre NAPI GRO Batch Gap
0 0.7 0.4 0 0.2
10 0.7 0.6 0 0.3
30 0.8 0.7 1 0.4
50 0.8 1.5 2 6.6
70 1.0 2.6 3 7.0
90 2.7 4.9 4 7.5
99 6.4 8.0 7 34.2
100 21.7 23.9 12 48.2
In looking over samples of long delays, there are two common issues that
affect multiple metrics:
* page_pool_alloc_pages_slow; affects:
P90/99 Net, P90/99 GRO Gap, P99 SoftIRQ wakeup
* unidentified 14-17 us gaps in homa_xmit_data, homa_gro_receive,
homa_data_pkt, and other places:
affects P99 GRO Other, P99 SoftIRQ, P99 GRO
In addition, I found the following smaller problems:
* unknown gaps before homa_gro_complete of 20-30 us, affects:
P90 SoftIRQ wakeup
Is this related to the "unidentified 14-17 us gaps" above?
* net_rx_action sometimes slow to start; affects:
Wakeup
* large batch size affects:
P90 SoftIRQ
33. (June 2022) Short-message timelines (xl170 clusters, "cp_node client
--workload 100 --port-receivers 0"). All times are ns (data excludes
client-side recv->send turnaround time). Most of the difference
seems to be in kernel call time and NIC->NIC time. Also, note that
the 5.4.80 times have improved considerably from January 2021; there
appears to be at least 1 us variation in RTT from machine to machine.
5.17.7 5.4.80
Server Client Server Client
----------------------------------------------------------
Send:
homa_send/reply 461 588 468 534
IP/Driver 514 548 508 522
Total 975 1136 1475 1056
Receive:
Interrupt->Homa GRO 923 1003 789 815
GRO 200 227 193 201
Wakeup SoftIRQ 601 480 355 347
IP SoftIRQ 361 441 400 361
Homa SoftIRQ 702 469 588 388
Wakeup App 94 106 87 53
homa_recv 447 562 441 588
Total 3328 3288 2853 2753
Recv -> send kcall 682 220
NIC->NIC (round-trip) 6361 5261
RTT Total 15770 13618
32. (January 2021) Best-case short-message timelines (xl170 cluster).
Linux 4.15.18 numbers were measured in September 2020. All times are ns.
5.4.80 4.15.18 Ratio
Server Client
---------------------------------------------------------
Send:
System call 360 360 240 1.50
homa_send/reply 620 870 420 1.77
IP/Driver 495 480 420 1.16
Total 1475 1710 1080 1.47
Receive:
Interrupt->NAPI 560 500 530 1.00
NAPI 560 675 420 1.47
Wakeup SoftIRQ 480 470 360 1.32
IP SoftIRQ 305 335 320 1.00
Homa SoftIRQ 455 190 240 1.34
Wakeup App 80 100 270 0.33
homa_recv 420 450 300 1.45
System Call 360 360 240 1.50
Total 3220 3080 2680 1.18
NIC->NIC (1-way) 2805 2805 2540 1.10
RTT Total 15100 15100 12600 1.20
31. (January 2021) Small-message latencies (usec) for different workloads and
protocols (xl170 cluster, 40 nodes, high load, MTU 3000, Linux 5.4.80):
W2 W3 W4 W5
Homa P50 30.9 41.9 46.8 55.4
P99 57.7 98.5 109.3 139.0
DCTCP P50 106.7 (3.5x) 160.4 (3.8x) 159.1 (3.4x) 151.8 (2.7x)
P99 4812.1 (83x) 6361.7 (65x) 881.1 (8.1x) 991.2 (7.1x)
TCP P50 108.8 (3.5x) 192.7 (4.6x) 353.1 (7.5x) 385.7 (6.9x)
P99 4151.5 (72x) 5092.7 (52x) 2113.1 (19x) 4360.7 (31x)
30. (January 2021) Analyzed effects of various configuration parameters,
running on 40-node xl170 cluster with MTU 3000:
duty_cycle: Reducing to 40% improves small message latency 25% in W4
40% in W5
fifo_fraction: No impact on small message P99 except W3 (10% degradation);
previous measurements showed 2x improvement in P99 for
largest messages with modified W4 workload.
gro_policy: NORMAL always better; others 10-25% worse for short P99
max_gro_skbs: Larger is better; reducing to 5 hurts short P99 10-15%.
However, anecdotal experience suggests that very large
values can cause long delays for things like sending
grants, so perhaps 10 is best?
max_gso_size: 10K looks best; not much difference above that, 10-20%
degradation of short P99 at 5K
nic_queue_ns: 5-10x degradation in short P99 when there is no limit;
no clear winner for short P99 in 1-10 us range; however,
shorter is better for P50 (1us slightly better than 2us)
poll_usecs: 0-50us all equal for W4 and W5; 50us better for W2 and W3
(10-20% better short P99 than 0us).
ports: Not much sensitivity: 3 server and 3 client looks good.
client threads: Need 3 ports: W2 can't keep up with 1-2 ports, W3 can't
keep up with 1 port. With 3 ports, 2 receivers has 1.5-2x
lower short P99 for W2 and W3 than 4 receivers, but for
W5 3 receivers is 10% better than 2. Best choice: 3p2r?
rtt_bytes: 60K is best, but not much sensitivity: 40K is < 10% worse
throttle_bytes: Almost no noticeable difference from 100-2000; 100 is
probably best because it includes more traffic in the
computation of NIC queue length, reducing the probability
of queue buildup
29. (October 2020) Polling performance impact. In isolation, polling saves
about 4 us RTT per RPC. In the workloads, it reduces short-message P50
up to 10 us, and P99 up to 25us (the impact is greater with light-tailed
workloads like W1 and W2). For W2, polling also improved throughput
by about 3%.
28. (October 2020) Polling problem: some workloads (like W5 with 30 MB
messages) need a lot of receiving threads for occasional bursts where
several threads are tied up receiving very large messages. However,
this same number of receivers results in poor performance in W3,
because these additional threads spend a lot of time polling, which
wastes enough CPU time to impact the threads that actually have
work to do. One possibility: limit the number of polling threads per
socket? Right now it appears hard to configure polling for all
workloads.
27. (October 2020) Experimented with new GRO policy HOMA_GRO_NO_TASKS,
which attempts to avoid cores with active threads when picking cores
for SoftIRQ processing. This made almost no visible difference in
performance, and also depends on modifying the Linux kernel to
export a previously unexported function, so I removed it. It's
still available in repo commits, though.
26. (October 2020) Receive queue order. Experimented with ordering the
hsk->ready_requests and @hsk->ready_responses list to return short
messages first. Not clear that this provided any major benefits, and
it reduced throughput in some cases because of overheads in inserting
ready messages into the queues.
25. (October 2020) NIC queue estimation. Experimented with how much to
underestimate network bandwidth. Answer: not much! The existing 5% margin
of safety leaves bandwidth on the table, which impacts tail latency for
large messages. Reduced it to 1%, which helps large messages a lot (up to
2x reduction in latency). Impact on small messages is mixed (more get worse
than better), but the impact isn't large in either case.
24. (July 2020) P10 under load. Although Homa can provide 13.5 us RTTs under
best-case conditions, this almost never occurs in practice. Even at low
loads, the "best case" (P10) is more like 25-30 us. I analyzed a bunch
of 25-30 us message traces and found the following sources of additional
delay:
* Network delays (from passing packet to NIC until interrupt received)
account for 5-10 us of the additional delay (most likely packet queuing
in the NIC). There could also be delays in running the interrupt handler.
* Every stage of software runs slower, typically taking about 2x as long
(7.1 us becomes 12-23 us in my samples, with median 14.6 us)
* Occasional other glitches, such as having to wake up a receiving
user thread, or interference due to NAPI/SoftIRQ processing of other
messages.
23. (July 2020) Adaptive polling. A longer polling interval (e.g. 500 usecs)
lowers tail latency for heavy-tailed workloads such as W4, but it hurts
other workloads (P999 tail latency gets much worse for W1 because polling
threads create contention for cores; P99 tail latency for large messages
suffers in W3). I attempted an adaptive approach to polling, where a thread
stops polling if it is no longer first in line, and gets woken up later to
resume polling if it becomes first in line again. The hope was that this
would allow a longer polling interval without negatively impacting other
workloads. It did help, but only a bit, and it added a lot of complexity,
so I removed it.
22. (July 2020) Best-case timetraces for short messages on xl170 CloudLab cluster.
Clients: Cum.
Event Median
--------------------------------------------------------------------------
[C?] homa_ioc_send starting, target ?:?, id ?, pid ? 0
[C?] mlx nic notified 939
[C?] Entering IRQ 9589
[C?] homa_gro_receive got packet from ? id ?, offset ?, priority ? 10491
[C?] enqueue_to_backlog complete, cpu ?, id ?, peer ? 10644
[C?] homa_softirq: first packet from ?:?, id ?, type ? 11300
[C?] incoming data packet, id ?, peer ?, offset ?/? 11416
[C?] homa_rpc_ready handed off id ? 11560
[C?] received message while polling, id ? 11811
[C?] Freeing rpc id ?, socket ?, dead_skbs ? 11864
[C?] homa_ioc_recv finished, id ?, peer ?, length ?, pid ? 11987
Servers: Cum.
Event Median
--------------------------------------------------------------------------
[C?] Entering IRQ 0
[C?] homa_gro_receive got packet from ? id ?, offset ?, priority ? 762
[C?] homa_softirq: first packet from ?:?, id ?, type ? 1566
[C?] incoming data packet, id ?, peer ?, offset ?/? 1767
[C?] homa_rpc_ready handed off id ? 2012
[C?] received message while polling, id ? 2071
[C?] homa_ioc_recv finished, id ?, peer ?, length ?, pid ? 2459
[C?] homa_ioc_reply starting, id ?, port ?, pid ? 2940
[C?] mlx nic notified 3685
21. (July 2020) SMIs impact on tail latency. I observed gaps of 200-300 us where
a core appears to be doing nothing. These occur in a variety of places
in the code including in the middle of straight-line code or just
before an interrupt occurs. Furthermore, when these happen, *every* core
in the processor appears to stop at the same time (different cores are in
different places). The gaps do not appear to be related to interrupts (I
instrumented every __irq_entry in the Linux kernel sources), context
switches, or c-states (which I disabled). It appears that the gaps are
caused by System Management Interrupts (SMIs); they appear to account
for about half of the P99 traces I examined in W4.
20. (July 2020) RSS configuration. Noticed that tail latency most often occurs