topology-aware: add l3Cache topology/pool nodes.#635
Conversation
e931356 to
e92e4d9
Compare
@wongchar Thank you ! We are definitely interested to get this merged. But I think it would be good to add e2e test cases for verification. Would you be able to provide a sample sysfs dump or the relevant subset of it for such a HW. It would help check how easily we could add such tests. We probably cannot emulate this directly by qemu, but this is not the only such feature and we already have some environment based overrides in the sysfs/detection code specifically to be able to fake and test things which can't be emulate properly. |
|
Agreed, attached is the sysfs of an epyc processor (8534P) with two L3 caches per die. (Single numa, single socket, smt-enabled). Happy to help add e2e tests based on what is most cohesive with the existing codebase :) |
|
@wongchar Here is a a minimal e2e test case added to your tree. The test case verifies confining unlimited burstability to an L3 cache cluster, which IIUC is the primary motivation for this PR. I think for getting this merged, we still need to
With the documentation update we should first wait for #628 to get merged (it's really intrusive), then rebase this PR on main/HEAD, then update the docs. |
e92e4d9 to
b827509
Compare
@wongchar Oh, and it looks like your sign-off is incorrect. AFAICT the problem is that you used your personal e-mail address as the commit author, but then signed it off with a different e-mail address, apparently your work one. |
a3921bc to
ae6eb05
Compare
|
Added some more test cases. Will address docs next (on my to-do) |
63050e8 to
961e17d
Compare
|
Added a few more test cases and squashed to the previous commit. Updated the documentation in a separate commit. Appreciate your review and feedback, thanks! |
| } else { | ||
| // Single NUMA node per die (or no NUMA subdivision). | ||
| // Check for L3 cache groups within this die. | ||
| if l3CacheIDs := p.sys.Package(socketID).L3CacheIDs(); len(l3CacheIDs) > 1 { |
There was a problem hiding this comment.
Until now, the topology tree has been constructed so that there has not been nodes with only one child.
Currently that would be a subject to change. For instance, two dies per socket, both dies having their own L3 cache, no sub-NUMA nodes within dies, results in a tree where each die nodes has one child: their own L3 cache node. The same happens if there are multiple NUMA nodes that have their own L3 caches, too.
@klihub, what do you think, should we try to stick with a minimum-depth tree where resources of a child node is always a proper subset of resources of its parent?
If so, then buildL3CachePool() should be called only if there are more than one L3 pools in a die/node scope, instead of more than L3 caches in a package.
There was a problem hiding this comment.
Yes, I think it's good to adhere to that principle. We should not have non-branching subtrees because they do not bring in any new topological information.
I guess the easiest here is to do a final filtering at the end of each iteration in buildNumaNodePool() and inf we end up with a single sole child, just remove it.
There was a problem hiding this comment.
+1 to @klihub the nodes that has only one child does not bring too much value to the algorithm.
The only exception might be scenario where we have very distinct group of cores for example (on hybrid cpus, group of E or LPE cores groupped by cache cluster).
|
L3 cache nodes are not present in the topology tree, for instance when L3 does not split NUMA nodes. This is the case in our topology-aware/n4c16 tests, for instance. If a pod is scheduled on such a node, and it is annotated with |
I think the same problem is already present, the most typical candidate being burstability limited to a (non-existent) die (topology level/node). And we don't try/do anything very intelligent in that case. We simply always prefer an exactly matched topology level over unmatched ones. But at the moment we do not, for instance, consider for a target topology T, an pool with level L1 > T better than another pool with level L2 > T, when L1 < L2 (where for the sake of discussion > means higher in the tree, so lower in terms of L.Value()). IOW, if the target cannot be satisfied, we do not consider a tighter fit better than a loser one, instead considering them equally good/bad and letting other comparision criteria take precendence. And I need to check, but I have some vague recollection thath I also found very recently some bug related to limited burstability when there are target nodes in the pool tree but the container does not fit into any of them... I need to check that one. |
There was a problem hiding this comment.
Pull request overview
This PR adds L3 cache as a new topology level to the NRI topology-aware resource policy. This is motivated by AMD EPYC CPUs where a single die can contain multiple L3 caches (Core Complexes / CCXs), enabling finer-grained resource affinity than the existing die-level support.
Changes:
- Adds
CPUTopologyLevelL3Cacheas a new topology level with value 5 (between NUMA=4 and L2Cache=6), updating the CRD validation and config to allow"l3cache"as a validunlimitedBurstableoption. - Extends
pkg/sysfsto discover L3 cache groupings per package (L3CacheIDs(),L3CacheCPUSet()), and addsbuildL3CachePoolto the topology pool builder that creates L3 cache pools under NUMA nodes, dies, or sockets depending on the detected hierarchy. - Adds an
l3cachenodeimplementation innode.gowithHintScore,GetPhysicalNodeIDs, andGetMemsetmethods, along with a comprehensive e2e test suite using cache overrides to simulate AMD CCX topology.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
pkg/sysfs/system.go |
Adds l3CacheCPUs field and discovery logic to cpuPackage, and exposes L3CacheIDs()/L3CacheCPUSet() on the CPUPackage interface |
pkg/apis/config/v1alpha1/resmgr/policy/config.go |
Adds CPUTopologyLevelL3Cache constant with value 5 and updates level ordering |
pkg/apis/config/v1alpha1/resmgr/policy/topologyaware/config.go |
Re-exports new level constant and adds l3cache to kubebuilder validation enum |
cmd/plugins/topology-aware/policy/node.go |
Adds L3CacheNode kind, l3cachenode struct, and its Node interface implementations |
cmd/plugins/topology-aware/policy/pools.go |
Adds buildL3CachePool and wires L3 cache pool creation into the topology builder |
cmd/plugins/topology-aware/policy/mocks_test.go |
Adds stub implementations of the two new CPUPackage interface methods |
config/crd/bases/config.nri_topologyawarepolicies.yaml |
Adds l3cache to the CRD unlimitedBurstable enum |
deployment/helm/topology-aware/crds/config.nri_topologyawarepolicies.yaml |
Same CRD update for the Helm chart |
docs/resource-policy/policy/topology-aware.md |
Documents the new l3cache topology level and updates the pool hierarchy description |
test/e2e/policies.test-suite/topology-aware/n4c128/topology.var.json |
New test topology for the n4c128 test variant |
test/e2e/policies.test-suite/topology-aware/n4c128/test19-cacheclusters/ |
New e2e test for L3 cache pool placement using OVERRIDE_SYS_CACHES |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func (p *policy) NewL3CacheNode(id idset.ID, cpus cpuset.CPUSet, parent Node) *l3cachenode { | ||
| n := &l3cachenode{} | ||
| n.self.node = n | ||
| n.init(p, fmt.Sprintf("L3 cache #%v", id), L3CacheNode, parent) |
There was a problem hiding this comment.
The NewL3CacheNode function uses fmt.Sprintf("L3 cache #%v", id) to generate the pool name, where id is the sysfs cache ID. On real hardware, cache IDs may not be globally unique across different sockets — they can repeat per-socket (e.g., socket 0 and socket 1 may both have an L3 cache with id=0). This means that on a multi-socket system where L3 cache IDs are not globally unique, multiple L3 cache pool nodes would map to the same key in p.nodes, with later entries silently overwriting earlier ones. When UpdateResources is called and tries to look up the pool by the previously assigned node's name, it may select the wrong pool.
Compare with NewDieNode, which uses fmt.Sprintf("die #%v/%v", pkg.id, id) to include the parent socket ID for uniqueness. The L3 cache node name should similarly include information that makes it globally unique, such as a combination of the parent node's name and the cache ID (e.g., fmt.Sprintf("%s/L3 cache #%v", parent.Name(), id)).
There was a problem hiding this comment.
@wongchar I wonder if it is really true for cache IDs to not be unique across sockets. For die ID I know it is true, typically on our HW but AFAICT not on yours. Anyway, if it is true for cache IDs then we should name them with a disambiguating prefix as suggested. If it is not true, then there is no need for it.
There was a problem hiding this comment.
cacheIDs are unique, but I agree with this feedback. More details to make it obvious on how to map back to the parent ID is helpful. I meant to add this but it slipped my mind. I will update soon
| for _, l3CacheID := range l3CacheIDs { | ||
| l3CacheCPUs := p.sys.Package(socketID).L3CacheCPUSet(l3CacheID) | ||
| // Only create L3 pool if its CPUs are within this die | ||
| if !l3CacheCPUs.Intersection(cpus).IsEmpty() && l3CacheCPUs.Intersection(cpus).Equals(l3CacheCPUs) { |
There was a problem hiding this comment.
The condition !l3CacheCPUs.Intersection(cpus).IsEmpty() && l3CacheCPUs.Intersection(cpus).Equals(l3CacheCPUs) computes l3CacheCPUs.Intersection(cpus) twice. This same pattern appears in both buildDiePool and buildNumaNodePool. The result should be stored in a local variable and reused to avoid the duplicate computation.
There was a problem hiding this comment.
I leveraged this in getL3CacheIDsForCPUs() of cmd/plugins/topology-aware/policy/pools.go
Will simplify !l3CacheCPUs.Intersection(cpus).IsEmpty() && l3CacheCPUs.Intersection(cpus).Equals(l3CacheCPUs)
to just
if cpus.Intersection(l3CacheCPUs).Equals(l3CacheCPUs)
| for _, l3CacheID := range l3CacheIDs { | ||
| l3CacheCPUs := p.sys.Package(socketID).L3CacheCPUSet(l3CacheID) | ||
| // Only create L3 pool if its CPUs are within this NUMA node | ||
| if !l3CacheCPUs.Intersection(cpus).IsEmpty() && l3CacheCPUs.Intersection(cpus).Equals(l3CacheCPUs) { |
There was a problem hiding this comment.
Same double-computation of l3CacheCPUs.Intersection(cpus) as in buildDiePool. The intersection should be stored in a local variable to avoid the redundant computation.
|
I see. Just as another proposal in the recent commit, thoughts on a helper function to determine if there are multiple L3 within the current scope (die/NUMA node)? The helper function takes the determined cpus as an input to narrow down to the right scope. This would prevent creating nodes with only one child. In terms of setting l3cache as unlimitedBurstable when no l3cache pools are created, I understand it to default to the deepest pool available. To prevent silent fallback, would this be a configuration error or require a loud warning at startup that configuration was not honored? |
I think that's reasonable.
A warning sounds a better choice for me here. |
Agreed, I will double check that its called out in the logs at a minimum. |
Nevermind, I see you already handle the warning in Added one more commit to address the items highlighted by copilot. Let me know if I need to squash any commits for simplicity. Thanks! |
Yes, I think it would be good to squash the last two fix-commits into the original Otherwise this now LGTM. |
- Extend sysfs package to discover L3 cache topology from the system
- Add L3 cache pool node type that groups CPUs sharing the same L3 cache
- Only create L3 cache pools when multiple L3 caches exist within the
current scope to avoid single-child parent nodes
Signed-off-by: Charles Wong <charles.wong2@amd.com>
Add a topology aware test case to verify that burstability can be limited to an L3 cache cluster in a new n4c128 topology setup with 8 physical cores in an L3 cluster. Signed-off-by: Krisztian Litkey <krisztian.litkey@intel.com>
Signed-off-by: Charles Wong <charles.wong2@amd.com>
Signed-off-by: Charles Wong <charles.wong2@amd.com>
b44cea1 to
a25233f
Compare
done! |
Hello!
I would like to propose L3 cache restriction/affinity in the NRI Topology Aware resource policy.
The existing die node level supports resource optimization for AMD EPYC SKUs where the die contains a single L3 cache.
However, there are some SKUs of AMD EPYC that will contain up to two L3 caches on a single die.
AMD refers to these L3 cache core groupings as a Core Complex (CCX) with up to two CCXs on a Core Complex Die (CCD).
The other motivation is to extend the unlimitedBurstable feature to the L3 Cache level as well.
Please let me know your thoughts