docs: add DRA NodeOverlay extension RFC #2559

jmdeal · 2025-10-02T21:45:36Z

Fixes #N/A

Description

Adds an RFC proposing an extension to the NodeOverlay CRD to support DRA. First step to addressing #2523.

How was this change tested?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

k8s-ci-robot · 2025-10-02T21:45:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jmdeal

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jmdeal]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

coveralls · 2025-10-02T21:59:37Z

Pull Request Test Coverage Report for Build 18579020644

Details

0 of 0 changed or added relevant lines in 0 files are covered.
2 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.06%) to 81.671%

Files with Coverage Reduction	New Missed Lines	%
pkg/controllers/static/provisioning/controller.go	2	58.54%

Totals
Change from base Build 18201406291:	0.06%
Covered Lines:	11576
Relevant Lines:	14174

💛 - Coveralls

jackfrancis · 2025-10-02T22:09:39Z

designs/node-overlay-dra.md

+There is some nuance to they type of the `resourceSlices` field. There are three options:
+- A list of unstructured objects
+- The upstream v1 schema
+- A custom `ResourceSlice` schema


@towca what were some of the considerations that went into how CAS represents ResourceSlice data in its simulator flow?

IIUC CAS uses the upstream scheduler implementation directly for it's simulation, so I assume it uses the upstream schema. If we used the upstream scheduler, that's what we'd do as well.

sumukha-radhakrishna · 2025-10-17T17:07:46Z

designs/node-overlay-dra.md

+
+**Note:** `NodeOverlays` are an advanced Karpenter feature and are not intended to be the primary way for users to use
+DRA. The expecation is that cloudproviders will build integrations for common use-cases and that the `NodeOverlay`
+extension will serve advanced users with specific requirements. `NodeOverlays` are being used for initial implementation


Is there a reason to build ontop of overlays if we plan on changing the implementation later?

I should clarify, just because NodeOverlays are being used for the initial implementation doesn't mean that NodeOverlay based support will go away long-term. Users will still be able to use NodeOverlays for drivers that aren't natively supported by their cloudproviders.

sumukha-radhakrishna · 2025-10-17T17:11:42Z

designs/node-overlay-dra.md

+the onboarding process by allowing existing manifests to be copied into the `NodeOverlay` directly. However, this also
+comes with some notable drawbacks:
+
+- Karpenter won't support all of the fields that are present in the upstream schema, for example those backed by alpha


Maybe another drawback is that we cannot validate unstructured objects so misconfiguration possibility will be more?

We can still validate unstructured objects, it would just be at runtime rather than at admission time. You're right though that is another drawback - the self documenting nature of a concrete schema is useful.

This section is talking about using the v1 schema directly though, it wouldn't be unstructured. We would ship with the version of the v1 schema that is available during that Karpenter release cycle.

jamesmt-aws · 2025-10-17T17:17:34Z

designs/node-overlay-dra.md

+
+### Non-goals
+
+- Support for describing non-node-local devices


Is an EBS volume a node-local device? Those are usually only attached to a single instance, but they have a lifetime that's distinct, and the storage size of an EBS volume (and a filesystem on top of that volume) can change over time. I'm guessing that an S3 Bucket or an EFS Filesystem is a non-node-local device, and I'm curious how we think about EBS volumes here.

Yes, but EBS volumes have their own system for lifecycle management in k8s: persistent volumes and persistent volume claims. I'm not aware of a desire to migrate from that framework to DRA since it has been purpose built for storage. If this does happen in a future it would be a non-node-local device. The rationale for us not supporting that at this time though is because Karpenter is not responsible for provisioning anything other than nodes.

Could you imagine a use case where customers used DRA to allocate a fixed percentage of an EBS volume to a pod that needs that, like 100GiB of storage? I would expect to be able to do something like that with DRA, but I also understand if the current implementation makes that tricky and it needs to be out-of-scope for now.

If it's a "node-local" EBS volume I would expect users to just use ephemeral storage requests rather than DRA. For non-node local storage (i.e. EBS volumes that are provisioned at runtime rather than as part of the instance launch) it would be the responsibility of the EBS CSI drvier to parse that ResourceClaim and determine the volume it needs to provision to satisfy those requests. It would also create the ResourceSlice presumably and Karpenter's scheduling simulation would account for it.

Take the exact order of operations I laid out with a grain of salt - I'm speculating - but I don't think it would end up being part of Karpenter's purview. Requirements will probably evolve over time though, the DRA ecosystem is seeing a lot of development right now and I expect our requirements to evolve with it.

sumukha-radhakrishna · 2025-10-17T17:19:21Z

designs/node-overlay-dra.md

+  configure these values would be confusing.
+
+For these reasons, this proposal recommends the third option, `ResourceSliceTemplates`. The following principles would
+be used to determine if fields should be included in Karpenter's `ResourceSliceTemplate` schema:


So we maintain copy of common ResourceSlice fields that we support? What if there is a cloud-provider specific resource?
https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/

What do you mean by a cloudprovider specific resource? ResourceSlices shouldn't have any cloudprovider specific fields since it's an upstream manifest definition. The fields we support will be a direct reflection of the functionality supported in our scheduler.

jamesmt-aws · 2025-10-17T17:19:51Z

designs/node-overlay-dra.md

+
+We face the same challenge with DRA: CloudProviders will need to be made aware of the `ResourceSlices` that will be
+published when a given instance type registers with the cluster as a node. This RFC proposes extending the existing
+`NodeOverlay` API to support specifying `ResourceSlices` in addition to conventional extended resources.


Can we link to definition of ResourceSlices, ideally in an RFC?

The definition in the RFC is quite out of date IIRC, the only source of truth I found for it was the generated swagger file in the upstream k8s repo. This website consumes and displays that doc, but it doesn't have the v1 version yet. I can see if I can get this site updated for 1.34 and link out to it from here: https://www.manifests.io/kubernetes/1.33/io.k8s.api.resource.v1beta2.ResourceSlice

Got it. If you add that context into your doc, I think that would help people (like me) catch up more quickly.

For sure, if I can't get that site updated quickly I'll see if I can just copy the manifest definition into the doc.

jamesmt-aws · 2025-10-17T17:21:37Z

designs/node-overlay-dra.md

+Today `NodeOverlays` support two types of overrides, both of which have their own semantic for conflict resolution.
+Conflict resolution for price is handled based on the weight of the `NodeOverlays`. If two `NodeOverlays` apply to the
+same instance type and have different weights, the higher weighted overlay is used. If their weights are the same,
+they're marked as conflicting. The semantic for capacity is similar, but it isn't treated as an atomic unit. Instead,


What's the implication of conflicting NodeOverlays? Is that just an error message with random behavior? Or do we fail to launch nodes? Or maybe something else happens?

There is a Ready status condition on the NodeClaim that goes false and the overlay is not applied (https://karpenter.sh/docs/concepts/nodeoverlays/#common-status-conditions)

It would be the same behavior we have for conflicting NodeOverlays today, those NodeOverlays would be marked as conflicting and wouldn't be applied.

DerekFrank · 2025-10-17T17:09:24Z

designs/node-overlay-dra.md

+advantage in those cases because Karpenter passes those fields unmutated to other processes, e.g. Kubelet. Since
+Karpenter's version is not coupled to those processess' versions, coupling the configuration schema can cause
+unnecessary compatiblity issues. On the other hand Karpenter won't be passing the `ResourceSlices` embedded in
+`NodeOverlays` to any other process. Additionally, Karpenter is coupled to the version of the schema supported by its


I could see a process passing ResourceSlices to NodeOverlays though

Potentially, but probably not directly. There are a number of fields you wouldn't want to include in the NodeOverlay definition (e.g. concrete attribute values that are specific to the individual device).

DerekFrank · 2025-10-17T17:11:19Z

designs/node-overlay-dra.md

+- Fields which don't influence scheduling should be excluded
+- Fields backing feature gates unsupported by Karpenter should be excluded
+
+Based on these principles, the following fields from the `v1` API in Kubernetes 1.34 would be **excluded**:


Link to the API spec would be helpful here :D

Yeah, alluded to this in a different comment but it doesn't exist as far as I can tell. The best source of truth would be to spin up a 1.34 cluster and describe the resource. There is a site I use for manifest definitions, I'm going to see if I can get it updated to 1.34 so I can link out to it.

DerekFrank · 2025-10-17T17:13:49Z

designs/node-overlay-dra.md

+The motivating example for this feature was very similar: users need a way to express constraints based on a pod's
+template hash, but that hash won't be resolved until the pod is created. Therefore, the user can't express this directly
+in the pod template. This is the same issue we face with DRA: we know the topology we want to express the constraint on
+(zone) but won't know the domain until the node is provisioned. This RFC proposes adding `matchLabelKeys` to


To be clear, we're talking about the host node here?

Yes, I can update this to clarify.

DerekFrank · 2025-10-17T17:14:38Z

designs/node-overlay-dra.md

+spec:
+  resourceSliceTemplates:
+    - spec:
+        nodeSelector:


I think we'll want to flatten some of this, unless there are other fields besides nodeSelectorTerms we want in the spec?

I'm only including the fields that are relevant to each section. There are other fields at each level that make sense to include.

DerekFrank · 2025-10-17T17:23:24Z

designs/node-overlay-dra.md

+unnecessary compatiblity issues. On the other hand Karpenter won't be passing the `ResourceSlices` embedded in
+`NodeOverlays` to any other process. Additionally, Karpenter is coupled to the version of the schema supported by its
+scheduler. Including a concrete schema in the `NodeOverlay` CRD definition self-documents the supported subset of DRA's
+functionality.


I'm not sure this self-documenation is worth the overhead of creating a new CRD. If the goal is to get end users to be able to use karpenter+DRA before we can do the driver+CP specific integrations for the common one, we might want to pick the approach with the least amount of work to get to that place.

If we find that driver+CP specific integrations are actually a lot more complicated than we think, maybe then we can revisit a ResourceSliceTemplate approach

That's only one of the advantages, the other advantage is that it gives us the flexibility to add or mutate existing fields when necessary. Also, creating our own definition is less work than ingesting the upstream CRD definition in my opinion, not more. We will need to dynamically update some validation rules, so it's not as simple as embedding the upstream struct into NodeOverlay.

sumukha-radhakrishna · 2025-10-17T17:29:55Z

designs/node-overlay-dra.md

+                - "topology.kubernetes.io/zone"
+```
+
+**Note:** This RFC does **not** propose extending the upstream `ResourceSlice` schema with `matchLabelKeys`. Although we


So the user now has to plan for having ResourceSlice on the cluster for kube-scheduler and configuring overlays to make it work with Karpenter? Cant we directly use the CR the driver creates and read specific fields instead of maintaining one here?

The reason we're adding ResourceSlices to NodeOverlays is because they haven't been created on the cluster yet. We need to be able to anticipate what ResourceSlices will be generated for an instance type before we launch it to understand if that instance type will be able to support the pod's requests.

For nodes that have already been created, Karpenter will use the existing ResourceSlices.

ellistarn · 2025-10-17T22:32:00Z

designs/node-overlay-dra.md

+published when a given instance type registers with the cluster as a node. This RFC proposes extending the existing
+`NodeOverlay` API to support specifying `ResourceSlices` in addition to conventional extended resources.
+
+**Note:** `NodeOverlays` are an advanced Karpenter feature and are not intended to be the primary way for users to use


ellistarn · 2025-10-17T22:34:12Z

designs/node-overlay-dra.md

+`ResourceSlice` schema. We have considered changing other fields in Karpenter to accept an unstructured object
+(`spec.kubelet` and `spec.userData` (Bottlerocket) on the `EC2NodeClass`) for this reason. This would provide an
+advantage in those cases because Karpenter passes those fields unmutated to other processes, e.g. Kubelet. Since
+Karpenter's version is not coupled to those processess' versions, coupling the configuration schema can cause


Does this go away if we DO couple the graduation of NodeOverlay's field ResourceSlice with the graduation of the Kubernetes APIs?

My argument here was that this argument doesn't apply since it is inherently coupled. We'll only support the subset of fields that are supported by our scheduler. I wanted to make sure I called this out since this is something we've considered for those other fields, but I don't believe it applies.

ellistarn · 2025-10-17T22:34:55Z

designs/node-overlay-dra.md

+
+- Karpenter won't support all of the fields that are present in the upstream schema, for example those backed by alpha
+  features.
+- Some of the fields are redundant, e.g. `spec.nodeName`, since it will be implicitly associated with the instance


Can we just validate this out?

We can, but does it make sense to include a field in the CRD if we're going to have a validation rule that prevents it from being set? At that point, I'd rather omit the field altogether. If we do retain the field, I think we should continue to allow it to be set but ignore the value.

ellistarn · 2025-10-17T22:35:26Z

designs/node-overlay-dra.md

+  features.
+- Some of the fields are redundant, e.g. `spec.nodeName`, since it will be implicitly associated with the instance
+  that's launched. Additionally, in this case it's impossible to know the concrete value ahead of launch.
+- Some of the required fields don't provide useful information for Karpenter, e.g. `spec.pool`. Requiring users to


Can we validate this out?

We could change the validation rules, though at that point it's no longer a carbon copy of the upstream schema since I consider the validation rules to be part of the schema. If we're changing validation rules, I don't see why we shouldn't just remove the fields explicitly.

ellistarn · 2025-10-17T22:37:26Z

designs/node-overlay-dra.md

+When a DRA driver creates `ResourceSlices` for such a device, it already knows the zone the node has been provisioned in
+and can inject that zonal requirement into the `ResourceSlice's` selector. On the other hand, the user does not
+necessarily know the zone that Karpenter will provision the node into. They could create a `NodeOverlay` per topology
+domain, but this becomes untennable with a large number of domains. This problem statement can be generalized: how does


Is there a large number of topology domains?
Do most DRA resources include topology domains?

Do most DRA resources include topology domains?

I think it's far too early in DRA's lifecycle to make a claim one way or another. It was one of the motivating examples for DRA, and some of the motivation was to use arbitrary topology keys (like rack identifiers).

ellistarn · 2025-10-17T22:39:22Z

designs/node-overlay-dra.md

+with regards to typing. The following three options are considered in this proposal:
+
+- A slice of unstructured objects
+- A slice of unmutated, upstream `ResourceSlice` objects


I'd like to see some specs/details on how we could achieve a (validated) subset of upstream ResourceSlice objects.

ellistarn · 2025-10-17T22:54:46Z

designs/node-overlay-dra.md

+While this approach does satisfy today's requirements, driven by the `matchAttribute` constraint, it may not offer the
+flexibility required long-term. Specifically,
+[KEP-5254: Constraints with CEL](https://github.com/kubernetes/enhancements/pull/5391) proposes the addition of a CEL


I am cautious about the ballooning complexity here, and wonder if we might be able to find simplification opportunities by working directly with DRA.

csplinter · 2025-10-22T15:02:50Z

designs/node-overlay-dra.md

+which instance types will satisfy a given set of requests. Cloudprovider implementations can get this information from
+any number of sources, for example AWS EC2's `DescribeInstanceTypes` API, and they use their knowledge of well-known
+device plugins to determine the resulting nodes' allocatable resources. The drawback of this approach is that each
+cloudprovider needs to build integrations for specific drivers, and users are limited by those integrations.


While I agree with this statement, there's an argument to be made that the rising adoption of GPUs (and their corresponding drivers) need to be treated as first-class citizens rather than advanced features.

DRA isn't the advanced feature in this case, NodeOverlay is. The long-term vision is for cloudproviders to build native DRA integrations with their drivers for their instance types, in the same way we've built native integrations for existing device-plugin based drivers. With these integrations in place, users will be able to use DRA with Karpenter with zero config. However, there will always be a need for NodeOverlay support since any given cloudprovider implementation can't support every driver that exists, for example a custom driver for a specific company.

The reason we're implementing integration with NodeOverlay first is that it allows us to develop the fundamental building blocks for DRA support without being tied to specific driver integrations.

csplinter · 2025-10-22T15:08:19Z

designs/node-overlay-dra.md

+### Goals
+
+- Enable end users to use DRA drivers with Karpenter without cloudprovider specific integrations
+- Enable the description of complex topological relationships between nodes and `ResourceSlices`


I'm struggling with having the UX centered on ResourceSlices because these are really the responsibility of the device driver as opposed to the cluster admin. The cluster admin is going to be creating DeviceClasses and won't have direct knowledge of the ResourceSlice definition without digging into the driver implementation.

That's why the NodeOverlay integration for DRA is considered an advanced user feature. While this will be the primary way to integrate with NodeOverlay short-term while driver integrations are being developed, long-term cloudprovider specific driver integrations will be the primary way to interact with DRA. These integrations would require zero-configuration, in the same way that the existing device-plugin driver integrations do. How these integrations would be developed is out of scope of this RFC since that's a per-cloudprovider implementation detail.

Also, as a cluster administrator you still need to understand the shape of ResourceSlices being created when creating DeviceClasses. In particular, you need to understand the following:

What attributes devices will have

What capacity is available for devices

The driver that is provisioning the devices

The only detail that you need to specify on a NodeOverlay embedded ResourceSlice which you don't need to understand when creating a DeviceClass is the topology requirements. However in the common case (node-local devices) you won't need to specify anything.

Information that's purely for the driver (e.g. pool) has been excluded from the proposed ResourceSliceTemplate spec.

jmdeal · 2025-10-23T20:55:08Z

Had a discussion with @ellistarn offline, and we're thinking we may table the NodeOverlay extension for the time being. This was a valuable exercise in determining what information we would need to encode to support the full feature-set of DRA and I think most of the findings will still be valuable for informing the scheduling / cloudprovider integration. Currently we don't have a lot of concrete examples of drivers making it difficult to determine exactly what flexibility we need in out API surface. Even though the NodeOverlay API is alpha, we want to wait until the DRA ecosystem matures before commiting to additional API surface. This isn't to say we're pausing work on support for DRA in Karpenter - I'm going to be continuing to work on scheduling support and the cloudprovider integration point.

I'm going to continue to think on this and leave this RFC open for the time being. Even if tabled for now, there's a good chance that we revisit this down the road since the motivating examples for including DRA in NodeOverlay remain: cloudproviders won't be able to provide native support for every driver.

docs: add DRA NodeOverlay extension RFC

c80d188

k8s-ci-robot requested review from engedaam and tallaxes October 2, 2025 21:45

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 2, 2025

jmdeal changed the title ~~docs: add DRA NodeOverlay extension RFC~~ [WIP] docs: add DRA NodeOverlay extension RFC Oct 2, 2025

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 2, 2025

jmdeal marked this pull request as draft October 2, 2025 21:46

jmdeal changed the title ~~[WIP] docs: add DRA NodeOverlay extension RFC~~ docs: add DRA NodeOverlay extension RFC Oct 2, 2025

alimaazamat moved this to In Progress in Karpenter + DRA Oct 2, 2025

alimaazamat added this to Karpenter + DRA Oct 2, 2025

jackfrancis reviewed Oct 2, 2025

View reviewed changes

jmdeal added 2 commits October 9, 2025 14:56

temp

7af6b59

formatting

da09394

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 16, 2025

final updates

92ca785

jmdeal marked this pull request as ready for review October 17, 2025 00:54

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 17, 2025

sumukha-radhakrishna reviewed Oct 17, 2025

View reviewed changes

jamesmt-aws reviewed Oct 17, 2025

View reviewed changes

sumukha-radhakrishna reviewed Oct 17, 2025

View reviewed changes

jamesmt-aws reviewed Oct 17, 2025

View reviewed changes

DerekFrank reviewed Oct 17, 2025

View reviewed changes

sumukha-radhakrishna reviewed Oct 17, 2025

View reviewed changes

ellistarn reviewed Oct 17, 2025

View reviewed changes

csplinter reviewed Oct 22, 2025

View reviewed changes


		### Non-goals

		- Support for describing non-node-local devices

docs: add DRA NodeOverlay extension RFC #2559

Are you sure you want to change the base?

docs: add DRA NodeOverlay extension RFC #2559

Conversation

jmdeal commented Oct 2, 2025

Uh oh!

k8s-ci-robot commented Oct 2, 2025

Uh oh!

coveralls commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 18579020644

Details

💛 - Coveralls

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryan-mist Oct 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Oct 2, 2025 •

edited

Loading

ryan-mist Oct 17, 2025 •

edited

Loading