Skip to content

Conversation

@jmdeal
Copy link
Member

@jmdeal jmdeal commented Oct 2, 2025

Fixes #N/A

Description

Adds an RFC proposing an extension to the NodeOverlay CRD to support DRA. First step to addressing #2523.

How was this change tested?

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jmdeal

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 2, 2025
@jmdeal jmdeal changed the title docs: add DRA NodeOverlay extension RFC [WIP] docs: add DRA NodeOverlay extension RFC Oct 2, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 2, 2025
@jmdeal jmdeal marked this pull request as draft October 2, 2025 21:46
@jmdeal jmdeal changed the title [WIP] docs: add DRA NodeOverlay extension RFC docs: add DRA NodeOverlay extension RFC Oct 2, 2025
@coveralls
Copy link

coveralls commented Oct 2, 2025

Pull Request Test Coverage Report for Build 18579020644

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.06%) to 81.671%

Files with Coverage Reduction New Missed Lines %
pkg/controllers/static/provisioning/controller.go 2 58.54%
Totals Coverage Status
Change from base Build 18201406291: 0.06%
Covered Lines: 11576
Relevant Lines: 14174

💛 - Coveralls

@alimaazamat alimaazamat moved this to In Progress in Karpenter + DRA Oct 2, 2025
There is some nuance to they type of the `resourceSlices` field. There are three options:
- A list of unstructured objects
- The upstream v1 schema
- A custom `ResourceSlice` schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@towca what were some of the considerations that went into how CAS represents ResourceSlice data in its simulator flow?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC CAS uses the upstream scheduler implementation directly for it's simulation, so I assume it uses the upstream schema. If we used the upstream scheduler, that's what we'd do as well.

@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 16, 2025
@jmdeal jmdeal marked this pull request as ready for review October 17, 2025 00:54
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 17, 2025

**Note:** `NodeOverlays` are an advanced Karpenter feature and are not intended to be the primary way for users to use
DRA. The expecation is that cloudproviders will build integrations for common use-cases and that the `NodeOverlay`
extension will serve advanced users with specific requirements. `NodeOverlays` are being used for initial implementation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to build ontop of overlays if we plan on changing the implementation later?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should clarify, just because NodeOverlays are being used for the initial implementation doesn't mean that NodeOverlay based support will go away long-term. Users will still be able to use NodeOverlays for drivers that aren't natively supported by their cloudproviders.

the onboarding process by allowing existing manifests to be copied into the `NodeOverlay` directly. However, this also
comes with some notable drawbacks:

- Karpenter won't support all of the fields that are present in the upstream schema, for example those backed by alpha
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe another drawback is that we cannot validate unstructured objects so misconfiguration possibility will be more?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can still validate unstructured objects, it would just be at runtime rather than at admission time. You're right though that is another drawback - the self documenting nature of a concrete schema is useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is talking about using the v1 schema directly though, it wouldn't be unstructured. We would ship with the version of the v1 schema that is available during that Karpenter release cycle.


### Non-goals

- Support for describing non-node-local devices

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is an EBS volume a node-local device? Those are usually only attached to a single instance, but they have a lifetime that's distinct, and the storage size of an EBS volume (and a filesystem on top of that volume) can change over time. I'm guessing that an S3 Bucket or an EFS Filesystem is a non-node-local device, and I'm curious how we think about EBS volumes here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but EBS volumes have their own system for lifecycle management in k8s: persistent volumes and persistent volume claims. I'm not aware of a desire to migrate from that framework to DRA since it has been purpose built for storage. If this does happen in a future it would be a non-node-local device. The rationale for us not supporting that at this time though is because Karpenter is not responsible for provisioning anything other than nodes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you imagine a use case where customers used DRA to allocate a fixed percentage of an EBS volume to a pod that needs that, like 100GiB of storage? I would expect to be able to do something like that with DRA, but I also understand if the current implementation makes that tricky and it needs to be out-of-scope for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's a "node-local" EBS volume I would expect users to just use ephemeral storage requests rather than DRA. For non-node local storage (i.e. EBS volumes that are provisioned at runtime rather than as part of the instance launch) it would be the responsibility of the EBS CSI drvier to parse that ResourceClaim and determine the volume it needs to provision to satisfy those requests. It would also create the ResourceSlice presumably and Karpenter's scheduling simulation would account for it.

Take the exact order of operations I laid out with a grain of salt - I'm speculating - but I don't think it would end up being part of Karpenter's purview. Requirements will probably evolve over time though, the DRA ecosystem is seeing a lot of development right now and I expect our requirements to evolve with it.

configure these values would be confusing.

For these reasons, this proposal recommends the third option, `ResourceSliceTemplates`. The following principles would
be used to determine if fields should be included in Karpenter's `ResourceSliceTemplate` schema:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we maintain copy of common ResourceSlice fields that we support? What if there is a cloud-provider specific resource?
https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by a cloudprovider specific resource? ResourceSlices shouldn't have any cloudprovider specific fields since it's an upstream manifest definition. The fields we support will be a direct reflection of the functionality supported in our scheduler.


We face the same challenge with DRA: CloudProviders will need to be made aware of the `ResourceSlices` that will be
published when a given instance type registers with the cluster as a node. This RFC proposes extending the existing
`NodeOverlay` API to support specifying `ResourceSlices` in addition to conventional extended resources.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we link to definition of ResourceSlices, ideally in an RFC?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition in the RFC is quite out of date IIRC, the only source of truth I found for it was the generated swagger file in the upstream k8s repo. This website consumes and displays that doc, but it doesn't have the v1 version yet. I can see if I can get this site updated for 1.34 and link out to it from here: https://www.manifests.io/kubernetes/1.33/io.k8s.api.resource.v1beta2.ResourceSlice

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. If you add that context into your doc, I think that would help people (like me) catch up more quickly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, if I can't get that site updated quickly I'll see if I can just copy the manifest definition into the doc.

Today `NodeOverlays` support two types of overrides, both of which have their own semantic for conflict resolution.
Conflict resolution for price is handled based on the weight of the `NodeOverlays`. If two `NodeOverlays` apply to the
same instance type and have different weights, the higher weighted overlay is used. If their weights are the same,
they're marked as conflicting. The semantic for capacity is similar, but it isn't treated as an atomic unit. Instead,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the implication of conflicting NodeOverlays? Is that just an error message with random behavior? Or do we fail to launch nodes? Or maybe something else happens?

Copy link
Contributor

@ryan-mist ryan-mist Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a Ready status condition on the NodeClaim that goes false and the overlay is not applied (https://karpenter.sh/docs/concepts/nodeoverlays/#common-status-conditions)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be the same behavior we have for conflicting NodeOverlays today, those NodeOverlays would be marked as conflicting and wouldn't be applied.

advantage in those cases because Karpenter passes those fields unmutated to other processes, e.g. Kubelet. Since
Karpenter's version is not coupled to those processess' versions, coupling the configuration schema can cause
unnecessary compatiblity issues. On the other hand Karpenter won't be passing the `ResourceSlices` embedded in
`NodeOverlays` to any other process. Additionally, Karpenter is coupled to the version of the schema supported by its
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see a process passing ResourceSlices to NodeOverlays though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially, but probably not directly. There are a number of fields you wouldn't want to include in the NodeOverlay definition (e.g. concrete attribute values that are specific to the individual device).

- Fields which don't influence scheduling should be excluded
- Fields backing feature gates unsupported by Karpenter should be excluded

Based on these principles, the following fields from the `v1` API in Kubernetes 1.34 would be **excluded**:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to the API spec would be helpful here :D

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, alluded to this in a different comment but it doesn't exist as far as I can tell. The best source of truth would be to spin up a 1.34 cluster and describe the resource. There is a site I use for manifest definitions, I'm going to see if I can get it updated to 1.34 so I can link out to it.

The motivating example for this feature was very similar: users need a way to express constraints based on a pod's
template hash, but that hash won't be resolved until the pod is created. Therefore, the user can't express this directly
in the pod template. This is the same issue we face with DRA: we know the topology we want to express the constraint on
(zone) but won't know the domain until the node is provisioned. This RFC proposes adding `matchLabelKeys` to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, we're talking about the host node here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can update this to clarify.

spec:
resourceSliceTemplates:
- spec:
nodeSelector:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want to flatten some of this, unless there are other fields besides nodeSelectorTerms we want in the spec?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm only including the fields that are relevant to each section. There are other fields at each level that make sense to include.

unnecessary compatiblity issues. On the other hand Karpenter won't be passing the `ResourceSlices` embedded in
`NodeOverlays` to any other process. Additionally, Karpenter is coupled to the version of the schema supported by its
scheduler. Including a concrete schema in the `NodeOverlay` CRD definition self-documents the supported subset of DRA's
functionality.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this self-documenation is worth the overhead of creating a new CRD. If the goal is to get end users to be able to use karpenter+DRA before we can do the driver+CP specific integrations for the common one, we might want to pick the approach with the least amount of work to get to that place.

If we find that driver+CP specific integrations are actually a lot more complicated than we think, maybe then we can revisit a ResourceSliceTemplate approach

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's only one of the advantages, the other advantage is that it gives us the flexibility to add or mutate existing fields when necessary. Also, creating our own definition is less work than ingesting the upstream CRD definition in my opinion, not more. We will need to dynamically update some validation rules, so it's not as simple as embedding the upstream struct into NodeOverlay.

- "topology.kubernetes.io/zone"
```

**Note:** This RFC does **not** propose extending the upstream `ResourceSlice` schema with `matchLabelKeys`. Although we
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the user now has to plan for having ResourceSlice on the cluster for kube-scheduler and configuring overlays to make it work with Karpenter? Cant we directly use the CR the driver creates and read specific fields instead of maintaining one here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we're adding ResourceSlices to NodeOverlays is because they haven't been created on the cluster yet. We need to be able to anticipate what ResourceSlices will be generated for an instance type before we launch it to understand if that instance type will be able to support the pod's requests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For nodes that have already been created, Karpenter will use the existing ResourceSlices.

published when a given instance type registers with the cluster as a node. This RFC proposes extending the existing
`NodeOverlay` API to support specifying `ResourceSlices` in addition to conventional extended resources.

**Note:** `NodeOverlays` are an advanced Karpenter feature and are not intended to be the primary way for users to use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`ResourceSlice` schema. We have considered changing other fields in Karpenter to accept an unstructured object
(`spec.kubelet` and `spec.userData` (Bottlerocket) on the `EC2NodeClass`) for this reason. This would provide an
advantage in those cases because Karpenter passes those fields unmutated to other processes, e.g. Kubelet. Since
Karpenter's version is not coupled to those processess' versions, coupling the configuration schema can cause
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this go away if we DO couple the graduation of NodeOverlay's field ResourceSlice with the graduation of the Kubernetes APIs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My argument here was that this argument doesn't apply since it is inherently coupled. We'll only support the subset of fields that are supported by our scheduler. I wanted to make sure I called this out since this is something we've considered for those other fields, but I don't believe it applies.


- Karpenter won't support all of the fields that are present in the upstream schema, for example those backed by alpha
features.
- Some of the fields are redundant, e.g. `spec.nodeName`, since it will be implicitly associated with the instance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just validate this out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can, but does it make sense to include a field in the CRD if we're going to have a validation rule that prevents it from being set? At that point, I'd rather omit the field altogether. If we do retain the field, I think we should continue to allow it to be set but ignore the value.

features.
- Some of the fields are redundant, e.g. `spec.nodeName`, since it will be implicitly associated with the instance
that's launched. Additionally, in this case it's impossible to know the concrete value ahead of launch.
- Some of the required fields don't provide useful information for Karpenter, e.g. `spec.pool`. Requiring users to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we validate this out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could change the validation rules, though at that point it's no longer a carbon copy of the upstream schema since I consider the validation rules to be part of the schema. If we're changing validation rules, I don't see why we shouldn't just remove the fields explicitly.

When a DRA driver creates `ResourceSlices` for such a device, it already knows the zone the node has been provisioned in
and can inject that zonal requirement into the `ResourceSlice's` selector. On the other hand, the user does not
necessarily know the zone that Karpenter will provision the node into. They could create a `NodeOverlay` per topology
domain, but this becomes untennable with a large number of domains. This problem statement can be generalized: how does
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a large number of topology domains?
Do most DRA resources include topology domains?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do most DRA resources include topology domains?

I think it's far too early in DRA's lifecycle to make a claim one way or another. It was one of the motivating examples for DRA, and some of the motivation was to use arbitrary topology keys (like rack identifiers).

with regards to typing. The following three options are considered in this proposal:

- A slice of unstructured objects
- A slice of unmutated, upstream `ResourceSlice` objects
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see some specs/details on how we could achieve a (validated) subset of upstream ResourceSlice objects.

Comment on lines +267 to +269
While this approach does satisfy today's requirements, driven by the `matchAttribute` constraint, it may not offer the
flexibility required long-term. Specifically,
[KEP-5254: Constraints with CEL](https://github.com/kubernetes/enhancements/pull/5391) proposes the addition of a CEL
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am cautious about the ballooning complexity here, and wonder if we might be able to find simplification opportunities by working directly with DRA.

which instance types will satisfy a given set of requests. Cloudprovider implementations can get this information from
any number of sources, for example AWS EC2's `DescribeInstanceTypes` API, and they use their knowledge of well-known
device plugins to determine the resulting nodes' allocatable resources. The drawback of this approach is that each
cloudprovider needs to build integrations for specific drivers, and users are limited by those integrations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree with this statement, there's an argument to be made that the rising adoption of GPUs (and their corresponding drivers) need to be treated as first-class citizens rather than advanced features.

Copy link
Member Author

@jmdeal jmdeal Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DRA isn't the advanced feature in this case, NodeOverlay is. The long-term vision is for cloudproviders to build native DRA integrations with their drivers for their instance types, in the same way we've built native integrations for existing device-plugin based drivers. With these integrations in place, users will be able to use DRA with Karpenter with zero config. However, there will always be a need for NodeOverlay support since any given cloudprovider implementation can't support every driver that exists, for example a custom driver for a specific company.

The reason we're implementing integration with NodeOverlay first is that it allows us to develop the fundamental building blocks for DRA support without being tied to specific driver integrations.

### Goals

- Enable end users to use DRA drivers with Karpenter without cloudprovider specific integrations
- Enable the description of complex topological relationships between nodes and `ResourceSlices`

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling with having the UX centered on ResourceSlices because these are really the responsibility of the device driver as opposed to the cluster admin. The cluster admin is going to be creating DeviceClasses and won't have direct knowledge of the ResourceSlice definition without digging into the driver implementation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's why the NodeOverlay integration for DRA is considered an advanced user feature. While this will be the primary way to integrate with NodeOverlay short-term while driver integrations are being developed, long-term cloudprovider specific driver integrations will be the primary way to interact with DRA. These integrations would require zero-configuration, in the same way that the existing device-plugin driver integrations do. How these integrations would be developed is out of scope of this RFC since that's a per-cloudprovider implementation detail.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, as a cluster administrator you still need to understand the shape of ResourceSlices being created when creating DeviceClasses. In particular, you need to understand the following:

  • What attributes devices will have
  • What capacity is available for devices
  • The driver that is provisioning the devices

The only detail that you need to specify on a NodeOverlay embedded ResourceSlice which you don't need to understand when creating a DeviceClass is the topology requirements. However in the common case (node-local devices) you won't need to specify anything.

Information that's purely for the driver (e.g. pool) has been excluded from the proposed ResourceSliceTemplate spec.

@jmdeal
Copy link
Member Author

jmdeal commented Oct 23, 2025

Had a discussion with @ellistarn offline, and we're thinking we may table the NodeOverlay extension for the time being. This was a valuable exercise in determining what information we would need to encode to support the full feature-set of DRA and I think most of the findings will still be valuable for informing the scheduling / cloudprovider integration. Currently we don't have a lot of concrete examples of drivers making it difficult to determine exactly what flexibility we need in out API surface. Even though the NodeOverlay API is alpha, we want to wait until the DRA ecosystem matures before commiting to additional API surface. This isn't to say we're pausing work on support for DRA in Karpenter - I'm going to be continuing to work on scheduling support and the cloudprovider integration point.

I'm going to continue to think on this and leave this RFC open for the time being. Even if tabled for now, there's a good chance that we revisit this down the road since the motivating examples for including DRA in NodeOverlay remain: cloudproviders won't be able to provide native support for every driver.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

10 participants