[fix] : Cleanup stale daemonsets not managed by any nvidiadriver CR #2081

rahulait · 2026-01-29T19:06:39Z

Description

When a NvidiaDriver CR's nodeSelector is updated to be more restrictive (e.g., from region=us-east-1 to region=us-east-1, zone=us-east-1), the controller would:

Create DaemonSets for node pools matching the new nodeSelector
Leave behind DaemonSets for node pools that no longer match
These "orphaned" DaemonSets would continue running pods even though no NvidiaDriver CR wanted them

For example, I had a cluster with 4 nodes (2 ubuntu 22.04 and 2 ubuntu 24.04) and all nodes had region=us-east-1 tag. Nvidiadriver CR was installed with nodeselector region=us-east-1. So there were two daemonsets created (one for ubuntu 22.04 and one for ubuntu 24.04) and all driver pods came up fine.

k get nvidiadriver -o yaml | grep -A2 nodeSelector
    nodeSelector:
      region: us-east-1
      
#### Daemonset status

nvidia-gpu-driver-ubuntu22.04-c99c48d99   2         2         2       2            2           feature.node.kubernetes.io/system-os_release.ID=ubuntu,feature.node.kubernetes.io/system-os_release.VERSION_ID=22.04,nvidia.com/gpu.deploy.driver=true,nvidia.com/gpu.present=true,region=us-east-1                  23m
nvidia-gpu-driver-ubuntu24.04-c99c48d99   2         2         2       2            2           feature.node.kubernetes.io/system-os_release.ID=ubuntu,feature.node.kubernetes.io/system-os_release.VERSION_ID=24.04,nvidia.com/gpu.deploy.driver=true,nvidia.com/gpu.present=true,region=us-east-1                  23m

I then added another label to one of the ubuntu 24.04 nodes (zone=us-east-1). I then changed nodeselector for nvidiadriver CR to also include the new label. So only one node in cluster satisfies the condition. However, only one daemonset got updated. Other daemonset (of ubuntu 22.04 nodes) is now no longer managed by any nvidiadriver even though it had owner reference to the one which created it.

k get nvidiadriver -o yaml | grep -A2 nodeSelector
    nodeSelector:
      region: us-east-1
      zone: us-east-1
      
#### Daemonset status

nvidia-gpu-driver-ubuntu22.04-c99c48d99   2         2         2       2            2           feature.node.kubernetes.io/system-os_release.ID=ubuntu,feature.node.kubernetes.io/system-os_release.VERSION_ID=22.04,nvidia.com/gpu.deploy.driver=true,nvidia.com/gpu.present=true,region=us-east-1                  23m
nvidia-gpu-driver-ubuntu24.04-c99c48d99   1         1         1       1            1           feature.node.kubernetes.io/system-os_release.ID=ubuntu,feature.node.kubernetes.io/system-os_release.VERSION_ID=24.04,nvidia.com/gpu.deploy.driver=true,nvidia.com/gpu.present=true,region=us-east-1,zone=us-east-1   23m

The Root Cause

The reconciliation flow was:

Clean up stale DaemonSets (based only on pod scheduling status)
Get manifest objects for current node pools
Create/update those objects

The cleanup function had no knowledge of which DaemonSets should exist, so it couldn't delete DaemonSets that were no longer desired.

The Fix

Updated code to:

Get manifest objects first
Perform cleanup
a. Build a set of desired Daemonset names from the manifest objects
b. Compare all owned daemonsets against the desired set
c. Delete any daemonset not in desired list
Create/update objects

After the fix:

k get nvidiadriver -o yaml | grep -A2 nodeSelector
    nodeSelector:
      region: us-east-1
      zone: us-east-1
      
#### Daemonset status

nvidia-gpu-driver-ubuntu24.04-c99c48d99   1         1         1       1            1           feature.node.kubernetes.io/system-os_release.ID=ubuntu,feature.node.kubernetes.io/system-os_release.VERSION_ID=24.04,nvidia.com/gpu.deploy.driver=true,nvidia.com/gpu.present=true,region=us-east-1,zone=us-east-1   23m

Checklist

No secrets, sensitive information, or unrelated changes
Lint checks passing (make lint)
Generated assets in-sync (make validate-generated-assets)
Go mod artifacts in-sync (make validate-modules)
Test cases are added for new code paths

Testing

copy-pr-bot · 2026-01-29T19:06:43Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>

rahulait · 2026-01-29T19:12:58Z

/ok to test 379f164

rahulait requested review from ArangoGutierrez, cdesiniotis, elezar, shivamerla and tariq1890 as code owners January 29, 2026 19:06

cleanup stale driver daemonsets

379f164

Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>

rahulait force-pushed the cleanup-stale-daemonsets branch from 5ab5d25 to 379f164 Compare January 29, 2026 19:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix] : Cleanup stale daemonsets not managed by any nvidiadriver CR #2081

[fix] : Cleanup stale daemonsets not managed by any nvidiadriver CR #2081

rahulait commented Jan 29, 2026

Uh oh!

copy-pr-bot bot commented Jan 29, 2026

Uh oh!

rahulait commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[fix] : Cleanup stale daemonsets not managed by any nvidiadriver CR #2081

Are you sure you want to change the base?

[fix] : Cleanup stale daemonsets not managed by any nvidiadriver CR #2081

Conversation

rahulait commented Jan 29, 2026

Description

The Root Cause

The Fix

Checklist

Testing

Uh oh!

copy-pr-bot bot commented Jan 29, 2026

Uh oh!

rahulait commented Jan 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant