Skip to content

Conversation

@rahulait
Copy link
Contributor

Description

When a NvidiaDriver CR's nodeSelector is updated to be more restrictive (e.g., from region=us-east-1 to region=us-east-1, zone=us-east-1), the controller would:

Create DaemonSets for node pools matching the new nodeSelector
Leave behind DaemonSets for node pools that no longer match
These "orphaned" DaemonSets would continue running pods even though no NvidiaDriver CR wanted them

For example, I had a cluster with 4 nodes (2 ubuntu 22.04 and 2 ubuntu 24.04) and all nodes had region=us-east-1 tag. Nvidiadriver CR was installed with nodeselector region=us-east-1. So there were two daemonsets created (one for ubuntu 22.04 and one for ubuntu 24.04) and all driver pods came up fine.

k get nvidiadriver -o yaml | grep -A2 nodeSelector
    nodeSelector:
      region: us-east-1
      
#### Daemonset status

nvidia-gpu-driver-ubuntu22.04-c99c48d99   2         2         2       2            2           feature.node.kubernetes.io/system-os_release.ID=ubuntu,feature.node.kubernetes.io/system-os_release.VERSION_ID=22.04,nvidia.com/gpu.deploy.driver=true,nvidia.com/gpu.present=true,region=us-east-1                  23m
nvidia-gpu-driver-ubuntu24.04-c99c48d99   2         2         2       2            2           feature.node.kubernetes.io/system-os_release.ID=ubuntu,feature.node.kubernetes.io/system-os_release.VERSION_ID=24.04,nvidia.com/gpu.deploy.driver=true,nvidia.com/gpu.present=true,region=us-east-1                  23m

I then added another label to one of the ubuntu 24.04 nodes (zone=us-east-1). I then changed nodeselector for nvidiadriver CR to also include the new label. So only one node in cluster satisfies the condition. However, only one daemonset got updated. Other daemonset (of ubuntu 22.04 nodes) is now no longer managed by any nvidiadriver even though it had owner reference to the one which created it.

k get nvidiadriver -o yaml | grep -A2 nodeSelector
    nodeSelector:
      region: us-east-1
      zone: us-east-1
      
#### Daemonset status

nvidia-gpu-driver-ubuntu22.04-c99c48d99   2         2         2       2            2           feature.node.kubernetes.io/system-os_release.ID=ubuntu,feature.node.kubernetes.io/system-os_release.VERSION_ID=22.04,nvidia.com/gpu.deploy.driver=true,nvidia.com/gpu.present=true,region=us-east-1                  23m
nvidia-gpu-driver-ubuntu24.04-c99c48d99   1         1         1       1            1           feature.node.kubernetes.io/system-os_release.ID=ubuntu,feature.node.kubernetes.io/system-os_release.VERSION_ID=24.04,nvidia.com/gpu.deploy.driver=true,nvidia.com/gpu.present=true,region=us-east-1,zone=us-east-1   23m

The Root Cause

The reconciliation flow was:

  1. Clean up stale DaemonSets (based only on pod scheduling status)
  2. Get manifest objects for current node pools
  3. Create/update those objects

The cleanup function had no knowledge of which DaemonSets should exist, so it couldn't delete DaemonSets that were no longer desired.

The Fix

Updated code to:

  1. Get manifest objects first
  2. Perform cleanup
    a. Build a set of desired Daemonset names from the manifest objects
    b. Compare all owned daemonsets against the desired set
    c. Delete any daemonset not in desired list
  3. Create/update objects

After the fix:

k get nvidiadriver -o yaml | grep -A2 nodeSelector
    nodeSelector:
      region: us-east-1
      zone: us-east-1
      
#### Daemonset status

nvidia-gpu-driver-ubuntu24.04-c99c48d99   1         1         1       1            1           feature.node.kubernetes.io/system-os_release.ID=ubuntu,feature.node.kubernetes.io/system-os_release.VERSION_ID=24.04,nvidia.com/gpu.deploy.driver=true,nvidia.com/gpu.present=true,region=us-east-1,zone=us-east-1   23m

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint)
  • Generated assets in-sync (make validate-generated-assets)
  • Go mod artifacts in-sync (make validate-modules)
  • Test cases are added for new code paths

Testing

@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 29, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Rahul Sharma <rahulsharm@nvidia.com>
@rahulait rahulait force-pushed the cleanup-stale-daemonsets branch from 5ab5d25 to 379f164 Compare January 29, 2026 19:10
@rahulait
Copy link
Contributor Author

/ok to test 379f164

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant