[fix] : Cleanup stale daemonsets not managed by any nvidiadriver CR #2081
+24
−5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
When a NvidiaDriver CR's nodeSelector is updated to be more restrictive (e.g., from region=us-east-1 to region=us-east-1, zone=us-east-1), the controller would:
Create DaemonSets for node pools matching the new nodeSelector
Leave behind DaemonSets for node pools that no longer match
These "orphaned" DaemonSets would continue running pods even though no NvidiaDriver CR wanted them
For example, I had a cluster with 4 nodes (2 ubuntu 22.04 and 2 ubuntu 24.04) and all nodes had
region=us-east-1tag. Nvidiadriver CR was installed with nodeselectorregion=us-east-1. So there were two daemonsets created (one for ubuntu 22.04 and one for ubuntu 24.04) and all driver pods came up fine.I then added another label to one of the ubuntu 24.04 nodes (
zone=us-east-1). I then changed nodeselector for nvidiadriver CR to also include the new label. So only one node in cluster satisfies the condition. However, only one daemonset got updated. Other daemonset (of ubuntu 22.04 nodes) is now no longer managed by any nvidiadriver even though it had owner reference to the one which created it.The Root Cause
The reconciliation flow was:
The cleanup function had no knowledge of which DaemonSets should exist, so it couldn't delete DaemonSets that were no longer desired.
The Fix
Updated code to:
a. Build a set of desired Daemonset names from the manifest objects
b. Compare all owned daemonsets against the desired set
c. Delete any daemonset not in desired list
After the fix:
Checklist
make lint)make validate-generated-assets)make validate-modules)Testing