[multicast] fix cache invalidation to work across Nexus instances #9742
+235
−82
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The previous implementation used an AtomicBool to signal cache invalidation from sled_expunge()/sled_upsert() to the multicast background task. This only affected the Nexus that received the request, while other Nexus instances would never know to invalidate their caches.
As per discussion, this pattern of out-of-band communication with background tasks is something we aught to avoid: if a task accepts input to determine what to do, it must separately handle cases where that input is unavailable (after restart, or when another Nexus instance receives the request). This bifurcates the code between the common case and rarely-exercised fallback paths.
This PR replaces the AtomicBool approach with inventory collection ID tracking. The reconciler now checks if the inventory collection ID has changed (database-driven), and invalidates caches when it detects a new collection. This works across all Nexus instances since the DB is the source of truth.
Now, the cache refresh always uses SledFilter::InService, which excludes expunged sleds. This provides a safety net even if/when cache invalidation is delayed, as the next cache refresh will exclude any expunged sleds.
This update also includes a new test and test extensions around cache invalidation involving the multicast background task.