Skip to content

Conversation

@zeeshanlakhani
Copy link
Collaborator

The previous implementation used an AtomicBool to signal cache invalidation from sled_expunge()/sled_upsert() to the multicast background task. This only affected the Nexus that received the request, while other Nexus instances would never know to invalidate their caches.

As per discussion, this pattern of out-of-band communication with background tasks is something we aught to avoid: if a task accepts input to determine what to do, it must separately handle cases where that input is unavailable (after restart, or when another Nexus instance receives the request). This bifurcates the code between the common case and rarely-exercised fallback paths.

This PR replaces the AtomicBool approach with inventory collection ID tracking. The reconciler now checks if the inventory collection ID has changed (database-driven), and invalidates caches when it detects a new collection. This works across all Nexus instances since the DB is the source of truth.

Now, the cache refresh always uses SledFilter::InService, which excludes expunged sleds. This provides a safety net even if/when cache invalidation is delayed, as the next cache refresh will exclude any expunged sleds.

This update also includes a new test and test extensions around cache invalidation involving the multicast background task.

The previous implementation used an AtomicBool to signal cache invalidation from
sled_expunge()/sled_upsert() to the multicast  background task.
This only affected the Nexus that received the request, while other
Nexus instances would never know to invalidate their caches.

As per discussion, this pattern of out-of-band communication with background
tasks is something we aught to avoid: if a task accepts input to determine what
to do, it must separately handle cases where that input is unavailable
(after restart, or when another Nexus instance receives the request).
This bifurcates the code between the common case and rarely-exercised
fallback paths.

This PR replaces the AtomicBool approach with inventory collection ID
tracking. The reconciler now checks if the inventory collection ID
has changed (database-driven), and invalidates caches when it detects a new
collection. This works across all Nexus instances since the DB is the source
of truth.

Now, the cache refresh always uses SledFilter::InService, which excludes
expunged sleds. This provides a safety net even if/when cache invalidation is
delayed, as the next cache refresh will exclude any expunged sleds.

This update also includes a new test and test extensions around cache invalidation
involving the multicast background task.
Copy link
Contributor

@jgallagher jgallagher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a couple tiny nits. Thanks for the quick turnaround here and putting up with all my questions.

@zeeshanlakhani zeeshanlakhani force-pushed the zl/mcast-task-cache-handle-fix branch from 1291dbb to 4d8a579 Compare January 29, 2026 19:13
…nnel

- Track sled baseboard->sp_slot mappings instead of collection IDs
- Only invalidate when sled physical locations change (rare), not on
  every inventory collection (frequent)
- Use existing inventory watch channel pattern shared by background tasks (not DB)
- Update tests to activate inventory_loader before reconciler
- Update TODO with DDM information
- fix nits
@zeeshanlakhani zeeshanlakhani force-pushed the zl/mcast-task-cache-handle-fix branch from bc05306 to 6885e4d Compare January 29, 2026 19:25
@zeeshanlakhani zeeshanlakhani merged commit 8e30605 into main Jan 30, 2026
16 checks passed
@zeeshanlakhani zeeshanlakhani deleted the zl/mcast-task-cache-handle-fix branch January 30, 2026 08:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants