Skip to content

Conversation

@zeeshanlakhani
Copy link
Collaborator

Fixes #9729

The test test_multicast_group_dpd_communication_failure_recovery can possibly panic during cleanup because it intentionally stops DPD to test failure recovery. When cleanup tries to simulate instance state transitions, the sled agent cannot communicate with the stopped DPD, causing a panic.

The fix is to add fallible versions of simulation helpers for test cleanup:

  • try_vmm_finish_transition in sled-agent-client: returns Result instead of panicking on communication failure
  • try_instance_simulate in instances.rs: wraps the fallible client method, includes VMM ID in error messages for debugging

Then, we update cleanup_instances and stop_instances in multicast/mod.rs to use these versions for cleanup.

I had a version of this in another PR, but that's still in review. Let's bring this in now.

Actual issues and true outcomes are still caught during test execution. Some cleanup failures are expected when tests intentionally break infrastructure.

…lticast tests

Fixes [#9729](#9729)

The test `test_multicast_group_dpd_communication_failure_recovery` can possibly
panic during cleanup because it intentionally stops DPD to test failure recovery.
When cleanup tries to simulate instance state transitions, the sled agent cannot
communicate with the stopped DPD, causing a panic.

The fix is to add fallible versions of simulation helpers for test cleanup:

- `try_vmm_finish_transition` in sled-agent-client: returns Result instead
   of panicking on communication failure
- `try_instance_simulate` in instances.rs: wraps the fallible client method,
   includes VMM ID in error messages for debugging

Then, we update `cleanup_instances` and `stop_instances` in multicast/mod.rs to 
use these versions for cleanup. 

I had a version of this in another PR, but that's still in review. Let's bring
this in now. 

Actual issues and true outcomes are still caught during test execution.
Some cleanup failures are expected when tests intentionally break infrastructure.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test failed in CI: test_multicast_group_dpd_communication_failure_recovery

2 participants