-
Notifications
You must be signed in to change notification settings - Fork 66
Local storage [4/4]: Use raw zvols #9593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Request the new *raw* zvol type when creating disks backed by local storage. Using raw zvols requires waiting for them to be initialized, which the instance start saga is now responsible for polling. In the instance start saga, ensure each raw zvol in parallel because this can take some time. This PR also explicitly removes the inherited encryption for these zvols. This is accomplished by changing `EncryptionDetails` to be an enum where the call site now has to explicitly choose between inheriting the parent dataset's encryption, using `aes-256-gcm`, and explicitly setting encryption off. Also chopped out is setting `volblocksize`: this was a mistake to set, and it should instead be left to the default value. Still to do is to revisit the overhead reserved for zvols: the previous emperical value was not for the new raw zvol type. This is tracked by oxidecomputer#9591. Fixes oxidecomputer#9519.
illumos-utils/src/zfs.rs
Outdated
| match execute_async(cmd).await { | ||
| Ok(_) => Ok(()), | ||
| Ok(_) => { | ||
| // no-op |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this just mean "Command succeeded, nothing else to do"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was a little confused that "no-op" meant "yeah I did it and it worked"
illumos-utils/src/zfs.rs
Outdated
| // Open the raw zvol and read from it. If EINPROGRESS is seen from | ||
| // the read, then it's not done initializing yet. We can't delete it | ||
| // until it's fully done, so return the `NotReady` variant to signal | ||
| // upstack that it should poll. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's a bummer; @grwilson is this something we could change in the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There will be an ioctl interface that will allow for the appstack to stop the initialization so that the process can be canceled or aborted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8c4376b changes to use DKIOCRAWVOLSTOP to stop the init
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and 529ab04 fixes that commit...
illumos-utils/src/zfs.rs
Outdated
| // Open the raw zvol and read from it. If EINPROGRESS is seen from | ||
| // the read, then it's not done initializing yet. We can't delete it | ||
| // until it's fully done, so return the `NotReady` variant to signal | ||
| // upstack that it should poll. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be worth creating a fn that delete and create both call rather than having the duplicated code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed to use DKIOCRAWVOLSTOP so this code looks less duplicated now
illumos-utils/src/zfs.rs
Outdated
| // The numeric type of these constants is not specified due to being | ||
| // different for illumos and linux. They are also `let dkioc` and | ||
| // not `const DKIOC` because `const` requires a type. | ||
| // | ||
| // XXX is there a way to pull this constant in from | ||
| // /usr/include/sys/dkio.h | ||
| let dkioc = 0x04 << 8; | ||
| let dkiocrawvolstop = dkioc | 31; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // The numeric type of these constants is not specified due to being | |
| // different for illumos and linux. They are also `let dkioc` and | |
| // not `const DKIOC` because `const` requires a type. | |
| // | |
| // XXX is there a way to pull this constant in from | |
| // /usr/include/sys/dkio.h | |
| let dkioc = 0x04 << 8; | |
| let dkiocrawvolstop = dkioc | 31; | |
| #[cfg(target_os = "illumos")] | |
| const DKIOCRAWVOLSTOP: libc::c_int = (0x04 << 8) | 0x1f; | |
| #[cfg(not(target_os = "illumos"))] | |
| const DKIOCRAWVOLSTOP: libc::c_ulong = (0x04 << 8) | 0x1f; |
eh?
You could use bindgen--I don't think it's worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this works. I was thinking to do the cfg specialization like this but forgot about not and was thinking that I'd have to add a target for each OS. Done in ac3d056
|
Do not merge this yet, it depends on a fix for #9607 |
|
Note to self: must change this PR to set volblocksize = 128k again. |
|
|
||
| pub async fn delete_dataset_volume( | ||
| params: DatasetVolumeDeleteArgs<'_>, | ||
| ) -> Result<(), DeleteDatasetVolumeError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random comment here to remind us to figure out why the delete failed on berlin over the weekend.
Sorry for this message not actually having any useful information regarding that failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
94345a3 fixes one of the bugs you saw, but doesn't address why the parent dataset could not be deleted originally
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try again deleting and see what shakes out.
If the /dev/zvol/rdsk/... path is gone, then skip trying to delete what is probably already deleted. Check the DKIOCRAWVOLSTATUS output to see if the init thread even needs stopping, and return early accordingly. In the face of a transient EINTR stopping the raw zvol init thread, retry the ioctl once. In the face of a transient "dataset is busy" when deleting the parent dataset, retry once. Other parts of sled-agent (like inventory collection) maybe caused this, we never nailed it down. Refactor waiting logic into separate `wait_for_raw_volume_ok_to_delete` function to ensure that the fd is dropped and this part of sled agent isn't holding an open file descriptor to the raw zvol while trying to delete it.
|
I have no idea how that got closed...? |
leftwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly questions :)
| } | ||
|
|
||
| _ => { | ||
| // The zero init thread encountered an error, we can proceed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this where we might change the current code to handle an init that runs forever retrying the init? Or, in this case are we okay and the init will have failed with an error and we will just go ahead and delete it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's an error running the zero init thread (which I'm assuming also does the allocation), I'm assuming we fail and are not retrying, yeah. We'll have to delete the raw vol and try again.
| pub async fn delete_dataset_volume( | ||
| params: DatasetVolumeDeleteArgs<'_>, | ||
| /// Returns Ok when a raw volume is ok to delete, and an error otherwise | ||
| pub async fn wait_for_raw_volume_ok_to_delete( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we really waiting here? We do a few retries but I don't see a loop that will continue checking until the conditions are satisfied. If I'm correct, then maybe name this check_for_raw_volume_ok_to_delete()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it will wait in the STOP ioctl for the thread to stop, so I wanted that to be reflected in the function name :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we throw a comment in then, so next time I read this I won't think this is will wait for the deletion before returning? :)
illumos-utils/src/zfs.rs
Outdated
| } | ||
|
|
||
| // Execution should never reach here? | ||
| Ok(()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we panic if we do get here? How could this happen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b3a2168 changes this, have a look :)
grwilson
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The raw volume status check logic looks good to me (including the EFRAGS). I also verified that the current code will report back EFRAGS if we encounter gang blocks with a blocksize of 128k.

Request the new raw zvol type when creating disks backed by local storage. Using raw zvols requires waiting for them to be initialized, which the instance start saga is now responsible for polling.
In the instance start saga, ensure each raw zvol in parallel because this can take some time.
This PR also explicitly removes the inherited encryption for these zvols. This is accomplished by changing
EncryptionDetailsto be an enum where the call site now has to explicitly choose between inheriting the parent dataset's encryption, usingaes-256-gcm, and explicitly setting encryption off.Also chopped out is setting
volblocksize: this was a mistake to set, and it should instead be left to the default value.Still to do is to revisit the overhead reserved for zvols: the previous emperical value was not for the new raw zvol type. This is tracked by #9591.
Fixes #9519.