Skip to content

Conversation

@hawkw
Copy link
Member

@hawkw hawkw commented Jan 27, 2026

At the time of writing, both Crucible and Propolis have the necessary
machinery to support read-only disks. Propolis can expose a block device
to the guest as read-only, and Crucible VCRs can contain a read-only
flag, which is already necessary in order to support snapshots. However,
the control plane does not know how to create and manage read-only
disks. This branch adds control plane support for read-only virtual
disks.

Theory of Operation

The read-onlyness of read-only disks created by the implementation in
this branch is enforced both at the Propolis level (by making the
virtual block device read-only) and at the Crucible level, by making
the regions themselves read-only. In order to use read-only Crucible
regions to back such disks, all read-only disks are backed by existing
read-only volumes
. This way, the regions backing the disk are created
as read-only. If we did not do this, we would need to have a way to
create read-write regions in order to write data to the disk initially,
and then make those regions become read-only, which would be
substantially more complex and require changes in Crucible.

This means that only read-only disks may only be created from existing
images or snapshots, as the read-only volumes backing the image or
snapshot is used as the backend for the read-only disk. Eventually, we
could also support creating a read-only disk by importing blocks
directly, by having the saga create a transient read/write disk, take a
snapshot of that disk once it has been finalized, use the snapshot to
create a read-only disk, and then delete the temporary read/write disk
and snapshot. I didn't do this here since it didn't really seem that
useful --- you could just upload a project image, instead, and then you
can use it to make as many read-only disks as you'd like. On the other
hand, it probably never makes sense to allow creating a read-only blank
disk, since...well, you could never make it not be blank. While we could
make such a disk service reads BLAZINGLY FAST, most Unixes already
come with such a block device in the form of /dev/null, and it's
probably unnecessary to let you make your Oxide distributed disk version
of /dev/null.

Virtual Resource Accounting

Note that creating a read-only disk currently requires a disk_create
saga, but it's a very small one. Since the read-only volumes always
already exist, we don't actually need a saga to create the new
read-only disk --- it's really just something that exists exclusively in
CockroachDB and doesn't require actually creating new Crucible regions
out there in the Real World. Most of this database work is done in a
transaction, since it's all just in CRDB. However, currently, we must do
virtual resource accounting for every read-only disk, and the current
virtual provisioning CTE was a bit annoying to hack up into something
that can run in that transaction, so we just make a saga that does the
virtual resource accounting in one saga node, and then does "everything
else" in another one.

With that said, it's actually pretty stupid for a read-only disk of
size s to require s bytes of the storage quota, since...it's really
just a pointer to the volume that was already provisioned for the source
image or snapshot, and any number of read-only disks can be created from
that same image or snapshot without actually storing multiple copies of
the data on a physical disk. The only overhead is the handful of
database records representing the read-only disk and its VCR. So, in an
ideal world, we wouldn't count them against the quota. But, currently,
deleting such a disk will release virtual storage, so if we didn't
allocate it when creating them, there would be an unfortunate quota
leak. @jmpesp and I discussed this a bit and we agreed that if and when
we implement the physical storage accounting described in RFD 639,
this should all change so that the physical disk space is allocated
once, by the image/snapshot, rather than repeatedly by each read-only
disk. But for now, the current behavior of allocating virtual storage to
each read-only disk is unfortunately correct.

Other Notes

While adding tests for this change, I made some changes to the
resource_helpers::create_disk_from_snapshot function in order to

  1. allow passing the read_only flag to the params::DiskCreate it
    constructs, and,
  2. to always use the correct size of the parent snapshot for the child
    disk.

This necessitated a diff that touched basically all the other tests that
used that function. Sorry about that, those changes should be trivially
ignorable in code review.

if let Some(sub_err) = sub_err.take() {
err.bail(
Error::from(sub_err)
.internal_context("failed to checkout snapshot volume"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we do something with InlineErrorChain here and the other spots that have the sub err?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm not really sure what that gets us here, since none of the sub errors we get back have complex source() chains in them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the VolumeGetError and VolumeCreateError types could probably be made much nicer but I wasn't sure if i wanted to mess with the way they formatted themselves in this otherwise-unrelated PR...

@hawkw hawkw marked this pull request as ready for review January 28, 2026 21:04
@hawkw
Copy link
Member Author

hawkw commented Jan 28, 2026

This change still needs some proper end-to-end testing on a racklette with real Crucibles and Propolises, but I think it should be more or less ready for review. I'll do real end-to-end testing once I get a TUF repo of the latest commit built.

@hawkw hawkw requested a review from jmpesp January 28, 2026 21:13
Copy link
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also worth adding a test for read-only region replacement succeeding for these new types of disks (see crucible_replacements.rs). I'm 99.9% positive the existing machinery will just work but it's worth it to check.

@hawkw
Copy link
Member Author

hawkw commented Jan 28, 2026

It's also worth adding a test for read-only region replacement succeeding for these new types of disks (see crucible_replacements.rs)

Will do, thanks!

hawkw added a commit that referenced this pull request Jan 28, 2026
@hawkw hawkw requested a review from jmpesp January 29, 2026 23:31
@hawkw
Copy link
Member Author

hawkw commented Jan 30, 2026

This failure of test_instance_start_creates_networking_state sure looks like a flake; it looks like there's a previous issue around a possible flake in that test (#6540) but that was determined to be an actual bug and fixed. I'll make a new ticket for that.

@hawkw hawkw enabled auto-merge (squash) January 30, 2026 17:57
@hawkw hawkw disabled auto-merge January 30, 2026 17:58
@hawkw hawkw changed the title [nexus] create read-only disks from snapshots [nexus] create read-only disks from images or snapshots Jan 30, 2026
Copy link
Contributor

@jmpesp jmpesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛳️

@hawkw hawkw mentioned this pull request Jan 30, 2026
@hawkw hawkw enabled auto-merge (squash) January 30, 2026 21:54
@hawkw hawkw merged commit 3dfb80f into main Jan 31, 2026
16 checks passed
@hawkw hawkw deleted the eliza/ro branch January 31, 2026 01:20
Snapshot {
snapshot_id: Uuid,
/// If `true`, the disk created from this snapshot will be read-only.
read_only: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since 99.99% of the time people will not be provisioning read-only disks, can we make this either Option<bool> or #[serde(default)]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, was slightly surprised by this while doing the web console implementation but I forgot to bring it up

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm happy to change this. It seems like adding a #[serde(default)] ought to not require a whole API version bump if it makes it in before the release, but I think it probably does, since the tooling will get mad if the existing generated API spec changes?

Also, as we discussed on Matrix, there is the potential to revisit whether this should be on DiskSource and not DiskBackend or at the top level of the DiskCreate params. @rmustacc pushed back on it being on DiskSource, and I think he was conceptually right that it is a property of the disk rather than of the source...but, in practice, putting it on one of the higher level types means that there are more invalid DiskCreate params that users can construct and which we will reject with a 400 error. I felt like it was worth having the OpenAPI document make it clear which params are valid, as it felt more self-documenting than having it always be part of the API type but invalid in many cases. But, I'm open to being convinced that's wrong, so I wanted to bring it up while we were discussing changing the API again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. First thinking about where the read_only property appears. I think it might make sense to have it on "attach" in the future to enforce r/o semantics on an otherwise r/w disk. But it also makes sense on these particular variants because we're creating a fundamentally distinct disk type i.e. one whose contents are known to be fixed for all time. In other words, I think having read_only on these variants makes sense and in the future we might also have this setting at the "attach" level... probably as an Option<bool> where e.g. read_only = Some(false) on a read-only disk would produce an error?
  2. I can't think of a strong argument for Option<bool> vs. #[serde(default)]. The generated code is no easier or harder on the rust side. I do think that #[serde(default)] is slightly cleaner in that for the Option there could be ambiguity about what None (i.e. an absent) field means.

Copy link
Contributor

@david-crespo david-crespo Jan 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would ne nice to have the types enforce the right combination of things, but I think I agree with rm that it’s confusing to make it sound like read only is a property of the source. Especially since it is at top level in the Disk view. ideally, we would be able to put the field at top level and still have the types enforce that it is only allowed to be true when the disk source is image or snapshot. I can write that type in TypeScript, but we don’t have a nice way of expressing it in OpenAPI, let alone generating the right type from that in all the SDKs. So I think I lean toward putting it at top level on the create body and letting 400s and doc comments clarify the rules. But I am fine with either way.

On the new API version question, it’s fine if it requires that — seems like we are resigned to essentially any change requiring that.

hawkw added a commit that referenced this pull request Jan 31, 2026
@ahl rightly [points out][1] that the common case for the `disk_create`
APIs is to create read-write disks, and it seems unfortunate to require
all calls to create a read-write disk from an image or snapshot to
require a `"read_only": false` field in the JSON body. Thus, this commit
adds `#[serde(default)]` so that disk source params _without_ a
`"read_only"` are the same as `"read_only": false`.

Unfortunately, we didn't catch this until _after_ #9731 merged, so
fixing it the proper way requires a whole new API version. Since the
conversion between the Rust params types in `nexus-types` is trivial, it
might have been possible to bend the rules a bit here and just tweak the
previously-generated API version to make the field nullable --- clients
operating with the slightly-older version of that version would still
always be accepted since they would _always_ send the field. But,
bending the rules makes me scared I'm gonna get in trouble, so I did it
the proper way, at the expense of a whole lot of code that does
basically nothing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants