Skip to content

Conversation

@smklein
Copy link
Collaborator

@smklein smklein commented Jan 8, 2026

Adds images as a virtual provisioning resource. Images are now tracked as a fourth resource type alongside Instance, Disk, and Snapshot.

Resource Accounting

  • Image creation increments virtual_disk_bytes_provisioned at the appropriate collection levels
  • Image deletion decrements it
  • Project-scoped images update: Project -> Silo -> Fleet collections
  • Silo-scoped images update: Silo -> Fleet collections

New Code

  • Added Image variant to ResourceTypeProvisioned and StorageType enums
  • Added silo-scoped CTE queries (new_insert_silo_storage, new_delete_silo_storage) for images without a project
  • Added SPACE_ACCOUNT saga action to both image_create and image_delete sagas
  • Data migration backfills accounting for all existing non-deleted images

Testing

  • Expectorate & Explain tests for all new SQL queries
  • Migration validation test for project and silo scoped images
  • Added a test_image_virtual_provisioning_collection integration test

Fixes #9602

@smklein smklein marked this pull request as ready for review January 9, 2026 00:08
@david-crespo
Copy link
Contributor

Maybe this is a silly question, but are images replicated the way disk volumes are? I ask because when you see virtual disk storage you can sort of mentally triple or quadruple it to reason about the physical disk size (as dubious an endeavor as that is) but if images are not replicated, then that math no longer works. Again that is probably fine because it is unwise to attempt to reason about physical capacity on the basis of these metrics anyway.

@jmpesp
Copy link
Contributor

jmpesp commented Jan 9, 2026

Yeah, images are backed by volumes in the same way that snapshots and crucible disks are, and the same replication factor applies

@smklein
Copy link
Collaborator Author

smklein commented Jan 9, 2026

Maybe this is a silly question, but are images replicated the way disk volumes are? I ask because when you see virtual disk storage you can sort of mentally triple or quadruple it to reason about the physical disk size (as dubious an endeavor as that is) but if images are not replicated, then that math no longer works. Again that is probably fine because it is unwise to attempt to reason about physical capacity on the basis of these metrics anyway.

This is a good question, and kinda the crux of the discussion in RFD 624 (which we still need to resolve - I'm just trying to track all these resources consistently to start). Local disks can/will violate this property, so we will need to adapt the accounting system to cope.

However, images are currently stored with triplicate redundancy, like distributed disks and snapshots.

@smklein smklein requested review from david-crespo and jmpesp January 9, 2026 22:02
project_id
) AS totals
WHERE
vpc.id = totals.project_id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here in the migration you account for existing images in their current projects, but when these images are promoted to silo-level or demoted from silo to project there is nothing updating the accounting for that.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point, totally forgot about image promotion / demotion.

I've updated promotion/demotion to handle this case by "subtracting the old accounting, then adding the new accounting" within a transaction.

Added tests for this, as well as tests that "double demotion" and "double promotion" returns errors (as expected).

),
time_modified = NOW()
WHERE
collection_type = 'Fleet';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These operations are not idempotent: if run more than once, they will add the sums to the total again every time. How bad is that? Feels a little complicated to avoid it, but we might have to.

@david-crespo
Copy link
Contributor

Two issues found by Codex and confirmed by Claude. First one checks out, seems straightforward. Second one is pretty subtle but it sounds real. The robot is correct that the easy thing to do is just say you can't promote/demote images when you're over quota, even though technically you should be able to.

Issue 1: Missing Clickhouse Metrics

The helper methods in virtual_provisioning_collection.rs (e.g., virtual_provisioning_collection_insert_storage, virtual_provisioning_collection_delete_storage, etc.) call append_disk_metrics(&provisions) after running the VirtualProvisioningCollectionUpdate query (lines 223-224, 293-294, 320-321, 346-347).

The promote/demote code in image.rs calls VirtualProvisioningCollectionUpdate::new_* directly and runs get_results_async without ever calling append_disk_metrics. This means Clickhouse won't see these accounting moves.

Issue 2: Quota Check Blocking Over-Quota Silos

Consider silo_image_demote:

  1. new_delete_silo_storage(image_id, size, silo_id) decrements silo and fleet by size
  2. new_insert_storage(image_id, size, project_id) increments project, silo, and fleet by size

The net silo change is zero. However, the insert query (apply_update at line 106) checks:

quota >= current_silo_storage + size

After step 1 commits within the transaction, current_silo_storage = original - size. So step 2 checks:

quota >= (original - size) + size = original

If the silo was already over quota (original > quota), this check fails even though the operation wouldn't increase silo usage. The same issue affects project_image_promote.

Possible Fixes

For issue 1: Either call the helper methods instead of the raw queries, or explicitly call append_disk_metrics on the results. The challenge is that the helpers don't have _on_connection variants for use in transactions, so you'd need to add those or call append_disk_metrics manually after each query in the transaction.

For issue 2: This is more complex. Options include:

  • Create dedicated move/transfer queries that do the delete+insert atomically without quota checks (since net usage is unchanged)
  • Add a skip_quota_check flag to the insert queries for use in move operations
  • Document that over-quota silos cannot promote/demote images and require quota increases first

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Track image storage usage

4 participants