Skip to content

HDDS-13572. SCMSnapshotProvider should not auto-create Ratis directory#9689

Open
echonesis wants to merge 1 commit intoapache:masterfrom
echonesis:HDDS-13572
Open

HDDS-13572. SCMSnapshotProvider should not auto-create Ratis directory#9689
echonesis wants to merge 1 commit intoapache:masterfrom
echonesis:HDDS-13572

Conversation

@echonesis
Copy link
Contributor

What changes were proposed in this pull request?

This PR prevents SCMSnapshotProvider from auto-creating Ratis directory when it is not configured.

  • Add validation to check Ratis directory existence before snapshot operations
  • Throw clear error when Ratis directory is missing instead of silently creating it

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-13572

How was this patch tested?

GitHub Actions CI: https://github.com/echonesis/ozone/actions/runs/21465081605

Copy link
Contributor

@szetszwo szetszwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@echonesis , thanks a lot for working on this!

Both the ratisStorageDir and ratisSnapshotDir should only be created only if either --init or --bootstrap option is set in StorageContainerManagerStarter. Without those options, the directories should already exist. If not, we should fail the command.

Please see also the comments inlined.

Comment on lines 97 to +99
@VisibleForTesting
protected SCMSnapshotProvider newScmSnapshotProvider(StorageContainerManager storageContainerManager) {
return new SCMSnapshotProvider(storageContainerManager.getConfiguration(),
protected SCMSnapshotProvider newScmSnapshotProvider(
StorageContainerManager storageContainerManager) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is not used in test. Let's remove @ VisibleForTesting

Comment on lines 83 to 85
public SCMSnapshotProvider(ConfigurationSource conf,
List<SCMNodeDetails> peerNodes,
CertificateClient scmCertificateClient) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is used only in TestSCMSnapshotProvider, let's remove it and change the test.

File snapshotDirFile = new File(snapshotDir);

SCMSnapshotProvider.StartupOption startupOption;
if (snapshotDirFile.exists()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use

    if (Files.isDirectory(Paths.get(snapshotDir))) {

Comment on lines +101 to +109
wait_for_om_leader

echo "Deleting test keys to create tombstones"
docker exec "${om_container}" bash -c "kinit -kt /etc/security/keytabs/testuser.keytab testuser/om@EXAMPLE.COM && ozone fs -rm -R -skipTrash ofs://omservice/vol1/bucket1"
echo "Test keys deleted"

echo "Restarting OM after key deletion to flush tombstones to sst files"
docker restart "${om_container}"
wait_for_om_leader
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems unrelated? If yes, please revert it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changes are related - the stricter Ratis directory checks affect startup timing.
The wait_for_om_leader ensures proper ordering after container restarts.

Comment on lines +39 to +55
SST Size Should Be Reduced
[Arguments] ${original_size}
${current_size} = Get OM DB SST Files Size
Should Be True ${current_size} < ${original_size}
... OM DB size should be reduced after compaction. Before: ${original_size}, After: ${current_size}

*** Test Cases ***
Testing OM DB Size Reduction After Compaction
# Test keys are already created and flushed
# Delete keys to create tombstones that need compaction
Delete Test Keys

# Test keys are already created, deleted and flushed by the test script
${size_before_compaction} = Get OM DB SST Files Size

Compact OM DB Column Family fileTable
Compact OM DB Column Family deletedTable
Compact OM DB Column Family deletedDirectoryTable

${size_after_compaction} = Get OM DB SST Files Size

Should Be True ${size_after_compaction} < ${size_before_compaction}
... OM DB size should be reduced after compaction. Before: ${size_before_compaction}, After: ${size_after_compaction}
# Compaction is asynchronous, poll until size is reduced
Wait Until Keyword Succeeds 60sec 5sec SST Size Should Be Reduced ${size_before_compaction}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems unrelated? If yes, please revert it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as #9689 (comment)
This ensures proper ordering after container restarts.


// Create Ratis snapshot directory during initialization
// This ensures the directory exists before SCM starts normally
String snapshotDir = SCMHAUtils.getSCMRatisSnapshotDirectory(conf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

snapshotDir should be created only for scmInit or scmBootstrap. So, it should not be created here.

@echonesis
Copy link
Contributor Author

@echonesis , thanks a lot for working on this!

Both the ratisStorageDir and ratisSnapshotDir should only be created only if either --init or --bootstrap option is set in StorageContainerManagerStarter. Without those options, the directories should already exist. If not, we should fail the command.

Please see also the comments inlined.

Thanks @szetszwo
I agree with this principle. However, there's an upgrade scenario I'd like to clarify:
In StorageContainerManager constructor (L376-378), when an old cluster (without Ratis/HA) starts normally and detects !isSCMHAEnabled(), it calls initializeRatis() to auto-enable Ratis. This also needs to create the directories.
Should this upgrade path:

  1. Continue to auto-create directories (current behavior), or
  2. Require users to run --init first before upgrading?

@szetszwo
Copy link
Contributor

szetszwo commented Feb 3, 2026

... there's an upgrade scenario I'd like to clarify ...

Thanks for pointing out the upgrade scenario. The initializeRatis(..) does not auto-create directories. When the directories are created for upgrade?

We probably should have --upgrade option. Any "auto-create directories" behavior is error prone and leads to data loss.

@echonesis
Copy link
Contributor Author

... there's an upgrade scenario I'd like to clarify ...

Thanks for pointing out the upgrade scenario. The initializeRatis(..) does not auto-create directories. When the directories are created for upgrade?

We probably should have --upgrade option. Any "auto-create directories" behavior is error prone and leads to data loss.

In initializeRatis() (StorageContainerManager.java L1354-L1358), I added explicit creation of snapshotDir:

String snapshotDir = SCMHAUtils.getSCMRatisSnapshotDirectory(conf);
HddsUtils.createDir(snapshotDir);

For ratisStorageDir, it's auto-created by Ratis internally when SCMRatisServerImpl.initialize() calls RaftServer.build(). The upgrade scenario (L376-378) calls initializeRatis() during normal startup when !isSCMHAEnabled(), which triggers both directory creations.

I agree that adding an --upgrade option would be a cleaner approach.

@szetszwo
Copy link
Contributor

szetszwo commented Feb 4, 2026

In initializeRatis() (StorageContainerManager.java L1354-L1358), I added explicit creation of snapshotDir:

Sorry, I mean to ask -- When is the snapshotDir created for upgrade before this change?

@echonesis
Copy link
Contributor Author

In initializeRatis() (StorageContainerManager.java L1354-L1358), I added explicit creation of snapshotDir:

Sorry, I mean to ask -- When is the snapshotDir created for upgrade before this change?

Before this change, both directories were auto-created in SCMSnapshotProvider constructor (L71, L74-75) via HddsUtils.createDir(). So for upgrade, when SCMHAManagerImpl creates SCMSnapshotProvider, it auto-creates both directories there.

@szetszwo
Copy link
Contributor

szetszwo commented Feb 5, 2026

Before this change, both directories were auto-created in SCMSnapshotProvider constructor ...

@echonesis , thanks for the info. For now, let's add the auto-create only for upgrade case but not inside initializeRatis(..) since initializeRatis(..) is also used in the non-upgrade cases. Would it work?

@echonesis
Copy link
Contributor Author

Before this change, both directories were auto-created in SCMSnapshotProvider constructor ...

@echonesis , thanks for the info. For now, let's add the auto-create only for upgrade case but not inside initializeRatis(..) since initializeRatis(..) is also used in the non-upgrade cases. Would it work?

Yes, that works. I'll move the auto-create logic to the upgrade case instead of initializeRatis().
One clarification: should we still plan for a --upgrade option in a follow-up JIRA to make this explicit, or is auto-create in the upgrade path acceptable as a long-term solution?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants