-
Notifications
You must be signed in to change notification settings - Fork 3.4k
HBASE-29806: master procedure executor fail due to NPE in RegionRemoteProcedureBase.afterReplay() #7667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
🎊 +1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
A parent procedure can not be completed while a sub procedure is still pending. |
|
Thank you for the feedback. You're correct that under normal operation, a parent procedure cannot complete while a sub procedure is still pending - this is the intended invariant. However, this NPE occurs specifically during crash recovery (afterReplay()), not during normal execution. During crash recovery, the procedure store may contain inconsistent state that violates this invariant due to:
The HBase codebase already handles this scenario in other places. For example, in ProcedureExecutor.countDownChildren() (line 1985-1989): Procedure<TEnvironment> parent = procedures.get(procedure.getParentProcId());
if (parent == null) {
assert procStack.isRollingback();
return;
} This shows that null parent scenarios are already recognized and handled during rollback. The The proposed fix is a defensive null check - consistent with the pattern already used in countDownChildren(). This ensures the master can recover even from unexpected procedure store states, rather than failing startup with an NPE. |
A fix for HBASE-29806: RegionRemoteProcedureBase.afterReplay() causes NPE when parent procedure has already completed.
Root Cause
When the master restarts after a crash:
Changes Made
File: hbase-server/src/main/java/org/apache/hadoop/hbase/master/assignment/RegionRemoteProcedureBase.java
- Added null check for parent procedure
- If parent is null, logs a warning and returns gracefully
- The orphaned child procedure will be cleaned up by the procedure executor
- Added null check for parent procedure
- If parent is null, silently skips unattach since parent has already completed
Test Added
File: hbase-server/src/test/java/org/apache/hadoop/hbase/master/assignment/TestRegionRemoteProcedureBaseOrphanAfterReplay.java
A unit test that:
Files Changed