-
Notifications
You must be signed in to change notification settings - Fork 587
fix(pt/pd): fix incompatibility between AutoBatchSize and eval hooks #5181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
📝 WalkthroughWalkthroughThese changes implement an OOM (Out of Memory) retry mechanism in the DeePMD-kit inference pipeline. The modifications enable auto batch size adjustments during descriptor and fitting-last-layer evaluations by introducing retry signaling and toggling OOM retry mode before and after evaluation calls. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request attempts to fix an incompatibility between AutoBatchSize and evaluation hooks (eval_descriptor and eval_fitting_last_layer) that caused mismatched descriptor output when OOM errors occurred during batch processing. The issue manifested as descriptors having more frames than the input system (e.g., 241 frames vs 175 in the reported issue).
Changes:
- Introduces a retry mechanism via
RetrySignalexception and@retrydecorator to restart processing from the beginning when OOM occurs during hook-based evaluation - Adds
oom_retry_modeflag toAutoBatchSizeto control whether OOM errors trigger a full retry - Enables retry mode in
eval_descriptorandeval_fitting_last_layermethods for both PyTorch (pt) and Paddle (pd) backends
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| deepmd/utils/batch_size.py | Adds RetrySignal exception, retry decorator, oom_retry_mode flag, and logic to raise RetrySignal on OOM when retry mode is enabled |
| deepmd/pt/infer/deep_eval.py | Enables/disables oom_retry_mode around eval calls in eval_descriptor and eval_fitting_last_layer methods |
| deepmd/pd/infer/deep_eval.py | Enables/disables oom_retry_mode around eval calls in eval_descriptor and eval_fitting_last_layer methods |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
deepmd/pt/infer/deep_eval.py (1)
796-812: Ensure hooks & OOM retry mode are reset on exceptions.
Lines 796-812 and 855-871 enable hooks/retry mode but only disable them on the success path. Ifself.eval(...)ormodel.eval_*throws, the flags remain enabled and can corrupt subsequent calls.✅ Safer pattern (apply to both methods)
- if self.auto_batch_size is not None: - self.auto_batch_size.set_oom_retry_mode(True) - model.set_eval_descriptor_hook(True) - self.eval( - coords, - cells, - atom_types, - atomic=False, - fparam=fparam, - aparam=aparam, - **kwargs, - ) - descriptor = model.eval_descriptor() - model.set_eval_descriptor_hook(False) - if self.auto_batch_size is not None: - self.auto_batch_size.set_oom_retry_mode(False) + if self.auto_batch_size is not None: + self.auto_batch_size.set_oom_retry_mode(True) + model.set_eval_descriptor_hook(True) + try: + self.eval( + coords, + cells, + atom_types, + atomic=False, + fparam=fparam, + aparam=aparam, + **kwargs, + ) + descriptor = model.eval_descriptor() + finally: + model.set_eval_descriptor_hook(False) + if self.auto_batch_size is not None: + self.auto_batch_size.set_oom_retry_mode(False)Also applies to: 855-871
deepmd/pd/infer/deep_eval.py (1)
823-842: Ensure hooks & OOM retry mode are reset on exceptions.
Lines 823-842 and 884-901 toggle hooks/retry mode without afinally. Any exception duringself.eval(...)ormodel.eval_*can leave the backend in a bad state.✅ Safer pattern (apply to both methods)
- if self.auto_batch_size is not None: - self.auto_batch_size.set_oom_retry_mode(True) - model.set_eval_descriptor_hook(True) - self.eval( - coords, - cells, - atom_types, - atomic=False, - fparam=fparam, - aparam=aparam, - **kwargs, - ) - descriptor = model.eval_descriptor() - model.set_eval_descriptor_hook(False) - if self.auto_batch_size is not None: - self.auto_batch_size.set_oom_retry_mode(False) + if self.auto_batch_size is not None: + self.auto_batch_size.set_oom_retry_mode(True) + model.set_eval_descriptor_hook(True) + try: + self.eval( + coords, + cells, + atom_types, + atomic=False, + fparam=fparam, + aparam=aparam, + **kwargs, + ) + descriptor = model.eval_descriptor() + finally: + model.set_eval_descriptor_hook(False) + if self.auto_batch_size is not None: + self.auto_batch_size.set_oom_retry_mode(False)Also applies to: 884-901
🤖 Fix all issues with AI agents
In `@deepmd/utils/batch_size.py`:
- Around line 161-162: The OOM handler incorrectly checks the method object
self.set_oom_retry_mode rather than its boolean result, so every OOM triggers a
retry; change the condition to call the method (if self.set_oom_retry_mode():
...) so it evaluates the returned bool before raising RetrySignal, and ensure
the method returns a proper bool.
🧹 Nitpick comments (1)
deepmd/utils/batch_size.py (1)
32-54: Clarify retry semantics (docstring vs behavior).
Line 32 says retries happen for “certain times,” but the wrapper loops forever. Either document it as unbounded or add a cap/backoff.✏️ Minimal doc fix
- """Decorator to retry the function until it succeeds or fails for certain times. + """Decorator to retry the function until it succeeds (no max retry cap).
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #5181 +/- ##
==========================================
- Coverage 81.95% 81.93% -0.02%
==========================================
Files 714 714
Lines 73441 73477 +36
Branches 3616 3617 +1
==========================================
+ Hits 60187 60205 +18
- Misses 12091 12110 +19
+ Partials 1163 1162 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
…deepmd-kit into eval_desc_auto_batch_size
for more information, see https://pre-commit.ci
Fix #5180.
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.