Skip to content

Conversation

@njzjz
Copy link
Member

@njzjz njzjz commented Jan 29, 2026

Fix #5180.

Summary by CodeRabbit

Release Notes

  • New Features
    • Enhanced automatic batch size adjustment with out-of-memory (OOM) retry capability for more reliable model evaluation during descriptor and fitting operations.
    • Added configurable OOM retry mode to provide control over memory error recovery behavior.

✏️ Tip: You can customize this high-level summary in your review settings.

Copilot AI review requested due to automatic review settings January 29, 2026 15:39
@dosubot dosubot bot added the bug label Jan 29, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 29, 2026

📝 Walkthrough

Walkthrough

These changes implement an OOM (Out of Memory) retry mechanism in the DeePMD-kit inference pipeline. The modifications enable auto batch size adjustments during descriptor and fitting-last-layer evaluations by introducing retry signaling and toggling OOM retry mode before and after evaluation calls.

Changes

Cohort / File(s) Summary
OOM Retry Infrastructure
deepmd/utils/batch_size.py
Introduces RetrySignal exception for OOM retry signaling, adds a retry decorator for function retrying with logging, implements oom_retry_mode flag in AutoBatchSize class, modifies execute method to raise RetrySignal on OOM when retry mode is enabled, decorates execute_all with @retry, and adds set_oom_retry_mode public API method to toggle OOM retry behavior.
Descriptor and Fitting Layer Evaluation
deepmd/pd/infer/deep_eval.py, deepmd/pt/infer/deep_eval.py
Adds conditional toggling of AutoBatchSize's OOM retry mode before and after descriptor evaluation and fitting-last-layer evaluation in both Paddle and PyTorch inference implementations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • #4214: Adds PyTorch eval_descriptor support which is directly relevant to the eval_descriptor retry toggling introduced in this PR.

Suggested reviewers

  • wanghan-iapcm
🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(pt/pd): fix incompatibility between AutoBatchSize and eval hooks' directly and clearly summarizes the main change: fixing an incompatibility issue between AutoBatchSize and evaluation hooks across PyTorch and PaddlePaddle backends.
Linked Issues check ✅ Passed The PR addresses the root cause of issue #5180 by implementing OOM retry mode toggling in AutoBatchSize and evaluation functions to prevent descriptor frame mismatches when batch sizes are automatically adjusted during evaluation.
Out of Scope Changes check ✅ Passed All changes are directly related to the linked issue: OOM retry mechanism in batch_size.py, and toggling this mode in eval_descriptor and eval_fitting_last_layer across both PyTorch and PaddlePaddle implementations.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request attempts to fix an incompatibility between AutoBatchSize and evaluation hooks (eval_descriptor and eval_fitting_last_layer) that caused mismatched descriptor output when OOM errors occurred during batch processing. The issue manifested as descriptors having more frames than the input system (e.g., 241 frames vs 175 in the reported issue).

Changes:

  • Introduces a retry mechanism via RetrySignal exception and @retry decorator to restart processing from the beginning when OOM occurs during hook-based evaluation
  • Adds oom_retry_mode flag to AutoBatchSize to control whether OOM errors trigger a full retry
  • Enables retry mode in eval_descriptor and eval_fitting_last_layer methods for both PyTorch (pt) and Paddle (pd) backends

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File Description
deepmd/utils/batch_size.py Adds RetrySignal exception, retry decorator, oom_retry_mode flag, and logic to raise RetrySignal on OOM when retry mode is enabled
deepmd/pt/infer/deep_eval.py Enables/disables oom_retry_mode around eval calls in eval_descriptor and eval_fitting_last_layer methods
deepmd/pd/infer/deep_eval.py Enables/disables oom_retry_mode around eval calls in eval_descriptor and eval_fitting_last_layer methods

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
deepmd/pt/infer/deep_eval.py (1)

796-812: Ensure hooks & OOM retry mode are reset on exceptions.
Lines 796-812 and 855-871 enable hooks/retry mode but only disable them on the success path. If self.eval(...) or model.eval_* throws, the flags remain enabled and can corrupt subsequent calls.

✅ Safer pattern (apply to both methods)
-        if self.auto_batch_size is not None:
-            self.auto_batch_size.set_oom_retry_mode(True)
-        model.set_eval_descriptor_hook(True)
-        self.eval(
-            coords,
-            cells,
-            atom_types,
-            atomic=False,
-            fparam=fparam,
-            aparam=aparam,
-            **kwargs,
-        )
-        descriptor = model.eval_descriptor()
-        model.set_eval_descriptor_hook(False)
-        if self.auto_batch_size is not None:
-            self.auto_batch_size.set_oom_retry_mode(False)
+        if self.auto_batch_size is not None:
+            self.auto_batch_size.set_oom_retry_mode(True)
+        model.set_eval_descriptor_hook(True)
+        try:
+            self.eval(
+                coords,
+                cells,
+                atom_types,
+                atomic=False,
+                fparam=fparam,
+                aparam=aparam,
+                **kwargs,
+            )
+            descriptor = model.eval_descriptor()
+        finally:
+            model.set_eval_descriptor_hook(False)
+            if self.auto_batch_size is not None:
+                self.auto_batch_size.set_oom_retry_mode(False)

Also applies to: 855-871

deepmd/pd/infer/deep_eval.py (1)

823-842: Ensure hooks & OOM retry mode are reset on exceptions.
Lines 823-842 and 884-901 toggle hooks/retry mode without a finally. Any exception during self.eval(...) or model.eval_* can leave the backend in a bad state.

✅ Safer pattern (apply to both methods)
-        if self.auto_batch_size is not None:
-            self.auto_batch_size.set_oom_retry_mode(True)
-        model.set_eval_descriptor_hook(True)
-        self.eval(
-            coords,
-            cells,
-            atom_types,
-            atomic=False,
-            fparam=fparam,
-            aparam=aparam,
-            **kwargs,
-        )
-        descriptor = model.eval_descriptor()
-        model.set_eval_descriptor_hook(False)
-        if self.auto_batch_size is not None:
-            self.auto_batch_size.set_oom_retry_mode(False)
+        if self.auto_batch_size is not None:
+            self.auto_batch_size.set_oom_retry_mode(True)
+        model.set_eval_descriptor_hook(True)
+        try:
+            self.eval(
+                coords,
+                cells,
+                atom_types,
+                atomic=False,
+                fparam=fparam,
+                aparam=aparam,
+                **kwargs,
+            )
+            descriptor = model.eval_descriptor()
+        finally:
+            model.set_eval_descriptor_hook(False)
+            if self.auto_batch_size is not None:
+                self.auto_batch_size.set_oom_retry_mode(False)

Also applies to: 884-901

🤖 Fix all issues with AI agents
In `@deepmd/utils/batch_size.py`:
- Around line 161-162: The OOM handler incorrectly checks the method object
self.set_oom_retry_mode rather than its boolean result, so every OOM triggers a
retry; change the condition to call the method (if self.set_oom_retry_mode():
...) so it evaluates the returned bool before raising RetrySignal, and ensure
the method returns a proper bool.
🧹 Nitpick comments (1)
deepmd/utils/batch_size.py (1)

32-54: Clarify retry semantics (docstring vs behavior).
Line 32 says retries happen for “certain times,” but the wrapper loops forever. Either document it as unbounded or add a cap/backoff.

✏️ Minimal doc fix
-    """Decorator to retry the function until it succeeds or fails for certain times.
+    """Decorator to retry the function until it succeeds (no max retry cap).

@njzjz njzjz marked this pull request as draft January 29, 2026 16:12
@codecov
Copy link

codecov bot commented Jan 29, 2026

Codecov Report

❌ Patch coverage is 47.91667% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.93%. Comparing base (8787b45) to head (f785d61).

Files with missing lines Patch % Lines
deepmd/pd/infer/deep_eval.py 4.76% 20 Missing ⚠️
deepmd/pt/infer/deep_eval.py 80.95% 4 Missing ⚠️
deepmd/utils/batch_size.py 83.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5181      +/-   ##
==========================================
- Coverage   81.95%   81.93%   -0.02%     
==========================================
  Files         714      714              
  Lines       73441    73477      +36     
  Branches     3616     3617       +1     
==========================================
+ Hits        60187    60205      +18     
- Misses      12091    12110      +19     
+ Partials     1163     1162       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] dp eval-desc give mismatch descriptors on certain System in certain GPU node

1 participant