fix(pt/pd): fix incompatibility between AutoBatchSize and eval hooks #5181

njzjz · 2026-01-29T15:39:37Z

Fix #5180.

Summary by CodeRabbit

Release Notes

New Features
- Enhanced automatic batch size adjustment with out-of-memory (OOM) retry capability for more reliable model evaluation during descriptor and fitting operations.
- Added configurable OOM retry mode to provide control over memory error recovery behavior.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

for more information, see https://pre-commit.ci

coderabbitai · 2026-01-29T15:42:52Z

📝 Walkthrough

Walkthrough

These changes implement an OOM (Out of Memory) retry mechanism in the DeePMD-kit inference pipeline. The modifications enable auto batch size adjustments during descriptor and fitting-last-layer evaluations by introducing retry signaling and toggling OOM retry mode before and after evaluation calls.

Changes

Cohort / File(s)	Summary
OOM Retry Infrastructure `deepmd/utils/batch_size.py`	Introduces RetrySignal exception for OOM retry signaling, adds a retry decorator for function retrying with logging, implements oom_retry_mode flag in AutoBatchSize class, modifies execute method to raise RetrySignal on OOM when retry mode is enabled, decorates execute_all with `@retry`, and adds set_oom_retry_mode public API method to toggle OOM retry behavior.
Descriptor and Fitting Layer Evaluation `deepmd/pd/infer/deep_eval.py`, `deepmd/pt/infer/deep_eval.py`	Adds conditional toggling of AutoBatchSize's OOM retry mode before and after descriptor evaluation and fitting-last-layer evaluation in both Paddle and PyTorch inference implementations.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

#4214: Adds PyTorch eval_descriptor support which is directly relevant to the eval_descriptor retry toggling introduced in this PR.

Suggested reviewers

wanghan-iapcm

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'fix(pt/pd): fix incompatibility between AutoBatchSize and eval hooks' directly and clearly summarizes the main change: fixing an incompatibility issue between AutoBatchSize and evaluation hooks across PyTorch and PaddlePaddle backends.
Linked Issues check	✅ Passed	The PR addresses the root cause of issue `#5180` by implementing OOM retry mode toggling in AutoBatchSize and evaluation functions to prevent descriptor frame mismatches when batch sizes are automatically adjusted during evaluation.
Out of Scope Changes check	✅ Passed	All changes are directly related to the linked issue: OOM retry mechanism in batch_size.py, and toggling this mode in eval_descriptor and eval_fitting_last_layer across both PyTorch and PaddlePaddle implementations.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This pull request attempts to fix an incompatibility between AutoBatchSize and evaluation hooks (eval_descriptor and eval_fitting_last_layer) that caused mismatched descriptor output when OOM errors occurred during batch processing. The issue manifested as descriptors having more frames than the input system (e.g., 241 frames vs 175 in the reported issue).

Changes:

Introduces a retry mechanism via RetrySignal exception and @retry decorator to restart processing from the beginning when OOM occurs during hook-based evaluation
Adds oom_retry_mode flag to AutoBatchSize to control whether OOM errors trigger a full retry
Enables retry mode in eval_descriptor and eval_fitting_last_layer methods for both PyTorch (pt) and Paddle (pd) backends

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 7 comments.

File	Description
deepmd/utils/batch_size.py	Adds RetrySignal exception, retry decorator, oom_retry_mode flag, and logic to raise RetrySignal on OOM when retry mode is enabled
deepmd/pt/infer/deep_eval.py	Enables/disables oom_retry_mode around eval calls in eval_descriptor and eval_fitting_last_layer methods
deepmd/pd/infer/deep_eval.py	Enables/disables oom_retry_mode around eval calls in eval_descriptor and eval_fitting_last_layer methods

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

deepmd/utils/batch_size.py

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

deepmd/pt/infer/deep_eval.py (1)

796-812: Ensure hooks & OOM retry mode are reset on exceptions.
Lines 796-812 and 855-871 enable hooks/retry mode but only disable them on the success path. If self.eval(...) or model.eval_* throws, the flags remain enabled and can corrupt subsequent calls.

✅ Safer pattern (apply to both methods)

-        if self.auto_batch_size is not None:
-            self.auto_batch_size.set_oom_retry_mode(True)
-        model.set_eval_descriptor_hook(True)
-        self.eval(
-            coords,
-            cells,
-            atom_types,
-            atomic=False,
-            fparam=fparam,
-            aparam=aparam,
-            **kwargs,
-        )
-        descriptor = model.eval_descriptor()
-        model.set_eval_descriptor_hook(False)
-        if self.auto_batch_size is not None:
-            self.auto_batch_size.set_oom_retry_mode(False)
+        if self.auto_batch_size is not None:
+            self.auto_batch_size.set_oom_retry_mode(True)
+        model.set_eval_descriptor_hook(True)
+        try:
+            self.eval(
+                coords,
+                cells,
+                atom_types,
+                atomic=False,
+                fparam=fparam,
+                aparam=aparam,
+                **kwargs,
+            )
+            descriptor = model.eval_descriptor()
+        finally:
+            model.set_eval_descriptor_hook(False)
+            if self.auto_batch_size is not None:
+                self.auto_batch_size.set_oom_retry_mode(False)

Also applies to: 855-871

deepmd/pd/infer/deep_eval.py (1)

823-842: Ensure hooks & OOM retry mode are reset on exceptions.
Lines 823-842 and 884-901 toggle hooks/retry mode without a finally. Any exception during self.eval(...) or model.eval_* can leave the backend in a bad state.

✅ Safer pattern (apply to both methods)

-        if self.auto_batch_size is not None:
-            self.auto_batch_size.set_oom_retry_mode(True)
-        model.set_eval_descriptor_hook(True)
-        self.eval(
-            coords,
-            cells,
-            atom_types,
-            atomic=False,
-            fparam=fparam,
-            aparam=aparam,
-            **kwargs,
-        )
-        descriptor = model.eval_descriptor()
-        model.set_eval_descriptor_hook(False)
-        if self.auto_batch_size is not None:
-            self.auto_batch_size.set_oom_retry_mode(False)
+        if self.auto_batch_size is not None:
+            self.auto_batch_size.set_oom_retry_mode(True)
+        model.set_eval_descriptor_hook(True)
+        try:
+            self.eval(
+                coords,
+                cells,
+                atom_types,
+                atomic=False,
+                fparam=fparam,
+                aparam=aparam,
+                **kwargs,
+            )
+            descriptor = model.eval_descriptor()
+        finally:
+            model.set_eval_descriptor_hook(False)
+            if self.auto_batch_size is not None:
+                self.auto_batch_size.set_oom_retry_mode(False)

Also applies to: 884-901

🤖 Fix all issues with AI agents

In `@deepmd/utils/batch_size.py`:
- Around line 161-162: The OOM handler incorrectly checks the method object
self.set_oom_retry_mode rather than its boolean result, so every OOM triggers a
retry; change the condition to call the method (if self.set_oom_retry_mode():
...) so it evaluates the returned bool before raising RetrySignal, and ensure
the method returns a proper bool.

🧹 Nitpick comments (1)

deepmd/utils/batch_size.py (1)
32-54: Clarify retry semantics (docstring vs behavior).
Line 32 says retries happen for “certain times,” but the wrapper loops forever. Either document it as unbounded or add a cap/backoff.
✏️ Minimal doc fix
-    """Decorator to retry the function until it succeeds or fails for certain times.
+    """Decorator to retry the function until it succeeds (no max retry cap).

deepmd/utils/batch_size.py

codecov · 2026-01-29T16:15:17Z

Codecov Report

❌ Patch coverage is 47.91667% with 25 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.93%. Comparing base (8787b45) to head (f785d61).

Files with missing lines	Patch %	Lines
deepmd/pd/infer/deep_eval.py	4.76%	20 Missing ⚠️
deepmd/pt/infer/deep_eval.py	80.95%	4 Missing ⚠️
deepmd/utils/batch_size.py	83.33%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5181      +/-   ##
==========================================
- Coverage   81.95%   81.93%   -0.02%     
==========================================
  Files         714      714              
  Lines       73441    73477      +36     
  Branches     3616     3617       +1     
==========================================
+ Hits        60187    60205      +18     
- Misses      12091    12110      +19     
+ Partials     1163     1162       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

…deepmd-kit into eval_desc_auto_batch_size

for more information, see https://pre-commit.ci

fix(pt/pd): fix incompatibility between AutoBatchSize and eval hooks

6b4ca59

Copilot AI review requested due to automatic review settings January 29, 2026 15:39

github-actions bot added the Python label Jan 29, 2026

Copilot started reviewing on behalf of njzjz January 29, 2026 15:39 View session

dosubot bot added the bug label Jan 29, 2026

[pre-commit.ci] auto fixes from pre-commit.com hooks

404d1ac

for more information, see https://pre-commit.ci

Copilot AI reviewed Jan 29, 2026

View reviewed changes

deepmd/utils/batch_size.py Outdated Show resolved Hide resolved

deepmd/utils/batch_size.py Show resolved Hide resolved

deepmd/utils/batch_size.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Jan 29, 2026

View reviewed changes

deepmd/utils/batch_size.py Outdated Show resolved Hide resolved

njzjz marked this pull request as draft January 29, 2026 16:12

Copilot AI reviewed Jan 29, 2026

View reviewed changes

njzjz and others added 4 commits January 30, 2026 00:19

apply Copilot's suggestions

01de666

Merge branch 'eval_desc_auto_batch_size' of https://github.com/njzjz/…

232651b

…deepmd-kit into eval_desc_auto_batch_size

rm retry

72c4e36

[pre-commit.ci] auto fixes from pre-commit.com hooks

f785d61

for more information, see https://pre-commit.ci

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(pt/pd): fix incompatibility between AutoBatchSize and eval hooks #5181

fix(pt/pd): fix incompatibility between AutoBatchSize and eval hooks #5181

njzjz commented Jan 29, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 29, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

codecov bot commented Jan 29, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix(pt/pd): fix incompatibility between AutoBatchSize and eval hooks #5181

Are you sure you want to change the base?

fix(pt/pd): fix incompatibility between AutoBatchSize and eval hooks #5181

Conversation

njzjz commented Jan 29, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

njzjz commented Jan 29, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 29, 2026 •

edited

Loading

codecov bot commented Jan 29, 2026 •

edited

Loading